Why were Athlon/VIA systems so notoriously flaky back in the day?

I remember fondly the NASDAQ bubble days, when it seemed like everyone was upgrading their PC every six months to be able to do useful and fun things, like play the newest 3D games (or to take advantage of the emerging high-speed internet without being slowed down by the piles of cable/DSL ISP crapware that came on the modem install CD).

Back then, custom PC shops competed with the big-box system builders by offering cheaper AMD Athlon based systems that frequently outperformed the much more expensive Intel Pentium III based systems, especially when unlocked and overclocked. But a common thread in the day was debating whether a motherboard with an Athlon CPU and a VIA chipset would be stable enough for the customer’s needs, or if his system should be specified with an Intel CPU and a 440BX based motherboard instead.

Among other issues that began to arise in those days with the introduction of the AMD “Thunderbird” core Athlon and Duron CPUs was the need for extremely high pressure between the heatsink/fan (HSF) assembly and exposed CPU core in order to assure complete contact, with the broken CPU sockets and ruined motherboard traces that resulted from incorrect installation or removal of the HSF — as well as the need for a high-quality phase-change thermal compound to prevent the pressure and heat from turning the thermal compound into melted goo that would run off the CPU, leaving it to fry.

The Thunderbird was the first PC CPU family that could not even survive a quick power-on self-test (POST) without an attached heatsink, internally self-immolating if part or all the core was not heatsinked. Due to this phenomenon, motherboards began to incorporate shutdown logic that prevented the CPU from being powered if a newly-required speed sensor was not incorporated in the HSF assembly, and sounded an audible alarm when this was the case.

These CPUs produced a lot of heat, necessitating active case cooling in many instances, and also introduced builders to the variation in quality of power supplies, previously almost a non-issue. The difference between a no-name 250W power supply and a quality Sparkle 300W power supply could literally mean the difference between a stable system and an unstable system.

This was exacerbated by newer AGP-based video cards like the GeForce 256 that — in addition to bringing forward many of the typical AGP stability problems with Super Socket 7 platforms to the Athlon platforms, in most cases necessitating disabling of AGP or limiting speed to 1X — also consumed inordinate amounts of power for the day. It was not uncommon to have to replace an underpowered power supply along with a video card upgrade, and if the motherboard did not have appropriate overcurrent protection on an AGP slot that could not provide sufficient power for such a card, it was also not uncommon to have to replace the motherboard soon after such a video card upgrade due to voltage regulators that let out their magic blue smoke and ceased to work.

Of course, these were also the days when the capacitor problems that would plague PC motherboards for the next decade or so were first starting to emerge. Even quality brands like Abit and Micro-Star (MSI) would return within the year for unstable behavior, and would be found to have bulging capacitors around the CPU socket. Unless the tech was idle, replacing the motherboard was a more effective use of time than replacing the capacitors.

I found myself on the losing end of the price/performance/stability debate several times when I built an Athlon-based system that should, by the specs, have been far in excess of the customer’s needs, and yet in the end, the customer was unhappy due to misbehavior that I could never seem to reproduce in the shop. Frequently, drivers, customer-installed software like the AOL client, Windows itself, or PEBKAC were offered as explanations for the problem. Back then, the line of thought was that surely, vendors wouldn’t be selling defective hardware, would they?

But looking in the rear-view mirror, it is easy to see why those early Athlon systems were notoriously unreliable.

After all, in addition to the seemingly endless array of newly developed systems-integration problems described above, we now know that those Athlon CPUs, the VIA motherboard chipsets that they were commonly paired with, common PCI peripherals of the day like the Highpoint ATA-66 RAID controllers and the Sound Blaster Live! family of sound cards, and motherboard BIOSes were positively riddled with bugs.

AMD’s revision guide for the “Model 4” Thunderbird-core Athlon, which also applies to the “Model 3” Thunderbird-core Duron, describes no less than four errata (#11, 13, 14, and 16) that, if not worked-around by an aware systems programmer, would render the system unstable and possibly corrupt data. (A fix for #16 was eventually published by Microsoft, by the time it just about didn’t matter anymore. BIOS vendors were expected to work around the power-management bugs, and many simply never bothered to.)

On top of the bugs in the AMD CPUs, George Breese’s famous VIA PCI fix addressed several more issues with the VIA platform. These ranged from problems caused by poor third-party BIOS engineering, to bugs in the VIA 686B ATA controller that caused data corruption (at first denied by VIA, and then eventually admitted when denial was no longer possible), to problematic interactions between the VIA PCI arbiter logic and common PCI peripheral chips including video capture and sound cards that would corrupt data, cause poor performance, or hang the system.

The bus parking issue was primarily a problem with PCI expansion card controller chips that assumed that the PCI arbiter supported bus parking when it did not always do so. This assumption was made by the chip designers even though bus parking is not a part of the PCI bus standard, because the Intel 82440BX chipset, one of the most popular Pentium II chipsets, supported bus parking. The device assumed that it would be able to stay on the PCI bus indefinitely as long as no other device requested the bus (this was the BX behavior), while the old design VIA PCI controller would cut the device off without exception once its PCI latency timer expired. Since VIA south bridges prior to the VT8235 (which was paired with KT333 and later, but sometimes with a KT266A) did not support this bus parking, these devices would corrupt data, crackle, or behave otherwise unpredictably on a VIA platform, even while the same device was working perfectly on an Intel or SiS platform with no special workarounds. Increasing the PCI latency timer of the device, as Breese’s patch did, sometimes helped make these devices work on chipsets like the KT133, KT133A, KT266, KT266A, but not perfectly; the new bus parking feature was enabled for KT400 and later by the patch. (One exemption from this problem would be MSI motherboards that incorporate MSI’s custom PCI arbiter chip that originally appeared on the BX Master. Since these motherboards do not use the VIA southbridge’s PCI arbiter logic, substituting their own instead, they should not exhibit this problem.)

Problems with the VIA PCI arbiter date way back to the Apollo VP3 and MVP3 days. VIA’s PCI IRQ Routing Miniport driver, and later the “4in1” driver, was the solution even recommended by Creative themselves. (While the problem didn’t have anything to do with IRQ routing itself, the VIA driver modified and optimized PCI bridge registers as a side effect.)

Other problems existed on AMD chipsets like the AMD751/756 and 761/766. AGP compatibility problems and USB errata were typical, and along with them came bugs in the integrated ATA controller (leading to poor performance and system hangs on the 766) and bugs in the implementation of the Programmable Interval Timer that would present compatibility problems with legacy software.

Occurrence of many of the above problems can only be minimized by software and not totally fixed, leading to a system that crashes or corrupts data unpredictably.

Given all of the above issues. it is simply amazing that these systems managed to work as well as they did! Never underestimate the ability of vendors to cover up defects in the products they sell. View statements that minimize your observed issues with a critical eye, and when theory does not map to practice, change the theory. After all, your customers deserve your having paid full attention to your own personal experiences so as to prevent them from having to relive the experience for themselves!

Leave a Reply