Archive for September, 2006

Anatomy of a PC BIOS

Tuesday, September 26th, 2006

Either use a “live” or “dead” image. A “dead” image may have a header that must be removed before disassembly because the BIOS code itself uses absolute offsets. A “live” image may be misleading because the BIOS code can self-modify, overwrite components or map them out using system hardware, and any static data reflects the state of the system after boot, not before boot. Also, the 64K segment of F0000-FFFFF is not necessarily large enough to map the entire BIOS ROM chip, meaning that there are parts of the BIOS ROM which are hidden from the system after initialization.

BIOS components

POST (Power On Self Test) routine

This is executed after initial CPU, cache, and chipset/memory setup.
The POST routine, since it is only used at boot, may be mapped into the option ROM region and then mapped out by the boot code and chipset after it is finished. This means it can be located outside the 64K BIOS segment in the BIOS ROM (which is larger than 64K in this case).
This routine will make external notification of its status by writes to port 0x80.

BIOS interrupt vectors

The real mode Interrupt Vector Table (IVT) contains pointers into ISR (Interrupt Service Routine) methods in BIOS code to implement various software interrupts. For analysis, you should also dump the IVT, and preferably from a protected mode operating system such as Linux which has not installed its own real-mode interrupt handlers. It is usually located at the start of memory (a protected mode operating system can relocate it, but Linux does not), and is of length $400 (256 DWORD vectors in segment:offset format – little endian!). Usually only handlers for INT0 through INT$77 are initialized. Handlers that are not initialized point to 0:0 which is also the INT0 handler. This handler may enter a debugger or simply crash the system. Most initialized interrupt vectors will point to an 'iret' instruction, making them a no-op. For installed vectors, if the OS has not installed its own handlers, the segment should always be a BIOS segment (i.e. the system BIOS F000 or the video or HDD controller BIOS). Then in your code viewer, disassemble the vector's offset in the BIOS image to see what the handler does. It will check the contents of various registers, commonly ax. Then the BIOS will either call its own functions or simply exit depending on the register contents. You can use Ralf Brown's Interrupt List to figure out what the BIOS is doing here based on the register contents passed to it. The NMI vector is INT $2 and the CPU exception vectors are scattered from INT $0-$11. Hardware IRQ handlers for PIC interrupts 0-7 are mapped to INT $8-$F, and for interrupts 8-15 are mapped to INT $70-77 (IVT offset 0x1A0).

You will notice that hardware IRQ handlers work slightly differently. When the PIC receives an interrupt, it asserts INT#. The CPU finishes what it is doing and asserts INTA#. The PIC then deasserts INT# and sends the interrupt number (as a byte) to the CPU, and, in real mode, the CPU jumps to the code the IVT vector points to. At the end of an IRQ handler, the BIOS will write an EOI (End Of Interrupt) code to the 8259 PIC (Programmable Interrupt Controller). The EOI code is $20, and the hardware port of the master PIC is $20 and the slave PIC is $A0. Sending EOI to the PIC allows the PIC to handle new IRQ events that are equal or lower priority. Also, an ISR will always be exited with an 'iret' instruction. This gets the previous EFLAGS, CS, and EIP off the stack. Interrupts are enabled while executing an interrupt handler, thus the PIC could reassert INT# in response to a higher priority IRQ. If the programmer disables interrupts in an ISR (using 'cli'), he must remember to re-enable them (using 'sti') before 'iret'. And receiving a NMI or SMI during an ISR is possible even if interrupts are disabled. However, if a NMI or SMI is executing, further interrupts of any kind are disabled until iret or RSM.

BIOS32

This is a generic entry point for BIOS services that can be used by 32-bit callers. It, and thus the installed services, can be called directly from a protected mode program (using CALL FAR). To detect its presence, scan for the string _32_ on a paragraph (16-byte) boundary from $E0000-$FFFFF AND validate the checksum of the structure present there.

Reference:

Standard BIOS 32-bit Service Directory Proposal, v0.4, Phoenix Technologies

Legacy configuration tables

Various programs may rely on the presence of these lists, tables, and interrupts, so they should be present and accurately reflect the state of the system hardware and firmware.

  • BIOS equipment list (Call INT $11 to obtain as value in AX), also stored at 0:410
  • Video Parameter Table (VPT), address stored in INT $1D (not a vector!), located at $FF0A4 in 100% PC compatibles
  • Diskette Parameter Table (DPT), address stored in INT $1E (not a vector!), located at $FEFC7 in 100% PC compatibles
  • Video Graphics Character Table (VGCT), address stored in INT $1F (not a vector!), located at $FFA6E in 100% PC compatibles
  • Fixed Disk Parameter Table (FDPT) for first hard drive, address stored in INT $41 (not a vector!), located at $FE401 in 100% PC compatibles
  • Fixed Disk Parameter Table (FDPT) for second hard drive, address stored in INT $46 (not a vector!)
  • EGA video graphics character table, address stored in INT $43 (not a vector!)
  • PCjr compatibles should also have INT $44, INT $48, and INT $49 locations pointing to valid tables

The RTC (CMOS) RAM (described below) may also contain system configuration information that has been stored by the BIOS or a configuration utility.

The BIOS Data Area (BDA, or BIOS Communication Area), 0:400 to 0:4AC, contains volatile BIOS state as well as hardware configuration data. It is left over from when the BIOS was always contained in a read only ROM. It too must reflect valid state. PC Convertible compatibles must also fill in the locations between 0:4AC and 0:4F0. 0:4F0 is a 16 byte scratch area for inter-application communication. 0:500-0:5FF is a scratch area for ROM POST and ROM BASIC.

References:

Ralf Brown's Interrupt List

http://www.frontiernet.net/~fys/rombios.htm

Extended BIOS Data Area (XBDA/EBDA)

This is a 1K data area for data such as user defined drive parameters (Type 47) which do not fit in the BIOS communication area. The BIOS has to store these somewhere besides inside its own image, because it has traditionally been fixed in a memory mapped ROM which cannot be written to. (The other option was to store these two 16-byte lists at 0:31D and 0:32D, but doing so precludes the use of software interrupts corresponding to the overwritten vectors.) A pointer to the XBDA is stored in 0:40E and is usually equal to $9FC0 (0:9FC0), which is the top 1K of the 640K region. The XBDA eventually evolved to include information about pointing devices and peripheral ports among other configuration information.

With systems that contain more than 640K of memory (at least 1MB low memory), and the advent of chipsets that support shadow RAM (being able to switch between hardware memory/ROM and underlying RAM in $A0000-FFFFF on demand), the System BIOS could then be copied to RAM and the chipset used to switch from the ROM (mapped at boot) to the copy in the underlying RAM which will be used at runtime. The BIOS copy in Shadow RAM obsoletes the XBDA, as well as the BIOS communication area, because any volatile data can now be written to the BIOS image in RAM.

Even if the XBDA is empty (because the BIOS image is shadowed and the information is stored internally), it still exists and consumes 1K of conventional memory, i.e. the BIOS memory size at 0:413 reflects the size of conventional memory minus the size of this area.

SMBIOS (System Management BIOS, previously DMI [Desktop Management Interface])

SMBIOS is a standard that supercedes the legacy equipment list, parameter tables, and XBDA. Search for the string “_SM_” at paragraph boundaries in the $F0000-FFFFF region. It should be followed by “_DMI_” at the beginning of the next paragraph. Verify the checksum of this structure. Then, The DWORD 8 bytes following “_DMI_” is a pointer to the DMI structure ($000Fxxxx). This structure should be filled out and have a correct checksum so that client software can use it.

Reference: System Management BIOS (SMBIOS) Reference Specification, Version 2.4, DMTF

PnP (Plug & Play) BIOS

Real mode and 16-bit protected mode entry points, will usually call the same internal function.

PCI

PCI BIOS supports 16-bit real and protected-mode callers through INT $1A, AH=$B1. Protected mode callers use the BIOS32 interface with $PCI or ICP$ service id.

Reference:

PCI BIOS Specification, 2.0, PCI SIG

Accesses to $CF8-$CFF are PCI controller accesses. The BIOS will use $CF8 to select a configuration register on a particular PCI target, and $CFC to read or write that configuration data. A DWORD write to $CF8 contains the following:

bus = bits 16-23
device = bits 11-15
function = bits 8-10
register = bits 2-7

For a configuration cycle, the high bit should always be set ($8xxxxxxx). After setting the target correctly, write or read the desired byte, word, or dword from $CFC.

A configuration structure being written to $CF8 can be decoded with this program:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
        unsigned char bus, device, function, reg;

        int num = strtoul(argv[1], NULL, 16);

        num = num >> 2;
        reg = num & 0x3f;
        num = num >> 6;
        function = num & 7;
        num = num >> 3;
        device = num & 0x1f;
        num = num >> 5;
        bus = num & 0xff;

        printf("Bus %x  Device %x  Function %x  Reg %x\n", bus, device, function, reg);
        exit(EXIT_SUCCESS);
}

APM

SMM

SMRAM is located at 38000-3FFFF (SMBASE=$30000) or A8000-AFFFF (SMBASE=$A0000) when in SMM. SMRAM is initialized and cloaked by the system firmware and chipset. The chipset snoops SMIACT# (asserted by CPU in response to a SMI) and maps SMRAM in or out of these regions accordingly. The CPU then flushes the pipeline, dumps its state to the bottom of SMRAM (SMM state save map @ SMBASE+$7E00 to $7FFF), invalidates write back cache, and executes the code at SMBASE+$8000. When it encounters an RSM instruction, it reads its state back in from the state save map, and returns to the next program instruction as if nothing ever happened. The only visible evidence of an SMM invocation from the perspective of the operating system is missed timer ticks.

We are interested in analyzing the SMM code. The SMM code should be contained in the “dead” BIOS image somewhere. SMBASE can only be changed while in SMM, so, at boot time, the system firmware installs a dummy SMI handler at $38000 which changes SMBASE to $A0000, and then invokes SMM. Afterwards, the real SMI handler is installed at $A0000. $A0000 is the logical location for SMBASE anyway since this would otherwise be wasted RAM (hidden underneath mapped video card memory). Note: Upon entering SMM, SMI handler's CS is always $3000 even if SMBASE has been relocated, so correct handler code needs to check the current value of SMBASE and update CS accordingly.

VBE

NVRAM (RTC, ESCD, System configuration)

The original PC/AT RTC chip was a Motorola MC146818. The RTC cells in later machines are compatible with it. It has 64 bytes of RAM (commonly referred to as CMOS memory) and a battery-backed real-time clock. Later chips have 128 bytes of RAM. You can recognize RTC RAM accesses by the use of port $70 (index) and $71 (data). The user region of the RTC RAM (indices $0E-$3F, or -$7F on 128-byte chips) is used to store user-defined BIOS settings as well as internal firmware state. The BIOS code will usually read from interesting RTC RAM locations while starting up, for example to get user-defined CPU/FSB settings, or to see if the system needs to be resumed from a BIOS suspend. It will also eventually perform some sort of checksum over the region, to ensure that low supply voltage has not allowed the stored data to become corrupt.

When reading or writing RTC RAM, there should be a delay in between the write of $70 and the subsequent access of $71. However, this delay should not be too long because the RTC chip can become confused. A good rule of thumb is to always read from $71 after setting $70, even if the value is not interesting.

When the RTC IRQ handler is installed, the RTC index port is usually set to $0D (so that the status register can quickly be read). If you play with the RTC yourself, remember to disable interrupts first and restore $0D to the index before re-enabling interrupts, or the IRQ8 handler may become confused. Another good reason for this to be left to $0D is because if it is set to a RAM location, that location could be corrupted when power is removed from the system.

The system NMI mask bit is bit 7 of port $70. Yes, the Non-Maskable Interrupt is actually maskable, through the 8042 keyboard controller – the high bit of that port is latched by the 8042. This was overlapped with the index port so that the index could be written simultaneously with masking NMI (so that an NMI would not occur while RTC RAM was being updated – a potentially hazardous situation since the RTC RAM is preserved across reset even if it is in a corrupt state). However, it also means that being able to index more than 128 locations is impossible. It also means that if it is possible for the NMI to be masked when you go to perform your RTC access, you must first read the index port and preserve the high bit when writing your index value, otherwise you may inadvertently unmask the NMI.

Some MC146818-compatible RTC chips have even more RAM stored in extra banks, but the access method is implementation defined. Some use registers within $70 to access the extra banks, others (such as Intel PIIX) implement $72 and $73 as another 128-byte bank with identical access strategy to $70 and $71.

Reference: http://www.maxim-ic.com/appnotes.cfm/an_pk/77

MPS
APIC
IRQ routing table
ACPI
Boot flag (RSDT: “BOOT”)
Reference: Simple Boot Flag Specification, 2.1, Microsoft
System setup program

Code Trace analysis

Execution starts at $FFF0 and the code will jump to the real start routine from there. The system BIOS is not relocatable so use $F000 as the segment for disassembly. Other BIOS images must be relocatable either by dip switch (ISA) or by PCI configuration, so they will either employ relative addressing or have relocation fixups at the beginning of the code.

Accesses to $60-$68 are KBC accesses (i.e. a self test, or checking for the key combination to enter setup or to invoke a BIOS upgrade)

cr0 reads or writes are CPU configuration (enable/disable cache or FPU)

TSC, power management, and program correctness

Wednesday, September 20th, 2006

The RDTSC instruction to read the Pentium Time Stamp Counter (internal cycle counter of the Pentium and greater Intel CPUs, all compatibles, and some 486-class CPUs from other manufacturers such as the 5×86) is a syscall-free way for userspace programs on Intel platforms to do timing. However, the system CPU can be slowed or halted by things like cpufreq, APM BIOS, or ACPI power methods, and this causes the TSC count to slow as well. After a power management event, the TSC will be an incorrect reflection of real time to the program and the program will malfunction.

The correct way to deal with this is to create a thread that opens /dev/apm_bios and /var/run/acpid.socket if they exist, and when a power management event is detected from either source, recalibrate the program's TSC timing based on gettimeofday again.

CPUFreq can be dealt with by polling /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq and watching for any change there. Also, on recent kernels, when cpufreq is in effect, /proc/cpuinfo will reflect the current CPU speed, so it can also be used to detect CPU speed changes.

TSC is handy for timing in user programs because it is so cheap, but for correct program behavior it can be a hazard. As long as any event that could cause a CPU frequency change is accounted for, it can still be useful on modern systems.

HOWTO: A simple driver for the Linux CPUFreq framework

Tuesday, September 12th, 2006

Background

So I've got this oldish Toshiba laptop with a Pentium-120 processor. The laptop gets quite hot during normal operation, and the fan can be noisy. Fortunately, there is a way to change the CPU speed on the fly, as well as toggle the L2 cache and system fan on and off. This is accomplished through a userspace program called 'toshset', which uses the 'toshiba' driver through /dev/toshiba to enter the SMM mode of the laptop and execute BIOS code (known as the SCI and HCI) to fiddle with the appropriate setting. (The Toshiba SCI/HCI details were reverse engineered by Jonathan Buzzard from the Windows driver.)

What is SMM? (off-topic yet interesting)

SMM is a special mode of the processor triggered by an external interrupt on the #SMI pin. In the Toshiba laptop hardware, this pin is asserted by an read from I/O port $B2. Inside SMM, the CPU has access to a reserved area called SMRAM, usually of 32KB size and located at $38000-$3FFFF or $A8000-$AFFFF. At boot time, the BIOS loads SMBASE+$8000 with the program (SMI handler) that is to be run when SMM is entered, and then cloaks SMRAM from system software using the system core logic. When SMM is entered, the CPU state (excluding FPU and test registers) is written out from $3FFFF to $3FE00, and restored from that area when SMM is exited. It is only possible to recover the SMI handler code by disassembling the BIOS or by snooping the CPU's address and data bus after #SMIACT is asserted in response to #SMI, so under normal circumstances the SMM code can only be “black-box” analyzed by reverse engineering the system software that uses whatever API the SMM program provides in a particular machine. However, some system chipsets allow SMRAM to be re-enabled, after it has already been enabled for the handler installation and cloaked. On these systems, recovering or modifying the SMI handler code by a program is possible.

The code stored in SMRAM is quite literally the most arcane and buried embedded software within a PC. It can also be the source of much hidden latency, since a SMI interrupt causes the processor to write its entire internal state to SMRAM, flush its cache as the first action of the SMI handler (in write-back mode, this can take thousands of cycles), execute the SMI handler, and restore its internal state and exit SMM once a RSM instruction is encountered in the SMI handler. Unfortunately, there is no way to know how long the processor will be in SMM mode once entered because the length of the code it executes cannot be determined externally. It can also be quite difficult to discover all of the sources of SMI interrupts, since a SMI interrupt could come from any chipset component, and because it could potentially be invoked by software or by purely external hardware events depending on the specific design of the system. Also, when SMM is entered, the CPU pipeline is flushed; only pending I/O and HALT instructions can be restarted.

APM event

Since this laptop propagates power status events through the APM BIOS, my first “optimization” was to put a script in /etc/apm/event.d. When the power cord is plugged in, it sets the machine to maximum performance settings, and when the power cord is removed, it sets the machine to minimal performance settings.

#!/bin/sh

# Place this script in /etc/apm/event.d to automatically manage CPU
# power consumption in response to APM events.
# Debian: requires toshset and powermgmt-base packages

set -e

TOSHSET=/usr/bin/toshset
ON_AC_POWER=/usr/bin/on_ac_power

[ -x "${TOSHSET}" ] || exit 0
[ -x "${ON_AC_POWER}" ] || exit 0

cpu_fast() {
    logger "CPU going to performance settings"
    ${TOSHSET} -bs user
    ${TOSHSET} -cpu fast
    ${TOSHSET} -cpucache on
    ${TOSHSET} -lcd bright
    ${TOSHSET} -fan on
    ${TOSHSET} -d 30
    if [ -f /var/run/xbattbar.pid ]; then
        kill -TERM `cat /var/run/xbattbar.pid`;
    fi
    exit 0
}

cpu_slow() {
    logger "CPU going to power-saving settings"
    ${TOSHSET} -bs user
    ${TOSHSET} -cpu slow
    ${TOSHSET} -cpucache off
    ${TOSHSET} -lcd semi
    ${TOSHSET} -fan off
    ${TOSHSET} -d 3
    # Displaying the battery bar requires 'local:' to be entered into
    # /etc/X0.hosts - a security risk on a multiuser system but probably okay
    # for a portable workstation.
    if [ -x /usr/X11R6/bin/xbattbar ]; then
        DISPLAY=:0 /usr/X11R6/bin/xbattbar >/dev/null 2>&1 &
        echo $! > /var/run/xbattbar.pid;
    fi
    exit 0
}

${ON_AC_POWER} && ( cpu_fast; exit 0 )
${ON_AC_POWER} || ( cpu_slow; exit 0 )

apmiser

Then, I wanted a solution to manage the CPU speed based on load, because the system gets quite hot while sitting on {couch|lap}, and is idle or nearly idle most of the time anyway.

IBM Thinkpad laptops have a similar configuration interface called SMAPI BIOS (not to be confused with SMBIOS, the System Management BIOS extensions) which also uses SMM to configure the laptop. The program 'tpctl' and the /dev/thinkpad driver were written in similar fashion to the Toshiba driver.

There exists a daemon program called 'apmiser' for use with the IBM Thinkpad, which, despite its name, has nothing to do with APM. apmiser uses the information in /proc/stat to calculate the current CPU usage and to change the CPU speed on the fly using the 'tpctl' program based on the system load. It is trivial to modify apmiser to call toshset instead of tpctl. Unfortuntately, apmiser is written in Perl and so has a large memory footprint. (Translating it to C should improve the memory usage.)

Cron/load average

I decided I wanted to try a non-daemon approach. A program that is run once per minute from cron and checks the 1 minute load average in order to decide whether the CPU should be sped up, and the 5 minute load average to see if the CPU should be slowed down. It also manages the fan based on the 1 minute load.

This was a first try:

#!/bin/sh

LOAD=` uptime | sed 's/.*\(load.*\).*/\1/g' | sed 's/,//g' `

AVGOK=` echo $LOAD | awk '{ print ($3 /dev/null
        /usr/bin/toshset -cpu fast
        /usr/bin/toshset -cpucache on;
else
        /usr/bin/fan -f >/dev/null;
        if [ $FIVEMINIDLE -eq 1 ]; then
                /usr/bin/toshset -cpu slow
                /usr/bin/toshset -cpucache off
        fi
fi

This version, when run from cron once per minute, turned out to have a “hiccup” every minute because of bash starting and the several calls to toshset.
This was a rewrite in C for speed:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv)
{
        float min1, min5;
        FILE *f;
        float user1, user5;

        if (argc < 3) user5 = 1.00;
        else if (sscanf(argv[2], "%f", &user5) != 1) {
                puts("Error in 5 minute argument");
                exit(EXIT_FAILURE);
        }
        if (argc < 2) user1 = 0.50;
        else if (sscanf(argv[1], "%f", &user1) != 1) {
                puts("Error in 1 minute argument");
                exit(EXIT_FAILURE);
        }

        f = fopen("/proc/loadavg", "r");
        if (f == NULL)
        {
                perror("fopen");
                exit(EXIT_FAILURE);
        }

        if (fscanf(f, "%f %f", &min1, &min5) != 2)
        {
                perror("fscanf");
                exit(EXIT_FAILURE);
        }

        if (! (min1 < user1))
        {
                system("/usr/bin/toshset -fan on -cpu fast -cpucache on > /dev/null");
        }
        else
        {
                system("/usr/bin/toshset -fan off >/dev/null;");
                if (min5 < user5) {
                        system("/usr/bin/toshset -cpu slow -cpucache off");
                }
        }
        exit(EXIT_SUCCESS);
}

However, I still wasn't happy with this because the response to a change in load could not happen for around a minute, and it was still quite a “heavy” program with the several toshset invocations.

cpufreq

TODO

/sys/devices/system/cpu/cpu0/cpufreq:
scaling_governor: performance, powersave, conservative, etc (cpufreq_powersave.ko, cpufreq_conservative.ko)
scaling_cur_freq:
conservative/*

toshiba_freq.ko depends on freq_table.ko

/sys/module/cpufreq/parameters/debug: set to 1 to enable debug output of cpufreq-core and freq-table, 2 for debug output of driver, 4 for governors. This only produces output if CONFIG_CPU_FREQ_DEBUG=y. It is a bit field, so if you want all 3 outputs for example, set it to 7.

Building pentium optimized Debian packages

Friday, September 8th, 2006

You might have noticed, if you are maintaining an older system, that it is not exactly obvious in Debian how one should go about building optimized versions of some frequently used executable packages (like gzip, libraries, bash, perl etc). There are various solutions like apt-build, pentium-builder, and apt-get source –compile, but none of these did what I wanted; allowed me to maintain a locally optimized package that is only replaced with a non optimized version by the package manager when necessary (not on every upgrade).

1. Change debian/control
Change the Package: line from Package: foo to Package: foo-586
Add Provides: foo Replaces: foo and Conflicts: foo lines (see http://www.gatago.com/linux/debian/www/14642325.html why this should be done)
If any other packages have versioned dependencies on the foo package, instead of Conflicts: foo and Provides: foo, you will need to add another Package stanza for foo to build a versioned dummy package. Then make that dummy package Depends: foo-586

2. Change debian/rules
Make sure that the CFLAGS passed to configure include -mpentium or whatever options you need to compile optimized for your specific machine
Also, if the package builder uses debian/foo to refer to the package build directory in the script, you need to replace all occurrences of this with debian/foo-586. Better yet, replace them with debian/$(PACKAGE), define PACKAGE=foo-586 at the top, and submit a bug to improve the rules file.

3. Rename debian control files
Anything like foo.dirs, foo.install, foo.lintian, foo.pre/post/inst/rm will have to be renamed to foo-586.dirs, etc. And any references in debian/rules will have to be updated to reflect that. (Again, $(PACKAGE) is handy.)

Then dpkg-buildpackage -rfakeroot and install the package. dpkg should automatically figure out what is to be done, but you may have to use –force-depends if you know exactly why dpkg is confused and not installing your package (hint: make sure your Provides: is correct).

Apt-get upgrade will not replace this package, but apt-get dist-upgrade may. You might want to add a line to the postrm script to notify you via email when the package has been replaced so that you know it is time to build a new one.