SPONSORED LINKS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character Set Code Pages



On Friday, 24th October 1997, Ron Heiby wrote:

> I need to specify a sort of "minimal" set of characters to be
> represented by any 8-bit bytes. It's pretty clear to me that I want
> US-ASCII for the 0x00 through 0x7f values. 

If the eight-bit constraint didn't apply, I'd recommend the sixteen-bit
version of ISO 10646 (a.k.a. Unicode); but since it does, I'd go with
ISO 8859-1, which is a superset of ISO 646 (a.k.a. US-ASCII). 

> I was thinking that there is probably a set of characters in the 0x80
> through 0xff range that is very common to IBM-compatible personal
> computers. I'm looking for pointers to information on such a set of
> characters (code page), why I would choose one over the others, and what
> it/they is/are officially called. 

ISO 8859-1 is the first of a series of standards (currently numbering ten)
of eight-bit character sets published by ISO. The IBM code page equivalent
is 819, although it's not usually found in DOS (but there is a freeware
819 code page available - check out the http://www.kostis.de/ site, and
look for [I believe] isocp101.zip). Perhaps someone (Paul?) knows if OS/2 
uses the same sort of code page information files as DOS does.

The North American/western European Windows character set (code page 1252)
is a superset of ISO 8859-1; if you remove the characters from 0x80
through 0x9F from code page 1252, you'll have ISO 8859-1. Most UNIX boxes
support ISO 8859-1. The Amiga and the Acorn use ISO 8859-1. The first 256
characters of Unicode correspond to those of ISO 8859-1 (Unicode 0x0020 is
the 16-bit analogue of ISO 8859-1 0x20, ..., through 0x00FF -> 0xFF.)

Note that if you need characters from other regions, one of the other ISO
8859 standards might be a better fit, e.g. if you need to support eastern
European text, ISO 8859-2 would be a better choice; for Cyrillic text, ISO
8859-5; for Arabic text, ISO 8859-6; for Greek text, ISO 8859-7; for
Hebrew text, ISO 8859-8. All of the ISO 8859 character sets are supersets
of ISO 646, so US-ASCII is always available. 

Chris.
-- 
Christian CAREY <ccarey@capaccess.org>