Linux FAQ's & Manuals


unicode encodings

unicode is first of all just a large table which assigns unique integer numbers to characters. if you just talk about unicode it is not yet specified how these integer numbers are stored in the computer.

markus kuhn explains nicely what utf-8 is and what other encodings of unicode exist in his ``utf-8 and unicode faq for unix/linux'' pages:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf

the only unicode encoding a typical user will encounter on a linux system is utf-8, because it is ascii compatible and file system safe it is often used for file contents and file names.

other unicode encodings are mostly used only as internal encodings, for example glibc uses ucs-4 for it's internal wide character representation, and java uses utf-16. users who are not also programmers usually don't need to bother about these encodings (some applications also use utf-8 as their internal encoding, for example the text editor ``vim'').

2005-03-09