Charset: Difference between revisions

Latest revision as of 08:38, 1 September 2024

Unicode isn't hard if you know the history and where it comes from

Teleprinter / Telegraf

Baudot Encoding 5-bit, ITA-1

ASCII 7-bit C0 set of Control Characters / G0 set of Graphic Characters

International Encoding

ITU-T T50 IA5 String

7-bit ASCII

8 bit, lower half 0000000-011111111 is compatible with ASCII

IBM / Windows CodePages: Windows-1252 (defines the C1)

ISO-8859-1 Latin-1 (leaves C1 empty, upper half contains regionally significant characters)

ISO-8895-2 Latin-2 upper half is mostly for slavic languages

upper codepoints used for Unicode 2-byte UCS-2 or 4-byte UCS-4

EURO SIGN: U+20AC

UTF-8 is 8 bit

0bbbbbbb => us-ascii

10bbbbbb => next byte is also part of the character (character is up to 6 bytes)

Byte order BOM FFFE

HTML Escaping

Java internal UCS-2