Charset: Difference between revisions

From Braindump
Jump to navigation Jump to search
No edit summary
 
No edit summary
 
(6 intermediate revisions by the same user not shown)
Line 3: Line 3:
https://mcilloni.ovh/2023/07/23/unicode-is-hard/
https://mcilloni.ovh/2023/07/23/unicode-is-hard/


Baudot Encoding
Teleprinter / Telegraf
5-bit
 
Baudot Encoding 5-bit, ITA-1
 
https://cryptii.com/pipes/baudot
[[File:Baudot.png|thumb]]
 
ASCII 7-bit C0 set of Control Characters /  G0 set of Graphic Characters


International Encoding
International Encoding


IA5
ITU-T T50 IA5 String
 
https://www.itu.int/rec/T-REC-T.50[[File:7-bit ASCII.png|thumb|7-bit ASCII]]
 
=== Codepages ===
 
8 bit, lower half 0000000-011111111 is compatible with ASCII
 
IBM / Windows CodePages: Windows-1252 (defines the C1)
[[File:Latin-1.png|thumb]]
ISO-8859-1 Latin-1 (leaves C1 empty, upper half contains regionally significant characters)
 
ISO-8895-2 Latin-2 upper half is mostly for slavic languages 
 
https://www.charset.org/charsets/iso-8859-1
=== Unicode ===
upper codepoints used for Unicode 2-byte UCS-2 or 4-byte UCS-4
[[File:Unicode BMP.png|thumb]]
 
EURO SIGN: U+20AC
 
https://www.fileformat.info/info/unicode/char/20ac/index.htm


ASCII
UTF-8 is 8 bit
7-bit
Control blocks graphics


IBM CP
0bbbbbbb => us-ascii
Windows
Latin-1
8-bit, upper codepoints used for


Unicode
10bbbbbb => next byte is also part of the character (character is up to 6 bytes)
2-byte UCS-2 or 4-byte UCS-4


UTF
Byte order BOM FFFE
8 bit UTF-8
10...... => next byte is also used


Byte order and BOM FFFE
HTML Escaping


Java internal UCS-2
Java internal UCS-2

Latest revision as of 08:38, 1 September 2024

Unicode isn't hard if you know the history and where it comes from

https://mcilloni.ovh/2023/07/23/unicode-is-hard/

Teleprinter / Telegraf

Baudot Encoding 5-bit, ITA-1

https://cryptii.com/pipes/baudot

Baudot.png

ASCII 7-bit C0 set of Control Characters / G0 set of Graphic Characters

International Encoding

ITU-T T50 IA5 String

https://www.itu.int/rec/T-REC-T.50

7-bit ASCII

Codepages

8 bit, lower half 0000000-011111111 is compatible with ASCII

IBM / Windows CodePages: Windows-1252 (defines the C1)

Latin-1.png

ISO-8859-1 Latin-1 (leaves C1 empty, upper half contains regionally significant characters)

ISO-8895-2 Latin-2 upper half is mostly for slavic languages

https://www.charset.org/charsets/iso-8859-1

Unicode

upper codepoints used for Unicode 2-byte UCS-2 or 4-byte UCS-4

Unicode BMP.png

EURO SIGN: U+20AC

https://www.fileformat.info/info/unicode/char/20ac/index.htm

UTF-8 is 8 bit

0bbbbbbb => us-ascii

10bbbbbb => next byte is also part of the character (character is up to 6 bytes)

Byte order BOM FFFE

HTML Escaping

Java internal UCS-2