The inputenc package

Have you ever wondered how the inputenc package works? In this case you should read JLDiaz’ wonderful answer on

TeX does not know about unicode. For TeX, a character is simply 1 byte in the input. But unicode is multibyte. \usepackage[utf8]{inputenc} is an elaborate hack to fool TeX into accepting those multibyte chars.

When you write a file using utf8 encoding, each character in the file can be coded into 1, 2 or 3 bytes (or even 4 bytes for very exotic alphabets). If the character is in the ASCII standard, it takes only 1 byte, and everything is OK for TeX. Those characters have a binary code of the form 0xxxxxxx, i.e. the first bit is zero (because ASCII standard comprises codes only up to 127).

The unicode char ẟ that you used in your input, is coded in utf8 as three bytes, of binary values: 11100001, 10111010 and 10011111. Note that all of those begin with a bit of value 1, which is the “mark” utf8 uses to denote that they are not ASCII, but multibyte chars.

However, for TeX those three bytes are simply three chars, with codes "E1, "BA and "9F respectively (" is the hexadecimal prefix for TeX). What inputenc basically does is to make the character with code "E1 an active char, and define the command associated with that character in such a way that if after it came characters "BA and "9F then the TeX command \delta is issued.

I guess that you can understand now why you can’t alter the catcodes of Unicode characters.

XeTeX or LuaTeX, on the other hand, use a TeX engine capable of accepting “characters” 32bits wide, and the input phase “translates” utf8 to the appropiate unicode point, which is what TeX “eyes” see.

(One reason for me posting this here is to remember the answer without having to search for it on which can be a tedious task even if you know exactly what you’re looking for.)

Leave a Reply

Your email address will not be published. Required fields are marked *