Have you ever wondered how the inputenc package works? In this case you should read JLDiaz’ wonderful answer on TeX.sx:
TeX does not know about unicode. For TeX, a character is simply 1 byte in the input. But unicode is multibyte. [cce inline=”true” lang=”latex”]\usepackage[utf8]{inputenc}[/cce] is an elaborate hack to fool TeX into accepting those multibyte chars. When you write a file using [cce inline=”true” lang=”latex”]utf8[/cce] encoding, each character in the file can be coded into 1, 2 or 3 bytes (or even 4 bytes for very exotic alphabets). If the character is in the ASCII standard, it takes only 1 byte, and everything is OK for TeX. Those characters have a binary code of the form [cce inline=”true” lang=”latex”]0xxxxxxx[/cce], i.e. the first bit is zero (because ASCII standard comprises codes only up to 127). The unicode char ẟ that you used in your input, is coded in utf8 as three bytes, of binary values: [cce inline=”true” lang=”latex”]11100001[/cce], [cce inline=”true” lang=”latex”]10111010[/cce] and [cce inline=”true” lang=”latex”]10011111[/cce]. Note that all of those begin with a bit of value [cce inline=”true” lang=”latex”]1[/cce], which is the “mark” utf8 uses to denote that they are not ASCII, but multibyte chars. However, for TeX those three bytes are simply three chars, with codes [cce inline=”true” lang=”latex”]”E1[/cce], [cce inline=”true” lang=”latex”]”BA[/cce] and [cce inline=”true” lang=”latex”]”9F[/cce] respectively ([cce inline=”true” lang=”latex”]”[/cce] is the hexadecimal prefix for TeX). What [cce inline=”true” lang=”latex”]inputenc[/cce] basically does is to make the character with code [cce inline=”true” lang=”latex”]”E1[/cce] an active char, and define the command associated with that character in such a way that if after it came characters [cce inline=”true” lang=”latex”]”BA[/cce] and [cce inline=”true” lang=”latex”]”9F[/cce] then the TeX command [cce inline=”true” lang=”latex”]\delta[/cce] is issued. I guess that you can understand now why you can’t alter the catcodes of Unicode characters. XeTeX or LuaTeX, on the other hand, use a TeX engine capable of accepting “characters” 32bits wide, and the input phase “translates” utf8 to the appropiate unicode point, which is what TeX “eyes” see.
(One reason for me posting this here is to remember the answer without having to search for it on TeX.sx which can be a tedious task even if you know exactly what you’re looking for.)

Leave a Reply

Your email address will not be published. Required fields are marked *