Is it time for a new charset in the Digest? [telecom]

- T
- Telecom Digest Moderator
  
  Contact options for registered users
posted
9 years ago

Mon, Sep 15, 2014 9:37 PM

I've been using the ISO-8859-1 "Latin1" character set in the Digest for a few years now: we adopted it as the standard after a reader made me awaare that there are no accented characters in ASCII, so I figured that I'd implement a way for him to spell his name properly, and also

I'm wondering if it's time for another change, either to one of the "transitional" Unicode formats, such as UTF-8, or perhaps to a permanent solution such as UCS-16.

I'd like to hear opinions from you, particularly if you have expertise in choosing character sets for online publicatoins such as The Telecom Digest. TIA.

Bill Horne Moderator

- M
- MotoFox
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Sep 18, 2014 5:05 PM

Keep using it...

...because even though Unicode is hawked as the "universal" character set, ISO is supported by, like, everything. Client Unicode support is more widespread than it was 10 years ago but still far from universal. If you absolutely have to use Unicode, stick with the zeroth page of UTF8 (mirrors the standard charset). Many softwares (e.g. I occasionaly access this newsgroup directly via Eternal-September in a telnet session at a full-screen command line) react strangely to characters in higher-number pages. Lynx, for example, parses those stupid "curly quotes" as ~R, ~S and ~T, respectively. (A lesson I wish the Wordpress bozondos would learn!)

Strictly my own opinion, of course. But if what you're using works, has proven itself and is an industry standard yet, why change it?

- G
- Gordon Burditt
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Sep 19, 2014 5:52 AM

There is no UCS-16. There are UCS-2 or UTF-16. The "TF" in "UTF" stands for "Transformation Format", not "Transitional Format". Another thing that uses the term "transitional" is HTML, which is not related to character sets.

I recommend that you go to UTF-8, or stick with ISO-8859-1 (or Windows-1252, which is a superset of ISO-8859-1). I don't think the other choices are reasonable. Trying to go with ISO-8859-*, where 15 different charsets with lots of overlap are distinguished by charset tags is going to cause problems when someone using ISO-8859-X quotes someone using ISO-8859-Y, where X != Y, and characters outside the common subset are used.

I like UTF-8. I hope it becomes permanent for things like the web and email. It has the advantage that no byte sequence for any character is a subset of the byte sequence for any other character, so a pattern-search designed for ASCII still works. Actually, a lot of things "just work" with UTF-8 for programs expecting ASCII. That won't happen for UTF-16.

I hope UTF-16 and UCS-2 die out. They encourage a halfway solution in which characters with codes that won't fit in 16 bits aren't supported. They also have the byte-order abomination. They do

*NOT* solve the issue of variable-width characters. Even UCS-4 or UTF-32 does not do that, due to the existence of "combining characters". The byte order mark of UTF-16 is a problem for mail and news articles. Where do you put it? If it's before the headers, then most every mail and news server currently running will interpret it as part of the headers, mangling one of them, or worse, interpret it as a division between (no) headers and the body of the message, and ending up with a lot of rejected mail due to "missing" headers like From:, Subject: or Newsgroups: . If you put it at the start of the body, well, I can imagine the mess you end up with replies to articles with quoting, even if everyone is using UTF-16. No BOM. Multiple conflicting BOMs. BOMs in the middle of text where they aren't looked at.

How often have you needed to translate something to be posted from whatever character set it was in to ISO-8859-1, and ended up with untranslatable characters? If the answer is "never", there's probably no pressing need to change. If your only concern is people's names, there may be no need to change, unless you get a lot of contributers with Japanese, Chinese, Korean, or Vietnamese names who still write in English. But if you are going to change, please choose UTF-8, not UTF-16.

One problem that often arises from using multiple charsets in a newsgroup or mailing list is that quoted text with charset A included in a post with charset B often results in a mess on the screens of readers. Using UTF-8 won't solve this, but it will reduce it. It's even worse when characters in charset A used in the quoted post have no equivalent in charset B (possible with, for example, ISO-8859-1 vs. ISO-8859-5). At least if charset B includes all the characters, translation is possible. Unless you try putting your foot down and claiming that all submissions must be in UTF-8, you'll probably still have to translate parts of some submissions.

You should check out browser and mail reader support for various charsets. I believe the only required charsets for browsers are: ASCII, ISO-8859-1 ("Latin1"), Windows-1252 (a superset of ISO-8859-1), and UTF-8. I may be wrong about "required"; it may just mean "essential for the success of the program". In any case, a browser that does not support UTF-8 is going to miss out on a lot of the web.

In a survey of character sets used on the web in August, 2014

formatting link

these are some of the results (a web site may use more than one character set, so results may add to more than 100%, but not by much):

#1 UTF-8 81.4% #2 ISO-8859-1 9.9% #3 Windows-1251 2.3% #4 GB2312 1.4% #5 Shift JIS 1.3% #6 Windows-1252 1.2% #7 GBK 0.4% ... #18 US-ASCII 0.1% ... UTF-16 less than 0.1%

Web sites with unidentified character sets aren't counted. I presume that means that HTML with no charset tag is treated as "unidentified", not UTF-8, even if there's a rule that says untagged HTML should be treated as UTF-8.

However, even if a browser supports an encoding of Unicode (or several), it probably won't have all the fonts needed to render every character installed. That situation may also exist if a browser is trying to support all of the ISO-8859-* character sets.

Characters covered by ISO-8859-* (accented letters, etc.) will probably be well-supported. Characters used by dead languages (e.g. Egyptian hieroglyphics and Linear B) will likely not be. You'll also have problems with unofficial additions to Unicode in the Private Use Areas (e.g. the Klingon language) due to lack of an official registrar. Somehow I doubt that you will have any submissions about Ancient Egyptian area codes or long-distance rates to the Klingon homeworld.

Well, I'm no expert, but from the survey you can see what webmasters have chosen for web pages. Given that lots of mail readers are web-based, this is probably significant.

Gordon L. Burditt

- G
- Garrett Wollman
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sat, Sep 20, 2014 4:47 AM

Because it was designed specifically to have that property, of course. (UTF-8 is intellectually descended from FSS-UTF -- "file system safe"

-- which was invented by the Plan 9 people at Bell Labs, at a time when the Unicode Consortium was dead set on 16-bit characters. The actual encoding is slightly different.)

Other than that, I agree with pretty much everything that Gordon says. (And I say that as someone whose universe is pretty much all still ISO 8859-1.) Becase UTF-8 degrades gracefully to ASCII (erm, ISO 646), for most purposes, in English-language documents, there is no penalty to using it.

-GAWollman

- T
- Telecom Digest Moderator
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sat, Sep 20, 2014 4:14 PM

Thanks for the correction: I had not known that. May I infer that UTF-8 is a "permanent" format that is here to stay?

(BTW, what is being "transformed"?)

I had not known that either. What part of html is "transitional"?

Good point, and that explains the "blank space" in UTF-8 which other character sets use for "High ASCII" characters. THis is sometimes non-intuitive, however, as for example with the acute-accent "E":

In ISO-8859-1, the acute-accent "e" is a single byte with value

0xE9. In UTF-8, it is a two-byte sequence listed at

formatting link

as

Code point char Hex values Description

... which confuses me somewhat, since the "code point" is the same value as ISO-8859-1, but the actual byte sequence is very different. IIRC, the "C3" is an "escape" value that says "go to the two-byte table", but I may need instruction.

OK, you've convinced me: I didn't know that there *was* such a thing as a "Byte Order Mark", and having to add it to incoming posts which are /not/ in UCS-2 would be a PITA. So, I'll stay away from UCS-2 and UTF-16.

AHA! The crux of the issue!

I am compelled to translate "mystery meat" characters several times each week, and they *always* come in emails which have *NO* "charset" specified.

Some email clients send all characters out as whatever-charset-the- user-is-using, which in most cases is "windows-12xx", but without any clue for *other* operating systems or email clients as to what kind of mystery meat is in the can.

Moreover, quoted material which the sender /received/ as ISO-8859-1 is usually returned unmarked and unconverted, and is lumped in with the "default" character set of the email client being used, so that what

I will, and thanks again.

OK, that's pretty powerful evidence that UTF-8 has become the default, at least on the web.

However, AFAIK, there is no "default" for Usenet. The biggest problem I have when trying to come up with a one-size-fits-all solution to the charset dilemma is that so few email clients bother to mark outgoing messages (Either Usenet posts or emails) with the character set that was used to create them, and that means a lot of guesswork here at Digest Central whenever accented characters are used.

Take a look at these stats, which are drawn from Digest posts received between Aug 1 and Sep 20:

There were 271 posts, not including "service" messages from other sites, status reports from the Majordomo robot, etc.

Of those 271, only 214 contained the "Content-Type: text/plain" header. We received 27 "multipart" MIME messages (1), which aren't counted here, or "text/html" messages, which were discarded.(2)

US-ASCII 48.13% ISO-8859-1 30.37% UTF-8 9.81% ANSI_X3.4-1968 6.07% Windows-1252 5.61%

There are two things to note:

"Multipart" messages are converted to plain text before I see them, but they aren't counted in the figures above.

I'm unable to verify if the Charset info is correct, i.e., if the email client which created each post actually used the character set the client reported. So, I think I can use these percentages as a first approximation to draw these conclusions:

The majority of posts are marked as "ASCII" when they are created.
"ISO-8859-1" is a clear second choice.
All others are distant third-place finishers. However, I can't tell just from these numbers if the readers /want/ to use ASCII, ISO-8859-1, UTF-8, or something else. In other words, the large percentage of messages which used "ISO-8859-1" might be a result of readers setting their newsreaders or email clients to use that standard. Still, the low percentage of "UTF-8" submissions gives me pause.

Thanks again for your insight. I'm going to do more research.

Bill

1.) Multipart messages which have a "text/plain" component are stripped of other sections and sent to me as "text/plain" posts. They're not counted here because I didn't have time to go through the incoming emails and add up what character set each was using in the "text/plain" section. There were 27 multipart messages, less than 10% of the total.

2.) BTW, sorry if you sent a post with a "text/html" Content-type, but we don't have any way to convert HTML into plain text, and the Digest doesn't publish HTML posts. Every one I've ever looked at was spam.

- G
- Garrett Wollman
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sat, Sep 20, 2014 10:36 PM

This is by design. Quoting from the utf8(5) manual page:

The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards compatible with ASCII, so 0x00-0x7f refer to the ASCII character set. The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is represented by the following table:

[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] -> 1110bbbb, 10bbbbbb, 10bbbbbb [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] -> 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always used. Longer ones are detected as an error as they pose a potential security risk, and destroy the 1:1 character:octet sequence mapping.

Actually, there is, and it's UTF-8. This was standardized in the most recent update of the article format RFC (the number of which I've forgotten). But many clients have not yet been updated to meet the new RFC (because the most functional clients are often old abandonware that can't do character set conversion at all).

-GAWollman

- T
- tlvp
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Sep 21, 2014 1:59 AM

It's not that the "t" in "html" means "transitional" (it doesn't) -- it's that HTML has various acceptable standards, some of them transitional, as shown in the corresponding DOCTYPE declarations (with which a valid HTML document must begin), for example, in this one:

There you plainly see 'the term "transitional" ', n'est-ce pas?

Cheers, -- tlvp

- N
- Neal McLain
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Sep 21, 2014 4:54 AM

[snip]

The above-quoted lines of text indicate how Google treats diacritic characters in the Google Groups archive.

formatting link

Judging from previous comments in this thread, I may be the only T-D reader who read messages in, and posts from, the Google Group archive. While I don't have any opinion about what character set T-D should use in the future, I suggest that the Google Group archive shouldn't be left out of the discussion.

Neal McLain

***** Moderator's Note *****

I'd like to use a character set which *Everyone* can understand, but that's not always possible. I don't like it, but windoze is the default standard, and redmond gets to dictate what works whether I like it or not, so if I try to please Google and mickeysoft, I'm done before I start.

I'd much rather choose a standards-based solution, which everyone can agree on, and which will, at least, allow those with evil-empire software (of *ANY* kind) to adapt and find workarounds.

Bill Horne Moderator

- G
- Gordon Burditt
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Sep 21, 2014 8:37 AM

UTF-8 is as permanent as any other format, hopefully moreso. If someone seriously proposed UTF-666 to support all intergalactic languages when we're accepted into the Intergalactic Federation of Planets, I'd expect it would die quickly because of the massive waste of bits per character. Also, a character is not a whole number of bytes.

Character sets, especially Unicode, are defined in terms of "code points". For 8-bit character sets and smaller, the "code point" and the representation are the same. A possible exception is Baudot, with its shift characters, where a character is sent as 5 bits with the shift state implicitly included as a 6th bit. There is, in this case, a 6-bit code point for Baudot character, whether it's defined that way or not or the term "code point" was even invented when Baudot was in wide use.

In Unicode, there are at least 5 different representations (transformations): UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE. These are obtained by re-arranging bits and maybe adding more. You may argue about UTF-7, and whether UTF-16 is equivalent to one of UTF-16BE or UTF-16LE or it's a new representation by itself (the rules for BOMs are different).

UTF-32LE and UTF-32BE: you take the code point, turn it into a

32-bit number, break it up into 4 bytes in the appropriate order, and ship it. There are 22 possible byte orders that Unicode doesn't support. That was easy.

UTF-16BE and UTF-16LE: You take the code point, and if it fits in

16 bits, you break it into 2 bytes in the appropriate order, and ship it.

Otherwise, a code point is turned into a High Surrogate (D800-DBFF) followed by a Low Surrogate (DC00-DFFF). Take the code point value, and subtract 0x10000. (If you get a negative number here, then it fits in 16 bits, and you weren't supposed to get here.) Treat it as a 20-bit number (if it doesn't fit in 20 bits, codes >= 0x110000) are invalid), and split it into a 10-bit high half and a 10-bit low half. Add 0xD800 to the high half, and that gives the high surrogate. Add 0xDC00 to the low half, and that gives the low surrogate. Break these into 4 bytes in the appropriate order, and ship it.

The code points D800-DFFF are reserved (not only in UTF-16, but in all of Unicode) to avoid ambiguities between real characters and surrogates.

UTF-8: You take the code point, and if it fits in 7 bits, you ship it as-is (with a 0 high bit). Otherwise, a character consists of one leader byte (binary pattern 11XXXXXX) followed by one or more following bytes (binary pattern 10XXXXXX).

00XXXXXX Single-byte ASCII character 01XXXXXX Single-byte ASCII character 110XXXXX 10XXXXXX 2-character sequence (0x0080 - 0x07FF) 1110XXXX 10XXXXXX 10XXXXXX 3-character sequence (0x800 - 0xFFFF) 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX 4-character sequence (0x10000 - 0x10FFFF)

Unicode only goes up to 0x10FFFF (21 bits) mostly because UTF-16 won't go any higher. It could be extended to use a 7-character sequence, giving a 36-bit code point. If you're willing to use

0xFF as a leader byte, you can go up to 41-bit code points. Following bytes carry 6 data bits. Leading bytes carry 3-5, or if you want to extend it, 0-5.

UTF7: This was an attempt to combine Unicode and 7-bit-safe emails in a more space-efficient way than base64. I think they've pretty much given up on it.

A Byte Order Mark is the code point U+FEFF translated into whatever format is being used. The byte-reversed character, U+FFFE, is not used, making backwards-byte-order detectable. There *is* a UTF-8 byte order mark, EF BB BF, but it really serves no purpose but to screw up PHP pages (You're supposed to put certain code in PHP before outputting any text, and that invisible BOM before the

- J
- John Levine
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Sep 21, 2014 2:31 PM

Yes, you can be confident that UTF-8 is not going away. Everything in the IETF that needs characters beyond ASCII uses UTF-8.

Unicode is a 32 bit character set. UTF-8 is a very clever way of encoding Unicode characters into variable length groups of 8 bit bytes. The 0-127 ASCII-compatible range is represented as itself, and the longer encodings are self-synchronizing.

Quite right.

Not quite. For Unicode values between 0x080 and 0x7ff, if the bits in the character are ABCDEFGHIJK, the UTF-8 bytes are 110ABCDE 10FGHIJK. The value in the high bits of the first byte tell you how many more bytes follow. The Wikipedia article explains this well.

For this application BOMs aren't important.

R's, John