Is it time for a new charset in the Digest? [telecom]

Have a question or want to start a discussion? Post it! No Registration Necessary.  Now with pictures!

Threaded View
I've been using the ISO-8859-1 "Latin1" character set in the Digest
for a few years now: we adopted it as the standard after a reader made
me awaare that there are no accented characters in ASCII, so I figured
that I'd implement a way for him to spell his name properly, and also


I'm wondering if it's time for another change, either to one of the
"transitional" Unicode formats, such as UTF-8, or perhaps to a
permanent solution such as UCS-16.

I'd like to hear opinions from you, particularly if you have expertise
in choosing character sets for online publicatoins such as The Telecom
Digest. TIA.

Bill Horne
Moderator

Re: Is it time for a new charset in the Digest? [telecom]
Quoted text here. Click to load it


Keep using it...

Quoted text here. Click to load it

...because even though Unicode is hawked as the "universal" character set,
ISO is supported by, like, everything. Client Unicode support is more
widespread than it was 10 years ago but still far from universal. If you
absolutely have to use Unicode, stick with the zeroth page of UTF8
(mirrors the standard charset). Many softwares (e.g. I occasionaly access
this newsgroup directly via Eternal-September in a telnet session at a
full-screen command line) react strangely to characters in higher-number
pages. Lynx, for example, parses those stupid "curly quotes" as ~R, ~S and
~T, respectively. (A lesson I wish the Wordpress bozondos would learn!)

Quoted text here. Click to load it

Strictly my own opinion, of course. But if what you're using works, has
proven itself and is an industry standard yet, why change it?

Re: Is it time for a new charset in the Digest? [telecom]
Quoted text here. Click to load it

Quoted text here. Click to load it

There is no UCS-16.  There are UCS-2 or UTF-16.  The "TF" in "UTF"
stands for "Transformation Format", not "Transitional Format".
Another thing that uses the term "transitional" is HTML, which
is not related to character sets.

I recommend that you go to UTF-8, or stick with ISO-8859-1 (or
Windows-1252, which is a superset of ISO-8859-1).  I don't think
the other choices are reasonable.  Trying to go with ISO-8859-*,
where 15 different charsets with lots of overlap are distinguished
by charset tags is going to cause problems when someone using
ISO-8859-X quotes someone using ISO-8859-Y, where X != Y, and
characters outside the common subset are used.  

I like UTF-8.  I hope it becomes permanent for things like the web
and email.  It has the advantage that no byte sequence for any
character is a subset of the byte sequence for any other character,
so a pattern-search designed for ASCII still works.  Actually, a
lot of things "just work" with UTF-8 for programs expecting ASCII.
That won't happen for UTF-16.

I hope UTF-16 and UCS-2 die out.  They encourage a halfway solution
in which characters with codes that won't fit in 16 bits aren't
supported.  They also have the byte-order abomination.  They do
*NOT* solve the issue of variable-width characters.  Even UCS-4 or
UTF-32 does not do that, due to the existence of "combining
characters".  The byte order mark of UTF-16 is a problem for mail
and news articles.  Where do you put it?  If it's before the headers,
then most every mail and news server currently running will interpret
it as part of the headers, mangling one of them, or worse, interpret
it as a division between (no) headers and the body of the message,
and ending up with a lot of rejected mail due to "missing" headers
like From:, Subject: or Newsgroups: .  If you put it at the start
of the body, well, I can imagine the mess you end up with replies
to articles with quoting, even if everyone is using UTF-16.  No
BOM.  Multiple conflicting BOMs. BOMs in the middle of text where
they aren't looked at.

How often have you needed to translate something to be posted from
whatever character set it was in to ISO-8859-1, and ended up with
untranslatable characters?  If the answer is "never", there's
probably no pressing need to change.  If your only concern is
people's names, there may be no need to change, unless you get a
lot of contributers with Japanese, Chinese, Korean, or Vietnamese
names who still write in English.  But if you are going to change,
please choose UTF-8, not UTF-16.

One problem that often arises from using multiple charsets in a
newsgroup or mailing list is that quoted text with charset A included
in a post with charset B often results in a mess on the screens of
readers.  Using UTF-8 won't solve this, but it will reduce it.  It's
even worse when characters in charset A used in the quoted post
have no equivalent in charset B (possible with, for example,
ISO-8859-1 vs. ISO-8859-5).  At least if charset B includes all the
characters, translation is possible.  Unless you try putting your
foot down and claiming that all submissions must be in UTF-8,
you'll probably still have to translate parts of some submissions.

You should check out browser and mail reader support for various
charsets.  I believe the only required charsets for browsers are:
ASCII, ISO-8859-1 ("Latin1"), Windows-1252 (a superset of ISO-8859-1),
and UTF-8.  I may be wrong about "required"; it may just mean
"essential for the success of the program".  In any case, a browser
that does not support UTF-8 is going to miss out on a lot of the
web.

In a survey of character sets used on the web in August, 2014
(http://w3techs.com/technologies/overview/character_encoding/all ), these
are some of the results (a web site may use more than one character
set, so results may add to more than 100%, but not by much):

    #1    UTF-8        81.4%
    #2    ISO-8859-1    9.9%
    #3    Windows-1251    2.3%
    #4    GB2312        1.4%
    #5    Shift JIS    1.3%
    #6    Windows-1252    1.2%
    #7    GBK        0.4%
    ...
    #18    US-ASCII    0.1%
    ...
        UTF-16        less than 0.1%

Web sites with unidentified character sets aren't counted.
I presume that means that HTML with no charset tag is treated
as "unidentified", not UTF-8, even if there's a rule that says
untagged HTML should be treated as UTF-8.

However, even if a browser supports an encoding of Unicode (or
several), it probably won't have all the fonts needed to render
every character installed.  That situation may also exist if a
browser is trying to support all of the ISO-8859-* character sets.

Characters covered by ISO-8859-* (accented letters, etc.) will
probably be well-supported.  Characters used by dead languages (e.g.
Egyptian hieroglyphics and Linear B) will likely not be.  You'll
also have problems with unofficial additions to Unicode in the
Private Use Areas (e.g.  the Klingon language) due to lack of an
official registrar.  Somehow I doubt that you will have any
submissions about Ancient Egyptian area codes or long-distance rates
to the Klingon homeworld.

Quoted text here. Click to load it

Well, I'm no expert, but from the survey you can see what webmasters have
chosen for web pages.  Given that lots of mail readers are web-based,  
this is probably significant.

Gordon L. Burditt

Re: Is it time for a new charset in the Digest? [telecom]
Quoted text here. Click to load it

Because it was designed specifically to have that property, of course.
(UTF-8 is intellectually descended from FSS-UTF -- "file system safe"
-- which was invented by the Plan 9 people at Bell Labs, at a time
when the Unicode Consortium was dead set on 16-bit characters.  The
actual encoding is slightly different.)

Other than that, I agree with pretty much everything that Gordon
says.  (And I say that as someone whose universe is pretty much all
still ISO 8859-1.)  Becase UTF-8 degrades gracefully to ASCII (erm,
ISO 646), for most purposes, in English-language documents, there is
no penalty to using it.

-GAWollman
--  
Garrett A. Wollman    | What intellectual phenomenon can be older, or more oft
wollman@bimajority.org| repeated, than the story of a large research program
Opinions not shared by| that impaled itself upon a false central assumption
my employers.         | accepted by all practitioners? - S.J. Gould, 1993

Re: Is it time for a new charset in the Digest? [telecom]
On Fri, Sep 19, 2014 at 12:52:37AM -0500, Gordon Burditt wrote:
Quoted text here. Click to load it

Quoted text here. Click to load it

Thanks for the correction: I had not known that. May I infer that
UTF-8 is a "permanent" format that is here to stay?

(BTW, what is being "transformed"?)

Quoted text here. Click to load it

I had not known that either. What part of html is "transitional"?

Quoted text here. Click to load it
  
Quoted text here. Click to load it

Good point, and that explains the "blank space" in UTF-8 which other
character sets use for "High ASCII" characters. THis is sometimes
non-intuitive, however, as for example with the acute-accent "E":

In ISO-8859-1, the acute-accent "e" is a single byte with value
0xE9. In UTF-8, it is a two-byte sequence listed at  
http://www.utf8-chartable.de/ as

Code point  char  Hex values  Description


... which confuses me somewhat, since the "code point" is the same
value as ISO-8859-1, but the actual byte sequence is very
different. IIRC, the "C3" is an "escape" value that says "go to the
two-byte table", but I may need instruction.

Quoted text here. Click to load it

OK, you've convinced me: I didn't know that there *was* such a thing
as a "Byte Order Mark", and having to add it to incoming posts which
are /not/ in UCS-2 would be a PITA. So, I'll stay away from UCS-2 and
UTF-16.

Quoted text here. Click to load it

AHA! The crux of the issue!  

I am compelled to translate "mystery meat" characters several times
each week, and they *always* come in emails which have *NO* "charset"
specified.

Some email clients send all characters out as whatever-charset-the-
user-is-using, which in most cases is "windows-12xx", but without any
clue for *other* operating systems or email clients as to what kind of
mystery meat is in the can.

Moreover, quoted material which the sender /received/ as ISO-8859-1 is
usually returned unmarked and unconverted, and is lumped in with the
"default" character set of the email client being used, so that what


Quoted text here. Click to load it

I will, and thanks again.

Quoted text here. Click to load it

OK, that's pretty powerful evidence that UTF-8 has become the default,
at least on the web.

However, AFAIK, there is no "default" for Usenet. The biggest problem
I have when trying to come up with a one-size-fits-all solution to the
charset dilemma is that so few email clients bother to mark outgoing
messages (Either Usenet posts or emails) with the character set that
was used to create them, and that means a lot of guesswork here at
Digest Central whenever accented characters are used.

Take a look at these stats, which are drawn from Digest posts received
between Aug 1 and Sep 20:

There were 271 posts, not including "service" messages from other
sites, status reports from the Majordomo robot, etc.

Of those 271, only 214 contained the "Content-Type: text/plain"
header. We received 27 "multipart" MIME messages (1), which aren't
counted here, or "text/html" messages, which were discarded.(2)

US-ASCII         48.13%  
ISO-8859-1       30.37%  
UTF-8                9.81%  
ANSI_X3.4-1968      6.07%  
Windows-1252      5.61%  

There are two things to note:  

* "Multipart" messages are converted to plain text before I see them, but they aren't  
  counted in the figures above.  

* I'm unable to verify if the Charset info is correct, i.e., if the
  email client which created each post actually used the character set
  the client reported.
    
So, I think I can use these percentages as a first approximation to draw these conclusions:

1. The majority of posts are marked as "ASCII" when they are created.  
2. "ISO-8859-1" is a clear second choice.
3. All others are distant third-place finishers.    
    
However, I can't tell just from these numbers if the readers /want/ to
use ASCII, ISO-8859-1, UTF-8, or something else. In other words, the
large percentage of messages which used "ISO-8859-1" might be a result
of readers setting their newsreaders or email clients to use that
standard. Still, the low percentage of "UTF-8" submissions gives me pause.

Thanks again for your insight. I'm going to do more research.  

Bill

1.) Multipart messages which have a "text/plain" component are
    stripped of other sections and sent to me as "text/plain"
    posts. They're not counted here because I didn't have time to go
    through the incoming emails and add up what character set each was
    using in the "text/plain" section. There were 27 multipart
    messages, less than 10% of the total.

2.) BTW, sorry if you sent a post with a "text/html" Content-type, but
    we don't have any way to convert HTML into plain text, and the
    Digest doesn't publish HTML posts. Every one I've ever looked at
    was spam.

--  
Bill Horne
Moderator

Re: Is it time for a new charset in the Digest? [telecom]
On Sat, 20 Sep 2014 12:14:29 -0400, Telecom Digest Moderator wrote:

Quoted text here. Click to load it

It's not that the "t" in "html" means "transitional" (it doesn't) -- it's
that HTML has various acceptable standards, some of them transitional, as
shown in the corresponding DOCTYPE declarations (with which a valid HTML
document must begin), for example, in this one:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd ">

There you plainly see 'the term "transitional" ', n'est-ce pas?

Cheers, -- tlvp
--  
Avant de repondre, jeter la poubelle, SVP.

Re: Is it time for a new charset in the Digest? [telecom]
On Saturday, September 20, 2014 11:14:29 AM UTC-5, Telecom Digest Moderator
wrote:

[snip]

Quoted text here. Click to load it

[snip]

Quoted text here. Click to load it

[snip]

Quoted text here. Click to load it

The above-quoted lines of text indicate how Google treats diacritic characters
in the Google Groups archive.
https://groups.google.com/forum/?hl=en #!topic/comp.dcom.telecom/yqHnudP1Ia4

Judging from previous comments in this thread, I may be the only T-D reader
who read messages in, and posts from, the Google Group archive.  While I don't
have any opinion about what character set T-D should use in the future, I
suggest that the Google Group archive shouldn't be left out of the
discussion.

Neal McLain

***** Moderator's Note *****

I'd like to use a character set which *Everyone* can understand, but
that's not always possible. I don't like it, but windoze is the
default standard, and redmond gets to dictate what works whether I
like it or not, so if I try to please Google and mickeysoft, I'm done
before I start.

I'd much rather choose a standards-based solution, which everyone can
agree on, and which will, at least, allow those with evil-empire
software (of *ANY* kind) to adapt and find workarounds.

Bill Horne
Moderator

Re: Is it time for a new charset in the Digest? [telecom]
Quoted text here. Click to load it

Quoted text here. Click to load it

This is by design.  Quoting from the utf8(5) manual page:

     The UTF-8 encoding    represents UCS-4 characters as a sequence of octets,
     using between 1 and 6 for each character.    It is backwards    compatible
     with ASCII, so 0x00-0x7f refer to the ASCII character set.     The multibyte
     encoding of non-ASCII characters consist entirely of bytes    whose high
     order bit is set.    The actual encoding is represented by the following
     table:

     [0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
     [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb,    10bbbbbb
     [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
         1110bbbb, 10bbbbbb, 10bbbbbb
     [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
         11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
     [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
         111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
     [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
         1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

     If    more than a single representation of a value exists (for example,
     0x00; 0xC0    0x80; 0xE0 0x80    0x80) the shortest representation is always
     used.  Longer ones    are detected as    an error as they pose a    potential
     security risk, and    destroy    the 1:1    character:octet    sequence mapping.

Quoted text here. Click to load it

Actually, there is, and it's UTF-8.  This was standardized in the most
recent update of the article format RFC (the number of which I've
forgotten).  But many clients have not yet been updated to meet the
new RFC (because the most functional clients are often old
abandonware that can't do character set conversion at all).

-GAWollman

--  
Garrett A. Wollman    | What intellectual phenomenon can be older, or more oft
wollman@bimajority.org| repeated, than the story of a large research program
Opinions not shared by| that impaled itself upon a false central assumption
my employers.         | accepted by all practitioners? - S.J. Gould, 1993

Re: Is it time for a new charset in the Digest? [telecom]
Quoted text here. Click to load it

UTF-8 is as permanent as any other format, hopefully moreso.  If
someone seriously proposed UTF-666 to support all intergalactic
languages when we're accepted into the Intergalactic Federation of
Planets, I'd expect it would die quickly because of the massive
waste of bits per character.  Also, a character is not a whole
number of bytes.

Quoted text here. Click to load it

Character sets, especially Unicode, are defined in terms of "code
points".  For 8-bit character sets and smaller, the "code point"
and the representation are the same.  A possible exception is Baudot,
with its shift characters, where a character is sent as 5 bits with
the shift state implicitly included as a 6th bit.  There is, in
this case, a 6-bit code point for Baudot character, whether it's
defined that way or not or the term "code point" was even invented
when Baudot was in wide use.

In Unicode, there are at least 5 different representations
(transformations): UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.
These are obtained by re-arranging bits and maybe adding more.  You
may argue about UTF-7, and whether UTF-16 is equivalent to one of
UTF-16BE or UTF-16LE or it's a new representation by itself (the
rules for BOMs are different).

UTF-32LE and UTF-32BE:  you take the code point, turn it into a
32-bit number, break it up into 4 bytes in the appropriate order,
and ship it.  There are 22 possible byte orders that Unicode doesn't
support.  That was easy.

UTF-16BE and UTF-16LE:  You take the code point, and if it fits in
16 bits, you break it into 2 bytes in the appropriate order, and
ship it.

Otherwise, a code point is turned into a High Surrogate (D800-DBFF)
followed by a Low Surrogate (DC00-DFFF).  Take the code point value,
and subtract 0x10000.  (If you get a negative number here, then it
fits in 16 bits, and you weren't supposed to get here.) Treat it
as a 20-bit number (if it doesn't fit in 20 bits, codes >= 0x110000)
are invalid), and split it into a 10-bit high half and a 10-bit low
half.  Add 0xD800 to the high half, and that gives the high surrogate.
Add 0xDC00 to the low half, and that gives the low surrogate.  Break
these into 4 bytes in the appropriate order, and ship it.

The code points D800-DFFF are reserved (not only in UTF-16, but in
all of Unicode) to avoid ambiguities between real characters and
surrogates.

UTF-8:  You take the code point, and if it fits in 7 bits, you
ship it as-is (with a 0 high bit).  Otherwise, a character
consists of one leader byte (binary pattern 11XXXXXX) followed
by one or more following bytes (binary pattern 10XXXXXX).

00XXXXXX           Single-byte ASCII character
01XXXXXX           Single-byte ASCII character
110XXXXX 10XXXXXX          2-character sequence  
               (0x0080 - 0x07FF)
1110XXXX 10XXXXXX 10XXXXXX 3-character sequence  
                  (0x800 - 0xFFFF)
11110XXX 10XXXXXX 10XXXXXX 10XXXXXX 4-character sequence
               (0x10000 - 0x10FFFF)

Unicode only goes up to 0x10FFFF (21 bits) mostly because UTF-16
won't go any higher.  It could be extended to use a 7-character
sequence, giving a 36-bit code point.  If you're willing to use
0xFF as a leader byte, you can go up to 41-bit code points.
Following bytes carry 6 data bits.  Leading bytes carry 3-5, or
if you want to extend it, 0-5.

UTF7:  This was an attempt to combine Unicode and 7-bit-safe emails
in a more space-efficient way than base64.  I think they've pretty
much given up on it.

A Byte Order Mark is the code point U+FEFF translated into whatever
format is being used.  The byte-reversed character, U+FFFE, is not
used, making backwards-byte-order detectable.  There *is* a UTF-8
byte order mark, EF BB BF, but it really serves no purpose but to
screw up PHP pages (You're supposed to put certain code in PHP
before outputting any text, and that invisible BOM before the <?php
counts as text to be output, and that causes errors).

Quoted text here. Click to load it

HTML defines a number of DOCTYPE lines (and corresponding DTDs) to
put at the front of your HTML to indicate how it is to be parsed.
Two of these are called "HTML 4.01 Strict" and "HTML 4.01 Transitional".
The transitional one allows some older constructs in process of
being phased out to stay around while the strict version doesn't.
This may affect how the browser renders the page.


Quoted text here. Click to load it

There aren't any high-bit bytes that are part of a single-byte
character sequence, if that's what you mean.

Quoted text here. Click to load it

Quoted text here. Click to load it

UTF-8 and ISO-8859-1 have the same code points, but different
representation, for all of the characters in the "upper half" of
ISO-8859-1.  The "lower half" corresponds to ASCII, and the code
point and the representation are the same.  The rest of UTF-8 has
no equivalents in ISO-8859-1 (and uses multiple bytes per character).

There isn't *one* 2-byte table, there are 32 of them corresponding
to the 32 leader bytes for 2-byte sequences starting with C0 thru
DF.  And you really aren't supposed to use C0 and C1.

You may consider it an "escape", which is one valid way of looking
at it, but it contains 5 high-order bits of the value, and the
following byte contains 6 low-order bits, so the pattern of a
two-byte sequence 110XXXXX 10XXXXXX can cover values up to 0x7FF
(11 bits, corresponding to the X's).

Point of further confusion:  *EVERY* code point within the Unicode
range (0 - 0x10FFFF) has a 4-byte-long UTF-8 encoding.  Every code
point in the range 0 - 0xFFFF has a 3-byte long UTF-8 encoding.
Every code point in the range 0 - 0x7FF has a 2-byte long UTF-8
encoding.  That's 4 different ways of encoding an 'X'!  "Overlong"
encodings are supposed to be treated as errors.  Some people have
used the byte sequence C0 80 as a way of sneaking an ASCII NUL into
a C NUL-terminated string without it being a terminator.

Quoted text here. Click to load it

You have to do more than that:  if it's pure ASCII you are inserting
into an article to be sent as UTF-16, you need to add an ASCII NUL
between each character in the text.  Running it through "iconv" makes
this easier than it sounds.

Quoted text here. Click to load it

I have a program that tries to identify the charset of a file,
largely by rejecting impossible sequences, assuming it's a text
file.  It still usually comes up with several ranked possibilities.
(The UNIX "file" command might be more practical for this purpose.
I think a variant of it has been ported to Mac and Windows.)  No,
it can't deal with hodgepodge mixtures.

It's easy to identify pure 7-bit ASCII.

It's easy to identify something that is NOT UTF-8 because the
sequence of leading/following characters has to be right.  One
high-bit character from ISO-8859-* or Windows-* with surrounding
7-bit ASCII chars is enough to reject something as being UTF-8.

Something that it calls UTF-8 with over a dozen or two non-7-bit
characters almost certainly is UTF-8.

It's easy to identify Windows-* character sets if they use characters
in the range 0x80 - 0x9f.  Which variant?  Not so easy.

Telling the difference between ISO-8859-* variants is not easy.
ISO-8859-16 has no unassigned characters in the range used by
ISO-8859-*.  The others have only a few.

The same applies to Windows-12XX variants.


Quoted text here. Click to load it

Well, if they send the tag, you can translate it to UTF-8, but if
they don't, you may have no choice but to label it all "rat meat".


Quoted text here. Click to load it

Quoted text here. Click to load it

It sounds like there is room for a lot of improvement in mail clients.

Gordon L. Burditt

Re: Is it time for a new charset in the Digest? [telecom]
Quoted text here. Click to load it

Yes, you can be confident that UTF-8 is not going away.  Everything in
the IETF that needs characters beyond ASCII uses UTF-8.

Unicode is a 32 bit character set.  UTF-8 is a very clever way of
encoding Unicode characters into variable length groups of 8 bit
bytes. The 0-127 ASCII-compatible range is represented as itself, and
the longer encodings are self-synchronizing.

Quoted text here. Click to load it

Quite right.


Not quite.  For Unicode values between 0x080 and 0x7ff, if the bits in
the character are ABCDEFGHIJK, the UTF-8 bytes are 110ABCDE 10FGHIJK.
The value in the high bits of the first byte tell you how many more
bytes follow.  The Wikipedia article explains this well.

Quoted text here. Click to load it

For this application BOMs aren't important.

R's,
John

Site Timeline