NetBSD Blog

Bookmarks

Feeds

=?iso-8859-8-i?Q? Handling non-UTF-8 Hebrew email

June 10, 2018 posted by Maya Rashish

I like to use CLI email clients (mutt). This by itself is not unusual, but I happen to do this while speaking a language written right-to-left, Hebrew.
Decent bidi support in CLI tools is rare, so my impression is that very few people do this.

In the dark ages before Unicode, Hebrew used its own encodings which allowed typing both Latin and Hebrew letters: Windows-1255, ISO-8859-8.
I speculate that people initially expected input to be written in reverse order (aka "visual order"), assuming that everything will display text left to right.

When people wanted to use e-mail, they decided they'll write a line stating the charset encoding as others do, and use quoted-printable or base64 to avoid the content being mangled by clueless servers (8BITMIME wasn't around then).

But then they thought about bidi, and realized that writing in reverse isn't that great when you can have some bidi support. I've yet to write a bidi algorithm, but I suspect it makes line-wrapping illogical.

To avoid conflicts with existing emails, they decided on a separate encoding for the purpose of conveying that the information isn't in reverse: iso-8859-8-i: the content is in logical order, and Hebrew is assumed to be rtl.
iso-8859-8-e: the text direction is explicit using control codes.

The latter is a neat idea, but hasn't caught on. Now it's common to assume logical order, and even iso-8859-8 might be in that format.
While defining this, they've also done the same for Arabic (iso-8859-6).

This is a discussion that should've been part of the past - Unicode is now a thing, and I can send messages that contain Hebrew, Arabic, Chinese, English - without flipping back and forth in encoding (if that was ever even possible?), and out of the box! Never a need to enable support for specifying charset. Unicode has a detailed algorithm for handling bidi.
Unicode is love. Unicode is life. Use Unicode.
But I recently was looking for work, and HR's presumed Microsoft Outlook MUA did not use Unicode.

One of the emails I got was encoded as iso-8859-8-i.
It turns out, my MUA setup cannot handle this charset. It ended up looking like \344 things, and the subject as boxes.
mail is a plaintext format with extensions hacked into it, so you can view the raw content as a file. I used 'e' on mutt to open it:

Subject: =?iso-8859-8-i?B?base64stuff

(The magical 'encode this in a different way' for email subjects)

Content-Type: text/plain; charset="iso-8859-8-i"
Content-Transfer-Encoding: quoted-printable

So this is an iso-8859-8-i file.

OK, let's just read this file. I've got python.
I saved the file, which looked like this in its raw format:

=EE=E0=E9=E4

Or quoted-printable. Gotta turn that into raw data, then convert ISO-8859-8 to UTF-8.

import quopri
import sys
rawmsg = sys.stdin.read()
notutf8msg = quopri.decodestring(rawmsg)
utf8msg = notutf8msg.decode('iso-8859-8')
print(utf8msg)

Cool. I can read the message. I even discover 'fribidi' isn't just a library, but also provides a command I can pipe this into and see nicely-formatted Hebrew even without using weirdo terminal emulators.

But let's not leave bugs like that lurking around. It is my duty as an RTL warrior to fix it.

One of the perks to using pkgsrc/netbsd and open source is that I can immediately look at mutt's source code. I knew it could handle iso-8859-8, so that's what I looked for.

The amount of results (combined with experience) quickly suggested that the encoding is handled by the OS, netbsd in this case.
NetBSD didn't know about iso-8859-8-i.

Experience meant I knew to look in either src/lib/ (wasn't there) or src/share/ for 'data used by things'. I've looked for 'iso-8859-8' to see if it appears anywhere, and found it. It was good to see that NetBSD does appear to have a way to alias charsets as being equivalent, and I added iso-8859-8-i here, and did a full build because I didn't know how the files are used.

Testing locally, I could now read the email with mutt! But what about replying?
I have a weird email setup again. I had a hard time setting up a remote POP/IMAP thing, so I ssh to sdf.org and email from there. And I can't change their libc or install.
Hoping to just elide all the corrupted characters and reply with UTF-8 was too optimistic - mutt wanted to reply in the original encoding, and again could not handle it properly.

Well, I'll just put in my updated libc, and LD_PRELOAD it, then!
Except, after ktracing it (via 'ktruss -i mutt |grep esdb'), it turns out that it opens a file in /usr/share/i18n/ to figure out charset aliases.
I'll need to tell it to look elsewhere I can modify.
I've edited out paths.h, which is where the lookup path is stored, changed it to my home on sdf.org, and then built myself a fresh libc.
(It was during this I realized I could've just edited the email to say it's iso-8859-8, rather than iso-8859-8-i)

A few minor setbacks, and I could finally reply to the email, saying that yes, I will show up to the job interview.

I leave you with this tidbit from the RFC announcing these encodings and that finally, emails in Hebrew are possible:
"Within Israel there are in excess of 40 Listserv lists which will now start using Hebrew for part of their conversations."
Hurray!

[1 comment]

« Coverage of signal... | Main | GSoC 2018 Reports:... »

Comments:

Excellent work and fun to read ;-) After that you get your job, right? Because if not........

Posted by x on June 15, 2018 at 06:03 PM UTC #