Encoding is a Headache

I have to spend way too much of my programming time worrying about character encodings. Take my latest module, Text::Markup for example. The purpose of the module is very simple: give in the name of a file, and it will figure out the markup it uses (HTML, Markdown, Textile, whatever) and return a string containing the HTML generated from the file. Simple, right?

But, hang on. Should the HTML it returns be decoded to Perl’s internal form? I’m thinking not, because the HTML itself might declare the encoding, either in a XML declaration or via something like

<meta http-equiv="Content-type" content="text/html;charset=Big5" />

And as you can see, it’s not UTF-8. So decoded it would be lying. So it should be encoded, right? Parsers like XML::LibXML::Parser are smart enough to see such declarations and decode as appropriate.

But wait a minute! Some markup languages, like Markdown, don’t have XML declarations or headers. They’re HTML fragments. So there’s no wait to tell the encoding of the resulting HTML unless it’s decoded. So maybe it should be decoded. Or perhaps it should be decoded, and then given an XML declaration that declares the encoding as UTF-8 and encoded it as UTF-8 before returning it.

But, hold the phone! When reading in a markup file, should it be decoded before it’s passed to the parser? Does Text::Markdown know or care about encodings? And if it should be decoded, what encoding should one assume the source file uses? Unless it uses a BOM, how do you know what its encoding is?

Text::Markup is a dead simple idea, but virtually all of my time is going into thinking about this stuff. It drives me nuts. When will the world cease to be this way?

Oh, and you have answers to any of these questions, please do feel free to leave a comment. I hate having to spend so much time on this, but I’d much rather do so and get things right (or close to right) than wrong.


Michael Peters wrote:

Reasonable defaults, configurable overrides

Whenever I'm faced with a similar api problem I just just make the default something that most people will want, and then allow them to override it somehow.

In this case, I think the default that most people these days want from an HTML fragment is UTF8. But if they need something else, just let them pass an extra encoding parameter.

dagolden wrote:

tried CPAN?

Any time I hit a problem that I think must have been solved before, I search CPAN. Have you tried any of the detectors for things that don't declare their encoding explicitly? E.g. Encode::Guess or Encode::Detect?

Jerome Eteve wrote:

Encoding is not your concern

Encoding and decoding is a dangerous ground where trying to be clever can cause more damage than good.

Imho, the sanest way to deal with encoding/decoding is 'as close as the output/input as possible'.

In the Perl space, text strings must be Perl character strings.

Your module is in Perl and it creates HTML? Fine, let it just do that. You should return some html as a Perl string.

Now if your modules users want to output it as UTF-64, let them do so.

You may object that a lot of people don't know about encoding. Don't attempt to implement some magic to make their life easy, it never works.

If your module is capable of outputting entire HTML pages, provide a mechanism that allow users to inject their own fragment.

Theory wrote:


@Michael—Yep, that's what I'm doing. Mostly. I'm using File::BOM to open files, and if there is no BOM, falling back on UTF-8. I'm not sure I want to add a parameter, though, because some files should be read as raw bytes (like HTML).

@dagolden—I think those are kind of error-prone. I've used them in the past, along with Encode::Detect::Detector. They can get things wrong, and disagree with one another. I'm more comfortable insisting on UTF-8 or being unambiguous (by using a BOM or HTML header or something).

@Jerome—Yep, wanting to avoid cleverness. But different parsers expect different things. And my module doesn't generate HTML. It uses a bunch of HTML-generating parsers that have different expectations for their inputs. HTML, for example, should not be decoded to Perl's internal form. It should have an encoding declaration (header or meta element), and the parser will decode it for you. So, no magic, with you there, but gotta do my best to give the various parsers the values they'll work best with.

david nicol wrote:

the tipjar way

I have long held that a future Perl should abandon the string-is-a-sequence-of-octets paradigm for a more rope-like abstraction in which fragments each get their own encodings. Not only would such a refactoring change the entire unicode headache from a matter of conditional code paths depending on flags to a matter of parallel methods, but it also solves heeadaches such as the ones you are kvetching about in this post. So the document declares that the middle 321 octets are encoded as Big5 Chinese? Store those octets as a Big5 string, within the larger rope.

I don't have the thousand spare tuits to make this happen, but it's my considered recommendation.