Sharing with the community: text transcoding libraries

localization

#1

Lately I found myself having to process HTML that was not UTF-8 encoded (it was coded in the cyrillic coding page known as cp1251 or Windows-1251). It struck me as very odd that I couldn’t easily change the text to UTF-8 – since Elixir’s String only works with that – so I figured I’ll dig for an hour or two. Here are the results of the 3 libraries I tried.

Codepagex (Elixir)

I very much liked the idea of the library but after trying every possible way to configure it to include cp1251, it still didn’t (UPDATE: see at the end of the section). But basically what happens is (copied from the GitHub page):

iex> Codepagex.from_string("æøåÆØÅ", :iso_8859_1)
{:ok, <<230, 248, 229, 198, 216, 197>>}

iex> Codepagex.to_string(<<230, 248, 229, 198, 216, 197>>, :iso_8859_1)
{:ok, "æøåÆØÅ"}

If you want to inspect what codings are available:

Codepagex.encoding_list() # Can also pass :all to get those who are not supported as well.

Supposedly you can add other codings which aren’t enabled by default through the config – check the linked GitHub page for that. I wasn’t able to make cp1251 even appear in the list of the currently loaded codings even though I used all formats the author says are supported. So I gave it up.

Still, this is the one I like the most. If the codings that are supported out of the box are good enough for you, I recommend the library.

UPDATE by @michalmuskala: you actually can enable those coding pages. Minimal step-by-step:

  1. Add this to your config/config.exs:
config :codepagex, :encodings, [
  "VENDORS/MICSFT/WINDOWS/CP1251"
]
  1. Run this:
mix deps.compile codepagex --force
  1. Do this in iex:
iex> Codepagex.to_string(<<196, 224, 242, 224>>, :"VENDORS/MICSFT/WINDOWS/CP1251")
{:ok, "Дата"}

(/CC-ing the author: @tallakt and apologies for misunderstanding.)

elixir-mbcs (Elixir)

An Elixir wrapper around erlang-mbcs.

To install it I had to include this in my mix.exs:

{:elixir_mbcs, github: "woxtu/elixir-mbcs", tag: "0.1.3"}

It goes like this (copied from the GitHub page):

# Start mbcs server
iex> Mbcs.start
:ok

# Convert UTF-8 to Shift_JIS
iex> Mbcs.encode!("九条カレン", :cp932)
<<139, 227, 143, 240, 131, 74, 131, 140, 131, 147>>

# Convert Shift_JIS to UTF-8, and return as a list
iex> Mbcs.decode!([139, 227, 143, 240, 131, 74, 131, 140, 131, 147], :cp932, return: :list)
[20061, 26465, 12459, 12524, 12531]

It seems to support more codings than Codepagex.

Why I skipped it:

  • I was doing a hobby project inside Windows and as hard as I tried, I couldn’t make MINGW make actually work well enough to compile the .erl files. This is entirely Windows-specific and is in no way damning the library!

  • I dislike having to explicitly start a server so I can use a library. Then again, you can make a very thin wrapper around the starting function, insert it in your supervision tree and forget about that nuisance until the end of time. This didn’t suit me for a quick hobby project but again, is in no way a real drawback of the library.

All in all, if I did my hobby projects in Linux or macOS (working on it, the MBP 2015 still patiently awaits for me to get used to a laptop keyboard which I still can’t :roll_eyes:) and if I had to cover more codings then I would definitely go for this one even though I liked Codepagex API better.

erlyconv (Erlang)

This is the one I ended up using. Install it like so in your mix.exs:

{:erlyconv, github: "eugenehr/erlyconv"}

(I learned that day that you can just import Erlang projects in your Elixir projects and was very pleasantly surprised!)

My example usage:

iex> :erlyconv.from_unicode(:cp1251, "Дата")
<<196, 224, 242, 224>>

iex> :erlyconv.to_unicode(:cp1251, <<196, 224, 242, 224>>) 
"Дата"

No server starts, no wrappers, no need for configuration. Do note it looks like it supports a bit less codings than elixir-mbcs / erlang-mbcs.


Takeaways.

  • I would definitely try and contribute to Codepagex if or when I get the time because its approach looks really good – it works directly with files downloaded from the Unicode organization. If parsing those reliably can lead to maximum support for codings then I’d be all for that.

  • For now I cannot contribute to Erlang libraries due to me not knowing the language well enough but I wouldn’t try to work with erlang-mbcs yet. It’s a bit strange to me that a text transcoding library needs a server or why it needs to invoke make – I am guessing it’s old and it would make use of Erlang’s tooling better if it were written today. But of course I might be ignorant here so can’t really claim anything as a fact. Subjective opinion: erlang-mbcs is the clunkiest of the three. Still, it looks like it supports the most codings.

  • I’d try to help erlyconv because I liked its very minimalistic approach. And I am using it in my hobby project right now and I am very happy to have something that JustWorks™.

I’d love additional input from Erlang folks (@rvirding and @joeerl if they don’t mind being mentioned) or anybody else who has struggled with text transcoding.

Thanks for reading! Hope this was helpful to you.


How to determine if a string matches some character encoding in Elixir
#2

Have you explicitly recompiled the codepagex package with mix deps.compile codepagex --force after changing the configuration? Dependencies don’t pick up configuration changes automatically if they read it at compile-time.


#3

Updated the post and yes, it worked. I am a bit disappointed though, how many people will actually try this? I am in no way an advanced user but must I really be so as to catch this? :frowning:

At the very least, I can make a PR suggesting this step in the GitHub README.md so others will know what to do.


#4

Updated the post and yes, it worked. I am a bit disappointed though, how many people will actually try this? I am in no way an advanced user but must I really be so as to catch this?

I think it’s mostly up to the library authors. The config should be possible to change at runtime, I think.

In case of codepagex, I’d try building the mappings modules at runtime with compile:forms. And then expose some config functions which would update app env and then cause recompilation (code:soft_purge, compile:forms, code:load_binary).

In general, I try to design my own libraries in such a way that every config option can be changed at runtime.


#5

I opened a PR: https://github.com/tallakt/codepagex/pull/19/


#6

I am never going to heavily criticize people who do open source. They do it in their own time, voluntarily. The minimum viable improvement here are a few lines of documentation via a contributed PR – which I just did.


#7

This is IMO one of the negative consequences of the library taking user options during compilation via config.exs. In most cases, I’d prefer if these options were taken at runtime, preferably as plain function arguments. If this is not a viable option (e.g. performance implications, or too complicated interface), and the lib team opts for compile-time generation, it would be better if the options were taken through a macro. So for example, in this particular case something like the following would work better:

defmodule MyCodepagex do
  use Codepagex, encodings: [...]

  # ...
end

With such interface, even if you take encodings from config.exs (don’t really see the need for that, but ymmv), changing the config script would lead to this module being recompiled, and everything would work as expected.

Not saying it’s the best approach though. If taking options at runtime works fine, this would be my preferred option.


#8

I imagine since every coding page must be compiled the author opted for (a) minimal set of coding pages so as not to introduce huge compilation times in people’s CI/CD workflows and (b) didn’t feel it was necessary to provide a runtime change option due to the fact that coding pages must be compiled. (Although I would opt for that, a la fastglobal library’s style.)

Still though, it’s a nasty gotcha and even though I am not an advanced Erlang / Elixir user, I am no idiot either – but this wasn’t at all apparent to me until @michalmuskala pointed it out.

Some people would dislike your approach due to introducing one more module in the project but I feel it’s a very reasonable compromise for the benefits it gives you.

I am pondering a bigger PR along the lines of your suggestion. Thanks for validating. :023:


#9

The problem is that the current version is unfriendly to CI/CD, if you use caching. If you have a previously built version of the lib, then change the config, and push, the old version of the lib is going to be used from the cache.

If you’re not aware of this problem, and there are no tests related to encoding, you might unknowingly bump into the classical “works on my machine, but not in production” situation :slight_smile:

To fix this you’ll need to clear all the relevant caches manually, or otherwise somehow manually force rebuilding of the dep for each relevant build (e.g. target branch and open PRs).

If you want this to work automatically whenever an option is changed, I see the following options:

  • don’t use cache
  • force rebuild the dep on every build

Both of which will in fact increase the compilation time on CI/CD :slight_smile:


#10

Hm, yes. Especially the part when the dep is cached and everybody is wondering why the hell is the thing not working after a deployment… :frowning:

Hence I am wondering about a PR which would (1) introduce a config option for eager / lazy compilation of the coding pages and (2) allow the coding pages to be compiled at runtime (again, a la fastglobal style).


#11

One problem with such improvisation is that there will be some non-obvious gotchas. Some unexpected runtime recompilation might affect latency. Another subtle issue is that now invoking functions at compile-time becomes harder. If as a client of the lib I want to do something like this:

require Codepagex
@compile_time_data Codepagex.some_fun(...)

It still wouldn’t work, or even worse, it might silently work in a wrong (old) way.

This is why I prefer passing options as parameters to funs or macros. No special gotchas there. If you pass the option it works. If you don’t, it doesn’t :smiley:


#12

In the case of Codepagex I am kind of sad that Elixir can’t compile the coding pages to something like LISP’s s-expressions (OS-neutral format) and just bundle that with the library itself. That would rid us of the dilemmas forever.

As I see it, having to invoke the compiler in this particular case is more like a peculiarity of the BEAM and not a desired behaviour; I view the compilation as a hack here. Will the .beam files really be different under Windows, Linux, macOS? For what is basically a module with a bunch of functions that do pattern-matching on one and the same file input, no matter the OS? If the answer is “yes” then I’ll feel stupid but do tell me if so! :003:

I can understand timex having to periodically update timezones, for example. It’s a different case however; I don’t think the coding page files from Unicode.org even change (might be wrong though).


#13

AFAIK beam files are portable across diferent OSes (probably not if HiPE is used, but not really sure). Don’t really have any official reference for that, but I can confirm that in my first production I regularly used macos as the build machine, while production was on Ubuntu, and I never had problems.

That said, building on one OS and deploying on another is not something I’d advise, since if you use some NIFs that won’t work on another OS (and you’ll only notice it when it hits the production :slight_smile:) .


#14

In fact, it just occurred to me that Elixir is shipped like this too. You can find precompiled beams on the releases page. If I’m not mistaken, asdf also just fetches the precompiled beam files.


#15

I’m wondering if I can just compile all coding pages locally and ship them as .beam files then.


#16

The OS where you compile is not an issue - .beam files are portable. What is more problematic is the OTP version and Elixir version - you’d need to compile on the lowest version you support, but that means you sometimes can’t take advantage of improved APIs or improved compiler.


#17

The library could of course eagerly generate the beam files for all the code pages when it’s compiled. Whether that’s a good option depends on the size of the generated code. For example, the overhead might be too big for those deploying on embedded devices. No idea if this is the case though :smiley:


#18

That’s partially the problem: I’m not educated enough to know what is the reasonable lowest version to support.

What we have here is a library that generates modules and functions based on parsing a text file. Is having a bunch of functions that pattern-match on textual input calling for a certain minimum Elixir version? Is there a version that does this quicker?

I seriously don’t know.


#19

I’m not sure I understand what is the problem here? The library can rely on metaprogramming to generate a bunch of modules during compilation. However, these modules will only generated be when the client project is compiled, so it’s up to the user to ensure that the proper Erlang version is used (and the user always has this responsibility anyway, lib or no lib).


#20

^ That’s exactly my problem.

If you have a module with dozens of pattern-matching function heads, what’s the recommended minimum Erlang / Elixir version so you get maximum performance and maximum backwards portability?

I am clueless on these nuances still.