Lately I found myself having to process HTML that was not UTF-8 encoded (it was coded in the cyrillic coding page known as cp1251
or Windows-1251
). It struck me as very odd that I couldn’t easily change the text to UTF-8 – since Elixir’s String
only works with that – so I figured I’ll dig for an hour or two. Here are the results of the 3 libraries I tried.
Codepagex (Elixir)
I very much liked the idea of the library but after trying every possible way to configure it to include cp1251
, it still didn’t (UPDATE: see at the end of the section). But basically what happens is (copied from the GitHub page):
iex> Codepagex.from_string("æøåÆØÅ", :iso_8859_1)
{:ok, <<230, 248, 229, 198, 216, 197>>}
iex> Codepagex.to_string(<<230, 248, 229, 198, 216, 197>>, :iso_8859_1)
{:ok, "æøåÆØÅ"}
If you want to inspect what codings are available:
Codepagex.encoding_list() # Can also pass :all to get those who are not supported as well.
Supposedly you can add other codings which aren’t enabled by default through the config – check the linked GitHub page for that. I wasn’t able to make cp1251
even appear in the list of the currently loaded codings even though I used all formats the author says are supported. So I gave it up.
Still, this is the one I like the most. If the codings that are supported out of the box are good enough for you, I recommend the library.
UPDATE by @michalmuskala: you actually can enable those coding pages. Minimal step-by-step:
- Add this to your
config/config.exs
:
config :codepagex, :encodings, [
"VENDORS/MICSFT/WINDOWS/CP1251"
]
- Run this:
mix deps.compile codepagex --force
- Do this in
iex
:
iex> Codepagex.to_string(<<196, 224, 242, 224>>, :"VENDORS/MICSFT/WINDOWS/CP1251")
{:ok, "Дата"}
(/CC-ing the author: @tallakt and apologies for misunderstanding.)
elixir-mbcs (Elixir)
An Elixir wrapper around erlang-mbcs.
To install it I had to include this in my mix.exs
:
{:elixir_mbcs, github: "woxtu/elixir-mbcs", tag: "0.1.3"}
It goes like this (copied from the GitHub page):
# Start mbcs server
iex> Mbcs.start
:ok
# Convert UTF-8 to Shift_JIS
iex> Mbcs.encode!("九条カレン", :cp932)
<<139, 227, 143, 240, 131, 74, 131, 140, 131, 147>>
# Convert Shift_JIS to UTF-8, and return as a list
iex> Mbcs.decode!([139, 227, 143, 240, 131, 74, 131, 140, 131, 147], :cp932, return: :list)
[20061, 26465, 12459, 12524, 12531]
It seems to support more codings than Codepagex
.
Why I skipped it:
-
I was doing a hobby project inside Windows and as hard as I tried, I couldn’t make MINGW
make
actually work well enough to compile the.erl
files. This is entirely Windows-specific and is in no way damning the library! -
I dislike having to explicitly start a server so I can use a library. Then again, you can make a very thin wrapper around the starting function, insert it in your supervision tree and forget about that nuisance until the end of time. This didn’t suit me for a quick hobby project but again, is in no way a real drawback of the library.
All in all, if I did my hobby projects in Linux or macOS (working on it, the MBP 2015 still patiently awaits for me to get used to a laptop keyboard which I still can’t ) and if I had to cover more codings then I would definitely go for this one even though I liked Codepagex
API better.
erlyconv (Erlang)
This is the one I ended up using. Install it like so in your mix.exs
:
{:erlyconv, github: "eugenehr/erlyconv"}
(I learned that day that you can just import Erlang projects in your Elixir projects and was very pleasantly surprised!)
My example usage:
iex> :erlyconv.from_unicode(:cp1251, "Дата")
<<196, 224, 242, 224>>
iex> :erlyconv.to_unicode(:cp1251, <<196, 224, 242, 224>>)
"Дата"
No server starts, no wrappers, no need for configuration. Do note it looks like it supports a bit less codings than elixir-mbcs
/ erlang-mbcs
.
Takeaways.
-
I would definitely try and contribute to
Codepagex
if or when I get the time because its approach looks really good – it works directly with files downloaded from the Unicode organization. If parsing those reliably can lead to maximum support for codings then I’d be all for that. -
For now I cannot contribute to Erlang libraries due to me not knowing the language well enough but I wouldn’t try to work with
erlang-mbcs
yet. It’s a bit strange to me that a text transcoding library needs a server or why it needs to invokemake
– I am guessing it’s old and it would make use of Erlang’s tooling better if it were written today. But of course I might be ignorant here so can’t really claim anything as a fact. Subjective opinion:erlang-mbcs
is the clunkiest of the three. Still, it looks like it supports the most codings. -
I’d try to help
erlyconv
because I liked its very minimalistic approach. And I am using it in my hobby project right now and I am very happy to have something that JustWorks™.
I’d love additional input from Erlang folks (@rvirding and @joeerl if they don’t mind being mentioned) or anybody else who has struggled with text transcoding.
Thanks for reading! Hope this was helpful to you.