String transliteration - best strategy?

kip · November 4, 2018, 12:54am

One of my projects involves transliterating digits from one number system to another. For example:

iex> roman_digits = "0123456789"
iex> arab_digits = "٠١٢٣٤٥٦٧٨٩"
iex> transliterate("123", from: roman_digits, to: arab_digits)
"١٢٣"

What is the fastest transliteration strategy? Currently I generate a set of functions that translate each code point from one number system to another and then join them. Is there a faster/better method? Would using map lookups be faster? These would all be “small maps” (< 32 entries) since the set of digits is defined to be 0..9, +, -, \s, ., E

NobbZ · November 4, 2018, 9:02am

I think helper functions mapping directly on the binary should be the most efficient way, while building the new binary on the fly, roughly:

def western_to_eastern(<<>>, acc), do: acc
~c[0123456789]
|> Enum.zip(~c[٠١٢٣٤٥٦٧٨٩])
|> Enum.each(fn {western, eastern} ->
  def western_to_eastern(<<unquote(western)::utf8, r::binary>>, acc), do: western_to_eastern(r, <<acc::binary, unquote(eastern)::utf8>>)
end)

Also, remember, that both, 0-9 as well as ٠-٩ are arabic digits. The first is european or western style, the other one eastern style. Roman numerals are I, M, C, X, L, etc.

PS: The codesnippet is not tested, but should give you an idea. I’m pretty sure I got the quoting/unquoting wrong, as I do always

Schultzer · November 4, 2018, 6:28pm

Take a look at https://github.com/Schultzer/unidekode/blob/master/lib/unidekode.ex I do something similar.

kip · November 5, 2018, 3:12am

Thanks, much appreciated. This is basically what I’m doing now other than you’re using a binary comprehension which may be faster than joining at the end so I’ll benchmark the difference for sure.

There are about 80 different number systems I’m transliterating between. Most of them have digits 1…10 though a few have algorithmic transliteration (like the Hebrew numbering system). Fun with numbers!

kip · November 5, 2018, 3:14am

Thanks! Thats very similar to my current implementation - I’ll need to run some benchmarks to check what works out to be best. At least you and @NobbZ are agreeing on using the multi-clause function approach which I’m also using.

dimitarvp · December 8, 2018, 6:54pm

I am very late here but kept this thread open since this interested me.

This is what I came up with (~4.82 μs per function call with the parameter "01xyz234abc567def890!"):

defmodule Translit do
  def western_to_eastern(string, acc \\ "")

  def western_to_eastern(<<>>, acc), do: acc

  ~c[0123456789]
  |> Enum.zip(~c[٠١٢٣٤٥٦٧٨٩])
  |> Enum.each(fn {western, eastern} ->
    def western_to_eastern(<<unquote(western)::utf8, rest::binary>>, acc), do: western_to_eastern(rest, acc <> unquote(to_string([eastern])))
  end)

  def western_to_eastern(<<any::utf8, rest::binary>>, acc), do: western_to_eastern(rest, acc <> to_string([any]))
end

Why does unquote work without a quote block though, I have no idea.

OvermindDL1 · December 10, 2018, 4:38pm

def’s are implicit quote blocks is why.