JS/Elixir string indexing interoperability

stefanluptak · April 19, 2020, 11:47am

Hi all!

I would like to ask for an ideas how to solve the different way JavaScript and Elixir are approaching Strings and indexes of characters.

I want to store the same (user entered) text on the Elixir and also the JavaScript side. I also want to exchange messages describing operations like {:delete, from, to}, {:insert, position, text}, etc. to ensure the same text is on both sides.

The thing is, that this becomes intersting as soon as the text contains some emojis, characters with puncutation and so on.

Working with strings (binaries) in Elixir is pretty straightforward. Unfortunately (but not surprisingly ), the JavaScript behavior is (in my opinion) a bit unintuitive.

// JavaScript
> String.fromCharCode(97, 769).slice(0, 1)
'a'
> String.fromCharCode(225).slice(0, 1)
'á'

# Elixir
iex> [97, 769] |> to_string() |> String.slice(0, 1)
"á"
iex> [225] |> to_string() |> String.slice(0, 1)    
"á"

What strategy should I use? Should I work with the text as a charlist on the Elixir side or should I normalize all strings everywhere? Or is there a better strategy I should take a look at?

Thank you all for you advices.

hauleth · April 19, 2020, 12:25pm

Unicode is hard.
Unicode is very hard.
Unicode is enormously hard.

If you want to have consistent behaviour then operate on bytes. So you will use Blob or Uint8Array on JS side and binaries on Elixir side. I think it would be the simplest way. I hope that what you are describing (as I assume you want to created distributed concurrent text editor) is CRDT like LSEQ.

stefanluptak · April 19, 2020, 1:03pm

Thank you for your reply @hauleth. I am not sure I understand your solution. I will try to write pseudo-code example:

// JavaScript message to Elixir, somebody wrote "áb"
{:insert, 0, String.fromCharCode(97, 769, 98)} 
// JavaScript message to Elixir, somebody deleted "á" (but not b)
{:delete, 0, 2}

This will delete the “b” too, because Elixir is considering “á” to be 1 character and doesn’t care how many codepoints it has:

text_from_javascript = to_string([97, 769, 98])
deleted_text = String.slice(text, 0, 2)
# deletes "áb"

From my point of view, it might be safe (on Elixir side) to convert all the strings to charlists and then use the List / Enum operations. Am I wrong?

P.S.: Yes, the concurrent/distributed operations are handled with CRDT. It’s just this unicode stuff I am trying to solve.

NobbZ · April 19, 2020, 1:09pm

The big question is, do you want to work on “graphemes” or on “codepoints”? If the latter, do you do normalize first?

The Javascript seems to operate on “codepoints”, though the current normalisation is unknown, if it normalises at all, instead of just taking what it gets from the operating system.

stefanluptak · April 19, 2020, 1:12pm

I want to work on whatever is simpler and more consistent or safe respectively. I don’t do any normalization (yet?). But maybe it will be necessary. I am not sure. That’s why I am asking. I don’t want to overcomplicate it.

NobbZ · April 19, 2020, 1:18pm

Do you want to be a and ^ be considered as separate or as â?

If the latter, work on graphemes.

Read docs of string functions carefully to know if they work on graphemes or codepoints.

The same is true for the JS functions and methods you use.

Though I have to disappoint you. It will be complicated, no matter what. Most developers of string handling libraries either do not care or even understand the differences.

Anyway, try to avoid random access of strings by codepoint or grapheme, it’s O(n) operation!

hauleth · April 19, 2020, 1:33pm

Operate on bytes it is the safest way, so you do not use String module in short. This will provide you independence form encoding of your data. So it would look like this

// JavaScript message to Elixir, somebody wrote "áb"
["insert", 0, Uint8Array.of(97, 204, 129, 98)]
// JavaScript message to Elixir, somebody deleted "á" (but not b)
["delete", 0, 3]

And then in Elixir:

text_from_javascript = <<97, 204, 129, 98>>
deleted_text = binary_part(text_from_javascript, 0, 3)

Alternatively use LSEQ mentioned earlier which is representation independent (as it generates it’s own indices instead of using string positions).

stefanluptak · April 19, 2020, 1:35pm

Thanks a lot. Now I understand.

stefanluptak · April 19, 2020, 3:28pm

I tried to do a little benchmark here and I am quite surprised, that doing binary_part(binary, from, length) is 46x faster than Enum.slice(charlist, from, to)
Of course doing String.slice(string, from, to) is extremely slow. That’s not surprising.
Do you have some tips to do that even faster?

benwilson512 · April 19, 2020, 6:02pm

Do what, binary_part? binary_part is about as fast as it gets on the BEAM for that specific operation I think.

stefanluptak · April 19, 2020, 6:08pm

Yes, I meant that one. Sorry for not being clear enough. OK, good to know. Thanks.

hauleth · April 19, 2020, 6:10pm

Binaries are in reality just byte arrays and binary_part is just make_binary(old + start, length), so I highly doubt that you can get any faster than that.

stefanluptak · April 19, 2020, 6:17pm

That make sense. I thought that charlist is in fact the same, so I don’t understand the speed difference.
I just started watching Johanna’s talk about string processing performance. I hope I will learn something new and useful.

axelson · April 19, 2020, 6:20pm

Charlists are lists which are implemented as linked lists. And to iterate through the linked list you need to follow a bunch of pointers which is O(n).

stefanluptak · April 19, 2020, 6:39pm

Ah OK. Thanks a lot. So if I understand this correctly, charlist is a standard Elixir List container implemented as linked list in this case containing integers that represent code points and String is piece of memory and to “iterate” over it’s content we just need to increment the memory address. Am I right?