stefanluptak
JS/Elixir string indexing interoperability
Hi all!
I would like to ask for an ideas how to solve the different way JavaScript and Elixir are approaching Strings and indexes of characters.
I want to store the same (user entered) text on the Elixir and also the JavaScript side. I also want to exchange messages describing operations like {:delete, from, to}, {:insert, position, text}, etc. to ensure the same text is on both sides.
The thing is, that this becomes intersting as soon as the text contains some emojis, characters with puncutation and so on.
Working with strings (binaries) in Elixir is pretty straightforward. Unfortunately (but not surprisingly
), the JavaScript behavior is (in my opinion) a bit unintuitive.
// JavaScript
> String.fromCharCode(97, 769).slice(0, 1)
'a'
> String.fromCharCode(225).slice(0, 1)
'á'
# Elixir
iex> [97, 769] |> to_string() |> String.slice(0, 1)
"á"
iex> [225] |> to_string() |> String.slice(0, 1)
"á"
What strategy should I use? Should I work with the text as a charlist on the Elixir side or should I normalize all strings everywhere? Or is there a better strategy I should take a look at?
Thank you all for you advices.
Marked As Solved
hauleth
Operate on bytes it is the safest way, so you do not use String module in short. This will provide you independence form encoding of your data. So it would look like this
// JavaScript message to Elixir, somebody wrote "áb"
["insert", 0, Uint8Array.of(97, 204, 129, 98)]
// JavaScript message to Elixir, somebody deleted "á" (but not b)
["delete", 0, 3]
And then in Elixir:
text_from_javascript = <<97, 204, 129, 98>>
deleted_text = binary_part(text_from_javascript, 0, 3)
Alternatively use LSEQ mentioned earlier which is representation independent (as it generates it’s own indices instead of using string positions).
Also Liked
hauleth
Unicode is hard.
Unicode is very hard.
Unicode is enormously hard.
If you want to have consistent behaviour then operate on bytes. So you will use Blob or Uint8Array on JS side and binaries on Elixir side. I think it would be the simplest way. I hope that what you are describing (as I assume you want to created distributed concurrent text editor) is CRDT like LSEQ.
NobbZ
Do you want to be a and ^ be considered as separate or as â?
If the latter, work on graphemes.
Read docs of string functions carefully to know if they work on graphemes or codepoints.
The same is true for the JS functions and methods you use.
Though I have to disappoint you. It will be complicated, no matter what. Most developers of string handling libraries either do not care or even understand the differences.
Anyway, try to avoid random access of strings by codepoint or grapheme, it’s O(n) operation!
benwilson512
Do what, binary_part? binary_part is about as fast as it gets on the BEAM for that specific operation I think.
axelson
Charlists are lists which are implemented as linked lists. And to iterate through the linked list you need to follow a bunch of pointers which is O(n).
stefanluptak
Ah OK. Thanks a lot. So if I understand this correctly, charlist is a standard Elixir List container implemented as linked list in this case containing integers that represent code points and String is piece of memory and to “iterate” over it’s content we just need to increment the memory address. Am I right?









