Fun with unicode chardata

derek-zhou · September 28, 2021, 12:27am

Just figured out a tricky bug, which I’d like to share with you. I read binary data from a port, and I know the program on the other side is sending nothing but utf-8 strings. However, the list of binary data I received over time may not be a valid chardata!

The reason is some unicode multi-byte sequence may straddle binary packet boundary. This usually don’t happen if the writer is slow enough because in all likelihood, the writer is writing valid string one at a time. However, if the writer is fast or the reader is slow, this improper fragmentation is bound to happen.

kip · September 28, 2021, 1:36am

Good point! I’ve had a similar situation in the past and found I had to leverage String.chunk/2 with the :valid flag to accumulate valid UTF8 while holding over invalid bytes until the next packet comes in and so on.

derek-zhou · September 28, 2021, 3:25pm

Yeah. To parse UTF-8 input one character at a time, I still have to concatenate binaries together, which is a memory copy. before I was falsely assuming that I could just do a :string.next_codepoint/1 directly and skip the memory copy.