How to sanitize string input containing <<194, 160>> and maybe other junk?

I’m parsing a lot of Government web pages which were produced on ancient Microsoft systems. Like the .aspx Oregon Revised Statutes homepage. Or awkward not-really-html files like this “windows-1252” converted using MS Word and then put online. (!)

My parsing code is failing and I realized its because I have byte sequences like <<194, 160>> in the strings. (This might be Unicode nbsp?)

Here, that space between the 5 and the ( is actually <<194, 160>>:

iex(58)> :erlang.binary_to_list "1-55 (48)"
[49, 45, 53, 53, 194, 160, 40, 52, 56, 41]

One problem is, it doesn’t split or get caught by regex character classes like a space.

This has plagued me ever since I’ve worked with these docs - well over a decade :slight_smile: Here, I downloaded the file with curl, then read in with File.read!.

What’s a good way to go about working with these docs? What I want is to have just simple whitespace <<32>>, Not nbsp.

EDIT: I found that Regex will work with this data if I give it the :unicode (u) modifier.

pry(10)> raw_string
"Volume : 01 - Courts, Oregon Rules of Civil Procedure - Chapters 1-55 (48)"
pry(12)> Regex.run ~r/\w+-\w+/, raw_string
[<<49, 45, 53, 53, 194>>]
pry(12)> Regex.run ~r/\w+-\w+/u, raw_string
["1-55"]

I’m confused — I’m unsure whether raw_string is valid Unicode or not.

EDIT 2: I believe now that this is valid Unicode, which was correctly read in with File.read!. Unicode codepoint 160 is nbsp. But in UTF-8 it’s <<194, 160>> which is what I’m finding in my strings.

The puzzling part to me is that Regex requires the :unicode option to properly work with unicode. (Since we’re Unicode-first.) Another question I have is why doesn’t String.split/1 split on it?

I think if José was starting again he might well default the regex to be unicode-compatible. But that would now be a breaking change, hence the issue that you see. Elixir wraps Erlang’s re module and uses the same defaults.

2 Likes