I’m parsing a lot of Government web pages which were produced on ancient Microsoft systems. Like the .aspx Oregon Revised Statutes homepage. Or awkward not-really-html files like this “windows-1252” converted using MS Word and then put online. (!)
My parsing code is failing and I realized its because I have byte sequences like
<<194, 160>> in the strings. (This might be Unicode nbsp?)
Here, that space between the
5 and the
( is actually
iex(58)> :erlang.binary_to_list "1-55 (48)" [49, 45, 53, 53, 194, 160, 40, 52, 56, 41]
One problem is, it doesn’t split or get caught by regex character classes like a space.
This has plagued me ever since I’ve worked with these docs - well over a decade Here, I downloaded the file with curl, then read in with File.read!.
What’s a good way to go about working with these docs? What I want is to have just simple whitespace
<<32>>, Not nbsp.
EDIT: I found that Regex will work with this data if I give it the
:unicode (u) modifier.
pry(10)> raw_string "Volume : 01 - Courts, Oregon Rules of Civil Procedure - Chapters 1-55 (48)" pry(12)> Regex.run ~r/\w+-\w+/, raw_string [<<49, 45, 53, 53, 194>>] pry(12)> Regex.run ~r/\w+-\w+/u, raw_string ["1-55"]
I’m confused — I’m unsure whether
raw_string is valid Unicode or not.
EDIT 2: I believe now that this is valid Unicode, which was correctly read in with
File.read!. Unicode codepoint 160 is nbsp. But in UTF-8 it’s
<<194, 160>> which is what I’m finding in my strings.
The puzzling part to me is that Regex requires the
:unicode option to properly work with unicode. (Since we’re Unicode-first.) Another question I have is why doesn’t
String.split/1 split on it?