I’m parsing a lot of Government web pages which were produced on ancient Microsoft systems. Like the .aspx Oregon Revised Statutes homepage. Or awkward not-really-html files like this “windows-1252” converted using MS Word and then put online. (!)
My parsing code is failing and I realized its because I have byte sequences like <<194, 160>>
in the strings. (This might be Unicode nbsp?)
Here, that space between the 5
and the (
is actually <<194, 160>>
:
iex(58)> :erlang.binary_to_list "1-55 (48)"
[49, 45, 53, 53, 194, 160, 40, 52, 56, 41]
One problem is, it doesn’t split or get caught by regex character classes like a space.
This has plagued me ever since I’ve worked with these docs - well over a decade Here, I downloaded the file with curl, then read in with File.read!.
What’s a good way to go about working with these docs? What I want is to have just simple whitespace <<32>>
, Not nbsp.
EDIT: I found that Regex will work with this data if I give it the :unicode
(u) modifier.
pry(10)> raw_string
"Volume : 01 - Courts, Oregon Rules of Civil Procedure - Chapters 1-55 (48)"
pry(12)> Regex.run ~r/\w+-\w+/, raw_string
[<<49, 45, 53, 53, 194>>]
pry(12)> Regex.run ~r/\w+-\w+/u, raw_string
["1-55"]
I’m confused — I’m unsure whether raw_string
is valid Unicode or not.
EDIT 2: I believe now that this is valid Unicode, which was correctly read in with File.read!
. Unicode codepoint 160 is nbsp. But in UTF-8 it’s <<194, 160>>
which is what I’m finding in my strings.
The puzzling part to me is that Regex requires the :unicode
option to properly work with unicode. (Since we’re Unicode-first.) Another question I have is why doesn’t String.split/1
split on it?