Regex, Named captures. Where did I go wrong?

drakvuf · February 4, 2019, 7:20pm

I’m scraping articles and different news sites have different DateTime formats. It is a mess.
So I wrote a date parser module to handle the situation. I’m not really good with regex but I managed to write something useful: https://regexr.com/47ne9

The problem is that I’m not able to use this pattern in my code.

This function is using named_capture.

Scraper.Parser.Date.parse_date("2019. 01. 11.")
%{"day" => "", "month" => "", "time" => "", "year" => "2019"}

This is the default generated pattern:

~r/(?<time>([\d]{1,2}):[\d]{2})|(?<year>([\d]{4}))|(?<month>([0-1][0-2]|jan|január|feb|február|már|március|ápr|április|már|március|jún|június|júl|július|aug|augusztus|szep|szeptember|okt|október|nov|november|dec|december))|(?<day>([0-3][0-9]))/

I’ve tried it with Regex.scan and my result was this:
["2019", "", "", "2019", "2019"]

I’ve tried to run it in Javascript and it seemed to work fine:

"2019. 01. 11.".match(/(([\d]{1,2}):[\d]{2})|([\d]{4})|([0-1][0-2]|jan)|([0-3][0-9])/g)

[ '2019', '01', '11' ]

My guess is that I miss the global option or something like that but as I read the documentation I can’t give it as an option in Elixir.
I would really like to use named_capture here but at least get a result like with that simple Javascript code.

Note: the DateTime format I use in this example is just a simple example. So I’m not looking for a solution parsing that specific one. You can see all the DateTime formats in the linked gist.

OvermindDL1 · February 4, 2019, 8:12pm

First of all I’d recommend using http://elixre.lpil.uk/ as a site to test the regex, it shows you what matches and what doesn’t.

So it looks like your full generated regex is (?<time>([\d]{1,2}):[\d]{2})|(?<year>([\d]{4}))|(?<month>([0-1][0-2]|jan|január|feb|február|már|március|ápr|április|már|március|jún|június|júl|július|aug|augusztus|szep|szeptember|okt|október|nov|november|dec|december))|(?<day>([0-3][0-9])), and your first test date is / 2018.08.03., péntek 13:00 /, so running that through that above site (which uses elixir’s regex to do everything) shows that with the longest match it skips the / then matches on 201 then stops. With 2019-01-14 is matches 2019 then stops due to nothing else to match. Etc…

That website makes it really easy to build up regex’s for Elixir.

However, since this is a context-free grammar you are building and it looks like it is going to end up quite sizeable due to all the cases you are going to handle, then I’d recommend using NimbleParsec instead as it will generate faster code and it will be far more maintainable in the long term, far far more maintainable!

axelson · February 4, 2019, 8:14pm

Do you know of any gentle introductions to building up a grammar with NimbleParsec? I’ve looked at it before but having not worked with grammars (lex/yacc, etc) before it seemed a bit daunting.

OvermindDL1 · February 4, 2019, 8:26pm

Hmm, it’s pretty new so I don’t know if anyone has made an article for it yet.

The resident experts on it are of course the author @josevalim and the major user @tmbb, so anything beyond the docs they can explain here.

In general I’d probably start with a form of a ‘main parser’ that is just an alternative into lots of ‘specific’ parsers, each of which tries to parse the date out in each individual odd format you need to parse, each of those calling specialized parser for parsing a numerical year and so forth.

l00ker · February 5, 2019, 4:13pm

Oooh nice! Thanks for that!

OvermindDL1 · February 5, 2019, 5:22pm

I’m pretty sure it was our own amazing @lpil here that made it. ^.^