Hi folks,
Currently I’m considering passing most if not all my site’s user inputs through a parser specified using NimbleParsec but I’m truly ill-equipped (as per the Dunning-Kruger Effect) to judge whether that’s brilliant or the worst idea ever, or figure out how to make it safer. Having read through most of the threads about NimbleParsec on this forum I must admit most of it went well over my head, so I’m terribly sorry if in my ignorance I am repeating an already answered question but I can really use some help from the obvious experts on the topic right about now.
I currently have the following definitions:
url = repeat(
choice([
ascii_string([?A..?Z, ?a..?z, ?0..?9, ?-, ?., ?_, ?~, ?:, ?/, ??, ?#, ?[, ?], ?@, ?!, ?$, ?&, ?', ?*, ?+, ?,, ?;, ?=], min: 1),
replace(string("%28"), "("),
replace(string("%29"), ")"),
ascii_string([?%], 1) |> ascii_string([?0..?9], min: 1, max: 3),
replace(string("%%"), "%")
]))
|> reduce({:binary, :list_to_bin, []})
text = ascii_string([not: ?<, not: ?>, not: ?=, not: ?[, not: ?], not: ?(, not: ?)], min: 1)
lhook = string("<<")
equiv = string("==")
rhook = string(">>")
delims = choice([lhook, equiv, rhook])
defcombinator :text,
empty()
|> lookahead_not(delims)
|> concat(text)
|> unwrap_and_tag(:text)
defparsec :ribbon,
ignore(lhook)
|> tag(repeat(lookahead_not(equiv) |> parsec(:node)), :label)
|> ignore(equiv)
|> tag(repeat(lookahead_not(rhook) |> parsec(:node)), :content)
|> ignore(rhook)
|> tag(:ribbon)
defparsec :ribref,
ignore(lhook)
|> unwrap_and_tag(integer(min: 1, max: 19), :tid)
|> ignore(rhook)
|> unwrap_and_tag(:ribref)
defparsec :link,
ignore(string("["))
|> tag(repeat(lookahead_not(string("](")) |> parsec(:node)), :label)
|> ignore(string("]("))
|> unwrap_and_tag(lookahead_not(string(")")) |> concat(url), :url)
|> ignore(string(")"))
|> tag(:link)
defcombinator :node,
choice([
parsec(:ribbon),
parsec(:ribref),
parsec(:link),
parsec(:text)
])
|> post_traverse({:node_handler, []})
defparsec :nlt,
repeat(parsec(:node))
|> eos()
The ultimate objective being to parse “NLT” which stands for Non-linear Text - a type of rich/structured text endemic to my system. The idea is to create a somewhat similar experience to Markdown in that users would type regular text but have buttons on the form which would inject the required syntax when they wish to type up some recursive non-linear text to be saved to the database.
I realise the application domain would be unfamiliar to most, so I’ll explain in english what the grammar is supposed to recignise. In essense, a plain old string is the simplest form of non-linear text. Then the string might contain, much like in Markdown, a hyperlink in the format [label](url) which isn’t all that challenging but a good reference point because it means a sequence of nodes, where each node is either just text or a hyperlink is also valid non-linear text. Then we introduce two more options. The first being <<label==content>> which is called a ribbon. Ribbons are (at least two) separate entities and both the label and content portions are recursively sequences of nodes. The user can create these ribbon on the fly by using the syntax in the text they input. As we process it, we’d create the required ribbon component entities and replace the original ribbon definition with an abbreviated ribbon definition in the (fourth and final) form <<id>> where the id is simply a bigint database id of the leading component. The <<id>> format may refer to ribbon data that was created by the current form or to pre-existing ribbon data. And that is it. I don’t intend to allow any user text to be “executed” or sent directly to database and have the database infrastructure to validate that the user is allowed access to referenced links and ribbons, but this is very much parsing user text so it would no longer be true to say that what the user inputs isn’t parsed and that’s where I perceive the danger I cannot quite wrap my head around.
As an asside, I first tried using the Md library because what I’m trying to achieve felt so much like Markdown that I thought it would be a good place to start. But I never really felt I was able to control the grammar well enough to have any confidence that it wasn’t leaving gaping holes for all sorts of delierate or accidental abusers to exploit. It seemed there were Markdown syntax that was being recognised even by a supposedly empty spec, so I pivoted to NimbleParsec.
The plan is to incorporate the parsing into the Changeset validations and build the required fully recursive changeset to allow persisting the whole lot to database in a single transaction if it’s valid and have useful visual feedback as to where in the input the syntax stops being valid. That part I’m nowhere near done with so I can’t share that here for a more complete picture.
At this stage I was really just hoping for someone smarter than myself to take a look at the parsers and combinators I’ve defined and point out ways to guard against it become an attack vector to cripple my site. Something like if someone types in just a massive list of < characters it may overrun the stack and crash the system. Or point out why no such parser can ever be hardened enough to get exposed directly to user inputs. I can still back out of this approach if I have to even though it really seems elegant and effective from my current (poorly-informed) perspective.
Thank you kindly, in advance.