Nimbleparsec for user input?

MarthinL · January 13, 2025, 10:21am

Hi folks,

Currently I’m considering passing most if not all my site’s user inputs through a parser specified using NimbleParsec but I’m truly ill-equipped (as per the Dunning-Kruger Effect) to judge whether that’s brilliant or the worst idea ever, or figure out how to make it safer. Having read through most of the threads about NimbleParsec on this forum I must admit most of it went well over my head, so I’m terribly sorry if in my ignorance I am repeating an already answered question but I can really use some help from the obvious experts on the topic right about now.

I currently have the following definitions:

  url = repeat(
          choice([
            ascii_string([?A..?Z, ?a..?z, ?0..?9, ?-, ?., ?_, ?~, ?:, ?/, ??, ?#, ?[, ?], ?@, ?!, ?$, ?&, ?', ?*, ?+, ?,, ?;, ?=], min: 1),
            replace(string("%28"), "("),
            replace(string("%29"), ")"),
            ascii_string([?%], 1) |> ascii_string([?0..?9], min: 1, max: 3),
            replace(string("%%"), "%")
          ]))
        |> reduce({:binary, :list_to_bin, []})
  text = ascii_string([not: ?<, not: ?>, not: ?=, not: ?[, not: ?], not: ?(, not: ?)], min: 1)
  lhook = string("<<")
  equiv = string("==")
  rhook = string(">>")
  delims = choice([lhook, equiv, rhook])

  defcombinator :text,
            empty()
            |> lookahead_not(delims)
            |> concat(text)
            |> unwrap_and_tag(:text)

  defparsec :ribbon,
            ignore(lhook)
            |> tag(repeat(lookahead_not(equiv) |> parsec(:node)), :label)
            |> ignore(equiv)
            |> tag(repeat(lookahead_not(rhook) |> parsec(:node)), :content)
            |> ignore(rhook)
            |> tag(:ribbon)

  defparsec :ribref,
            ignore(lhook)
            |> unwrap_and_tag(integer(min: 1, max: 19), :tid)
            |> ignore(rhook)
            |> unwrap_and_tag(:ribref)

  defparsec :link,
            ignore(string("["))
            |> tag(repeat(lookahead_not(string("](")) |> parsec(:node)), :label)
            |> ignore(string("]("))
            |> unwrap_and_tag(lookahead_not(string(")")) |> concat(url), :url)
            |> ignore(string(")"))
            |> tag(:link)

  defcombinator :node,
            choice([
              parsec(:ribbon),
              parsec(:ribref),
              parsec(:link),
              parsec(:text)
            ])
            |> post_traverse({:node_handler, []})

  defparsec :nlt,
            repeat(parsec(:node))
            |> eos()

The ultimate objective being to parse “NLT” which stands for Non-linear Text - a type of rich/structured text endemic to my system. The idea is to create a somewhat similar experience to Markdown in that users would type regular text but have buttons on the form which would inject the required syntax when they wish to type up some recursive non-linear text to be saved to the database.

I realise the application domain would be unfamiliar to most, so I’ll explain in english what the grammar is supposed to recignise. In essense, a plain old string is the simplest form of non-linear text. Then the string might contain, much like in Markdown, a hyperlink in the format [label](url) which isn’t all that challenging but a good reference point because it means a sequence of nodes, where each node is either just text or a hyperlink is also valid non-linear text. Then we introduce two more options. The first being <<label==content>> which is called a ribbon. Ribbons are (at least two) separate entities and both the label and content portions are recursively sequences of nodes. The user can create these ribbon on the fly by using the syntax in the text they input. As we process it, we’d create the required ribbon component entities and replace the original ribbon definition with an abbreviated ribbon definition in the (fourth and final) form <<id>> where the id is simply a bigint database id of the leading component. The <<id>> format may refer to ribbon data that was created by the current form or to pre-existing ribbon data. And that is it. I don’t intend to allow any user text to be “executed” or sent directly to database and have the database infrastructure to validate that the user is allowed access to referenced links and ribbons, but this is very much parsing user text so it would no longer be true to say that what the user inputs isn’t parsed and that’s where I perceive the danger I cannot quite wrap my head around.

As an asside, I first tried using the Md library because what I’m trying to achieve felt so much like Markdown that I thought it would be a good place to start. But I never really felt I was able to control the grammar well enough to have any confidence that it wasn’t leaving gaping holes for all sorts of delierate or accidental abusers to exploit. It seemed there were Markdown syntax that was being recognised even by a supposedly empty spec, so I pivoted to NimbleParsec.

The plan is to incorporate the parsing into the Changeset validations and build the required fully recursive changeset to allow persisting the whole lot to database in a single transaction if it’s valid and have useful visual feedback as to where in the input the syntax stops being valid. That part I’m nowhere near done with so I can’t share that here for a more complete picture.

At this stage I was really just hoping for someone smarter than myself to take a look at the parsers and combinators I’ve defined and point out ways to guard against it become an attack vector to cripple my site. Something like if someone types in just a massive list of < characters it may overrun the stack and crash the system. Or point out why no such parser can ever be hardened enough to get exposed directly to user inputs. I can still back out of this approach if I have to even though it really seems elegant and effective from my current (poorly-informed) perspective.

Thank you kindly, in advance.

Eiji · January 13, 2025, 4:04pm

As long as you do not write SQL expressions by hand or allow user to write it (in short nothing related to nimble_parsec) as long I would not worry about security problems.

While there is no rule to cover all edge cases, every time you think about manipulating large structures in Elixir (which does not support mutable variables) you should consider simplifying especially having in mind Elixir is best for working with a huge number of small messages (in opposite to small number of huge messages). If you are in edge case and believe this is best for you then all good.

The only place where things could go really wrong is repeat, but not because it’s bad. Think that user input is not normal, but have few, dozens, hundreds or even thousands of MB. The nimble_parsec would not solve for you every possible problem. If you have problem in other part of app (SQL injection, missing input validation before sending to server like limit number of characters) then no parser could help you no mater how well written it is or what library it use.

For example when you would inject a parsed value as table name then you could have a SQL injection. Think that parser is supposed to parse <table-name> by taking the table name between less than (<) and greater than (>) characters ignoring hyphen (-) character. In normal case everything works as expected, but one time you could receive partial SQL which closes table identifier and ends with something like delete * from users; - that would happen no matter how well you write parser. However if you pass insecure variable to ecto’s query API then the adapter should escape input correctly, so you don’t have to worry about SQL injections.

As you can see unless your parser would be huge and catches all possible edge-cases (like SQL injections) no matter if the input is correctly parsed or not the other part of app could have a problem. In general you should double-check if user input is properly secured for example by setting maxlength to input element. Best if the validation would happen on client side, so the users would not lose their time re-entering all form fields. parser (this or other) is always well seen. nimble_parsec should generate one of the fastest if not fastest parsers, but as said it would not cover literally everything which is really fine.

MarthinL · January 13, 2025, 5:26pm

Hey, thanks for that. I really needed some sober external perspective to guide my thoughts about this, which is what you provided.

I’ve been taking a lot of care to send my user input through the same checks and safe-guards as they would have gone without the parsing. That in itself is no guarantee of course, but it does make it easier to identify where if anywhere the fact that I am parsing the text could result in something untoward getting ubscured and scnuck through. As it stands right now none of the text being parsed ends up hitting the database in any way other that what would have been possible through regular forms.

So I think I should be OK once I’ve added in some sane limits on text length, rate of input and levels of recursion allowed in one edit session. Logically my database is indefinitely recursive but since it’s impossible to fit too many levels onto the screen at any given time all the navigation around the database is already windowed to show only a set maximum levels at a time. It would make sense to enfore a similar rule when entering content.

I must say though, this rabbit hole has forced me to take another look as recursive changesets and cast_assoc in particular. I’ve not had much luck using it before not least of all because I got discouraged by the documentation describing it as opinionated recommending Multi instead. Multi isn’t too hard to use for nested structures, but for recursive inserts and updates the naming of the nested parts becomes a problem. But once I started figuring out how to create the conditions cast_assoc expects it’s come into its own as a really useful tool. Essentially, I use a post_traverse function to build a recursive nested changeset in the parse context which does does all the validation and when it’s valid on commit persisting the whole lot to database is a single Repo operation. I haven’t quite managed to use build the changeset as the args parameter returned by the parsec call but that too might be possible and if it is would simplify matters considerably as it would avoid having to keep the changeset in the context and the AST in sync. But you know, baby-steps.

Thanks for responding.

Stefano1990 · January 13, 2025, 5:30pm

Simply setting max length on an input element doesn’t protect you at all. I can edit the html and whatever I want.

MarthinL · January 13, 2025, 5:33pm

I’ve taken that to mean that the server-side code would perform sanity checks of its own before throwing text at a parser. Good luck overriding that by editing the HTML.

Eiji · January 13, 2025, 5:36pm

Of course it’s not, but that’s another level of potential problems. What I described was possible without client modification and/or custom requests. I was mainly talking about that developers should not trust even regular user input. Depending on solutions used you could block (or rather wait for some time and reject) a connection which is too big, but this is completely another topic where Elixir can, but does not need to pay key role …

Stefano1990 · January 13, 2025, 5:51pm

I meant it more like for completeness sake, so if a beginner reads this thread they don’t think that simply setting the max length on the input is some kind of validation.

MarthinL · January 13, 2025, 5:52pm

I’ve long wanted to write this error message into a system:

Your attempt at breaking or abusing the system was noted and duly logged against your profile. You’ve agreed to conditions of services which explicitly prohibits such behaviour and allows us to respond in a variety of ways ranging from messing with the service we provide you to full on legal action.

MarthinL · January 13, 2025, 5:56pm

It does have value though. If you receive input that exceeds the maximum input length you’ve specified it almost certainly indicates a deliberate attempt at disrupting your system.

Eiji · January 13, 2025, 5:58pm

Oh, that’s very bad practice … This way you instruct attacker that this was already spotted. You should wait some time, so the attacker only wastes time and resources waiting for reply. If you would simulate a “positive” response without telling the attacker that his attack was spotted then he may try again and again waste time and resources. Didn’t working on security devops stuff for years, but if I remember correctly you have to return 400 error response after few seconds of waiting or something like that …

MarthinL · January 13, 2025, 6:03pm

You’d have noticed me saying I’ve long wanted to write that, not that I have. I’m rather well aware that feedback is an enabler, most for good people but also for those with bad intentions. Detecting, with some degree of certainty, attempts to disrupt service, but either withholding feedback or deliberately giving misleading feedback is a powerful approach. In the context of my dream error message, that means that the error would not show on the attacker’s screeen while they’re messing about. It would appear as an email well after the fact. As for the immediate response to an eror like that, it’s probably best to let the attacker believe they’ve caused the chaos they intended even if it didn’t, like just never sending a response and forcing a restart of the session some minutes later.