Elixir style vs. "Parse don't validate" (from Haskell)

dogweather · November 6, 2023, 9:30am

EDIT: Parse, don't validate (2019) | Hacker News

I finished this working code which takes in an HTML page and outputs legal citations it finds. I like the code, and I think it’s pretty much Elixir style. E.g., it uses the shape of the data when it can. But I realized that it has an ordering dependency in the logic. In bigger projects, this can lead to bugs:

  @spec find_citations(binary()) :: list()
  @doc """
  Find citations in a string of HTML.
  """
  def find_citations(html) do
    {:ok, document} = Floki.parse_document(html)

    leginfo_urls =
      document
      |> Floki.attribute("a", "href")
      |> List.flatten()
      |> Enum.map(&URI.parse/1)
      |> Enum.filter(&leginfo_url?/1)

    leginfo_urls
    |> Enum.map(&leginfo_url_to_cite/1)
    |> Enum.sort()
    |> Enum.uniq()
  end


  defp leginfo_url?(%{host: "leginfo.legislature.ca.gov"}), do: true
  defp leginfo_url?(_), do: false

  defp leginfo_url_to_cite(%{query: query}) do
    query
    |> URI.decode_query()
    |> make_cite()
  end

  defp make_cite(%{"lawCode" => code, "sectionNum" => section}) do
    "CA #{@code_abbrevs[@cal_codes[code]]} Section #{section}"
    |> String.replace_suffix(".", "")
  end

I.e., my leginfo_url?() predicate validates the data and just returns a boolean. And so, the code has to be written correctly so that it’s called before leginfo_url_to_cite().

I think that the “parse don’t validate” idea is meant to remedy this. Instead of returning a boolean, one would return a type that can only be obtained by a valid parse. This way, instead of the programmer remembering to check for implicit dependencies, we enable the compiler to do it for us.

Does anybody here use that approach with Elixir? I suppose that with the above code, that’d mean creating a struct with statically defined keys :law_code and :section_num. And then the function heads would be written to only accept the named struct.

For this small code—which also has complete test coverage—I’m not sure if it’s worth the work. But in larger codebases, maybe it’d make sense. ?

ken-kost · November 6, 2023, 11:18am

  def find_citations(html) do
    html
    |> Floki.parse_document!(html)
    |> Floki.attribute("a", "href")
    |> List.flatten()
    |> Enum.map(&URI.parse/1)
    |> Enum.reduce([], &parse_valid/2)
    |> Enum.sort()
    |> Enum.uniq()
  end

  def parse_valid(uri, valids) do
    with %{host: "leginfo.legislature.ca.gov", query: query} <- uri,
         %{"lawCode" => code, "sectionNum" => section} <- URI.decode_query(query) do
      [String.replace_suffix("CA #{@code_abbrevs[@cal_codes[code]]} Section #{section}", ".", "") | valids]
    else
      _ -> valids
    end
  end

Is this something you had in mind (not tested)? In the above code nothing happens in else clause of with, the accumulator is just passed, but you could add here different pattern matches and maybe also parse the data and add it to the accumulator.

Also, since there’s a lot of enum calls consider using stream.

Maybe I don’t understand what is parse don’t validate from haskell.

krasenyp · November 6, 2023, 11:44am

As @ken-kost said, you might want to use stream. Instead of List.flatten/1 you can use Stream.transform/3. Your final call can be Enum.sort/1 which will turn the stream into a list.

dogweather · November 6, 2023, 12:31pm

jhogberg · November 6, 2023, 1:46pm

While types are very useful, it’s still a good idea to do this in untyped (“dynamically typed”) languages like Elixir and Erlang. Transforming external data to a known structure that your application understands and refusing to operate on anything but that eliminates lots of possible issues on its own, and is frankly less work in the long run than validating and operating on more-or-less raw data. Don’t let the absence of types stop you from implementing a good idea.

You don’t have to go so far as to declare a whole new struct every time, either, as you get many of the benefits just from consistently combining functions like leginfo_url? and leginfo_url_to_cite into a function that returns {:error, reason} | {:ok, {:ad_hoc_tag, value}}. For small things that can be good enough.

Adzz · November 8, 2023, 2:44pm

There are lots of data validation / casting libraries in Elixir that help with this sort of thing in different use cases. I wrote one myself GitHub - Adzz/data_schema: Declarative schemas for data transformations.

but as mentioned it doesn’t mean everything has to be a non primitive type, it just means you should do type casting at the edge of the system and be confident that if you get further in that that type casting happened.