Code critique: group_after/2 for parsing flat HTML

My problem

When parsing old government webpages, my input is often just like that one:

<p><b>Section 1, name, and text</b></p>
<p>Section 1 more text</p>
<p>Section 1 more text</p>
<p><b>Section 2, name, and text</b></p>
<p>Section 2 more text</p>
<p><b>Section 3,  name, and text</b></p>
// etc.

I’d really like feedback about the approach I came up with last night:

This looks like a take/while/scan kind of problem. But I couldn’t find an Enum or Floki function that seemed to handle this kind of repeated pattern.

I decided to write a function that would generically group items and following items using a predicate. In the case of the HTML above, the predicate would be “Does the element contain a <b>?” So, abstractly:

input =  [1,2,2,2,2,1,1,1,2]
output = [[1,2,2,2,2], [1], [1], [1, 2]]

I realized that’s not too hard with Enum.reduce:

  def group_after(list, predicate) do
    reduce(list, [], fn e, acc ->
      case predicate.(e) do
        true ->
          [[e]] ++ acc

        false ->
          {curr, rest} = List.pop_at(acc, 0)
          [curr ++ [e] | rest]

It works fine. Although, the reduce function’s code is very procedural and not expressive. What do you all think? Is there another approach I’m not considering?

An alternate idea: Consider a string "tfffftttf" as an isomorph of map(list, predicate). Then use an expressive regex like ~r/tf*/ to group the true & false — instead of the procedural reduce. Finally, undo the mapping back into the original list elements.

I’m not sure if I understand correctly. In your real world example the grouping is already done, because <b> are children of the <p>.

But your number example seems to be another problem (chunk by 1s and all 2s that follow a 1)

Slow load (possibly timeout). I know what you feel. Hope those are not pages with invalid HTML code. :smiling_imp:

Anyway, here is my solution:


defmodule Example do
  def sample(list, acc \\ [])

  # for empty input after parsing
  def sample([], []), do: []

  # when all p elements are passed
  # reverse last section texts and wrap them into list
  # as otherwise a resulting list would be added
  # to a main list where each element contains a list of sections texts
  def sample([], acc), do: [:lists.reverse(acc)]

  # in case of first bold text simply add text to
  # as the only element in new acc
  # and call function recursively
  def sample([{"p", _, [{"b", [], [text]}]} | tail], []) when is_binary(text) do
    sample(tail, [text])

  # however if there is some data in acc
  # reverse its contents and return it as a list of
  # last section texts and recursive call
  def sample([{"p", _, [{"b", [], [text]}]} | tail], acc) when is_binary(text) do
    [:lists.reverse(acc) | sample(tail, [text])]

  # when we got a normal text simply add it to acc
  # and call function recursively
  def sample([{"p", _, [text]} | tail], acc) when is_binary(text) do
    sample(tail, [text | acc])

|> Floki.parse_fragment!()
|> Example.sample()
|> dbg()

Pattern matching is a fastest solution. You can take a look at this post to see possible alternative solutions.

Sorry, yeah - I left it abstract. I used the 1’s to represent paragraphs with a <b>, and 2’s for <p>'s without.

Here’s a screenshot of the page.

And I’m trying to produce a list of %Section{}.

The actual HTML looks like this. All of the <p>'s just run on; there’s no hierarchy. The only clue is that the first line of a Section has a <b>. :joy:

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><b><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      838.025
Election laws apply.</span></b><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>
(1) ORS chapter 255 governs the following:</span></p>

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      (a)
The nomination and election of district board members.</span></p>

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      (b)
The conduct of district elections.</span></p>

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      (2)
The electors of a district may exercise the powers of the initiative and
referendum regarding a district measure, in accordance with ORS 255.135 to
255.205. [Formerly 494.043]</span></p>

Thanks! That’s pretty interesting. This meta tag is pretty chilling. :slight_smile:

<meta name=Generator content="Microsoft Word 15 (filtered)">

It seems they’re somehow scripting MS Word to “Save as HTML”.

sounds like fun … not :grin:

Since you did not give a struct definition I wrote it myself:


defmodule Section do
  defstruct ~w[contents id title]a

  def add_text(%__MODULE__{contents: contents} = section, {"span", _, [text]})
      when is_binary(text) do
    %{section | contents: [String.trim(text) | contents]}

  def first_paragraph({"b", _, [{"span", _, [id_title]}]}, {"span", _, [text]})
      when is_binary(id_title) and is_binary(text) do
    [id, title] = id_title |> String.trim() |> String.split("\n", parts: 2)
    %Section{contents: [String.trim(text)], id: id, title: title}

  def reverse_contents(%__MODULE__{contents: contents} = section) do
    %{section | contents: :lists.reverse(contents)}

defmodule Example do
  def sample(list, section \\ nil)

  def sample([], []), do: []

  def sample([], section), do: [Section.reverse_contents(section)]

  def sample([{"p", _, [title, text]} | tail], nil) do
    sample(tail, Section.first_paragraph(title, text))

  def sample([{"p", _, [title, text]} | tail], section) do
    new_section = Section.first_paragraph(title, text)
    [Section.reverse_contents(section) | sample(tail, new_section)]

  def sample([{"p", _, [text]} | tail], section) do
    sample(tail, Section.add_text(section, text))

|> Floki.parse_fragment!()
|> Example.sample()
|> dbg()

What do you think about it?

This is not too bad; at least all the <p> have closing </p>. You know they don’t have to; and I;ve seen html that freely mix the 2 styles, with or without closing </p>


That’s very cool. Thanks for taking a whack at it. The <span>'s and other attributes aren’t important, though, because they’re the same on every element. (!) The big picture is, we want mostly plain text — simplified HTML. This code’s purpose is to produce well formed JSON with all the important info from the original texts. I publish the JSON to a datasets public repo.

You can see how I solved it: The actual Section:

defmodule Crawlers.ORS.Models.Section do
  @moduledoc """
  An ORS Section.
  use TypedStruct

  typedstruct enforce: true do
    @typedoc "An ORS Section"

    field :kind, String.t(), default: "section"
    field :name, String.t()
    field :number, String.t()
    field :text, String.t()
    field :chapter_number, String.t()

The group_with/2 function:

  @doc """
  Group a list of elements into sub-lists, where each sub-list is
  led by an element that satisfies the predicate. It skips initial
  elements that do not satisfy the predicate.

  iex> group_with([1, 2, 3, 1, 4], &(&1 == 1))
  [[1, 2, 3], [1, 4]]

  iex> group_with(["a", "b", "x", "c", "d"], &(&1 == "x"))
  [["x", "c", "d"]]
  def group_with(list, predicate) do
    result_reversed =
      reduce(list, [], fn e, acc ->
        case predicate.(e) do
          true ->
            # Start a new group.
            [[e]] ++ acc

          false ->
            # Try to extract the current group.
            {curr, rest} = List.pop_at(acc, 0)

            case curr do
              # Skip until predicate is true.
              nil -> acc
              # Append to current group.
              group -> [group ++ [e] | rest]


I’ll just pray that Floki can handle it. I’m sure I’ll be dealing with that at some point.