Code critique: group_after/2 for parsing flat HTML

dogweather · September 24, 2022, 10:05pm

My problem

When parsing old government webpages, my input is often just like that one:

<p><b>Section 1, name, and text</b></p>
<p>Section 1 more text</p>
<p>Section 1 more text</p>
<p><b>Section 2, name, and text</b></p>
<p>Section 2 more text</p>
<p><b>Section 3,  name, and text</b></p>
// etc.

I’d really like feedback about the approach I came up with last night:

This looks like a take/while/scan kind of problem. But I couldn’t find an Enum or Floki function that seemed to handle this kind of repeated pattern.

I decided to write a function that would generically group items and following items using a predicate. In the case of the HTML above, the predicate would be “Does the element contain a ?” So, abstractly:

input =  [1,2,2,2,2,1,1,1,2]
output = [[1,2,2,2,2], [1], [1], [1, 2]]

I realized that’s not too hard with Enum.reduce:

  def group_after(list, predicate) do
    reduce(list, [], fn e, acc ->
      case predicate.(e) do
        true ->
          [[e]] ++ acc

        false ->
          {curr, rest} = List.pop_at(acc, 0)
          [curr ++ [e] | rest]
      end
    end)
  end

It works fine. Although, the reduce function’s code is very procedural and not expressive. What do you all think? Is there another approach I’m not considering?

An alternate idea: Consider a string "tfffftttf" as an isomorph of map(list, predicate). Then use an expressive regex like ~r/tf*/ to group the true & false — instead of the procedural reduce. Finally, undo the mapping back into the original list elements.

Sebb · September 25, 2022, 8:28am

I’m not sure if I understand correctly. In your real world example the grouping is already done, because  are children of the .

But your number example seems to be another problem (chunk by 1s and all 2s that follow a 1)

Eiji · September 25, 2022, 9:05am

Slow load (possibly timeout). I know what you feel. Hope those are not ASP.net pages with invalid HTML code.

Anyway, here is my solution:

Mix.install([:floki])

defmodule Example do
  def sample(list, acc \\ [])

  # for empty input after parsing
  def sample([], []), do: []

  # when all p elements are passed
  # reverse last section texts and wrap them into list
  # as otherwise a resulting list would be added
  # to a main list where each element contains a list of sections texts
  def sample([], acc), do: [:lists.reverse(acc)]

  # in case of first bold text simply add text to
  # as the only element in new acc
  # and call function recursively
  def sample([{"p", _, [{"b", [], [text]}]} | tail], []) when is_binary(text) do
    sample(tail, [text])
  end

  # however if there is some data in acc
  # reverse its contents and return it as a list of
  # last section texts and recursive call
  def sample([{"p", _, [{"b", [], [text]}]} | tail], acc) when is_binary(text) do
    [:lists.reverse(acc) | sample(tail, [text])]
  end

  # when we got a normal text simply add it to acc
  # and call function recursively
  def sample([{"p", _, [text]} | tail], acc) when is_binary(text) do
    sample(tail, [text | acc])
  end
end

"""
<p><b>Section 1, name, and text</b></p>
<p>Section 1 more text</p>
<p>Section 1 more text</p>
<p><b>Section 2, name, and text</b></p>
<p>Section 2 more text</p>
<p><b>Section 3,  name, and text</b></p>
"""
|> Floki.parse_fragment!()
|> Example.sample()
|> dbg()

Pattern matching is a fastest solution. You can take a look at this post to see possible alternative solutions.

dogweather · September 25, 2022, 11:26am

Sorry, yeah - I left it abstract. I used the 1’s to represent paragraphs with a , and 2’s for 's without.

Here’s a screenshot of the page.

And I’m trying to produce a list of %Section{}.

The actual HTML looks like this. All of the 's just run on; there’s no hierarchy. The only clue is that the first line of a Section has a .

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><b><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      838.025
Election laws apply.</span></b><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>
(1) ORS chapter 255 governs the following:</span></p>

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      (a)
The nomination and election of district board members.</span></p>

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      (b)
The conduct of district elections.</span></p>

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      (2)
The electors of a district may exercise the powers of the initiative and
referendum regarding a district measure, in accordance with ORS 255.135 to
255.205. [Formerly 494.043]</span></p>

dogweather · September 25, 2022, 11:35am

Thanks! That’s pretty interesting. This meta tag is pretty chilling.

<meta name=Generator content="Microsoft Word 15 (filtered)">

It seems they’re somehow scripting MS Word to “Save as HTML”.

Sebb · September 25, 2022, 11:53am

sounds like fun … not

Eiji · September 25, 2022, 12:11pm

Since you did not give a struct definition I wrote it myself:

Mix.install([:floki])

defmodule Section do
  defstruct ~w[contents id title]a

  def add_text(%__MODULE__{contents: contents} = section, {"span", _, [text]})
      when is_binary(text) do
    %{section | contents: [String.trim(text) | contents]}
  end

  def first_paragraph({"b", _, [{"span", _, [id_title]}]}, {"span", _, [text]})
      when is_binary(id_title) and is_binary(text) do
    [id, title] = id_title |> String.trim() |> String.split("\n", parts: 2)
    %Section{contents: [String.trim(text)], id: id, title: title}
  end

  def reverse_contents(%__MODULE__{contents: contents} = section) do
    %{section | contents: :lists.reverse(contents)}
  end
end

defmodule Example do
  def sample(list, section \\ nil)

  def sample([], []), do: []

  def sample([], section), do: [Section.reverse_contents(section)]

  def sample([{"p", _, [title, text]} | tail], nil) do
    sample(tail, Section.first_paragraph(title, text))
  end

  def sample([{"p", _, [title, text]} | tail], section) do
    new_section = Section.first_paragraph(title, text)
    [Section.reverse_contents(section) | sample(tail, new_section)]
  end

  def sample([{"p", _, [text]} | tail], section) do
    sample(tail, Section.add_text(section, text))
  end
end

"""
<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><b><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      838.025
Election laws apply.</span></b><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>
(1) ORS chapter 255 governs the following:</span></p>

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      (a)
The nomination and election of district board members.</span></p>

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      (b)
The conduct of district elections.</span></p>

<p class=MsoNormal style='margin-bottom:0in;line-height:normal;text-autospace:
none'><span style='font-size:12.0pt;font-family:"Times New Roman",serif'>      (2)
The electors of a district may exercise the powers of the initiative and
referendum regarding a district measure, in accordance with ORS 255.135 to
255.205. [Formerly 494.043]</span></p>
"""
|> Floki.parse_fragment!()
|> Example.sample()
|> dbg()

What do you think about it?

derek-zhou · September 25, 2022, 12:35pm

This is not too bad; at least all the  have closing . You know they don’t have to; and I;ve seen html that freely mix the 2 styles, with or without closing 

dogweather · September 25, 2022, 10:31pm

That’s very cool. Thanks for taking a whack at it. The 's and other attributes aren’t important, though, because they’re the same on every element. (!) The big picture is, we want mostly plain text — simplified HTML. This code’s purpose is to produce well formed JSON with all the important info from the original texts. I publish the JSON to a datasets public repo.

You can see how I solved it: The actual Section:

defmodule Crawlers.ORS.Models.Section do
  @moduledoc """
  An ORS Section.
  """
  use TypedStruct

  typedstruct enforce: true do
    @typedoc "An ORS Section"

    field :kind, String.t(), default: "section"
    field :name, String.t()
    field :number, String.t()
    field :text, String.t()
    field :chapter_number, String.t()
  end
end

The group_with/2 function:

  @doc """
  Group a list of elements into sub-lists, where each sub-list is
  led by an element that satisfies the predicate. It skips initial
  elements that do not satisfy the predicate.

  iex> group_with([1, 2, 3, 1, 4], &(&1 == 1))
  [[1, 2, 3], [1, 4]]

  iex> group_with(["a", "b", "x", "c", "d"], &(&1 == "x"))
  [["x", "c", "d"]]
  """
  def group_with(list, predicate) do
    result_reversed =
      reduce(list, [], fn e, acc ->
        case predicate.(e) do
          true ->
            # Start a new group.
            [[e]] ++ acc

          false ->
            # Try to extract the current group.
            {curr, rest} = List.pop_at(acc, 0)

            case curr do
              # Skip until predicate is true.
              nil -> acc
              # Append to current group.
              group -> [group ++ [e] | rest]
            end
        end
      end)

    reverse(result_reversed)
  end

dogweather · September 25, 2022, 10:35pm

I’ll just pray that Floki can handle it. I’m sure I’ll be dealing with that at some point.