Floki parse and group by heading

neuone · July 18, 2018, 2:18pm

The HTML has no attributes it’s very vanilla and plain

<!doctype html>
<html>
<body>
  <section class="body-copy">
  <h2>Topic 1</h2>
  <p>data-a</p>
  <p>data-b</p>
  <p>data-c</p>
  <h2>Topic 2</h2>
  <p>data-d</p>
  <p>data-e</p>
  <p>data-f</p>
  <h2>Topic 3</h2>
  <p>data-g</p>
  <p>data-h</p>
  <p>data-i</p>
  </section>
</body>
</html>

I’m using Floki and I’m trying to parse it so I can create a List of maps like so.

%{ topic: "Topic 1", data: "data-a" }
%{ topic: "Topic 1", data: "data-b" }
%{ topic: "Topic 1", data: "data-c" }
%{ topic: "Topic 2", data: "data-d" }
%{ topic: "Topic 2", data: "data-e" }
%{ topic: "Topic 2", data: "data-f" }

I’m struggling to get all the P tags under each H2 with Floki.

# Try loading this html
path = "/Users/Foo/Desktop/test.html"
{_, local_file } = File.read(path)

# This will return me all the h2
Floki.find(local_file, "h2")               
[{"h2", [], ["Topic 1"]}, {"h2", [], ["Topic 2"]}, {"h2", [], ["Topic 3"]}]

# This will return me the first p from a specific h2. But not all of them
Floki.find(local_file, "h2:nth-of-type(1) + p")
[{"p", [], ["data-a"]}]

# This return this first p for Topic 2 but I need 2 more p tags (data-e, data-f)
Floki.find(local_file, "h2:nth-of-type(2) + p")
[{"p", [], ["data-d"]}]

# This return this first p for Topic 3 but I need 2 more p tags (data-h, data-i)
Floki.find(local_file, "h2:nth-of-type(3) + p")
[{"p", [], ["data-g"]}]

Question
I cannot figure out how to get ONLY the P tags for each H2.

mischov · July 18, 2018, 3:38pm

The problem is that the data is not actually grouped by anything other intent and order, which CSS selectors don’t do well with.

If each item looked like <div><h2>Header</h2><p>Item 1</p><p>Item 2</p></div> then trying to select with CSS selectors would be a lot easier because that’s the type of data those selectors work best with.

Here are a couple ways you could solve the problem (based on the example data):

If there are an identical number of p tags after each h2 tag, you could get a list of all the children of the <section> and do a Enum.chunk_by to group the nodes that belong together and go from there.
If there aren’t an identical number of p tags after each h2 tag or there are other unwanted children, you could get a list of all the children of the <section> and write a reducer that builds up a map of h2 element to a list of the sections that occur before the next h2, and go from there.

There might be better solutions I’m not thinking of, but those are places where you could start.

neuone · July 18, 2018, 4:36pm

The situation is like your number 2 point. After the H2 it just random amount p tags nothing else.
I will need to figure out how to write a reducer first. And then a reducer for situation number 2.

idi527 · July 18, 2018, 5:22pm

Maybe

def grouped_by_topics(body_nodes) do
  # nil here is a hack, you can run this function in two phases instead
  # 1. find the first `h1` with a topic
  # 2. then start this function with the rest of `body_nodes`
  grouped_by_topics(body_nodes, nil, [], [])
end

# when we meet a `h1` tag, start a new `inner_acc` for collecting the data for the topic in `h1`
defp grouped_by_topics([{"h1", [], ["Topic" <> _ = next_topic]} | rest], prev_topic, prev_inner_acc, outer_acc) do
  grouped_by_topics(rest, next_topic, [], [%{prev_topic => prev_inner_acc} | outer_acc])
end 

# when we meet a new `p` tag, add it to the `inner_acc` for the current topic
defp grouped_by_topics([{"p", [], ["data" <> _ = new_data]} | rest], current_topic, inner_acc, outer_acc) do
  grouped_by_topics(rest, current_topic, [new_data | inner_acc], outer_acc)
end

# neither a `p` nor an `h1` tag -- skip
defp grouped_by_topics([_other | rest], current_topic, inner_acc, outer_acc) do
  grouped_by_topics(rest, current_topic, inner_acc, outer_acc)
end

# no more html nodes -- finish
defp grouped_by_topics([], last_topic, inner_acc, outer_acc) do
  [%{last_topic => inner_acc} | outer_acc]
end

inner_acc is for collecting data-* in p tags
outer_acc is for collecting %{topic => data (aka final_inner_acc)} maps
current_topic is for keeping the topic from the last h2 tag

mischov · July 18, 2018, 5:35pm

This is brittle as all get-out because it makes some serious assumptions about the shape of the data, but:

def extract_topic_maps(html) do
  nodes = Floki.find(html, "section.body-copy > *")

  {_, topic_maps} =
    Enum.reduce(nodes, {nil, []}, fn
      {"p", _, _}, {nil, _} -> raise "Invalid state: no topic"
      # New topic, set in accumulator
      {"h2", _, [topic]}, {_, topic_maps} -> {topic, topic_maps}
      # New value in topic, add appropriate topic_map to topic_maps in accumulator
      {"p", _, [data]}, {topic, topic_maps} -> {topic, [%{topic: topic, data: data} | topic_maps]}
      node, _ -> raise "Invalid state: unexpected node #{inspect(node)}"
    end)

  Enum.reverse(topic_maps)
end

With your input that returns something like

[
  %{data: "data-a", topic: "Topic 1"},
  %{data: "data-b", topic: "Topic 1"}, 
  %{data: "data-c", topic: "Topic 1"},
  %{data: "data-d", topic: "Topic 2"},
  ...
]

neuone · July 18, 2018, 6:39pm

This is the final result

[
  %{"Topic 3" => ["data-i", "data-h", "data-g"]},
  %{"Topic 2" => ["data-f", "data-e", "data-d"]},
  %{"Topic 1" => ["data-c", "data-b", "data-a"]},
  %{nil: []}
]

Is this solution a good example of Recursion.
Still learning Elixir, thats why I am asking.

neuone · July 18, 2018, 6:50pm

This result is what I wanted to achieve.
Thank you for this solution.
I don’t completely understand exactly how it works, but I’ll look into it this evening.
But I’m reading up on reducers in Elixir. I’m hoping this along with with my tutorials will get me further along.

mischov · July 18, 2018, 7:04pm

Simply, reducers are functions that take 1) the current item from the collection being reduced over, and 2) the current accumulator, and that return a value that will be used as the accumulator in the next call of the reducer (for the next item in the collection being reduced over).

For example, fn n, sum -> sum + n end is a reducer that take a number, and adds it to the running sum. It could be used like

iex> Enum.reduce([1, 2, 3], 0, fn n, sum -> sum + n end)
6

In the solution, my accumulator takes the shape of a tuple that hold the current topic (or nil) as the first element, and a list of topic maps as the second element.

The reducer has four clauses:

If it encounters a <p> element before it has a topic, it raises.
If it encounters a <h2> element, it assumes that that element has one child, which will be the topic, and sets the topic in the accumulator to that topic.
If it encounters a <p> element and has a topic, it creates a topic map with the data from the <p> element (again assuming the element only has one child) and the topic, and adds that topic map to the topic maps in the accumulator.
If it encounters any other shape of node (whether it’s a <span>, or a <p> element with no or multiple children, or anything else), it raises.

If you have any other questions about how some part of that works feel free to ask.

OvermindDL1 · July 18, 2018, 8:13pm

If you don’t need to write data ‘out’ then using Meeseeks has some far better selectors that can grab entire ranges far easier via xpath selectors and so some extended css selectors.