Help with using sweet_xml and nested xpath

Newxan · January 19, 2020, 12:28pm

Hi, I’ve got a database XML dump from an old wordpress site that I want to import into a project.
I’musing sweet_xml and it’s working fine for the simple structures.
However when I encounter lists of nested fields such as wordpress’s metadata fields I can’t get sweet_xml to pick up the values. I’ve reproduced my problem in a minimal way:
import.ex

defmodule Mix.Tasks.Import do
  use Mix.Task

  import SweetXml

  @target_file "./data.xml"

  def run(_args) do
    Mix.Task.run("app.start")
    import_dump()
  end

  defp import_dump do
    tree = File.stream!(@target_file)

    rows =
      SweetXml.xpath(
        tree,
        ~x"/rss/channel/item"l,
        slug: ~x"./wp:post_name/text()"s,
        company_logo: ~x"./wp:postmeta[wp:meta_key = '_thumbnail_id']/wp:meta_value/text()"s,
        edited_at: ~x"./wp:postmeta[wp:meta_key = '_edit_last']/wp:meta_value/text()"s
      )

    IO.inspect(rows)
  end
end

data.xml

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/">
  <channel>
    <item>
      <wp:post_id>7068</wp:post_id>
      <wp:post_name>
<![CDATA[a-post-name]]>
      </wp:post_name>
      <wp:postmeta>
        <wp:meta_key>
<![CDATA[_edit_last]]>
        </wp:meta_key>
        <wp:meta_value>
<![CDATA[1]]>
        </wp:meta_value>
      </wp:postmeta>
      <wp:postmeta>
        <wp:meta_key>
<![CDATA[_thumbnail_id]]>
        </wp:meta_key>
        <wp:meta_value>
<![CDATA[11111]]>
        </wp:meta_value>
      </wp:postmeta>
    </item>
  </channel>
</rss>

The output But the output I’m getting no matter what I try is:

[%{company_logo: "", edited_at: "", slug: "\na-post-name\n      "}]

As you can see the company_logo and edited_at fields are always empty.

I’m not really sure where to go from here, as far as I can tell the xpath syntax itself is correct. But I am unsure about how you’re supposed to use the sweet_xml sigil modifiers when there is nested data such as this.

srowley · January 19, 2020, 2:00pm

There is a hint in your output - the CDATA elements are wrapped in newline characters and have trailing spaces, i.e. the value for slug in your example is "\napost-name\n ".

That means that your Xpath query is not finding anything for the slug or company logo nodes because, for example, it is looking for the key _edit_last when that key is \n_edit_last\n .

Your query would work like this:

SweetXml.xpath(
      tree,
      ~x"/rss/channel/item"l,
      slug: ~x"./wp:post_name/text()"s,
      company_logo: ~x"./wp:postmeta[contains(wp:meta_key, '_thumbnail_id')]/wp:meta_value/text()"s,
      edited_at: ~x"./wp:postmeta[contains(wp:meta_key, '_edit_last')]/wp:meta_value/text()"s
    )

I don’t know enough about XML to understand why the newlines/spaces are there; if there is some rule associated with that I suppose you could also just apply it and keep using = instead of contains, akin to (I didn’t test this one):

company_logo: ~x"./wp:postmeta[wp:meta_key = '\n_thumbnail_id\n     ')]/wp:meta_value/text()"s,

rjk · January 19, 2020, 2:39pm

I can confirm that the suggestion from @srowley works, this code is tested in an iex repl:

    clean_newlines_and_ws = fn x -> String.replace(x, ~r/\r|\n|\W/, "") end
    rows =
      SweetXml.xpath(
        tree,
        ~x"/rss/channel/item"l,
        slug: ~x"./wp:post_name/text()"s |> transform_by(clean_newlines_and_ws),
        company_logo:
          ~x"./wp:postmeta[contains(wp:meta_key, '_thumbnail_id')]/wp:meta_value/text()"s
          |> transform_by(clean_newlines_and_ws),
        edited_at:
          ~x"./wp:postmeta[contains(wp:meta_key, '_edit_last')]/wp:meta_value/text()"s
          |> transform_by(clean_newlines_and_ws)
      )

i also added a transform_by function from sweet_xml that cleans up those newlines and extra whitespaces.
Also not sure why those are added in the first place as I can’t see them in the XML itself, seems like the underlying xmerl library (that sweet_xml uses) or sweet_xml itself does something weird to those cdata nodes?

But at least with this code you could go on with the task at hand

Newxan · January 19, 2020, 4:15pm

Using ‘contains’ solves my problem. Thanks for the help!

srowley · January 23, 2020, 12:45pm

As a postscript, it occurred to me later that all that is happening here is that the whitespace within the markup is being preserved - looking at the sample within the offending tags there is a newline, the CDATA node, some spaces, and then another newline. The :xmerl documentation confirms this and provides an example very similar to this one.

Newxan · January 23, 2020, 1:06pm

Doh! That makes a lot of sense
I always use a auto-formatter when viewing code. In this case using https://docs.python.org/2/library/xml.dom.minidom.html to make it look readable.
But doing that broke the export from wordpress. I can confirm that the original has no newlines whenever there is CDATA…

Thanks for the follow up, its been bugging me ever since not knowing why this occurred.