Hi, I’ve got a database XML dump from an old wordpress site that I want to import into a project.
I’musing sweet_xml and it’s working fine for the simple structures.
However when I encounter lists of nested fields such as wordpress’s metadata fields I can’t get sweet_xml to pick up the values. I’ve reproduced my problem in a minimal way:
import.ex
defmodule Mix.Tasks.Import do
use Mix.Task
import SweetXml
@target_file "./data.xml"
def run(_args) do
Mix.Task.run("app.start")
import_dump()
end
defp import_dump do
tree = File.stream!(@target_file)
rows =
SweetXml.xpath(
tree,
~x"/rss/channel/item"l,
slug: ~x"./wp:post_name/text()"s,
company_logo: ~x"./wp:postmeta[wp:meta_key = '_thumbnail_id']/wp:meta_value/text()"s,
edited_at: ~x"./wp:postmeta[wp:meta_key = '_edit_last']/wp:meta_value/text()"s
)
IO.inspect(rows)
end
end
As you can see the company_logo and edited_at fields are always empty.
I’m not really sure where to go from here, as far as I can tell the xpath syntax itself is correct. But I am unsure about how you’re supposed to use the sweet_xml sigil modifiers when there is nested data such as this.
There is a hint in your output - the CDATA elements are wrapped in newline characters and have trailing spaces, i.e. the value for slug in your example is "\napost-name\n ".
That means that your Xpath query is not finding anything for the slug or company logo nodes because, for example, it is looking for the key _edit_last when that key is \n_edit_last\n .
I don’t know enough about XML to understand why the newlines/spaces are there; if there is some rule associated with that I suppose you could also just apply it and keep using = instead of contains, akin to (I didn’t test this one):
i also added a transform_by function from sweet_xml that cleans up those newlines and extra whitespaces.
Also not sure why those are added in the first place as I can’t see them in the XML itself, seems like the underlying xmerl library (that sweet_xml uses) or sweet_xml itself does something weird to those cdata nodes?
But at least with this code you could go on with the task at hand
As a postscript, it occurred to me later that all that is happening here is that the whitespace within the markup is being preserved - looking at the sample within the offending tags there is a newline, the CDATA node, some spaces, and then another newline. The :xmerl documentation confirms this and provides an example very similar to this one.
Doh! That makes a lot of sense
I always use a auto-formatter when viewing code. In this case using https://docs.python.org/2/library/xml.dom.minidom.html to make it look readable.
But doing that broke the export from wordpress. I can confirm that the original has no newlines whenever there is CDATA…
Thanks for the follow up, its been bugging me ever since not knowing why this occurred.