UTF-8 issue with Erlang :xmerl_scan function

Hello, Elixir community,

I have an issue with UTF-8 non ASCII chars during XML parsing using :xmerl_scan.string function:

   iex(2)> "<a>Д</a>" |> String.to_char_list |> :xmerl_scan.string

13:51:23.904 [error] 3416- fatal: {:error, {:wfc_Legal_Character, {:error, {:bad_character, 1044}}}}

** (exit) {:fatal, {{:error, {:wfc_Legal_Character, {:error, {:bad_character, 1044}}}}, {:file, :file_name_unknown}, {:line, 1}, {:col, 6}}}
    xmerl_scan.erl:4111: :xmerl_scan.fatal/2
    xmerl_scan.erl:2705: :xmerl_scan.scan_char_data/5
    xmerl_scan.erl:2617: :xmerl_scan.scan_content/11
    xmerl_scan.erl:2130: :xmerl_scan.scan_element/12
    xmerl_scan.erl:572: :xmerl_scan.scan_document/2
    xmerl_scan.erl:288: :xmerl_scan.string/2

Seems like some bug inside Erlang lib or inconsistency? Is there any other way to work with XML from Elixir?

Well be glad for any ideas/help.

Thank you,
Andrey

I can’t reproduce. Which version of xmerl are you using?

iex(1)> '<?xml version="1.0" encoding="utf-8"?><a>Д</a>' |> :xmerl_scan.string
{{:xmlElement, :a, :a, [], {:xmlNamespace, [], []}, [], 1, [],
  [{:xmlText, [a: 1], 1, [], '?', :text}], [],
  'some path', :undeclared}, []}
iex(2)> '<a>Д</a>' |> :xmerl_scan.string
{{:xmlElement, :a, :a, [], {:xmlNamespace, [], []}, [], 1, [],
  [{:xmlText, [a: 1], 1, [], '?', :text}], [],
  'some path', :undeclared}, []}
iex(3)> :xmerl_scan.module_info[:attributes]
[vsn: [:"0.20"], date: [:"03-09-16"]]

hi NobbZ,

I am using version that is supplied with OTP 19

iex(15)> :xmerl_scan.module_info()
[module: :xmerl_scan,
 exports: [user_state: 1, event_state: 1, hook_state: 1, rules_state: 1,
  fetch_state: 1, cont_state: 1, user_state: 2, event_state: 2, hook_state: 2,
  rules_state: 2, fetch_state: 2, cont_state: 2, file: 1, file: 2, string: 1,
  string: 2, accumulate_whitespace: 4, module_info: 0, module_info: 1],
 attributes: [vsn: [:"0.20"], date: [:"03-09-16"]],
 compile: [options: [{:outdir,
    '/home/vagrant/build-dir_16-06-21_15-00-31/otp-support/lib/xmerl/src/../ebin'},
   {:i,
    '/home/vagrant/build-dir_16-06-21_15-00-31/otp-support/lib/xmerl/src/../include'},
   :warn_unused_vars, :debug_info], version: '6.0.3',
  source: '/home/vagrant/build-dir_16-06-21_15-00-31/otp-support/lib/xmerl/src/xmerl_scan.erl'],
 native: false,
 md5: <<215, 59, 34, 93, 178, 182, 193, 100, 254, 84, 90, 181, 220, 160, 84,
   232>>]

To avoid console encoding problem:

    iex(27)> '<?xml version="1.0" encoding="utf-8"?><a>#{<< 220 :: utf8 >>}</a>' |> :xmerl_scan.string

17:16:54.401 [error] 3416- fatal: {:error, {:wfc_Legal_Character, {:error, {:bad_character, 220}}}}

** (exit) {:fatal, {{:error, {:wfc_Legal_Character, {:error, {:bad_character, 220}}}}, {:file, :file_name_unknown}, {:line, 1}, {:col, 44}}}
    xmerl_scan.erl:4111: :xmerl_scan.fatal/2
    xmerl_scan.erl:2705: :xmerl_scan.scan_char_data/5
    xmerl_scan.erl:2617: :xmerl_scan.scan_content/11
    xmerl_scan.erl:2130: :xmerl_scan.scan_element/12
    xmerl_scan.erl:572: :xmerl_scan.scan_document/2
    xmerl_scan.erl:288: :xmerl_scan.string/2

While:

iex(32)> '<?xml version="1.0" encoding="utf-8"?><a>#{<< ?u :: utf8 >>}</a>' |> :xmerl_scan.string
{{:xmlElement, :a, :a, [], {:xmlNamespace, [], []}, [], 1, [],
  [{:xmlText, [a: 1], 1, [], 'u', :text}], [],
  '/home/andrey/Projects/elixir-try/xml_parsing/test', :undeclared}, []}

OK, now while looking closer at it, I do realize, that it never worked for me.

I’m on windows currently, and I was in an IntelliJ-elixir project, so I fired up iex and tried in IntelliJ’s terminal where the kyrillic A-like letter was displayed in the copy pasted line. but it seems as if it only was displayed as this, look at xmerls return value in my tests… There it was a questionmark!

When I try your current examples it fails as well!

Perhaps you should try to reproduce in pure erlang and file a bug at https://bugs.erlang.org/

Also it could be that there is something on hex which may help: https://hex.pm/packages?_utf8=✓&search=xml&sort=downloads

Thank you NobbZ,

it seems that its a regression from OTP. I fired the ticket:

https://bugs.erlang.org/browse/ERL-255

Considering answer from ticket above, seems this is issue from Elixir

'<f>#{<<195, 156>>}</f>' |> :xmerl_scan.string

17:32:12.217 [error] 3416- fatal: {:error, {:wfc_Legal_Character, {:error, {:bad_character, 220}}}}

** (exit) {:fatal, {{:error, {:wfc_Legal_Character, {:error, {:bad_character, 220}}}}, {:file, :file_name_unknown}, {:line, 1}, {:col, 6}}}
xmerl_scan.erl:4111: :xmerl_scan.fatal/2
xmerl_scan.erl:2705: :xmerl_scan.scan_char_data/5
xmerl_scan.erl:2617: :xmerl_scan.scan_content/11
xmerl_scan.erl:2130: :xmerl_scan.scan_element/12
xmerl_scan.erl:572: :xmerl_scan.scan_document/2
xmerl_scan.erl:288: :xmerl_scan.string/2

It look like Elixir pass Unicode code of umlaut instead its UTF-8 bytes, while Erlang expects UTF-8 bytes.

Passing it through function :unicode.characters_to_binary gives not clear error:

iex(6)> '<f>#{<<195, 156>>}</f>' |> :unicode.characters_to_binary  |> :xmerl_scan.string
** (FunctionClauseError) no function clause matching in :lists.prefix/2
(stdlib) lists.erl:192: :lists.prefix('<', "<f>Ü</f>")
         xmerl_scan.erl:3897: :xmerl_scan.scan_mandatory/5
         xmerl_scan.erl:569: :xmerl_scan.scan_document/2
         xmerl_scan.erl:288: :xmerl_scan.string/2

Was there a solution for this?

For parsing valid UTF-8 with xmerl, encode to a list of bytes (not codepoints), as xmerl expects:

xml_string |> :erlang.binary_to_list |> :xmerl_scan.string

Alternatively, if for some reason you really want to pass codepoints instead, you can use a UTF-1 encoding, as mentioned in a message from 2009 (thanks Mikkel Jensen):

xml_string |> String.to_charlist |> :xmerl_scan.string([encoding: 'iso-10646-utf-1'])

Don’t change that encoding, it really is supposed to be iso-10646-utf-1.

1 Like