Can't use international characters in Phoenix Expug templates

edjroot · September 17, 2019, 5:58pm

Hello!

First of all, I apologize if this is not the right place to ask about this problem. I’ve just got started with Elixir and so far it’s been a pleasure!

I’m trying to use Expug as a templating engine for my new Phoenix project.

I’m using Phoenix 1.4.10, Elixir 1.9.0, Erlang 22. Tested on Ubuntu 19.04 and Arch Linux.

When I try to use accented characters in attribute values, e.g. button(title="helló") world (‘ó’ in this case), I get this error:

Compiling 1 file (.ex)

== Compilation error in file lib/hello_web/views/page_view.ex ==
** (UnicodeConversionError) invalid encoding starting at <<195, 34, 41, 41, 32, 37, 62, 62, 60, 37, 61, 32, 34, 92, 110, 34, 32, 37, 62, 104, 101, 108, 108, 111, 60, 37, 61, 32, 34, 92, 110, 34, 32, 37, 62, 60, 47, 112, 62, 10>>
    (elixir) lib/string.ex:2180: String.to_charlist/1
    (eex) lib/eex/tokenizer.ex:33: EEx.Tokenizer.tokenize/3
    (eex) lib/eex/compiler.ex:18: EEx.Compiler.compile/2
    (phoenix) lib/phoenix/template.ex:354: Phoenix.Template.compile/3
    (phoenix) lib/phoenix/template.ex:165: anonymous fn/4 in Phoenix.Template."MACRO-__before_compile__"/2
    (elixir) lib/enum.ex:1948: Enum."-reduce/3-lists^foldl/2-0-"/3
    (phoenix) expanding macro: Phoenix.Template.__before_compile__/1
    lib/hello_web/views/page_view.ex:1: HelloWeb.PageView (module)
    (elixir) lib/kernel/parallel_compiler.ex:229: anonymous fn/4 in Kernel.ParallelCompiler.spawn_workers/7

The very same character works elsewhere (i.e. outside attribute values), and it also works in EEx templates, so I guess it’s a bug in Expug.

I don’t mind getting my hands dirty to fix this myself; I just need some directions. I guess it’s just a matter of fixing the encoding somewhere between Expug’s stringifier and EEx’s parser, but I still don’t understand much about Elixir or EEx’s internals so I’m not sure and can’t find the culprit.

Thanks in advance!

benwilson512 · September 17, 2019, 6:08pm

Hey @edjroot, you’ve come to the right place! I would definitely double check that the character you’ve entered into the code file is UTF8 encoded. If you copy and past a latin 1 encoded character it won’t be valid.

edjroot · September 17, 2019, 6:18pm

Hey @benwilson512, thanks for the quick reply!

Sorry, I forgot to mention, but these accented characters (and other stuff like greek letters) seem to work everywhere in Expug templates except in attribute values, and they also work everywhere in EEx templates, so I really think it’s something with Expug.

How do I check if the character is UTF8? I tried String.valid?("ó") and got true. However, this page says the same character is invalid.

benwilson512 · September 17, 2019, 6:33pm

If String.valid? returns true, then it’s fine. This does seem like it could be a bug with Expug, I’d consider filing an issue on the project.

edjroot · September 17, 2019, 6:36pm

Thanks, I did file a bug, but unfortunately the project seems kinda dead and the maintainer took months to respond to previously posted issues.
But as I said, I don’t mind trying to fix it myself, I just don’t know where to look even though I’ve been trying.

OvermindDL1 · September 17, 2019, 6:40pm

Sounds worth forking?

edjroot · September 17, 2019, 6:44pm

Definitely!
I love Pug (and starting to love Elixir and Phoenix and…)! I’d gladly maintain the project myself, but I know next to nothing about parsers and this kind of stuff (and that’s why I’m struggling with something seemingly so simple to solve )

kip · September 18, 2019, 1:23am

Although <<195 :: utf8>> is a valid Unicode Character representing "Ã", its not the same as <<195>> (ie an integer) because the byte encoding is different.

One thing you might try is calling String.normalize(string, :nfc) before inserting it in your template. This will decompose the code point into 2 code points (and 3 bytes) all of which are < 128. Codepoints less than 128 are compatible with ANSI 7-bit since the byte encoding is the same. And therefore you should at least pass the error you are getting.

There are some good “tools” for diving into utf8, here are some examples related to your error message:

# Composed normalisation (your current code)
# Codepoints above 128 are not byte encoded the same as an integer
# which is the underlying source of your error
iex> String.normalize <<195 :: utf8>>, :nfc
"Ã"
iex> String.to_charlist "Ã"
[195]
# Decomposition normalisation
# Multiple code points are used and each
# of them is less than 128 so have the same byte
# encoding as an integer. Note this is not true for
# all of Unicode - but it is largely true for Latin derivative
# alphabets
iex> String.normalize <<195 :: utf8>>, :nfd
"Ã"
iex> String.to_charlist "Ã"                                       
[65, 771]
iex> List.to_string([65,771]) |> :erlang.byte_size
3

edjroot · September 18, 2019, 5:32am

Thanks a lot for the pointers!

I just saw your reply, but after many hours looking in the wrong places I found out there was a specific regex (~r/^(?:(?:\\")|[^"])/) causing the problem, and I got it to work by just adding a u flag to it. I suppose this is the right thing to do, but please do tell me if there’s a more appropriate way! (The updated repo is here if anyone is interested)

Anyway, thank you all for the support! I promise I’ll do my homework and learn more about Elixir and the ecosystem.

kip · September 18, 2019, 5:53am

Definitely your approach is more appropriate, I was just suggesting a potential work-around. Unfortunately the u flag is not default so errors like this do crop up from time-to-time.

NobbZ · September 18, 2019, 5:57am

I’m not sure if your workaround would work at all. Even after normalizing into distinct codepoints, as there is at least one codepoint spanning more than one byte, those bytes will both be greater than 127 and therefore not in ASCII anymore and make the regex fail.