Odd Ecto behavior with latin1 and utf8

johnford2002 · March 9, 2021, 8:08pm

Hey everyone!

I have a rather perplexing question. Our production app has been generating the following error message:

(MyXQL.Error) (1267) (ER_CANT_AGGREGATE_2COLLATIONS) Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8mb4_general_ci,COERCIBLE) for operation '='

That’s not terribly surprising, as the underlying DB was originally created on MySQL 5.6 (now on 5.7) and the latin1 defaults were what we used. We’re now processing events that contain UTF-8 data, like محافظه البحي.

We’re not quite ready to make an overall character set and collation change to the DB - that may come in the following weeks. For now, we’re just looking to capture those specific errors and get a sense of how often they happen. Only, when attempting to run some test cases on this, we can’t replicate the behavior!

We’ve eliminated some possibilities that were obviously impacting the tests. Local MySQL was getting the Ecto defaults for UTF-8 initially on each run, so character set and collation were overridden to get back to what we have in prod. We disabled the SQL sandbox using unboxed_run(repo, function) to allow these specific tests to persist their data so we could inspect it afterward. Somewhere, at a level we don’t understand yet, the tests are converting that value to what appears to be some other encoding.

MySQL client hitting the test DB shows this value being inserted for the above Arabic string:

Ù…ØØ§ÙØ¸Ù‡ Ø§Ù„Ø¨ØÙ

We’ve no idea where or how that’s happening, and more-so, it’s rather annoying how it seems to pass silently on test but raises errors in production. Any thoughts or ideas to follow up on?

johnford2002 · March 9, 2021, 8:31pm

I was able to somewhat figure out part of this.

The error occurs if I change the adapter back to using the defaults, UTF-8, while operating on a latin1 DB and table.

I still do not understand what encoding/translation the adapter is performing when it’s set to latin1 but detects UTF8 input, resulting in the weird string above.

I can tell it’s doing something as a raw SQL query to insert the Arabic value just results in a string of ?'s without any other conversion being applied.

Nicd · March 10, 2021, 11:12am

This is the input string “محافظه البحي” encoded as UTF-8, but then treated as if it were a win-1252 byte stream. You can confirm by pasting it in this encoder and selecting encode as UTF-8 and decode as Windows-1252. Win-1252 is also known as cp1252 and is the character set MySQL uses for the latin1 collation.

I believe Ecto is sending your data as UTF-8 and MySQL is treating the input as win-1252 as that is the DB’s character set, resulting in this mojibake.

I would focus on converting the database to UTF-8 and using that throughout all your stack to avoid such issues.

Also note! If you decide to convert your MySQL DB to UTF-8, notice that utf8 and utf8mb4 are different things in MySQL. The former will only accept characters up to 3 bytes wide, whereas the latter will accept the whole 4-byte range of UTF-8. 3 bytes will not be enough for many characters, such as emoji.