Solve can I solve in invalid byte 0xFA in <<119, 199, ...>> error in ecto

Pull out bulk data from the database using ecto and sending it as json.
Somewhere within the data is an invalid byte 0xFA utf8 code, my code to pull the data

data = Transactions
|> where([a], a.serviceType in [6, 7] and is_nil(a.deleted_at))
|> join(:left, [a], b in User, on: a.transactingUser_id == b.id)
|> select([a], %{
        channel: a.channel,
        plaza: a.startRoute,
        ref: a.transactionRef,
        cashier: fragment("concat(?, concat(' ', ?))", b.firstName, b.lastName)
}
|> Repo.all()

json(conn, data)

How can I solve this problem?
If possible, how can I convert it to string within the ecto query itself?

Please paste code between triple back-ticks (``` before and after) so people have a chance to understand it.

The code point 0xfa is valid unicode:

iex> <<0xfa :: utf8 >> == "Ăş"
true

So there is something about the sequence of bytes that is making the string be invalid UTF8. That suggests that the encoding of the data in the database is not UTF8.

You could confirm the encoding of your database with:

kip=# show server_encoding;
 server_encoding 
-----------------
 UTF8
(1 row)

It would also be helpful if you can post the complete binary that is invalid so we can investigate further.

3 Likes

The codepoint U+00fa is valid unicode as you have shown, though the problem is the byte.

The byte 0xFA alone is not valid UTF-8, as can be seen in this table:

0xFA is marked red, which means “is not allowed to appear in a UTF-8 encoded sequence of bytes”.

3 Likes

Can you please share the full error? that migh help to understand where it comes from.

Also please share your relevant schema and migrations.

(exit) an exception was raised:
     ** (Jason.EncodeError) invalid byte 0xFA in <<77, 105, 108, 100, 114, 101, 100, 32, 67, 104, 105, 110, 103, 250, 109, 98, 97>>
         (jason) lib/jason.ex:199: Jason.encode_to_iodata!/2
         (phoenix) lib/phoenix/controller.ex:284: Phoenix.Controller.json/2
         (nrfa_elixir) lib/elixir_web/UI CONTROLLERS/transactions/transactions_controller.ex:1: ElixirWeb.TransactionsController.action/2
         (nrfa_elixir) lib/elixir_web/UI CONTROLLERS/transactions/transactions_controller.ex:1: ElixirWeb.TransactionsController.phoenix_controller_pipeline/2
         (phoenix) lib/phoenix/router.ex:352: Phoenix.Router.__call__/2
         (nrfa_elixir) lib/elixir_web/endpoints/80/endpoint.ex:1: ElixirWeb.Endpoint.plug_builder_call/2
         (nrfa_elixir) lib/elixir_web/endpoints/80/endpoint.ex:1: ElixirWeb.Endpoint.call/2
         (phoenix) lib/phoenix/endpoint/cowboy2_handler.ex:65: Phoenix.Endpoint.Cowboy2Handler.init/4

thats the error am giving


(exit) an exception was raised:
     ** (Jason.EncodeError) invalid byte 0xFA in <<77, 105, 108, 100, 114, 101, 100, 32, 67, 104, 105, 110, 103, 250, 109, 98, 97>>
         (jason) lib/jason.ex:199: Jason.encode_to_iodata!/2
         (phoenix) lib/phoenix/controller.ex:284: Phoenix.Controller.json/2
         (nrfa_elixir) lib/elixir_web/UI CONTROLLERS/transactions/transactions_controller.ex:1: ElixirWeb.TransactionsController.action/2
         (nrfa_elixir) lib/elixir_web/UI CONTROLLERS/transactions/transactions_controller.ex:1: ElixirWeb.TransactionsController.phoenix_controller_pipeline/2
         (phoenix) lib/phoenix/router.ex:352: Phoenix.Router.__call__/2
         (nrfa_elixir) lib/elixir_web/endpoints/80/endpoint.ex:1: ElixirWeb.Endpoint.plug_builder_call/2
         (nrfa_elixir) lib/elixir_web/endpoints/80/endpoint.ex:1: ElixirWeb.Endpoint.call/2
         (phoenix) lib/phoenix/endpoint/cowboy2_handler.ex:65: Phoenix.Endpoint.Cowboy2Handler.init/4

There is a byte 250 close to the end. That is the 0xFA from the error.

Please check your database directly what byte there is and fix the data in the database.

The following shows what is exactly in fornt of and after that byte:

iex(1)> <<77, 105, 108, 100, 114, 101, 100, 32, 67, 104, 105, 110, 103>>
"Mildred Ching"
iex(2)> <<109, 98, 97>>
"mba"
1 Like

So how can I detect the error and replace it with a string directly from the database because my app is crashing when it encounters that error, code example please

As @NobbZ says, the primary issue here is that you have invalid UTF8-encoded data in your database. Therefore the correct fix is to correct the data in your database.

If you are unable to correct the data in your database then:

  1. Change your ecto schema type to :binary (not :string).
  2. After retrieving from the database do String.chunk(binary, :valid) to split the string into valid and invalid parts.
  3. Decide what policy you want to apply when there is invalid UTF-8 encoding in your data. Possibly something like Enum.filter(chunks, &String.valid?/1) |> Enum.join()
2 Likes

Thanks alot.

Can’t manage correcting my database manual

iex(5)> Transactions |> Repo.aggregate(:count, :id)
2466060

Thats the count of transactions I have in the DB. Can’t manage going through it one by one.

thanks

No one said it has to be done manually.

Though in my opinion it has to happen!

Make sure that your database is encoding aware and not just assuming binary/opaque blob of data.

Then write a program that fixes all the malformed data you have according to the policy you have decided to go with.

Keeping malformed data in the DB and reinventing fixes for each consumer of the database is what made us see all the questionmarks and backslashes in the early 2000s, as so many PHP folks tried to fix data before pushing to DB and another set of developers tried to fix when pulling out, leading to “double fixed” data which again was often worse, as the roundtrip of fixing started again with each data edit.

On top of that, developers just didn’t understand all the levels of escaping…

2 Likes

FWIW, the byte 0xFA represents a “middle dot” character ( · ) on the old DOS code page 437.

Untangling old data like this is exceptionally difficult in general; I once had to migrate a database that had been storing whatever native character set was sent by clients (due to an old IE6 bug) and thus had everything from Arabic to Cyrillic ISO-8859 codepages mixed together. We had to hand-code a translation function from “customer country” to “likely character set” to deal with all that mess…

2 Likes

Hello, I have almost the same issue.
I’m trying to encode a 5 digit number with sha256 and set as a parameter in Guardian JTW token.
I get this error:

** (exit) an exception was raised:
** (Jason.EncodeError) invalid byte 0xE6 in <<39, 230, 80, 212, 221, 113, 28, 70, 32, 224, 187, 31, 151, 119, 135, 64, 98, 182, 156, 148, 203, 167, 9, 232, 6, 132, 32, 193, 154, 195, 82, 14, 255, 75, 121, 167, 68, 155, 55, 111, 108, 62, 12, 85, 98, 82, 225, 222, 11, 147, …>>
(jason 1.2.2) lib/jason.ex:150: Jason.encode!/2
(jose 1.11.1) src/jwt/jose_jwt.erl:89: :jose_jwt.to_binary/1
(jose 1.11.1) src/jwt/jose_jwt.erl:172: :jose_jwt.sign/3
(guardian 2.1.1) lib/guardian/token/jwt.ex:259: Guardian.Token.Jwt.create_token/3
(guardian 2.1.1) lib/guardian.ex:765: Guardian.returning_tuple/1
(guardian 2.1.1) lib/guardian.ex:594: Guardian.encode_and_sign/4
(guardian 2.1.1) lib/guardian/plug.ex:204: Guardian.Plug.sign_in/5

After I added : encrypted = String.chunk(encrypted, :valid) the I get this error:

Request: POST /api/user/authenticate/mobile
** (exit) an exception was raised:
** (Jason.EncodeError) invalid byte 0xAC in <<172, 255>>
(jason 1.2.2) lib/jason.ex:150: Jason.encode!/2
(jose 1.11.1) src/jwt/jose_jwt.erl:89: :jose_jwt.to_binary/1
(jose 1.11.1) src/jwt/jose_jwt.erl:172: :jose_jwt.sign/3
(guardian 2.1.1) lib/guardian/token/jwt.ex:259: Guardian.Token.Jwt.create_token/3
(guardian 2.1.1) lib/guardian.ex:765: Guardian.returning_tuple/1
(guardian 2.1.1) lib/guardian.ex:594: Guardian.encode_and_sign/4
(guardian 2.1.1) lib/guardian/plug.ex:204: Guardian.Plug.sign_in/5

Many invalid chars are gone but Looks like still there are some invalid chars remaining.
What do you recommend?

Two things:

  • String.chunk returns a list with two binaries (“valid” and “invalid” characters), but this passes them BOTH along to be serialized to JSON, thus the error
  • dropping the characters wouldn’t be desirable anyways - since you mention the value is calculated with sha256, that’d be like using scissors to trim a big floppy disk to fit in a small disk drive

The values in JSON need to be valid UTF-8; if you’re trying to pass arbitrary binary data, consider using an encoding format like hexadecimal (34f89e4c23a0d) or base64. Elixir supports them through the Base module in stdlib.

4 Likes