Pull out bulk data from the database using ecto and sending it as json.
Somewhere within the data is an invalid byte 0xFA utf8 code, my code to pull the data
data = Transactions
|> where([a], a.serviceType in [6, 7] and is_nil(a.deleted_at))
|> join(:left, [a], b in User, on: a.transactingUser_id == b.id)
|> select([a], %{
channel: a.channel,
plaza: a.startRoute,
ref: a.transactionRef,
cashier: fragment("concat(?, concat(' ', ?))", b.firstName, b.lastName)
}
|> Repo.all()
json(conn, data)
How can I solve this problem?
If possible, how can I convert it to string within the ecto query itself?
Please paste code between triple back-ticks (``` before and after) so people have a chance to understand it.
The code point 0xfa is valid unicode:
iex> <<0xfa :: utf8 >> == "Ăş"
true
So there is something about the sequence of bytes that is making the string be invalid UTF8. That suggests that the encoding of the data in the database is not UTF8.
You could confirm the encoding of your database with:
kip=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)
It would also be helpful if you can post the complete binary that is invalid so we can investigate further.
So how can I detect the error and replace it with a string directly from the database because my app is crashing when it encounters that error, code example please
As @NobbZ says, the primary issue here is that you have invalid UTF8-encoded data in your database. Therefore the correct fix is to correct the data in your database.
If you are unable to correct the data in your database then:
Change your ecto schema type to :binary (not :string).
After retrieving from the database do String.chunk(binary, :valid) to split the string into valid and invalid parts.
Decide what policy you want to apply when there is invalid UTF-8 encoding in your data. Possibly something like Enum.filter(chunks, &String.valid?/1) |> Enum.join()
Make sure that your database is encoding aware and not just assuming binary/opaque blob of data.
Then write a program that fixes all the malformed data you have according to the policy you have decided to go with.
Keeping malformed data in the DB and reinventing fixes for each consumer of the database is what made us see all the questionmarks and backslashes in the early 2000s, as so many PHP folks tried to fix data before pushing to DB and another set of developers tried to fix when pulling out, leading to “double fixed” data which again was often worse, as the roundtrip of fixing started again with each data edit.
On top of that, developers just didn’t understand all the levels of escaping…
FWIW, the byte 0xFA represents a “middle dot” character ( · ) on the old DOS code page 437.
Untangling old data like this is exceptionally difficult in general; I once had to migrate a database that had been storing whatever native character set was sent by clients (due to an old IE6 bug) and thus had everything from Arabic to Cyrillic ISO-8859 codepages mixed together. We had to hand-code a translation function from “customer country” to “likely character set” to deal with all that mess…
Hello, I have almost the same issue.
I’m trying to encode a 5 digit number with sha256 and set as a parameter in Guardian JTW token.
I get this error:
String.chunk returns a list with two binaries (“valid” and “invalid” characters), but this passes them BOTH along to be serialized to JSON, thus the error
dropping the characters wouldn’t be desirable anyways - since you mention the value is calculated with sha256, that’d be like using scissors to trim a big floppy disk to fit in a small disk drive
The values in JSON need to be valid UTF-8; if you’re trying to pass arbitrary binary data, consider using an encoding format like hexadecimal (34f89e4c23a0d) or base64. Elixir supports them through the Base module in stdlib.