Is there any way of string normalize automatically when insert to the database?

Hello people.
I’ve been having some problems trying to save a string word with limited varchar(9).

create database big_text
    LOCALE 'en_US.utf8'
    ENCODING UTF8
    
    
create table big_text(
    description VARCHAR(9) not null
)

# OK
insert into big_text (description) values ('sintético')

# I Got error here
insert into big_text (description) values ('sintético')

I already know that the problem is because one word is using ‘é’ → Latin small letter E with Acute (this case only have 1 codepoint) and another word is using ‘é’ → Latin Small Letter E + Combining Acute Accent Modifier. (this case I have 2 codepoint) .

Where should I normalize? In elixir code or set some config on postgres using some feature of database like plugin etc…
Is there any way of string normalize automatically before insert to the database? I’m using postgres

Thank you

1 Like

Which database are you using? Postgres has a normalize function:

postgres=# select 'sintético' = 'sintético';
 ?column? 
----------
 f
(1 row)

postgres=# select 'sintético' = normalize('sintético');
 ?column? 
----------
 t
(1 row)
1 Like

And it looks like Elixir has a normalize function as well. But it asks you to specify how to normalize it so you have to know something about that:

iex(6)> "sintético" == "sintético"                         
false
iex(7)> "sintético" == String.normalize("sintético", :nfd) 
false
iex(8)> "sintético" == String.normalize("sintético", :nfc) 
true
iex(9)> "sintético" == String.normalize("sintético", :nfkd)
false
iex(10)> "sintético" == String.normalize("sintético", :nfkc)
true
1 Like

I’m using postgres.
Thank you joey. I discovered this function yesterday but I would like to know where is more appropriate to normalize user input?
On Elixir before sent data to database or using this function when data reach Postgres? If the better way is set on postgres. Is there any way to set some config that always I insert a non-normalized string it normalize automatically for me?

Ah I probably can’t help you too much with those because I never use this functionality myself. Hopefully someone else chimes in. But here are my naive $0.02:

  • Both ways are “appropriate” as in they should both give you the correct result. So it should boil down to which way is easier to understand/maintain in your code. I think it’s cleaner/clearer to normalize in Elixir before performing the query. This way you have clear delineation between the query step and the transformation step.
  • I personally wouldn’t automate it on the SQL. You might not want it for all cases now or in the future and having a transformation like that hidden could come back to bite you.
3 Likes

Is there any way to set some config that always I insert a non-normalized string it normalize automatically for me?

Personally I define a custom Ecto type for these things, they’re pretty simple to define.

You could also (probably?) write an INSERT trigger, see this tutorial and this SO post.

I mostly define it on the Elixir side because writing SQL functions ain’t my jam and when I am thinking “how or why did that happen?” I am going to look at “code” first, not my migrations or sql dump to see if any trigger was inserted.

Defining an SQL trigger is probably more desired if you have multiple clients or schemas, etc and you want to be sure anything thrown at the DB from anywhere gets processed in a specific way.

3 Likes

I don’t know your use case so maybe I am way off base, but I feel you are somehow too low level into the implementation details of Unicode. Unicode string normalization is not a zero-cost effort and unless you really really need it, shouldn’t be something you have to spend any attention on.

In the case of Postgres (but not necessarily other databases), the implementation of CHAR(n) is the same as VARCHAR so you aren’t saving any storage space. VARCHAR would be preferred in nearly all cases.

If you want to limit the length of the input for some reason, then you can use String.length/1 in Elixir or CHAR_LENGTH(string) in Postgres which is Unicode aware.

4 Likes