Ecto Query that removes spaces in table value before comparison

CONTEXT:

  • I have a geolocation table with a column that holds postal_codes
  • Those postal codes may have spaces like “AZ 0112”
  • I have a Type-To-Select form field that allows the user to type in the Postal code and then select from what is in the database

PROBLEM:

  • a user might type “az0112” or “az 0112” or “AZ0112” so I need to normalize the input with the table values by downcasing everything AND removing all spaces from input and table values
  • Downcasing the input and table value is no problem. And I can remove any spaces from the input. But I don’t know how to remove spaces from the table values inside the where clause.

Here’s what I have so far:

def postal_code_search_list(postal_code) when is_binary(postal_code) and postal_code != "" do
    postal_code = postal_code
      |> String.replace(postal_code, " ", "") # remove spaces
      |> String.downcase()

    postal_like = "#{postal_code}%"  # input is now downcased and no spaces
    
    query =
      from g in Geolocation,
      where: like(fragment("lower(?)", (g.postal_code)), ^postal_like)
      # ISSUE: g.postal_code may have spaces --- need to remove them in where clause
      # Is there another fragment call I could make?

    Repo.all(query)
  end

I think these are the sql functions that are referenced in Ecto fragment hex docs so maybe this will work?

Will give it a try. I’m really not well versed in Ecto or SQL. Any recommendations on how to improve this code would be most welcomed.

where: like(fragment("replace(?)", fragment("lower(?)", (g.postal_code)),
                            " ", ""), ^postal_like)

Even if you knew how to write this, best practice would be to NOT write it - doing that kind of computation per-row makes it impossible for the DB to use indexes on postal_code.

There are two better options, depending on whether your application cares about preserving the specific format entered by the user:

  • if that format isn’t significant, then when you store a postal_code you could change it into a standard capitalization / spacing / etc
  • if that format is significant, you could store a second field with the standard capitalization / spacing / etc that’s only used for searching
2 Likes

Ahhhhh. I see. I had not thought about indexing. That makes sense. I’ll just add another column with the postal codes all upper-cased and without spaces. That way, I have one value for presentation and one for searching.

Thank you! I would have never thought about the impact on indexing.

Not really. Just create an index where these functions are applied

create(
  index(:geolocation_table, ["replace(lower(postal_code), ' ', '')"],
    name: :geolocation_postal_code_lower_without_spaces_index
  )
)
1 Like

Is an index just another column added to a table behind the scenes? And how do I tell the query to use the index instead of the regular column for searching? Sorry … I have very minimal understanding of databases.

Is an index just another column added to a table behind the scenes?

In general, no, but I think I understand what you’re asking.
Index by applied function should have the same effect as a new field where you store results of function application + regular index.
Database will automatically use index if you use

replace(lower(postal_code), ' ', '')

in your query and you have defined the index which is written as I suggested above.
You can inspect the query by using EXPLAIN ANALYZE statement. The database will tell you whether the index was used or not. But keep in mind that index can be ignored by the DB on a small amount of records, so you’d have to either temporarily disable seq scan for this check or insert more data to the table.

1 Like

So I would add the index instead of a new column:

AND … my query would still use the following to format the table input according to the index:

where: like(fragment("replace(?)", fragment("lower(?)", (g.postal_code)),
                            " ", ""), ^postal_like)

But the query will leverage the new index (that you defined above), so I will not lose performance? Is that correct?

Almost. I missed that your query contains like. Index should use gin in this case

index(:geolocation_table, ..., name: ..., using: "gin")
1 Like

According to this site, postal codes for Azerbaijan may also have a - (hyphen) between the AZ and the digits. (I’m deducing Azerbaijan postal codes only from the example you gave - apologies if thats not the case).

Given there is an official format, I wonder if only storing and presenting a canonical format makes the most sense. Then having a changeset function (or Ecto type) that casts user input to the canonical format. Query handling is also greatly simplified.

The user enters a “postal code” (or “city, country” combo) in order to search for events nearby. So there really isn’t a case for a changeset function. The user will type the first 2 characters and then be presented with a basic select-list.

I was a bit worried about the corner cases that @fuelen mentioned in his post. I just don’t have enough experience with SQL or Ecto to navigate that. So, I decided to go the super safe path of keeping the columns with display formats and creating additional columns with canonical formatting that matches the index formats. This is a huge table and it will not change much after deployment, so now is the best time to add columns.

So, I now have “postal_code” and “postal_code_idx”. I used an Ecto migration to create an index that matches the format of the postal_code_idx column.

If the user is a lazy typer like me (LOL) and doesn’t cap the postal code letters, then the app will format the input accordingly. That way, it is compared properly against postal_code_idx.

I decided not to remove any hyphens. I saw them used in other postal codes as well and I’m assuming a user may enter it as a required character. I only removed spaces for the search to eliminate situations where the user doesn’t enter a space or enters more than 1 space.

The user will already get a list of possible choices after 2 characters, so the input will then function like a select list.

This made me read up more on indexes!! Thank you all for all the advice!

1 Like

I my (limited) experience this kind of “nearby” search gets more complicated than expected. For example, there are 88 cities in the USA called “Washington”. 41 Springfields. 73 Newtowns in the UK and 39 Mount Pleasants.

Then Unicode has a good laugh at us programmers because there are 20 different “space” characters, and at least 11 dash-like characters.

Of course I don’t know your use case or user base at all so I hope your solution works and scales as much as you need. I just think separating the canonicalisation of the search term in Elixir is much easier to implement and maintain than updating a SQL query whenever you need to make adjustments.

1 Like

I’m realizing from your last post that I’m not fully understanding your recommendation. When you talk about “separating the canonicalisation of the search term” from the database entries, do you mean this:

To address the unicode issues:

Database: First I need to create a column of postal codes that itself has been canonicalized. So I need to take every entry of postal code and only keep letters and numbers. I then need to standardize the capitalization of the letters. Everything else needs to be deleted. So AZ-1000 or AZ 1000 or az 1000 or az-1000 becomes … AZ1000.

User Input: This input then needs to be put through the same canonicalization. So I would run each input through the same function that cleaned up the database column.

Now that the database entries and user input are in the same format, I can just use a straight up where clause in my query (thereby eliminating any updates on the SQL query itself):

where: g.postal_code_idx == ^input

Is that correct?

Are there any libraries that exist to “canonicalize” strings to just alphanumerical characters?

The challenge of cities with the same name will have to be addressed by encouraging the user to first enter Country and State/Region. I’m not sure how else to address the case of cities with same names.

Can you recommend anything that I should read for this kind of functionality? I know I’m re-inventing the wheel because this has been done a bizillion times. I leverage tz_datetime for my date entires. I would jump at any library that helps with this kind of canonicalization problem.

Wow @kip! I just found all the work you have done on unicode!

Posting this for others

Ex_unicode post on Elixir Forum with a ton of pointers to unicode libraries

Unicode String

Yes, this is pretty much what I was thinking. It separates the canonicalisation process from the database query. And since the canonicalisation process might vary over time (as you learn new things about the crazy ways people enter data) you only have to do that in one place, in Elixir code. And I’m expecting making changes to Elixir code will be easier to express intent and to maintain that a complex SQL query. Lastly, you’re giving the Postgres planner and optimiser a better chance to select indices for efficiency.