Hey!
I have a list of urls and I would want to ask Postgres to remove the urls which are already present in the database. I know I can do this with cte and unnest like this in pure sql:
WITH new_urls_to_be_crawled AS (
SELECT UNNEST(ARRAY[
'https://example.com/site/to/be/crawled',
'https://example.com/site/not/yet/crawled',
]) as source_url
)
SELECT source_url
FROM new_urls_to_be_crawled
LEFT OUTER JOIN crawled_pages USING(source_url)
WHERE crawled_pages.source_url IS NULL;
Would there be a nice way to do this in ecto as well? I’m currently loading all urls which were in the database and filtering them in elixir with new_urls = urls -- old_urls
but I thought it’s probably better to do it directly in the PostgreSQL.
Let me see if i get this right.
You have a list of url in the db and another one in memory.
You want to select only those url from the db which are not in the in-memory list?
Why not just do a not in
query?
from crawled in "crawled_pages",
where: crawled.url not in ^list_of_url`
I want the inverse of that so to keep the strings in memory and remove the ones which already exist in the database. The query you provided is the one I’m already using and then still need to remove the database urls from the in-memory urls.
I surmised it couldn’t be that easy.
I would probably keep the code you have, I do not think there is much to gain by moving it to the database.
I saw this blog post about UNNESTing WITH ORDINALITY and learned how I can give arrays to fragments and then it was easy.
Pure SQL would probably be readable though.
defmodule MyApp.CrawledURL do
use Ecto.Schema
import Ecto.Changeset
schema "crawled_urls" do
field :source_url, :string
end
end
urls = ["https://example.com/site/to/be/crawled", "https://example.com/site/not/yet/crawled"]
from fragment(
"SELECT * FROM UNNEST(?::text[]) source_url",
^urls
), select: [:source_url]
|> except_all(^from(CrawledURL, select: [:source_url]))
|> Repo.all
2 Likes