How to get first x elements from the list?

MatijaL · October 26, 2022, 12:11pm

Hello,

I have an Ecto query which gets me some data from the database, then I use Enum.filter to get filtered results and now I want to limit that to first 3 elements.

query |> Enum.filter(fn x -> x.created_by == "user" end) |> ???

Enum.filter returns the list of elements, how do I get the first 3 elements from it?

I have to do this on 2 sets of elements, I know I can filter them on db level and get them with 2 separate queries. I decided to get the data from the db with one larger query and then filter the data on app level as I believed this should be easier on the db and probably faster. Was I wrong here?

NeutronStein · October 26, 2022, 12:32pm

... |> Enum.take(3)

MatijaL · October 26, 2022, 12:32pm

I just found out about Enum.slice which can help me with that…

query |> Enum.filter(fn x -> x.created_by == "user" end) |> Enum.slice(0..2)

Enum.take() as @NeutronStein suggested is even nicer solution.

So this is the answer to my first question… if anyone can share their thoughts about getting and filtering data this way compared to the db way, I would be very grateful.

NeutronStein · October 26, 2022, 12:41pm

query |> where([x], x.created_by == ^"user") |> order_by([x], x.id) |> limit(3)

kartheek · October 26, 2022, 12:41pm

https://hexdocs.pm/ecto/Ecto.Query.html#limit/3

If you are not using rest of the rows - may be you can add limit directly to the query ?

Generally a single DB level query will always will be faster than writing two queries - you have to consider the overhead of serialisation. You can check query plans and then add indexes if needed.

sbuttgereit · October 26, 2022, 12:42pm

I don’t have any right to ask what I’m about to ask because I don’t know really anything about what you’re actually doing… but I’m not going to let that stop me.

Why not do the filtering in database query? It seems to me that you’re going to incur unnecessary latency transferring data between the database and the application; especially if they’re different servers and the data has to cross a network; and you’ll be processing the data (filtering) in a less efficient manner given that database servers are carefully designed to perform well with that workload. There are other reasons to do as much data processing in the database during retrieval but they kinda fall into the category of dealing with more data than necessary across the steps of the process.

MatijaL · October 26, 2022, 12:49pm

Yes, that’s the query I have in place but was thinking about changing it to a bigger query filtered by app.

I could, in fact, I’ve built it this way but now I’m thinking that having 2 queries would be slower then 1 bigger query filtered by the app. As I said, I may be wrong and I want to learn and understand those things better and that’s why I’m asking.

sbuttgereit · October 26, 2022, 12:55pm

I don’t see how this follows. If you can write 1 query for the database to get unfiltered data, why do you think you need 2 queries to achieve the desired filtering at the database?

MatijaL · October 26, 2022, 1:03pm

I’m sorry for not making it clear enough…

I have users and each user can be a project’s creator or contributor. In a user’s account page, I want to list projects where user is a creator but also list projects where user is a contributor.
So, is it better to create 1 query with all user projects(both creator and contributor) and then on an app level, filter them and show them in different groups or is it better to have 2 separate queries, 1 for projects where user is a creator and 1 for projects where user is a contributor?

dimitarvp · October 26, 2022, 1:30pm

That’s what I would do. Get both kinds of users with one DB query and then programmatically split the list into two – author / contributor. Super easy when you add Enum.group_by at the end of your pipe.

sbuttgereit · October 26, 2022, 1:32pm

I would need to see the actual table structures to give you the best advice. Having said that, based on my understanding of your description, I’ve heard nothing that would require you to use two queries to get the complete filtered (and even sorted) data from the database in a single query.

Since you’re getting the unfiltered data via a single query, it sounds like there is a single row description that works for both the creator and contributor data which can be a sticking point in these cases. This could be within the realm of a simple join query with appropriate where predicates or a union query depending on the details of the information architecture you’re dealing with.

So for example. I have a very enterprisy application which has simple firewall like rules allowing a tenant to specify which hosts or networks their users can be seen to be connecting from. There is a table for global rules which apply to all tenants, there is a table for the rules of each tenant, and then a single tenant can have multiple application instances each of which can define their own rules which is kept in its own table… and if no applicable rules are found there’s a default rule to apply. When a user wants to authenticate, I need to find which, if any, of the global, tenant, or instance network rules matches the host we see the user originating from (and yes, there are many problems with this sort of thing in practice… but, let’s stick to databases for now). So I query the database in a single database query which queries all three tables filtered for the user’s tenant, target instance, and originating host, sorts the rules of all three tables in order of precedence, and returns the single row which will be applied for the specific scenario given the variables of the query or returns the default rule if no records match the criteria. In this case I do this with a union query since I can order based on the precedence of the tables compared to each other. The application only ever sees the single record which is the rule that governs that particular request.

Anyway, I say all that to demonstrate that mixed data queries can be made into a single query which can filter down to the precise records the application needs. The biggest obstacle is if the shape of the returned records can’t be sensibly reconciled, but it sounds like you’ve achieved that already.

addendum

Finally to be clear, I’m speaking about the filtering scenario:

Enum.filter(fn x -> x.created_by == "user" end) |> Enum.take(3)

What @dimitarvp describes is related, but a little bit different and I don’t necessarily take issue with what he’s saying. But bringing rows from the database to the application to just decide what rows to discard is what I would suggest you reconsider.

dimitarvp · October 26, 2022, 1:50pm

Well I don’t have a lot of context on OP’s task but if they are worried about load spikes – or generally high load – then I’d definitely resort to “get what you need by a clever complex SQL / Ecto query and reshape it with Elixir to match your business code needs”.

Until we hear something more this is what it looks like they need.

MatijaL · October 26, 2022, 5:56pm

Yes, I believe this is what I need and that’s why I even started thinking about refactoring my queries. In your opinion, would using Enum.group_by be a better choice then using Enum.filter

query |> Enum.filter(fn x -> x.created_by == "user" end) |> Enum.take(3)

Well, I don’t really worry about load spikes right now but I want to understand those things now so I will be able to make better decisions when I will face those problems.

@sbuttgereit Thanks for taking the time to write such a thoughtful response, I will definitely take the time to read it carefully, try to fully understand it and learn from it.

dimitarvp · October 26, 2022, 7:47pm

I am still not very clear on what you exactly do you need but no, don’t use Enum.take, use Ecto’s limit.

Eiji · October 26, 2022, 8:27pm

Ecto.Query.API based solution

Here goes an example script that shows how you can filter by associations. It should be the most optimal way.

Mix.install([:ecto_sql, :postgrex])

defmodule Repo do
  use Ecto.Repo, adapter: Ecto.Adapters.Postgres, otp_app: :my_app
end

defmodule Migration do
  use Ecto.Migration

  def change do
    create table("users") do
      add(:name, :string)
      timestamps()
    end

    create table("projects") do
      add(:creator_id, references(:users))
      add(:name, :string)
      timestamps()
    end

    create table("user_projects", primary_key: false) do
      add(:project_id, references(:projects))
      add(:user_id, references(:users))
    end
  end
end

defmodule User do
  use Ecto.Schema

  schema "users" do
    field(:name)
    timestamps()
  end
end

defmodule Project do
  use Ecto.Schema

  schema "projects" do
    belongs_to(:creator, User)
    field(:name, :string)
    many_to_many(:contributors, User, join_through: "user_projects")
    timestamps()
  end
end

defmodule Example do
  alias Ecto.Query
  require Query

  def cleanup do
    Repo.stop()
  end

  def prepare do
    Application.put_env(:my_app, Repo,
      database: "example",
      password: "postgres",
      pool_size: 10,
      show_sensitive_data_on_connection_error: true,
      username: "postgres"
    )

    Application.ensure_all_started(:ecto_sql)
    Application.ensure_all_started(:ecto_sqlite3)
    Repo.__adapter__().storage_down(Repo.config())
    Repo.__adapter__().storage_up(Repo.config())
    Repo.start_link()
    Ecto.Migrator.up(Repo, 1, Migration)
  end

  def sample do
    Project
    |> Query.from(as: :project)
    # prevent duplicates if creator is also a contributor
    |> Query.distinct([project: project], project.name)
    # join assocs
    |> Query.join(:inner, [project: project], assoc(project, :creator), as: :creator)
    |> Query.join(:inner, [project: project], assoc(project, :contributors), as: :contributors)
    # filter projects based on its assocs
    |> Query.where(
      [contributors: contributors, creator: creator],
      contributors.name == "Foo" or creator.name == "Foo"
    )
    # limit number of projects
    |> Query.limit(2)
    |> Repo.all()
    |> IO.inspect()
  end

  def seed do
    foo = Repo.insert!(%User{name: "Foo"})
    bar = Repo.insert!(%User{name: "Bar"})
    Repo.insert(%Project{contributors: [foo, bar], creator: foo, name: "both"})
    Repo.insert(%Project{contributors: [bar], creator: foo, name: "creator"})
    Repo.insert(%Project{contributors: [foo, bar], creator: bar, name: "contributor"})
    Repo.insert(%Project{contributors: [bar], creator: bar, name: "none"})
  end
end

Example.prepare()
Example.seed()
Example.sample()
Example.cleanup()

Please keep in mind that not all databases support distinct like it was used in code above, for example SQLite. I send a PostgreSQL based example as this is a default choice for an Elixir/Phoenix database.

Enum based solution

If for some reason you can’t use ecto’s query API like mentioned by others already then my example may be interesting for you:

defmodule Example do
  def sample(list, func \\ &Function.identity/1, max \\ :infinity)

  def sample(list, func, :infinity) when is_list(list) and is_function(func, 1) do
    Enum.filter(list, func)
  end

  def sample(list, func, max)
      when is_list(list) and is_function(func, 1) and is_integer(max) and max > 0 do
    list
    |> Enum.reduce_while({0, []}, fn element, {count, acc} ->
      case {count + 1, func.(element)} do
        {_count, result} when result in [nil, false] -> {:cont, {count, acc}}
        {^max, _result} -> {:halt, {count, [element | acc]}}
        {count, _result} -> {:cont, {count, [element | acc]}}
      end
    end)
    |> then(fn {_count, acc} -> Enum.reverse(acc) end)
  end
end

The code above is a bit big, but that’s because it’s really flexible. In short it’s a combination of Enum.filter/2 + Enum.take/2 in one function. With it you can filter only first n matching elements which is especially useful when working with big lists.

MatijaL · October 27, 2022, 8:31am

Users can create many projects, also, they can contribute to many projects. Each project can have only one creator and only one contributor so there are two separate table fields (:created_by and :contributed_by).

project.ex
schema "projects" do
    fields ...

    belongs_to :created_by, User
    belongs_to :contributed_by, User
    timestamps()
end

Here I’m getting all user projects with one query and then use Enum functions to filter them

account_controller.ex
def index(conn, _params, current_user) do
   profile = Profiles.get_profile!(current_user.id)
   projects = Projects.list_user_projects(current_user)
   created_projects = Enum.filter(projects, fn x -> x.created_by_id == current_user.id end) |> Enum.sort_by(&(&1.inserted_at), Date) |> Enum.take(3)
   contributed_projects = Enum.filter(projects, fn x -> x.contributed_by_id == current_user.id end) |> Enum.sort_by(&(&1.inserted_at), Date) |> Enum.take(3)

   render(conn, "index.html", created_projects: created_projects, contributed_projects: contributed_projects)
end

projects.ex
def list_user_projects(user) do
    query = from p in Project, where: p.created_by_id == ^user.id, or_where: p.contributed_by_id == ^user.id, select: p

    Repo.all(query)
end

Alternative which I’ve created first is that I have two queries, one for created projects and one for contributed projects.

account_controller.ex
def index(conn, _params, current_user) do
   profile = Profiles.get_profile!(current_user.id)
   created_projects = Projects.list_created_projects(current_user)
   contributed_projects = Projects.list_contributed_projects(current_user)

   render(conn, "index.html", created_projects: created_projects, contributed_projects: contributed_projects)
end

projects.ex
def list_created_projects(user) do
    query = from p in Project, where: p.created_by_id == ^user.id, order_by: [desc: p.inserted_at], limit: 3, select: p
    Repo.all(query)
end
def list_contributed_projects(user) do
    query = from p in Project, where: p.contributed_by_id == ^user.id, order_by: [desc: p.inserted_at], limit: 3, select: p
    Repo.all(query)
end

In my template, I want to show 3 latest created projects and 3 latest contributed projects.

dimitarvp · October 27, 2022, 1:21pm

I see. I’d go for your first approach plus add sorting by inserted_at in list_user_projects so you don’t have to sort in the Elixir code.

dimitarvp · October 27, 2022, 1:30pm

Wait no, you should also add Ecto’s limit.

Hm, it seems you just need a rather more specialized function for this and not the generic list_user_projects:

def list_created_projects(user) do
  query =
      from p in Project,
      where: p.created_by_id == ^user.id,
      order_by: [asc: :inserted at],
      limit: 3,
      select: p

  Repo.all(query)
end

def list_contributed_projects(user) do
  query =
      from p in Project,
      where: p.contributed_by_id == ^user.id,
      order_by: [asc: :inserted at],
      limit: 3,
      select: p

  Repo.all(query)
end

This can probably be made with only one Ecto/SQL query but right now I can’t figure out how (just got up from a nap, lol).

MatijaL · October 27, 2022, 1:49pm

What if I create list_user projects() function where I make two queries, one for created projects, one for contributed projects and use union_all to combine the results?

dimitarvp · October 27, 2022, 2:01pm

Problem is that I don’t remember now. I’d advise you to try really hard to do it with one query in general, yes.