How to correctly using Distinct function in ecto?

freewebwithme · October 18, 2022, 5:01am

This is simplified data structures

schema "campaigns" do
  belongs_to :created_by, User, type: :id
  has_one :analytics
end

schema "analytics" do
  belongs_to :campaign, Campaign
end

schema "message_records" do
  belongs_to :campaign, Campaign
end

Campaign has_one analytics. But more than one message_records are belongs_to one campaign.
Because campaign will have many(like 10-1000) meesages which has one message_records each.

What I want is that return this value grouped_by campaign.created_by_id

user_id: campaign.created_by_id
total_campaigns_count: count(campaign.id),
total_messages_count: sum(a.sent_messages),
total_credit_used: sum(mdr.segments),

I tried

 campaign_with_mdr =
      base_query()
      |> where([c], ^filters_with(filters))
      |> join(:inner, [c], mdr in MessageDetailsRecord, on: c.id == mdr.campaign_id)
      |> distinct(:id)

query = 
      from cm in subquery(campaign_with_mdr),
        left_join: a in assoc(cm, :analytics),
        group_by: [c.created_by_id],
        select: %{
          user_id: cm.created_by_id,
          total_campaigns_count: count(c.id),
          total_messages_count: sum(a.sent_messages),
          total_credit_used: sum(mdr.segments)
        }
Repo.all(query)

This query won’t return duplicate campaigns but can’t get a total_credit_used: sum(mdr.segments)

Then I tried other method

 query =
      from c in Campaign,
        where: ^filters_with(filters),
        left_join: a in assoc(c, :analytics),
        join: mdr in MessageDetailsRecord,
        on: c.id == mdr.campaign_id,
        distinct: c.id,
        group_by: [c.created_by_id, c.id],
        select: %{
          user_id: c.created_by_id,
          total_campaigns_count: count(c.id),
          total_messages_count: sum(a.sent_messages),
          total_credit_used: sum(mdr.segments),

        }

    Repo.all(query)

This method’s problem is that I need to include c.id into group_by clause or not get an error says
ERROR 42803 (grouping_error) column "c0.id" must appear in the GROUP BY clause or be used in an aggregate function

Can you help?

andreaseriksson · October 18, 2022, 6:08am

Cant you just split it up in two or three queries?

al2o3cr · October 19, 2022, 12:28am

My interpretation of what you’re selecting suggests that there are two different kinds of aggregate calculations:

per campaign, to add up all the mdr.segments
per user, to count campaigns and add up all the sent_messages

The first part is a small query:

ids_with_mdr_counts_query =
  from(c in Campaign,
    where: ^filters_with(filters),
    join: mdr in MessageDetailsRecord,
    on: c.id == mdr.campaign_id,
    group_by: c.id,
    select: %{
      campaign_id: c.id,
      user_id: c.created_by_id,
      total_credit_used: sum(mdr.segments)
    }
  )

Then the rest:

from(data in subquery(ids_with_mdr_counts_query),
  left_join: a in Analytics,
  on: a.campaign_id == data.campaign_id,
  group_by: data.user_id,
  select: %{
    user_id: data.user_id,
    total_campaigns_count: count(data.campaign_id),
    total_messages_count: sum(a.sent_messages),
    total_credit_used: sum(data.total_credit_used)
  }
)

This takes the now-aggregated {campaign_id, user_id, total_credit_used} tuples for each campaign matching filters and aggregates them further by user_id.

(apologies if there are errors in the above, I write most Ecto queries through incremental refinement in iex so this is trickier)

freewebwithme · October 19, 2022, 3:54pm

Thanks for the detailed answer. But I still don’t understand. which point does it remove duplicate message detail record?

 campaign_with_mdr =
        base_query()
        |> where([c], ^filters_with(filters))
        |> join(:inner, [c], mdr in MessageDetailsRecord, as: :records, 
            on: c.id == mdr.campaign_id)
        |> distinct(true)
  
      final_query =
        from [cm, records: mdr] in campaign_with_mdr,
          left_join: a in assoc(cm, :analytics),
          group_by: [cm.created_by_id],
          select: %{
            user_id: cm.created_by_id,
            total_campaigns_count: count(cm.id),
            total_messages_count: sum(a.sent_messages),
            total_credit_used: sum(mdr.segments),

          }

This is what I tried and what is difference between your code and mine?

joey_the_snake · October 19, 2022, 4:25pm

freewebwithme:

query = 
      from cm in subquery(campaign_with_mdr),
        left_join: a in assoc(cm, :analytics),
        group_by: [c.created_by_id],
        select: %{
          user_id: cm.created_by_id,
          total_campaigns_count: count(c.id),
          total_messages_count: sum(a.sent_messages),
          total_credit_used: sum(mdr.segments)
        }

The problem is there’s no source in the query called mdr. You have cm and a. It will probably work if you put that in. But not sure what you want it to be. There should be an issue with c as well but maybe it’s masked by the mdr error.

edit: Now that I read it more carefully, you actually probably want to reference them both with cm because they are coming from the subquery.

al2o3cr · October 19, 2022, 5:40pm

The results of your campaign_with_mdr query aren’t changed by the distinct(true) because there aren’t any duplicates - each row has a distinct pair of (campaigns.id, message_details_records.id) values. More concretely, imagine we have this data:

campaigns:

id	created_by_id
1	4
2	4
3	5

message detail records:

id	campaign_id	segments
6	1	1
7	1	2
8	1	7
9	2	1
10	2	1
11	3	1

analytics:

id	campaign_id	sent_messages
12	1	17
13	2	42
14	3	100

Then the rows from your campaign_with_mdr subquery look like:

`campaigns.id`	`campaigns.created_by_id`	`mdrs.id`	`mdrs.segments`
1	4	6	1
1	4	7	2
1	4	8	7
2	4	9	1
2	4	10	1
3	5	11	1

Joining the analytics table adds those columns to each candidate row:

`campaigns.id`	`campaigns.created_by_id`	`mdrs.id`	`mdrs.segments`	`a.id`	`a.sent_messages`
1	4	6	1	12	17
1	4	7	2	12	17
1	4	8	7	12	17
2	4	9	1	13	42
2	4	10	1	13	42
3	5	11	1	14	100

I’m assuming the “duplicates” you refer to are the multiple rows with the same campaigns.id value here, which will cause things like total_campaigns_count to be incorrect.

The version in my previous post avoids that by adding up mdr.segments first and then joining and counting. The rows from the ids_with_mdr_counts_query look like:

`campaign_id`	`user_id`	`total_credit_used`
1	4	10
2	4	2
3	5	1

This is where the “duplicates” went: they’ve already been summed, producing exactly one row per campaigns.id.

Joining analytics works just like before:

`campaign_id`	`user_id`	`total_credit_used`	`a.id`	`a.sent_messages`
1	4	10	12	17
2	4	2	13	42
3	5	1	14	100

Now the sums and counts are correct - user 4 has 2 total campaigns, 12 total credits used, and 59 messages sent.

freewebwithme · October 19, 2022, 6:17pm

Thanks @al2o3cr !!! I really appreciated!