Calculate total percentage of a Explorer.Dataframe.group_by

bill64k · March 28, 2023, 6:23pm

My dataframe contains a knowledge graph showing a person’s proficiency level for a given skill (person, skill, proficiency). I would like to show the count of a person at a specific proficiency level and the percentage of this count to the total knowledge graph of the person.

For example: Edward Smarts has 129 total proficiencies, 79 of those proficiencies are at skill level == 4. The percentage of his proficiency skills at level 4 is 0.6124.

My working code uses multiple interim dataframes to bring the information together, but I am looking for a more optimized method. I made several attempts at summarise_by but could not get it to produce the results desired. Appreciate any suggestions? I am still learning the power of Livebook and the Explorer library.

total_profs_by_person =
  df
  |> DF.group_by(["person"])
  |> DF.summarise(total: count(person))
  |> DF.arrange(desc: total)

grouped = DF.group_by(df, ["person", "level"])

sme_count_by_person =
  grouped
  |> DF.summarise(count: count(level))
  |> DF.arrange(desc: count, asc: person)
  |> DF.filter(level == 4)

sme_percentage =
  DF.join(sme_count_by_person, total_profs_by_person)
  |> DF.mutate(percentage: count / total)
  |> DF.select(["person", "total", "level", "count", "percentage"])
  |> DF.arrange(desc: percentage)

imkleats · March 28, 2023, 9:26pm

I would probably use a pivot to have a column per level. Then you’re simply dividing one column by the sum of a set of columns.

bill64k · March 28, 2023, 10:41pm

Is that a pivot_wider on level? Still trying to understand the differences of group_by and pivot_longer/wider. Can I do that on the original df without needing the interim group_bys?

imkleats · March 28, 2023, 11:07pm

Sorry for giving you a too-terse first answer, but I’m glad you followed up.

It would be pivot_wider to transform the enumerable values in your “levels” series into one column per distinct level.

I don’t think you can remove the [person, level] group by (bound to grouped), but the other (total_profs_by_person) becomes unnecessary. I think the order of operations doesn’t matter for whether the group_by is applied first or the pivot_wider is applied first, though.

(edit: when i say “doesn’t matter”, I mean the result wouldn’t be sensitive to changes. I’d probably do the group_by/count first because that reduces the size of the dataframe on which you’re applying the pivot_wider, so a minor memory optimization)

bill64k · March 29, 2023, 8:00pm

Thanks for elaborating. After weeding through much complexity, my original two step approach seems to be the most straightforward. The pivot_wider may assist, but because I am needing to pivot on the skill literal, I am hitting the roadblock that the value_from must be an integer.

imkleats · April 1, 2023, 1:30am

It’s fairly common in data analysis to need to employ dummy variables or other ways to map categorical variables to an ordinal value (i.e. natural number), so even if you’re content with your first solution, I hope you don’t mind me sharing the solution to what’s blocking you from using pivot_wider.

You could add a column/series of ones to your initial dataframe. This new column is what you’d use for the values_from argument, and your existing skill column would be the names_from argument. Then you can group your data by [person] and perform a sum aggregation to obtain the count by skills.

Alternatively, if you want to start with the [person, skill] grouping with a count aggregation, your values_from argument would be whatever new column is holding the count. This would skip the dummy column of ones & the subsequent summation.

In any case, best of luck with your learning journey!

josevalim · April 1, 2023, 7:27am

Another idea is to use select+sum for counting: sum(select(level == 4, 1, 0)).