Getting duplicates instead of unique values

I know that uniq/1 & uniq_by/2 return a list of unique values, but what is the best way to get a list of duplicate values?

I noticed that the code for uniq & uniq_by could easily collect a list of duplicates while unique values are collected; so I copied this code and modified it to return a tuple with a list of unique values and a list of duplicates.

But have I missed some existing functionality to do this?

Nope. And I have never missed a function that returns the duplicates. If I really have need to know count or duplicates, I usually just do something like this (even though its probably not the most efficient way):

list
|> Enum.group_by(&(&1))
|> Enum.filter(fn {_, [_,_|_]} -> true; _ -> false end)
|> Enum.map(fn {x, _} -> x end)
5 Likes

Agree with NobbZ, I don’t know of any built in functionality for this.

The most efficient way I know to do it is what it sounds like you’ve already done:

def duplicates(list) do
  acc_dupes = fn(x, {elems, dupes}) ->
    case Map.has_key?(elems, x) do
      true -> {elems, Map.put(dupes, x, nil)}
      false -> {Map.put(elems, x, nil), dupes}
    end
  end

  list |> Enum.reduce({%{}, %{}}, acc_dupes) |> elem(1) |> Map.keys()
end
1 Like

I’m not sure if you’re aware, we have the excellent match?/2 macro for these cases:

|> Enum.filter(&match?({_, [_, _ | _]}, &1))

or if you want to make it more descriptive:

|> Enum.filter(&match?({_, list} when is_list(list) and length(list) > 1, &1))

(although is_list(list) shouldn’t be necessary in this case)

3 Likes

Thanks @nobbz and @jfeng for taking the time to look at this.

I copied the uniq/1 and uniq_by/2 code and modified slightly to come up with the following:

  def split_uniq(enumerable) do
    split_uniq_by(enumerable, fn x -> x end)
  end

  def split_uniq_by(enumerable, fun) when is_list(enumerable) do
    split_uniq_list(enumerable, %{}, fun)
  end

  defp split_uniq_list([head | tail], set, fun) do
    value = fun.(head)

    case set do
      %{^value => true} ->
        {uniq, dupl} = split_uniq_list(tail, set, fun)
        {uniq, [head | dupl]}

      %{} ->
        {uniq, dupl} = split_uniq_list(tail, Map.put(set, value, true), fun)
        {[head | uniq], dupl}
    end
  end

  defp split_uniq_list([], _set, _fun),
    do: {[], []}

It returns a tuple containing a list of unique values and a list of duplicated values. I could have made it more efficient by only returning duplicates (it’s about 20% slower than Enum.uniq).

The benefit of this approach is that it returns all duplicate elements in the original order. Given [2, 2, 2, 1, 1] your solutions will return [1, 2] as the duplicates. My (copied) solution will return [2, 2, 1] which can then be run through Enum.uniq/1 or Enum.sort/1 if required. Your solution results appear in sort order for smaller lists and appear to be in a random order for longer lists.

Because I copied uniq_by/2 functionality, given a list of maps [%{id: 1, name: "a"}, %{id: 2, name: "b"}, %{id: 3, name: "b"}, %{id: 4, name: "b"}] my (copied) solution can return all elements with a duplicated name for example [%{id: 3, name: "b"}, %{id: 4, name: "b"}]. I only copied the functionality for lists and not other enumerables.

As could be expected, given I copied it from the Elixir source code, my implementation is quite efficient compared to your solutions. Below are some Benchee benchmarks using a random list of 1000 integers between 0 and 99 (from StreamData). I assume I could match the efficiency of Enum.uniq/1 if I only returned duplicates, instead of unique values and duplicates.

Name                ips        average  deviation         median         99th %
Enum.uniq       21.74 K       45.99 μs     ±7.19%          46 μs          49 μs
ryzey           17.95 K       55.72 μs     ±6.29%          54 μs          64 μs
jfeng            7.81 K      127.97 μs     ±9.64%         124 μs         173 μs
nobbz            6.41 K      156.03 μs    ±15.14%         151 μs         245 μs

Comparison: 
Enum.uniq       21.74 K
ryzey           17.95 K - 1.21x slower
jfeng            7.81 K - 2.78x slower
nobbz            6.41 K - 3.39x slower

It would be nice to have a standard implementation of this so that duplicates could be retrieved in an efficient and consistent way. It seems like missing functionality when unique values can be easily obtained. I guess looking for unique values must be a lot more common that looking for duplicates!

5 Likes