Getting duplicates instead of unique values

I know that uniq/1 & uniq_by/2 return a list of unique values, but what is the best way to get a list of duplicate values?

I noticed that the code for uniq & uniq_by could easily collect a list of duplicates while unique values are collected; so I copied this code and modified it to return a tuple with a list of unique values and a list of duplicates.

But have I missed some existing functionality to do this?

Nope. And I have never missed a function that returns the duplicates. If I really have need to know count or duplicates, I usually just do something like this (even though its probably not the most efficient way):

list
|> Enum.group_by(&(&1))
|> Enum.filter(fn {_, [_,_|_]} -> true; _ -> false end)
|> Enum.map(fn {x, _} -> x end)
5 Likes

Agree with NobbZ, I don’t know of any built in functionality for this.

The most efficient way I know to do it is what it sounds like you’ve already done:

def duplicates(list) do
  acc_dupes = fn(x, {elems, dupes}) ->
    case Map.has_key?(elems, x) do
      true -> {elems, Map.put(dupes, x, nil)}
      false -> {Map.put(elems, x, nil), dupes}
    end
  end

  list |> Enum.reduce({%{}, %{}}, acc_dupes) |> elem(1) |> Map.keys()
end
1 Like

I’m not sure if you’re aware, we have the excellent match?/2 macro for these cases:

|> Enum.filter(&match?({_, [_, _ | _]}, &1))

or if you want to make it more descriptive:

|> Enum.filter(&match?({_, list} when is_list(list) and length(list) > 1, &1))

(although is_list(list) shouldn’t be necessary in this case)

3 Likes

Thanks @nobbz and @jfeng for taking the time to look at this.

I copied the uniq/1 and uniq_by/2 code and modified slightly to come up with the following:

  def split_uniq(enumerable) do
    split_uniq_by(enumerable, fn x -> x end)
  end

  def split_uniq_by(enumerable, fun) when is_list(enumerable) do
    split_uniq_list(enumerable, %{}, fun)
  end

  defp split_uniq_list([head | tail], set, fun) do
    value = fun.(head)

    case set do
      %{^value => true} ->
        {uniq, dupl} = split_uniq_list(tail, set, fun)
        {uniq, [head | dupl]}

      %{} ->
        {uniq, dupl} = split_uniq_list(tail, Map.put(set, value, true), fun)
        {[head | uniq], dupl}
    end
  end

  defp split_uniq_list([], _set, _fun),
    do: {[], []}

It returns a tuple containing a list of unique values and a list of duplicated values. I could have made it more efficient by only returning duplicates (it’s about 20% slower than Enum.uniq).

The benefit of this approach is that it returns all duplicate elements in the original order. Given [2, 2, 2, 1, 1] your solutions will return [1, 2] as the duplicates. My (copied) solution will return [2, 2, 1] which can then be run through Enum.uniq/1 or Enum.sort/1 if required. Your solution results appear in sort order for smaller lists and appear to be in a random order for longer lists.

Because I copied uniq_by/2 functionality, given a list of maps [%{id: 1, name: "a"}, %{id: 2, name: "b"}, %{id: 3, name: "b"}, %{id: 4, name: "b"}] my (copied) solution can return all elements with a duplicated name for example [%{id: 3, name: "b"}, %{id: 4, name: "b"}]. I only copied the functionality for lists and not other enumerables.

As could be expected, given I copied it from the Elixir source code, my implementation is quite efficient compared to your solutions. Below are some Benchee benchmarks using a random list of 1000 integers between 0 and 99 (from StreamData). I assume I could match the efficiency of Enum.uniq/1 if I only returned duplicates, instead of unique values and duplicates.

Name                ips        average  deviation         median         99th %
Enum.uniq       21.74 K       45.99 μs     ±7.19%          46 μs          49 μs
ryzey           17.95 K       55.72 μs     ±6.29%          54 μs          64 μs
jfeng            7.81 K      127.97 μs     ±9.64%         124 μs         173 μs
nobbz            6.41 K      156.03 μs    ±15.14%         151 μs         245 μs

Comparison: 
Enum.uniq       21.74 K
ryzey           17.95 K - 1.21x slower
jfeng            7.81 K - 2.78x slower
nobbz            6.41 K - 3.39x slower

It would be nice to have a standard implementation of this so that duplicates could be retrieved in an efficient and consistent way. It seems like missing functionality when unique values can be easily obtained. I guess looking for unique values must be a lot more common that looking for duplicates!

5 Likes

Just had a need for this and coded it up like so:

def duplicates(enumerable) do
  enumerable
  |> Enum.uniq()
  |> Enum.reduce(enumerable, fn e, acc -> List.delete(acc, e) end) 
  |> Enum.uniq()
end

FWIW, that approach can have bad performance for specific inputs.

For instance, imagine passing it a large list of already-unique values - it will List.delete them one at a time, traversing the entire list each time. Quadratic slowdown!

Here’s an alternative approach that uses Map and Enum.frequencies:

  def duplicates(enumerable) do
    enumerable
    |> Enum.frequencies()
    |> Map.reject(fn {_, c} -> c == 1 end)
    |> Map.keys()
  end

One caveat: this version does not make any promises about the order that the duplicates appear in, while the version in your post will return them in order of their first appearance in the input.

1 Like