Thanks @nobbz and @jfeng for taking the time to look at this.
I copied the uniq/1
and uniq_by/2
code and modified slightly to come up with the following:
def split_uniq(enumerable) do
split_uniq_by(enumerable, fn x -> x end)
end
def split_uniq_by(enumerable, fun) when is_list(enumerable) do
split_uniq_list(enumerable, %{}, fun)
end
defp split_uniq_list([head | tail], set, fun) do
value = fun.(head)
case set do
%{^value => true} ->
{uniq, dupl} = split_uniq_list(tail, set, fun)
{uniq, [head | dupl]}
%{} ->
{uniq, dupl} = split_uniq_list(tail, Map.put(set, value, true), fun)
{[head | uniq], dupl}
end
end
defp split_uniq_list([], _set, _fun),
do: {[], []}
It returns a tuple containing a list of unique values and a list of duplicated values. I could have made it more efficient by only returning duplicates (it’s about 20% slower than Enum.uniq).
The benefit of this approach is that it returns all duplicate elements in the original order. Given [2, 2, 2, 1, 1]
your solutions will return [1, 2]
as the duplicates. My (copied) solution will return [2, 2, 1]
which can then be run through Enum.uniq/1
or Enum.sort/1
if required. Your solution results appear in sort order for smaller lists and appear to be in a random order for longer lists.
Because I copied uniq_by/2
functionality, given a list of maps [%{id: 1, name: "a"}, %{id: 2, name: "b"}, %{id: 3, name: "b"}, %{id: 4, name: "b"}]
my (copied) solution can return all elements with a duplicated name for example [%{id: 3, name: "b"}, %{id: 4, name: "b"}]
. I only copied the functionality for lists and not other enumerables.
As could be expected, given I copied it from the Elixir source code, my implementation is quite efficient compared to your solutions. Below are some Benchee benchmarks using a random list of 1000 integers between 0 and 99 (from StreamData). I assume I could match the efficiency of Enum.uniq/1
if I only returned duplicates, instead of unique values and duplicates.
Name ips average deviation median 99th %
Enum.uniq 21.74 K 45.99 μs ±7.19% 46 μs 49 μs
ryzey 17.95 K 55.72 μs ±6.29% 54 μs 64 μs
jfeng 7.81 K 127.97 μs ±9.64% 124 μs 173 μs
nobbz 6.41 K 156.03 μs ±15.14% 151 μs 245 μs
Comparison:
Enum.uniq 21.74 K
ryzey 17.95 K - 1.21x slower
jfeng 7.81 K - 2.78x slower
nobbz 6.41 K - 3.39x slower
It would be nice to have a standard implementation of this so that duplicates could be retrieved in an efficient and consistent way. It seems like missing functionality when unique values can be easily obtained. I guess looking for unique values must be a lot more common that looking for duplicates!