I had some spare time, so here are three implementations.
I have created an updated gist for the benchmark which you can run just as before as elixir benchmark.exs
.
I’ve also added a few larger inputs, so you can see the difference in how solutions behave if they get a list with 100 elements vs a list with 1_000_000 elements.
1. Using Okasaki
The following is a 1:1 conversion of @100phlecs ’ solution but using the afore-mentioned Okasaki library for the queue rather than using Erlang’s :queue
module directly. I’ve added the three different backing queue implementations as separate functions to the benchmark so you can see what the difference in performance and memory usage is (for this particular problem).
The main takeaway (at least for me) is that in these kinds of situations you can see the overhead of an extra Elixir struct wrapping the thing you are manipulating, and the overhead of dispatching through a protocol (even though it is consolidated).
I expect that these overheads are especially visible because there are many inter-module calls which are not JIT-friendly. Nonetheless, in everything but the most performance-critical code I think that using a library that has Elixir-friendly features (e.g. pipe-able functions, data structures implementing Inspect
, Enumerable
, Collectable
, Access
) is preferable to using :queue
or :array
directly (or rolling your own).
defmodule QqwyOkasaki do
@moduledoc """
A straight-up 1:1 port of 100phlecs' solution but using queues from Okasaki library
instead of Erlang's `:queue` module.
Unfortunately, benchmarking shows that there is (a constant time) overhead,
probably because of the extra Elixir struct wrapping the queue, and dispatching happening through protocols.
"""
def solution([num | rst], target, queue_implementation \\ Okasaki.Implementations.ConstantQueue) do
q = Okasaki.Queue.empty(implementation: queue_implementation)
find(rst, Okasaki.Queue.insert(q, num), num, target)
end
defp find(lst, q, sum, target) when sum == target do
case {lst, Okasaki.Queue.empty?(q)} do
{[], true} ->
nil
{[num | rst], true} ->
find(rst, Okasaki.Queue.insert(q, num), num, target)
{_, false} ->
:ok
end
end
defp find([num | rest], q, sum, target) when sum < target,
do: find(rest, Okasaki.Queue.insert(q, num), sum + num, target)
defp find(lst, q, sum, target) when sum > target do
{:ok, {prev, q}} = Okasaki.Queue.remove(q)
find(lst, q, sum - prev, target)
end
defp find([], _, _, _), do: nil
end
2. Using Nx
This was mostly written for fun because I haven’t had the opportunity to use Nx much yet.
It is not actually fast. I think this is because of two reasons:
- The overhead of JIT is visible if your input collections are small.
- The problem is not SIMD-friendly nor easily paralellizable, so loop unrolling/branch predicting etc. will not help.
defmodule QqwyNx do
import Nx.Defn
@moduledoc """
A fun implementation using Nx tensors.
Not actually fast, since the problem is not SIMD-friendly and JIT-compilation
has a significant overhead when used on small input collections.
"""
def valid?(list, target) do
found =
do_valid?(target, Nx.tensor([0 | list], backend: EXLA.Backend))
|> Nx.to_number
found == 1
end
defn do_valid?(target, tensor) do
sums = Nx.cumulative_sum(tensor)
{_, _, _, _, found} =
while {target, sums, lower_index = 0, higher_index = 1, found = 0},
(not found) and higher_index < Nx.flat_size(sums)
do
h = sums[higher_index]
l = sums[lower_index]
cond do
h - l == target and lower_index < higher_index ->
# Answer found: short-circuit
found = 1
{target, sums, lower_index, higher_index, found}
h - l <= target ->
# Below target: add next element to window
higher_index = higher_index + 1
{target, sums, lower_index, higher_index, found}
h - l >= target ->
# Above target: remove oldest element from window
lower_index = lower_index + 1
{target, sums, lower_index, higher_index, found}
true ->
# Unreachable but Nx compiler requires it
{target, sums, lower_index, higher_index, found}
end
end
found
end
end
3. Using lists properly
After tinkering a little, I realized that there is a much simpler solution! We only need to walk over the input list once, and can build up the window implicitly while doing this.
The trick is that we pass the same list twice to the recursive function. Inside the recursion, one of the two parameters will become the ‘tail’ of the list used by the other argument. Because of how immutable, persistent datastructures work, the same underlying memory is used without any copying so this is fast and memory-efficient.
And because we are only looking at the head of the list, access is O(1).
This is my preferred solution, and I think it is reasonably readable.
EDIT: The new solution by @wanton7 implements the same idea. I believe both of our solutions compile down to virtually identical bytecode. 
defmodule Qqwy do
def valid?(list, target) do
do_valid?(target, list, list, 0, 0)
end
defp do_valid?(target, _, _, current, length) when current == target and length > 0, do: true
defp do_valid?(_, [], _, _, _), do: false
defp do_valid?(_, _, [], _, _), do: false
defp do_valid?(target, [h | hs], [l | ls], current, length) do
cond do
current <= target ->
do_valid?(target, hs, [l | ls], current + h, length + 1)
current >= target ->
do_valid?(target, [h | hs], ls, current - l, length - 1)
end
end
end
Benchmarking results:
➜ elixir benchmarks.exs
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.15.0
Erlang 25.0
Benchmark suite executing with the following configuration:
warmup: 500 ms
time: 1 s
memory time: 1 s
reduction time: 1 s
parallel: 1
inputs: 100, 10_000, 1_000_000
Estimated total run time: 2.27 min
Benchmarking 100phlecs with input 100 ...
Benchmarking 100phlecs with input 10_000 ...
Benchmarking 100phlecs with input 1_000_000 ...
Benchmarking eksperimental with input 100 ...
Benchmarking eksperimental with input 10_000 ...
Benchmarking eksperimental with input 1_000_000 ...
Benchmarking nikiiv with input 100 ...
Benchmarking nikiiv with input 10_000 ...
Benchmarking nikiiv with input 1_000_000 ...
Benchmarking nikiiv2 with input 100 ...
Benchmarking nikiiv2 with input 10_000 ...
Benchmarking nikiiv2 with input 1_000_000 ...
Benchmarking qqwy with input 100 ...
Benchmarking qqwy with input 10_000 ...
Benchmarking qqwy with input 1_000_000 ...
Benchmarking qqwy_nx with input 100 ...
14:33:07.171 [info] TfrtCpuClient created.
Benchmarking qqwy_nx with input 10_000 ...
Benchmarking qqwy_nx with input 1_000_000 ...
Benchmarking qqwy_okasaki_amortized with input 100 ...
Benchmarking qqwy_okasaki_amortized with input 10_000 ...
Benchmarking qqwy_okasaki_amortized with input 1_000_000 ...
Benchmarking qqwy_okasaki_constant with input 100 ...
Benchmarking qqwy_okasaki_constant with input 10_000 ...
Benchmarking qqwy_okasaki_constant with input 1_000_000 ...
Benchmarking qqwy_okasaki_erlang with input 100 ...
Benchmarking qqwy_okasaki_erlang with input 10_000 ...
Benchmarking qqwy_okasaki_erlang with input 1_000_000 ...
Benchmarking stefanluptak with input 100 ...
Benchmarking stefanluptak with input 10_000 ...
Benchmarking stefanluptak with input 1_000_000 ...
Benchmarking tomekowal with input 100 ...
Benchmarking tomekowal with input 10_000 ...
Benchmarking tomekowal with input 1_000_000 ...
Benchmarking wanton7 with input 100 ...
Benchmarking wanton7 with input 10_000 ...
Benchmarking wanton7 with input 1_000_000 ...
Benchmarking wanton72 with input 100 ...
Benchmarking wanton72 with input 10_000 ...
Benchmarking wanton72 with input 1_000_000 ...
##### With input 100 #####
Name ips average deviation median 99th %
qqwy 4272.33 K 0.23 μs ±52.28% 0.25 μs 0.38 μs
wanton72 4209.73 K 0.24 μs ±39.10% 0.25 μs 0.38 μs
100phlecs 1413.34 K 0.71 μs ±1352.57% 0.58 μs 0.92 μs
wanton7 756.03 K 1.32 μs ±897.87% 1.21 μs 1.46 μs
nikiiv 750.64 K 1.33 μs ±563.77% 1.25 μs 1.58 μs
tomekowal 734.15 K 1.36 μs ±497.97% 1.25 μs 1.54 μs
qqwy_okasaki_erlang 345.05 K 2.90 μs ±352.25% 2.67 μs 4.33 μs
qqwy_okasaki_amortized 336.25 K 2.97 μs ±250.51% 2.88 μs 3.21 μs
nikiiv2 295.09 K 3.39 μs ±67.50% 3.29 μs 3.83 μs
qqwy_okasaki_constant 270.46 K 3.70 μs ±319.20% 3.42 μs 4.92 μs
eksperimental 113.38 K 8.82 μs ±154.92% 8.17 μs 40 μs
stefanluptak 1.48 K 677.72 μs ±16.20% 667.33 μs 818.48 μs
qqwy_nx 0.0477 K 20956.23 μs ±5.44% 20558.48 μs 25253.79 μs
Comparison:
qqwy 4272.33 K
wanton72 4209.73 K - 1.01x slower +0.00348 μs
100phlecs 1413.34 K - 3.02x slower +0.47 μs
wanton7 756.03 K - 5.65x slower +1.09 μs
nikiiv 750.64 K - 5.69x slower +1.10 μs
tomekowal 734.15 K - 5.82x slower +1.13 μs
qqwy_okasaki_erlang 345.05 K - 12.38x slower +2.66 μs
qqwy_okasaki_amortized 336.25 K - 12.71x slower +2.74 μs
nikiiv2 295.09 K - 14.48x slower +3.15 μs
qqwy_okasaki_constant 270.46 K - 15.80x slower +3.46 μs
eksperimental 113.38 K - 37.68x slower +8.59 μs
stefanluptak 1.48 K - 2895.44x slower +677.49 μs
qqwy_nx 0.0477 K - 89531.96x slower +20956.00 μs
Memory usage statistics:
Name average deviation median 99th %
qqwy 0 KB ±0.00% 0 KB 0 KB
wanton72 0 KB ±0.00% 0 KB 0 KB
100phlecs 5.20 KB ±0.00% 5.20 KB 5.20 KB
wanton7 1.16 KB ±0.00% 1.16 KB 1.16 KB
nikiiv 13.52 KB ±0.00% 13.52 KB 13.52 KB
tomekowal 13.52 KB ±0.00% 13.52 KB 13.52 KB
qqwy_okasaki_erlang 9.34 KB ±0.00% 9.34 KB 9.34 KB
qqwy_okasaki_amortized 9.62 KB ±0.00% 9.62 KB 9.62 KB
nikiiv2 0 KB ±0.00% 0 KB 0 KB
qqwy_okasaki_constant 11.13 KB ±0.00% 11.13 KB 11.13 KB
eksperimental 57.15 KB ±0.00% 57.15 KB 57.15 KB
stefanluptak 387.20 KB ±0.00% 387.20 KB 387.20 KB
qqwy_nx 12714.51 KB ±0.02% 12715.55 KB 12719.06 KB
Comparison:
qqwy 0 KB
wanton72 0 KB - 1.00x memory usage +0 KB
100phlecs 5.20 KB - ∞ x memory usage +5.20 KB
wanton7 1.16 KB - ∞ x memory usage +1.16 KB
nikiiv 13.52 KB - ∞ x memory usage +13.52 KB
tomekowal 13.52 KB - ∞ x memory usage +13.52 KB
qqwy_okasaki_erlang 9.34 KB - ∞ x memory usage +9.34 KB
qqwy_okasaki_amortized 9.62 KB - ∞ x memory usage +9.62 KB
nikiiv2 0 KB - 1.00x memory usage +0 KB
qqwy_okasaki_constant 11.13 KB - ∞ x memory usage +11.13 KB
eksperimental 57.15 KB - ∞ x memory usage +57.15 KB
stefanluptak 387.20 KB - ∞ x memory usage +387.20 KB
qqwy_nx 12714.51 KB - ∞ x memory usage +12714.51 KB
Reduction count statistics:
Name average deviation median 99th %
qqwy 77 ±0.00% 77 77
wanton72 76 ±0.00% 76 76
100phlecs 475 ±0.00% 475 475
wanton7 896 ±0.00% 896 896
nikiiv 324 ±0.00% 324 324
tomekowal 325 ±0.00% 325 325
qqwy_okasaki_erlang 1414 ±0.00% 1414 1414
qqwy_okasaki_amortized 1173 ±0.00% 1173 1173
nikiiv2 439 ±0.00% 439 439
qqwy_okasaki_constant 1139 ±0.00% 1139 1139
eksperimental 5109 ±0.00% 5109 5109
stefanluptak 121249 ±0.00% 121249 121249
qqwy_nx 811508.05 ±0.13% 811185 815392
Comparison:
qqwy 77
wanton72 76 - 0.99x reduction count -1
100phlecs 475 - 6.17x reduction count +398
wanton7 896 - 11.64x reduction count +819
nikiiv 324 - 4.21x reduction count +247
tomekowal 325 - 4.22x reduction count +248
qqwy_okasaki_erlang 1414 - 18.36x reduction count +1337
qqwy_okasaki_amortized 1173 - 15.23x reduction count +1096
nikiiv2 439 - 5.70x reduction count +362
qqwy_okasaki_constant 1139 - 14.79x reduction count +1062
eksperimental 5109 - 66.35x reduction count +5032
stefanluptak 121249 - 1574.66x reduction count +121172
qqwy_nx 811508.05 - 10539.07x reduction count +811431.05
##### With input 10_000 #####
Name ips average deviation median 99th %
wanton72 5403.52 K 0.185 μs ±38.37% 0.167 μs 0.29 μs
qqwy 5305.83 K 0.188 μs ±84.23% 0.167 μs 0.29 μs
100phlecs 2157.44 K 0.46 μs ±1217.80% 0.42 μs 0.58 μs
wanton7 958.76 K 1.04 μs ±1150.24% 0.92 μs 1.13 μs
nikiiv 929.89 K 1.08 μs ±668.00% 1 μs 1.29 μs
tomekowal 678.60 K 1.47 μs ±599.21% 1.04 μs 5.21 μs
qqwy_okasaki_erlang 471.32 K 2.12 μs ±384.63% 2 μs 2.33 μs
qqwy_okasaki_amortized 421.06 K 2.37 μs ±442.91% 2.29 μs 2.63 μs
qqwy_okasaki_constant 330.69 K 3.02 μs ±502.10% 2.75 μs 3.83 μs
eksperimental 134.32 K 7.44 μs ±234.34% 6.67 μs 38.67 μs
nikiiv2 26.69 K 37.46 μs ±61.26% 39 μs 51.71 μs
stefanluptak 0.26 K 3853.09 μs ±6.45% 3823.07 μs 4040.31 μs
qqwy_nx 0.0564 K 17733.98 μs ±5.05% 17346.58 μs 21416.13 μs
Comparison:
wanton72 5403.52 K
qqwy 5305.83 K - 1.02x slower +0.00341 μs
100phlecs 2157.44 K - 2.50x slower +0.28 μs
wanton7 958.76 K - 5.64x slower +0.86 μs
nikiiv 929.89 K - 5.81x slower +0.89 μs
tomekowal 678.60 K - 7.96x slower +1.29 μs
qqwy_okasaki_erlang 471.32 K - 11.46x slower +1.94 μs
qqwy_okasaki_amortized 421.06 K - 12.83x slower +2.19 μs
qqwy_okasaki_constant 330.69 K - 16.34x slower +2.84 μs
eksperimental 134.32 K - 40.23x slower +7.26 μs
nikiiv2 26.69 K - 202.42x slower +37.28 μs
stefanluptak 0.26 K - 20820.23x slower +3852.90 μs
qqwy_nx 0.0564 K - 95825.89x slower +17733.80 μs
Memory usage statistics:
Name average deviation median 99th %
wanton72 0 KB ±0.00% 0 KB 0 KB
qqwy 0 KB ±0.00% 0 KB 0 KB
100phlecs 3.96 KB ±0.00% 3.96 KB 3.96 KB
wanton7 0.92 KB ±0.00% 0.92 KB 0.92 KB
nikiiv 11.02 KB ±0.00% 11.02 KB 11.02 KB
tomekowal 11.02 KB ±0.00% 11.02 KB 11.02 KB
qqwy_okasaki_erlang 7.24 KB ±0.00% 7.24 KB 7.24 KB
qqwy_okasaki_amortized 7.48 KB ±0.00% 7.48 KB 7.48 KB
qqwy_okasaki_constant 9.13 KB ±0.00% 9.13 KB 9.13 KB
eksperimental 46.35 KB ±0.00% 46.35 KB 46.35 KB
nikiiv2 0 KB ±0.00% 0 KB 0 KB
stefanluptak 2967.06 KB ±0.00% 2967.06 KB 2967.06 KB
qqwy_nx 11293.81 KB ±0.01% 11293.92 KB 11296.37 KB
Comparison:
wanton72 0 KB
qqwy 0 KB - 1.00x memory usage +0 KB
100phlecs 3.96 KB - ∞ x memory usage +3.96 KB
wanton7 0.92 KB - ∞ x memory usage +0.92 KB
nikiiv 11.02 KB - ∞ x memory usage +11.02 KB
tomekowal 11.02 KB - ∞ x memory usage +11.02 KB
qqwy_okasaki_erlang 7.24 KB - ∞ x memory usage +7.24 KB
qqwy_okasaki_amortized 7.48 KB - ∞ x memory usage +7.48 KB
qqwy_okasaki_constant 9.13 KB - ∞ x memory usage +9.13 KB
eksperimental 46.35 KB - ∞ x memory usage +46.35 KB
nikiiv2 0 KB - 1.00x memory usage +0 KB
stefanluptak 2967.06 KB - ∞ x memory usage +2967.06 KB
qqwy_nx 11293.81 KB - ∞ x memory usage +11293.81 KB
Reduction count statistics:
Name average deviation median 99th %
wanton72 61 ±0.00% 61 61
qqwy 62 ±0.00% 62 62
100phlecs 219 ±0.00% 219 219
wanton7 706 ±0.00% 706 706
nikiiv 101 ±0.00% 101 101
tomekowal 102 ±0.00% 102 102
qqwy_okasaki_erlang 814 ±0.00% 814 814
qqwy_okasaki_amortized 620 ±0.00% 620 620
qqwy_okasaki_constant 772 ±0.00% 772 772
eksperimental 3814 ±0.00% 3814 3814
nikiiv2 3687 ±0.00% 3687 3687
stefanluptak 923069 ±0.00% 923069 923069
qqwy_nx 825682.49 ±0.19% 825310 830201
Comparison:
wanton72 61
qqwy 62 - 1.02x reduction count +1
100phlecs 219 - 3.59x reduction count +158
wanton7 706 - 11.57x reduction count +645
nikiiv 101 - 1.66x reduction count +40
tomekowal 102 - 1.67x reduction count +41
qqwy_okasaki_erlang 814 - 13.34x reduction count +753
qqwy_okasaki_amortized 620 - 10.16x reduction count +559
qqwy_okasaki_constant 772 - 12.66x reduction count +711
eksperimental 3814 - 62.52x reduction count +3753
nikiiv2 3687 - 60.44x reduction count +3626
stefanluptak 923069 - 15132.28x reduction count +923008
qqwy_nx 825682.49 - 13535.78x reduction count +825621.49
##### With input 1_000_000 #####
Name ips average deviation median 99th %
wanton72 13.49 K 74.14 μs ±2.81% 74.13 μs 79.95 μs
qqwy 12.67 K 78.90 μs ±4.57% 79.33 μs 83.04 μs
tomekowal 6.51 K 153.59 μs ±37.01% 149.58 μs 224.16 μs
nikiiv 6.51 K 153.62 μs ±36.85% 152.21 μs 177.45 μs
100phlecs 5.59 K 178.82 μs ±33.56% 177.46 μs 198.25 μs
wanton7 4.35 K 229.73 μs ±30.08% 224.58 μs 310.48 μs
eksperimental 1.73 K 577.66 μs ±35.37% 497 μs 1121.22 μs
qqwy_okasaki_amortized 1.25 K 801.76 μs ±17.34% 792.71 μs 907.51 μs
qqwy_okasaki_constant 1.11 K 900.52 μs ±16.05% 889.00 μs 1001.37 μs
qqwy_okasaki_erlang 0.89 K 1120.28 μs ±23.66% 1078.62 μs 1518.26 μs
nikiiv2 0.21 K 4661.71 μs ±16.51% 4206.83 μs 7317.24 μs
qqwy_nx 0.00015 K 6882592.79 μs ±0.00% 6882592.79 μs 6882592.79 μs
stefanluptak 0.00000 K336956857.20 μs ±0.00%336956857.20 μs336956857.20 μs
Comparison:
wanton72 13.49 K
qqwy 12.67 K - 1.06x slower +4.76 μs
tomekowal 6.51 K - 2.07x slower +79.44 μs
nikiiv 6.51 K - 2.07x slower +79.48 μs
100phlecs 5.59 K - 2.41x slower +104.68 μs
wanton7 4.35 K - 3.10x slower +155.59 μs
eksperimental 1.73 K - 7.79x slower +503.52 μs
qqwy_okasaki_amortized 1.25 K - 10.81x slower +727.62 μs
qqwy_okasaki_constant 1.11 K - 12.15x slower +826.38 μs
qqwy_okasaki_erlang 0.89 K - 15.11x slower +1046.14 μs
nikiiv2 0.21 K - 62.88x slower +4587.57 μs
qqwy_nx 0.00015 K - 92831.87x slower +6882518.65 μs
stefanluptak 0.00000 K - 4544847.56x slower +336956783.06 μs
Memory usage statistics:
Name Memory usage
wanton72 0 MB
qqwy 0 MB - 1.00x memory usage +0 MB
tomekowal 0.40 MB - ∞ x memory usage +0.40 MB
nikiiv 0.40 MB - ∞ x memory usage +0.40 MB
100phlecs 1.18 MB - ∞ x memory usage +1.18 MB
wanton7 0.30 MB - ∞ x memory usage +0.30 MB
eksperimental 2.21 MB - ∞ x memory usage +2.21 MB
qqwy_okasaki_amortized 2.54 MB - ∞ x memory usage +2.54 MB
qqwy_okasaki_constant 2.88 MB - ∞ x memory usage +2.88 MB
qqwy_okasaki_erlang 2.37 MB - ∞ x memory usage +2.37 MB
nikiiv2 0 MB - 1.00x memory usage +0 MB
qqwy_nx 3235.34 MB - ∞ x memory usage +3235.34 MB
stefanluptak 148456.41 MB - ∞ x memory usage +148456.41 MB
**All measurements for memory usage were the same**
Reduction count statistics:
Name average deviation median 99th %
wanton72 19.56 K ±0.00% 19.56 K 19.56 K
qqwy 19.56 K ±0.00% 19.56 K 19.56 K
tomekowal 29.34 K ±0.00% 29.34 K 29.34 K
nikiiv 29.34 K ±0.00% 29.34 K 29.34 K
100phlecs 79.94 K ±0.32% 79.97 K 80.73 K
wanton7 163.27 K ±0.00% 163.27 K 163.27 K
eksperimental 239.54 K ±0.21% 239.65 K 240.88 K
qqwy_okasaki_amortized 218.31 K ±0.10% 218.28 K 218.95 K
qqwy_okasaki_constant 272.71 K ±0.16% 272.56 K 274.43 K
qqwy_okasaki_erlang 265.57 K ±0.08% 265.45 K 266.45 K
nikiiv2 99.64 K ±0.88% 99.53 K 106.20 K
qqwy_nx 213890.61 K ±0.00% 213890.61 K 213890.61 K
stefanluptak 49206910.82 K ±0.00% 49206910.82 K 49206910.82 K
Comparison:
wanton72 19.56 K
qqwy 19.56 K - 1.00x reduction count +0.00100 K
tomekowal 29.34 K - 1.50x reduction count +9.78 K
nikiiv 29.34 K - 1.50x reduction count +9.78 K
100phlecs 79.94 K - 4.09x reduction count +60.38 K
wanton7 163.27 K - 8.35x reduction count +143.72 K
eksperimental 239.54 K - 12.25x reduction count +219.98 K
qqwy_okasaki_amortized 218.31 K - 11.16x reduction count +198.75 K
qqwy_okasaki_constant 272.71 K - 13.94x reduction count +253.15 K
qqwy_okasaki_erlang 265.57 K - 13.58x reduction count +246.01 K
nikiiv2 99.64 K - 5.09x reduction count +80.08 K
qqwy_nx 213890.61 K - 10936.78x reduction count +213871.05 K
stefanluptak 49206910.82 K - 2516076.64x reduction count +49206891.26 K