tduccuong

Need help possibly improve CubDB large query

Hi,

I am running into CubDB timeout and not sure where to improve:

CubDB.select(ref, [min_key: {:tests, 1}, max_key: {:tests, 5}, min_key_inclusive: true, max_key_inclusive: true]) |> Enum.count()
5

CubDB.select(ref, [min_key: {:tests, 1}, max_key: {:tests, 100000}, min_key_inclusive: true, max_key_inclusive: true]) |> Enum.count()
100000

CubDB.select(ref, [min_key: {:tests, 1}, max_key: {:tests, 1000000}, min_key_inclusive: true, max_key_inclusive: true]) |> Enum.count()
<timeout>

The element’s format is simple, just {:tests, <number>}. I put 10 millions elements from 1..10_000_000 for testing the query. Do I miss anything, or there is better way to run query on large elements for CubDB?

Thanks a lot!

2 comments

#stream #cubdb

0 186 2

2026-01-24 17:44:36 UTC

First Post!

allen-munsch

Here’s what I found:

select … returns a lazy stream

Additionally:

The returned lazy stream can be filtered, mapped, and transformed with standard functions in the Stream and Enum modules. The actual database read is deferred to when the stream is iterated or evaluated

And:

If we want to obtain the sum of the first 10 positive numeric values associated to keys from :a to :f, we can do:

sum =
  CubDB.select(db,
    min_key: :a,
    max_key: :f
  )
  |> Stream.map(fn {_key, value} -> value end) # map values
  |> Stream.filter(fn n -> is_number(n) and n > 0 end) # only positive numbers
  |> Stream.take(10) # take only the first 10 entries in the range
  |> Enum.sum() # sum the selected values

Using functions from the Stream module for mapping and filtering ensures that we do not fetch unnecessary items from the database. In the example above, for example, after fetching the first 10 entries satisfying the filter condition, no further entry is fetched from the database.

You might be able to increase the timeout? Or, why are you reading 1_000_000 as a test?

This might work, but why do it in the first place?

Task.async(fn ->
  CubDB.select(ref, min_key: {:tests, 1}, max_key: {:tests, 1_000_000}) 
  |> Enum.count()
end)
|> Task.await(:infinity)

Might be able to process in chunks?

CubDB.select(ref, min_key: {:tests, 1}, max_key: {:tests, 1_000_000})
|> Stream.chunk_every(10_000)
|> Enum.reduce(0, fn chunk, acc -> acc + length(chunk) end)

Post #2

Last Post!

tduccuong

thanks @allen-munsch, to answer your question, I occasionally get DB read timeout in prod. I wanted to do some stress test to see where the problem is, hence that {:tests, <number>}.

In prod I also did streaming and chunking, but the timeout persists, ofc increasing timeout can help, but that is not the solution. Now, if I simply have this:

Interactive Elixir (1.18.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> 1..1_000_000 |> Enum.map(&{:tests, &1}) |> Enum.count()
1000000

It returns very fast. But not the case for my original CubDB query above. So my naive stress test is just to find out if I can improve anything from CubDB side? Assuming 1 millions of records must be returned, despite after filtering, mapping, etc.

Thanks a lot!

Post #3