MosaicDB - a hobby project that uses Nx (distributed SQL vectordb rag)

I wanted to stop by to say thank you

I was recently working through a hobby learning project for retrieval augmented generation, with some bells and whistles ( elixir is not my first language )

It’s probably reinventing the wheel, but hey, wheels can be interesting:

And well, Nx made it so easy to do!

So, again, Thanks!

13 Likes

Really cool project! Kudos!

3 Likes

Looks very cool!

I’ve had an idea for a while now that you could get passable vector search with Nx using xor and popcnt on 1-bit quantized vectors. For those unaware, xor/popcnt are essentially 1-bit cosine distance.

Instead of building an index the idea is that you just brute-force search very quickly with native vectorized instructions, which I figure Nx should be able to do in batches without eating any significant Elixir speed penalty. Unlike vector indexes (e.g. HNSW) this approach has the advantage of being extremely simple and there is no need for external dependencies (just Nx) which I find appealing.

I don’t know all that much about vector search and I haven’t kept up with the state of things for a few months (and I know things are changing that fast). Do you have any thoughts on this?

1 Like

I like the idea, I haven’t personally tried it out, I’ve tried out some of the other configurable distance types via redis vector search, but didn’t see much of a difference for top k for the data types, I’ve been working with.

Were you thinking just FLAT index, with xorpopcnt?

I believe there is precedence? Somewhere in the features in qdrant maybe?: qdrant/lib/quantization/src/encoded_vectors_binary.rs at e293ab1f08d3eb8fb8d9c3380c1de71b7a76570c · qdrant/qdrant · GitHub

Something, like that?

I need to do some more reading on it.

1 Like

I see Redis has a “FLAT” index, and I think it’s what I’m talking about but it’s hard to tell because their docs don’t explain what it does at all (unfortunate). What’s weird about that name is a) the brute-force approach is kinda not an index by definition and b) there is a type of vector index called IVFFLAT but I guess this is just a case of unfortunate naming?

But yeah, my understanding is that up to a certain dataset size xor/popcnt on 1-bit quantized vectors as a first pass is actually surprisingly performant. Then you can do reranking either with cosine similarity or, apparently, LLM reranking is pretty good. The benefit of this approach is simplicity; HNSW and co are extremely complicated and expensive to update incrementally. Keeping everything in RAM helps but then you have to keep everything in RAM.

Keep in mind my knowledge is limited and probably like 6mo out of date here so I have no idea what the SOTA is with on-disk vector indexes right now. I think there has been progress.

But as far as “shortest path to a passable pure(ish)-Elixir vector search” this seemed to be the way to go. Of course I have yet to pursue this.

1 Like

I did some thinking about this in the past week.

In the spirit of Riak, vrings/vnode.

Basically the idea I was looking at was to “quantize” a vector slice ( could be a sampling? ) into a directory structure.

think: mkdir -p 000/001/071 for vec = [0.0, 0.008, 0.71]

Creates a 3 digit coarse quantization bin per slice of the vector.

Combined into:

  • 000/001/071/blah.db
  • 000/001/071/1-bit-quant.id.blah.bin

It’d help with locality/neighborhood search, binning.

And then at the leaf there could be the sqlite dbs, 1 bit quant bins

vector → coarse bucket path + bitpacked sign vector

So speed wise I think 1-bit quantized could be done, I was just thinking in an architectural sense, how to translate vectors into a posix compliant file system in a distributed way, and exploring that stuff.

bin_idx = trunc(Enum.sum(slice) / length(slice) * (config.bins - 1))
String.pad_leading("#{bin_idx}", 3, "0")

I added a sketch here:

1 Like

The index you are building here with this directory trick is one-dimensional; you’re storing the vectors in a 1D btree (offloaded to the filesystem). This is not going to work because the vectors are not one-dimensional, they are N-dimensional (N being the vector size). It’s possible to build trees of higher dimensions, e.g. a quadtree, but this approach does not scale to higher dimensions (“curse of dimensionality”).

Vector indexes like IVFFLAT (which remember is apparently not the Redis FLAT) cluster using dimensionality reduction, e.g. kmeans clustering. This is an approximate method, which is apparently the best we can do with an index (that’s the curse again). Unfortunately IVFFLAT can’t be incrementalized (it has to be rebuilt for each addition) which is kinda bad for databases, so that’s where stuff like HNSW comes in.

HNSW is a lot more complicated and I don’t have a deep understanding (tbh I don’t really even remember my shallow understanding), but it’s easier to incrementalize. It’s structured with a lot of random memory access though, so not great for on-disk storage. There are variants designed for disk (SSD), but I don’t know how they work. I assume it’s analogous to a binary tree vs. a btree.

Quantization is just reducing the (bit) size of the vectors’ entries. You are making them smaller by giving up precision. A 1-bit quantization is very extreme, but allows you to abuse really fast vectorized instructions to compare the vectors bitwise (xor/popcnt).

So the 1-bit quantization approach isn’t really an index per se. You are actually doing an exact nearest-neighbor search (not approximate) but on the quantized vectors, which turns out to be a decent approximate search on the real vectors.

This is about the extent of my knowledge as I’m not a practitioner. I did a bunch of research earlier this year as I wanted to be prepared to add vector indexes to my database in the future. In hindsight this was probably a waste of time because by the time I get around to it the SOTA is gonna be completely different lol.

Actually, I know @jstimps was interested in this topic as well so I wonder if he’s kept up with developments.

1 Like

To clarify what I said above, if you perform kmeans and then store the bins on the filesystem as you describe, you have essentially invented IVFFLAT (as I understand it). But you still have to brute-force search the bins. You can’t use the filesystem itself as an index, as in “give me all vectors starting with 001”, because that is a one-dimensional search.

As a side note, down the road this is the sort of thing I hope people will reach for my database/library Hobbes for (once it’s ready). It will allow you to store a large amount of ordered KV data, similar to what you’re doing with the filesystem, except with sharding, replication, and strongly consistent transactions :slight_smile:

1 Like

Not sure if this will be helpful, but I can share my experience so far.

I’m still on this journey, more of a side quest. I have a working implementation of HNSW in Erlang, using FDB for storage, and Nx for vector operations. Benchmarked against the Irisa Sift datasets, recall is good, but predictably performance is quite poor. Having to do hundreds of network round trips for graph traversal is a death knell. I knew this going in, but I wanted to see it for my own eyes. If only there was an embedded data store that I could drop in as a replacement… :smiley:

I’m not up to date on the SOTA, but I am interested in continuing my side quest by studying SPANN+SPFRESH. I’m sympathetic to the idea that the vector store is changing dynamically over time rather than a static graph computed by some periodic job, as is typical elsewhere.

I also believe that the need for optimizing compute over a graph will only increase, both in hardware accelerated compute and also in database query execution, and perhaps even some combination of the two someday.

2 Likes

Hah, I know you’re joking but architecturally Hobbes is FDB. The performance characteristics will be exactly the same. You could run it in-memory on a single node but then it would just be a slower :ets with concurrency control and I’m not sure that’s very useful on its own.

I respect that you actually tried with HNSW. I was very confused until I realized all of the graph DB companies are literally just holding the entire thing in RAM. That might scale to terabytes but you’re certainly not getting to petabytes, especially now that DRAM prices have randomly doubled lol. So it’s a no-go for web scale search (i.e. a literal web search engine) until they figure out how to get it onto disk.

As a digression, I have also been thinking about whether I can dodge vector search altogether. My interest was always in small-scale search for consumer apps (e.g. a note-taking app or a chat app, not web scale). I have been critical of AI stuff in some contexts, but for search this stuff is like actual magic. What used to take a research team and millions of dollars is now just straight up trivial, and that is 100% a good thing (and will probably end up slaying some monopolies).

But for search, I’ve been thinking: what about if instead of vector search, you just use a traditional keyword search engine (trivial to build) and then just ask a stupid LLM to rewrite the query 20 times and then rank the results.

I have no evidence but I am like 90% sure this would totally work and I know exactly how to build a keyword search engine (it’s literally spelled out in the record layer paper) so I find this approach even more enticing than the vector stuff.

Vectors are still very interesting, though.

1 Like

Is that like this one? https://www.tensorflow.org/datasets/catalog/sift1m

Is this a nod to something in particular? oh, a nod to Hobbes?

Have you looked at StreamingDiskANN? have any thoughts on the comparisons?

Honestly, I am interested in this piece, I would love to see a locally runnable semantic + keyword engine that is easy to run for the every day person.

This is interesting, like just throw out the semantic part, and use the implicit semantics inside the model? I also have no evidence, but I would second a motion on your hypothesis about this.


In regards to the half baked idea (filesystem as vector store) I was thinking about from a few weeks ago, I got around to running an experiment.

I was trying to come up with some fancy name, i dunno its just a sketch

HNQT: Hierarchical Navigable Quantization Trees ( maybe there’s prior work, i’m just not that studied )

@garrison I even had a look at tiger beetle to figure out how i could add some learnings
@jstimps I also think there might be something to some dynamism in the graph, i messed around with teleporting, and some beam search

I have the experiment here:

@garrison what do you do with tigerbeetles batch window? like sure it might be only 5ms, but that’s like 5ms of data that could go missing? I was thinking of doing just explicit wait for insert, or like fire and forget for less important stuff? (edit): like the fsync flush window, for batching, going from hot, warm, cold

(edit2: also apologies, i know its in python, but im just trying to work out the idea first )

Here’s the datasets I was referring to

My comment about embedded key value stores was made as a reference to some recent new projects announced on the forum, including Hobbes and Bedrock. Both are projects I’m following closely.

Regarding your new approach, for my own use cases for ANN, I wouldn’t be able to accept the heavy compromise on recall. But perhaps there are other use cases where that’s ok.

2 Likes

I did skim these papers after this discussion, and I was surprised to find that diskann/spann/spfresh are all attempts to incrementalize IVF-style indexes rather than to batch HNSW-style indexes as I had imagined.

I definitely need to read the paper more carefully but SPFresh actually seemed quite straightforward. If you’re familiar with autosharding systems (e.g. FDB) it’s literally the same idea but with buckets of vectors. Split and merge.

I had just taken for granted that others had said you can’t incrementalize IVF, but clearly you actually can lol. SPFresh would work just fine on an FDB-style database, and Turbopuffer seems to have proven that it works at scale. In that context betting on HNSW seems insane. I wonder what the other providers are up to.

1 Like

What I was describing is essentially just a simple version of a “deep research” agent. This approach has also seen success with coding agents, where giving them tools to explore the codebase in addition to vector search has benefits. It’s definitely a strategy with some merit, and the simplicity is attractive to me.

Can you elaborate? There is definitely no window for committed data to go missing in Tigerbeetle, and they have the strongest guarantees there that I’m aware of. Uncommitted data can always go missing no matter how good your guarantees are. That problem is unsolvable (e.g. the data could go missing in the network).

1 Like

Same

I think this answer’s my question.

I was wondering if there was something deeper before commit occurs, but sounds like standard error/retry, or whatever pattern to get to: hey yep the data is committed.

Thanks

I’d like to do some more thinking/experimenting around recall, that makes sense.

I’ll need to do some more reading on Hobbes, Bedrock, etc.

Thank you for the dataset pointer.

In the case of Tigerbeetle specifically all transactions are bound to a globally unique idempotency key so they can be fearlessly retried. This avoids a situation where a transaction was successfully committed but the reply to the client is lost so the client cannot know if it’s safe to retry. Actually, Tigerbeetle’s idempotency keys are meant to be globally unique all the way to the end-user (they call this end-to-end idempotency), meaning the browser or app can generate and retry with the key. You can see how this would be useful for financial transactions (don’t want to send money twice!).

FoundationDB has a similar feature but it’s experimental. I am not aware of any other databases with transaction idempotency as a first-class feature, but it’s a great idea.

This is a very nice feature, thank you for sharing this important detail. I need to look even more into Tigerbeetle.

Tangentially related, but this is one of the reasons why I started digging deeper on this stuff, qdrant’s HNSW checkpoint rebuilding on cluster restart takes such a long time. Even smaller sized (10-20GB) qdrant cluster takes absolutely forever to restart ( HNSW ), rebuilding the graph, or index, or checkpoint.

For example, all stuff I’m sure you understand, as you’ve stated in regards to goals of Hobbes.

Kafka’s restarts where data can go missing, centers around various quorum and timing issues, I’m not saying there is anything wrong with RAFT/PAXOS, etc, It’s just maybe not the tool for strict transaction guarantees?

Regarding Kafka KRaft, oh you mean we can’t trust completely the data made it to kafka or is in kafka?

  • Producer sends record
  • Leader appends to its log
  • Leader responds immediately
  • Followers haven’t replicated yet
  • Leader crashes before replication
  • New leader elected from ISR
  • Record never existed on replicas
  • Good game, pack it in, we’re done

So I say tangentially, because it sure would be nice to have end to end idempotency, for various systems, where like yeah just retry it, easy, done.

1 Like

Consensus algorithms (vsr/paxos/raft/etc) are the only tool for fault-tolerant strong consistency.

The bug in your example is that in consensus the leader is not allowed to reply before replicating to a quorum, all of whom must persist to disk. Of course there is nothing database companies love to do more than lie about guarantees, and the exceptions (e.g. Tigerbeetle) are rare.

But this is not a technical problem.

1 Like