Extracting keywords to search from massive amount of documents

onnimonni · January 18, 2024, 12:22pm

I’m trying to go through massive amount of short text documents and extract certain keywords from them to make filtering of the documents easier.

This same problem has probably been solved already thousands of times by different companies over time.

I’m able to do this with chatGPT by asking it to extract technology related keywords and giving some sample keywords first for it and asking response in JSON but I feel there’s probably easier and less expensive way to achieve the same thing but I’m just not aware of it.

How would you solve a problem like this with Elixir?

shanesveller · January 18, 2024, 1:49pm

This problem space is pretty much an entire field of study, that of Natural Language Processing. Some techniques/prior art you could look at for jumping-off topics are TF-IDF, word2vec, and HuggingFace Tokenizers. The latter has a Hex package and is used internally by Bumblebee. As a casual bystander to the field, my understanding is this is pretty much the state of the art tactic right now, and I know that it is the kind of data representation used by LLMs.

You might combine tokenizers output with pgvector or another data store that can natively store and index vector data, which is the output of tokenizing. There’s a variety of techniques for comparing vectors but cosine similarity/distance seems to be one of the common suggestions. Pgvector for example can do this.

The Fly folks have a write up on a superficially similar ask, indexing and searching Hex packages:

D4no0 · January 18, 2024, 2:00pm

This video by Chris is a good resource on how easy and powerful is Bumblebee and pre-trained models:

shanesveller · January 18, 2024, 2:06pm

If you have a finite, fixed set of labels to apply to each matching document, such as a short list of discrete topics, that might be doable as zero-shot classification with a pre-trained model.

regex.sh · January 18, 2024, 9:19pm

Try this, embed everything and put it in Qdrant db then you can use filter parameter in search function
https://qdrant.github.io/qdrant/redoc/index.html#tag/points/operation/search_points

If you need help with Qdrant I wrote client for it (but it’s a bit outdated but should work tho) Just ping me or email me and I will help you with it.