I’m trying to go through massive amount of short text documents and extract certain keywords from them to make filtering of the documents easier.
This same problem has probably been solved already thousands of times by different companies over time.
I’m able to do this with chatGPT by asking it to extract technology related keywords and giving some sample keywords first for it and asking response in JSON but I feel there’s probably easier and less expensive way to achieve the same thing but I’m just not aware of it.
How would you solve a problem like this with Elixir?
This problem space is pretty much an entire field of study, that of Natural Language Processing. Some techniques/prior art you could look at for jumping-off topics are TF-IDF, word2vec, and HuggingFace Tokenizers. The latter has a Hex package and is used internally by Bumblebee. As a casual bystander to the field, my understanding is this is pretty much the state of the art tactic right now, and I know that it is the kind of data representation used by LLMs.
You might combine tokenizers output with pgvector or another data store that can natively store and index vector data, which is the output of tokenizing. There’s a variety of techniques for comparing vectors but cosine similarity/distance seems to be one of the common suggestions. Pgvector for example can do this.
The Fly folks have a write up on a superficially similar ask, indexing and searching Hex packages:
If you have a finite, fixed set of labels to apply to each matching document, such as a short list of discrete topics, that might be doable as zero-shot classification with a pre-trained model.