Planning an AI (LLM) app with RAG & PEFT, based on newest open source models (Llama-2, Mixtral, tbe).
However I hate Python/JS (normal languages for such jobs).
Given Elixir’s strengths in concurrency, scalability, fault tolerance, immutable data, and stateless functions, I believe it could be ideal as a programming environment/platform for developing complex LLM apps with multi-agent and multi-threaded capabilities.
Does Elixir possesses the requisite maturity and toolset/ecosystem to build such a project effectively ?
Did somebody explore this path?
This talk might be useful as it examines the eco system an features that make Elixir powerful for MLops.
Tx, I was aware about these. They are very promising presentations but are just initiatives.
(I can’t propose a project based on just conference presentations. The client’s CTO will need to see “facts”).
Wondering if some of you have actual experience on it at the business/production level.
I’m also interested in RAGs & systems to query/chat with a document collection. Have been testing PrivateGPT - hope to find an Elixir equivalent.
I am aware of PrivateGPT, I’ve been playing a bit with it. But It’s only RAG (chat with documents)
Our client’s needs include prior fine-tuning/PEFT (LoRA etc.) of the model for his specific domain. A complete tool for this is LangChain.
Regrettably, our trials reveal that LangChain, limited to Python/JS deployments, falls short in large-scale deployment for multiple concurrent clients/agents.
I know it. Not usable for me as it deals only with GPT API and their proprietary models ($$$).
Our client prefers an implementation based solely on open-source models.
Besides… no offence, but this Elixir implementation of LangChain is very limited (still too young for a production deployment).
It’s a pity because Elixir (running on Erlang’s BEAM)) is the ideal platform (IMHO) for such AI applications. I don’t see much interest around to extend it in this direction.
I would suggest actually taking the models you are wanting to use for a spin using Livebook.
Right now today you can import models developed in say python and operationalize them using Elxir with Bumblebee, Ortex, axon_onnx, Axon, Scholar etc all underpinned by NX.serving which can provide distributed serving of models using every GPU in your cluster.
You can’t make robust decisions from the armchair and will need to do some validation for yourself. Whilst Elixir is realitively new to the ML space it has the underpinnings to be the compelling solution for deployment. The weakest area currently is model development, but for serving models Elixir has tools to import existing models. That’s not to say there are not gaps but the important thing to identify is are there untenable gaps for the use cases you currently have and if you do use Elixir then what parts of your current enviornment will it replace. I would hazard to guess you would pick a scoped part of the overall solution and try Elixir there, get some experience then consolidate and expand from that.
One of the most advanced intellectual property search systems migrated all their models and processing to Elixir to operationalize their solution. In doing so they halved their AWS costs through less complexity processing 100’s of millions of patents on a weekly basis when updating their models. Their original talk is here:
Their latest talk is here:
It is definitely possible. A RAG system has three components:
- Models for generating embeddings
- An index
- A LLM
You will find support for generating embeddings in Bumblebee. You need to pick a model though and sbert is a starting point: https://www.sbert.net/
Indexes is the area we have least developed on. There are both ExFAISS and hnswlib bindings on GitHub. We want to officially release the latter at some point. Alternatively, you can pick a vector database or even PG with pg_vector for this step, which I would recommend.
Then you need to pick a LLM, either with Bumblebee or off the shelf.
Here is a post, a bit dated, that gives you more pointers: Semantic Search with Phoenix, Axon, Bumblebee, and ExFaiss - DockYard
Honestly, implementing this has both technical moving parts but business building parts. What is the best model for your use case? Best embeddings? How to generate embeddings for your documents? Etc. my suggestion would be to pick an off the shelf solution to evaluate the results and build a prototype, and only then evaluate what makes sense to bring in-house for performance, value, security reasons.
In case it matters, I am speaking both as a library author and as someone who has built more than one proof of concept RAG system.
Here is an article that shows how to implement step 1 and 2 with Elixir: Real World ™ Machine Learning on Fly GPU's · The Phoenix Files
There are a few more steps usually involved that wrap around LLM, either on the indexing, retrieval or formulation of the responses phase and I am not sure how much of these tools we have in our ecosystem. If you look at LlamaIndex, for example, you can pick from several strategies for querying/retrieval, pre-processing and post-processing of data, summarization steps, verification of alignment steps, context/window tracking, logging and such. They work, out of the box, for the most part.
I think it’s feasible to build an RAG tool in plain Elixir but you have to be prepared to build more of these building blocks yourself.
Oh, definitely. If you have PDFs, you need to extract text from them for embeddings. Once you choose a LLM, you need to consider prompt engineering, window size, etc. Some models make it easier than others.
But those considerations all exist around the three main blocks I have mentioned and will vary per use case and per technology. That’s another reason why I would start with something off the shelf and then break it apart based on your needs.
This is a great example. I will try to contact fly.io maybe get some insights on their experience.
Thanks Jose, obviously it is feasible RAG with Elixir.
I’m facing the big challenge in persuading the customer to adopt this new (and exotic) approach, as they prefer to stick to tools and technologies that are “well-established”, “industry-tested” and take no risks.
For what its worth, my team at Revelry has been building a custom RAG-based application with Elixir/Phoenix over the past year, and I am really happy we decided to stick with Elixir.
I gave a brief breakdown of the steps needed to build a RAG flow (without LangChain) in this blog post (see the section about “How to build a RAG flow”). That article is primarily about comparing OpenAI’s API offerings, but the tangent about RAG that I gave is relevant, especially given that we did exactly that using Phoenix and Elixir. We are currently using OpenAI for our LLM, but the RAG part is really LLM agnostic. We could plug in open source models in place of GPT 4 if we wanted to, but right now it is getting us the best results.
Directly from the blog post linked above are the general steps to build a RAG flow:
- Set up a Vector Database
- Options range from building it yourself in Postgres with PG Vector, to open source VectorDB’s such as Chroma, to a nice managed solution like Pinecone.
- Enable uploading of documents to your system that need to go into the vectorDB (probably via some web interface)
- Extract plain text from the files (can be more involved depending on the file type)
- For each uploaded document, chunk the text based on content type
- there are a lot of decisions to be made here in terms of how large the chunks are, what to separate the chunks on, how much overlap there should be, etc
- Convert those chunks into vector embeddings
- You can use openAI’s embedding models via API, but you can also use any embedding model of your choice (open source or proprietary).
- Store those vector embeddings in your vector DB
- Query against the vector DB using semantic search to pull relevant pieces of information out, and inject that info into a prompt before it’s sent to the LLM
- Send the “retrieval augmented” prompt to the LLM to generate the stuff. Hence “Retrieval Augmented Generation”.
Very interesting experience indeed. Many thanks for sharing it.
Did you use this in production or just an internal experiment ?
How did you perform the finetuning (PEFT/LoRA) of your model(s) ? Have you been also using Elixir for that?
This is definitely more than an experiment. We are building an application called ProdOps.AI which has the goal of augmenting software/product delivery teams using generative AI. Its very much in its early stages, primarily being used by our internal teams and close partners, but we plan to release it out in the wild in the very near future. Don’t want to promote that here, but just for context the primary use cases (currently) include generating product roadmaps, backlogs, user stories, and implementation plans augmented with proprietary data that can be either synced up via an external data source (e.g. github/slack/google drive) or manually uploaded. We also built a pretty useful prompt template management system which allows for building prompts that can query against a given organization’s proprietary data based on user inputs at the time of generation. We initially created this prompt management system to enable our team to easily iterate on prompts for specific use cases, but it turned into a pretty useful generic RAG-based prompt management system. We hope to abstract some of the less opinionated/product development specific stuff out into open source tooling at some point, but that will take some time and effort.
As far as fine tuning goes, we havent crossed that bridge yet. So far, we have gotten what we needed from RAG + GPT4 without fine tuning. We definitely will be digging more into fine tuning soon, but it doesnt seem necessary for this application at this juncture.
That said, I asked my team if they had any opinions on training LoRAs etc, and was reminded that Sean Moriarity mentioned in a talk a while back that he’d fine-tuned a model using Python tooling but then deployed it on Elixir for production inference. Check his talk out (should be timestamped where he talks about the fine tuning)