Anybody built on top of the new assistants/threads API yet? What’s the sequence of calls you should be making to get the threads use case working? It’s not linear anymore as it requires you to periodically check if a run is complete. OpenAI can now suggest you to make multiple function_calls and these can be executed in parallel. Is using a GenServer for every thread that keeps track of all this state and acts as a bridge between UI ↔ Phoenix App ↔ OpenAI a good idea?
I’ve been hearing good things about the zephyr model created by HF, and was curious whether something that capable could be run locally on my machine.
since there are multiple apps that proxy the openai api to other models (including local models), this seems to be quite straightforward. the only thing that needs to be changed is to parameterize the base_url.
I have done this now with the function OpenaiEx.with_base_url/2 to parameterize the api base URL. I tested it against the llama.cpp-python library and the non-streaming versions of the completion and chat-completion apis worked on the first try. the streaming versions seem to be causing some difficulty still. i will try the streaming calls against another proxy when i get a chance, to check if it’s a bug in the proxy.
my current dev machine is a mac with 8GB and it turns out that’s not enough to run a 7b model as well as a docker dev container (and assorted other apps). i suspect 16 Gb would work.
if anyone wants to try local LLMs via an OpenAI API proxy, please give the library a whirl and let me know what you think.
i have not published this to hex as yet, so the mix entry has to point to the github main branch for now.
@arton sorry for the delay, I’m not on the forum every day, so I didn’t realize you had asked the questions. Here are my 2 cents.
I don’t see why it couldn’t be used for production. It’s just a very thin wrapper over the JSON RPC api. If there are some changes that need to be made, I’m happy to make them. For instance, there may need to be some way to handle Finch pools, but these design decisions are best driven by an actual use-case, so I haven’t tried to add them in ex-ante.
The library doesn’t do any exponential backoff. I’d prefer to leave those kinds of decisions to the library user and/or whatever Finch does.