Open sourcing LLM-proxy

Sep 15, 2024

Do you want to save 50% on LLM costs? Read on

I run a somewhat ambitious hobbyist project that needs tons of compute. Thousands of audio files are transcribed in my home lab every day. Then I use a mix of Google Gemini and OpenAI APIs to summarize, extract key data, etc. I could run a model locally, but I’d rather use the local GPUs to crunch these audios. Since I don’t plan to monetize this project, keeping costs low is crucial. If costs rise, I need to find ways to take them down again.

Back in April OpenAI announced BatchAPI. “The API gives a 50% discount on regular completions… Results guaranteed to come back within 24hrs and often much sooner.”.
That seemed promising, as my use-case is not real-time.
Batch APIs are generally asynchronous, you just can’t sit waiting for the operations to complete. OpenAI provides an endpoint where you can query the status of the request. Once the request is finished, you can use another endpoint to fetch the results.

My first intention was to use the Batch API properly: create a table in the database to store all these outstanding batches, ask for periodic updates, and act on finished batches. I even created the tables, or actually had Claude create them for me. But for my first prototype I just held the batches in-memory, and kept polling the API until they were finished. This approach worked well, because the latency was under a minute, very far from the 24h limit. Still, when your calls go from a few seconds to a minute (an hour at peak times), you need some non-trivial re-architecture. I ended up spending a good chunk of my Sunday on that.

The solution worked great, but I have many small utilities that use the OpenAI API, and I didn’t want to spend a Sunday on each. What if I could create something where my code didn’t need any changes and it would use the Batch API under the covers? That’s how llm-proxy was born.

llm-proxy takes requests for an LLM. It groups them into batches based on configurable criteria (time window, batch size, etc.). It sends the batch of requests to the batch API and waits for a response. The batch can have size 1, which is fine. Once it receives a batch of responses, it will unpack them and send a response for each of the requests.
Here’s the nice part: because the proxy API is “backwards compatible” with the synchronous API, you can use the off-the-shelf SDKs you were already using. Luckily all SDKs support overriding the base URL, it is a requirement to work on Azure. You just need to use the proxy URL as your base URL. The repo includes working examples for the most popular programming languages.

The code itself is written in Go. Not that you need to care, it’s transparent to you. It is familiar to me, and its goroutines and channels are perfect abstractions for the job.

While the happy path is quite straightforward, the error conditions are a bit trickier. I devoted significant time to that path, the most likely bugs will come from unexpected failure modes.

Security considerations

llm-proxy never stores credentials -or anything else- on disk, and it doesn’t have its own authentication against OpenAI. It just uses the keys of the request to relay them to OpenAI.
It’s intelligent enough to not mix requests with different keys: all requests in a batch must use the same key and call the same endpoint, llm-proxy honors this.

Future work

Implement a configurable grace period. If a batch doesn’t complete within the allotted time, the batch will be canceled, partial results will be returned, and the remaining requests will be sent via the synchronous API.
Add support for Gemini, which also supports batch processing.
Enable usage tracking (counting tokens, estimating cost).

Try it!

Find llm-proxy in https://github.com/xdrudis/llm-proxy and start saving in minutes.