A CPU-only embeddings library for every language

Imagine, for a moment, a world where there is no library for image manipulation, not even for cropping or resizing. Eventually, the good folks at Flickerestagram, who need to crop and resize a neverending stream of images, come up with an open source solution, CropResizer. Benchmarks agree that it crops and resizes one full gigabyte of images per second on a single machine. Amazing !

Our team needs to manage user avatars in our project in language A. Crop the original to a square, resize it to 64x64, and save it. Seems like a great use case for CropResizer.

The first problem is that most of CropResizer is written in language X that's the go-to choice for working with images. The hard parts needed performance beyond what language X supports, so they were rewritten in low-level language Y instead. But language Y is known to be quite unsafe, so there is quite a bit of code in more recent low-level language Z as well. All three languages have their own package repositories, build systems and runtimes. It takes a few weeks, but our Ops team manages to build most of CropResizer, to download pre-built binary blobs for the rest from a source that hopefully won't be make the supply chain attack headlines next year, and to get it working in production. The development machines, however, are a lost cause, as language Z doesn't officially support the operating system, and language Y uses a build pipeline that is completely different, so I guess we'll be working inside virtual machines when images are involved.

The second problem is that there are no bindings to call CropResizer-as-a-library from our application in language A, so we instead invoke CropResizer-as-a-program every time we need to crop and resize an image. The performance is terrible, as CropResizer was never intended to run on single images. Our most senior programmer spends an entire week getting some semblance of integration tests to work on the CI server (like CropResizer itself, they won't run on our development machines).

Also, the CI server now has a state-of-the-art GPU installed ! Image processing is, after all, a great task for GPUs: CropResizer obviously uses a GPU to achieve its gigabyte-per-second throughput, and although it could run in CPU-only mode up to version 0.7.3, we're now on version 0.9.2-preview and inactivity-bot has just auto-closed the CPU-only issue on GitHub for the third time this year. The price is a bit steep for a server that will only resize a few megabytes of images per day, but we managed to convince Finance that such was the cost of progress.

What we couldn't convince them of, however, was that we needed a GPU on each of our web servers. In the end, we now have «The Image Server» running on a single GPU-enabled machine, and web servers push images into a queue to be cropped and resized by the Image Server.

The senior who implemented the integration tests is crying silently in the corner, and the Ops team is sending us death threats about having one more machine profile to manage.

The following monday, we discover that one of our team members knows a bit of language X, and he spent the entire week-end writing «The New Image Server» in that language.

Junior developers hail him as a hero, and senior developers are asking HR for a «no language X» clause in their work contracts to avoid the inevitable maintenance fallout.

The X rewrite improved performance significantly, as we can now keep a hot instance of CropResizer-as-a-library ready to crop and resize. However, since we had no blessed patterns or libraries for language X, The New Image Server lacks proper logging, contains unsanitary amounts of hard-coded credentials, and sometimes get stuck in some kind of deadlock that we mitigate with an hourly kill -9. It's a bit of a mess, by our team's usual standards, but the CTO promised us we'd have some time to fix that in the sprint after next.

It has been three years. The New Image Server processes a gigabyte of images per day, and costs 10% of our total cloud spend. Given the low volume and high performance, we have yet to observe more than one image at a time in the queue. So it has been decided that we would shut it down, and use the new CropResizer-as-a-service from our cloud provider.

Back to the real world.

Every serious programming language has an image processing library. Using it creates no additional complexity for deploying the application, does not require dedicated hardware, and does not prevent it from running on some operating systems. Cropping and resizing an image happens in-process with the overhead of a function call. It is definitely not the kind of thing that requires a dedicated server, or calls to a web API. A niche need for a CropResizer still exists, in those rare situations where a gigabyte per second is necessary, but most teams will never reach a point where CropResizer makes financial sense, let alone one where it is the only solution.

I believe the same should be true of embeddings.

Adding more parameters to an embedding model yields diminishing returns, and the overall trend is towards smaller models that rely on a better design (and better tokenization) to improve performance with minimal losses in quality. We're reaching a point where a CPU can compute the embedding of a query in real time, and still get results that are good enough.

For now, the main distribution channel for embeddings is a combination of Python scripts and dependencies in C and Rust. We're still in the CropResizer world.

Tokenizers are now getting simple enough that they can be ported to other languages. Going from Rust to C# or Java is a good idea when operational complexity is more important than nanoseconds of performance.

For the models themselves, something like ONNX is quite promising: a standard that packs the model weights together with a program for which a virtual machine can be implemented in many different languages. Most models on HuggingFace have an ONNX version (though not always from the original author), and if enough companies start consuming ONNX-format models, the equilibrium might move from «the model is published when there's a PyTorch script» to «the model is published when there's an ONNX file» not only for embedding models, but for other machine learning domains as well.

And while a GPU-enabled ONNX runtime can be huge (package is 99 MB!), there is also room for light-weight, CPU-only alternatives, that serve companies that need operational simplicity rather than performance.

Looking forward to it.