W4D5 - Faster inference with quantized models - Experiments in managing perfectionism

Today I attended LangChain's 'hacking hours' event, which presumably was organized to get people ramped up on LangGraph. I had a few interesting chats though amusingly not about LangChain. I came across Sasha Rush's GPU Puzzles, which is a set of notebooks to teach GPU programming through a set of puzzles. This led me to Tensor Puzzles, which teaches you how to think in tensors.

I had been using Whisper to transcribe audio files. At the event I was introduced to whisper.cpp, a quantized version of the model obtained by using lower precision floats and integers. This reduces model size and speeds up inference, and in particular has improved performance on CPUs. With Whisper it took 8 hours to transcribe 1 hour of audio, with whisper.cpp I needed just 2 hours!

It's a bit harder to motivate building a web app now that my use case can be done locally...