Hi Giovanni and the Orama team! Thanks so much for replying.
It’s Karolus, from Flow AI. I’m in the picture above, the second one from the right.
Behind Flow AI is our hard-working team, proudly remote and async, from 🇫🇮🇪🇸🇦🇹🇵🇱🇵🇰.
With this team we previously built the email assistant product Flowrite. It was one of the first products of its kind in the world to be using LLMs in production. Being that early meant that we managed to gather huge interest and scale our user base organically to around 200,000 power-users. We competed in the productivity space, with highly demanding users, majority of them from the US, and many from companies such as Google, Tesla, Snap.
The largest problem we faced with our technical execution was with our GenAI. We obsessed on figuring out how we could optimize and monitor the fit and accuracy of our generator (= nowadays called fancily ‘LM system’) towards all of our different customer groups and use-cases. We had so so many of them. After all, we had set out to free absolutely everyone from tedious typing, not just a single segment.
With the vast number of users, a wide-open segment and the hugely unreliable base LLM models of that era, we had ourselves a hair-on-fire problem.
Our story with evaluation began then and there. To get ourselves out of the rut, we developed our first simple toolchain. And we really needed it badly. We wanted to improve our ability to control the length and style of generations, as well as eliminate various types of what we categorized as ‘catastrophic failures’.
Since then a lot has changed. The field of research has advanced tremendously, with evermore capable techniques becoming available each month. There is now an abundance of tools. However, since LMs are being utilized in new places, there’s also an increasing number of hard and novel problems to solve.
This sea change spurred us to switch up our markets from B2C to B2B last year. Now finally we get to develop the solution to which we wish we would have had access earlier on.
Against that background, we believe we could help you greatly at Orama, and you could be the beneficiaries of our lived experience and scar tissue.
At this point in our journey, we haven’t yet completed the full vision of our platform but we have already created a very powerful proposal for LM system evaluation. We would like to begin with you in a limited fashion by experimenting with an LM-as-judge in one of your use-cases.
It is what we are offering asap from the shelf. It is opinionated and it channels the expertise that we have picked up along our journey.
This is how we do it differently from everybody:
a. quick on our feet we use auto-evaluators and a specially trained rubric inference engine on small datasets, you save on labeling, you save on process b. accuracy we have specially finetuned & capable LM-as-judge models of multiple sizes which are purpose-built to adapt well to a changing set of rubrics, and they do it remarkably well. on top of this, we understand the hard-to-notice pitfalls, caveats and biases that make it very challenging to create your own robust judge models that generalize, to mitigate these errors we apply a plethora of proprietary data mixing, training methods and other fixes. not only that, but all our judges are meta-evaluated and so guaranteed to align with your human experts c. managed transparently we don’t like to brag but we have some of the best deployment options because we create our judges ourselves and can provide you with full observability and control as needed, choose from latency options for offline and online use-cases
What we offer you soon, if you stay with us and become our customer or co-development partner: