Rendered at 22:34:17 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
ntonozzi 4 days ago [-]
Why do they need to run benchmarks to confirm performance? Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts? The fact that they are not doing this makes me suspicious that they are in fact not doing the exact same thing as vLLM.
It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.
lukebechtel 4 days ago [-]
Yes, speculative decoding will make both us and VLLM faster, but we believe it would be a relatively even bump on both sides, so we didn't include it in this comparison. Worth another test!
nyrikki 3 days ago [-]
> Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts?
You don’t even get that with GPUs in general, or really floating point in general.
The Art of Computer Programming. Volume 2: Seminumerical Algorithms section 4.2.2 with explain where it loses floating addition associativity property.
> However, as the name “batch-invariant” suggests, the technique is currently limited to handling variations related only to the batch dimension, making it robust to continuous batching and other batch-size–related changes, but not to other forms of nondeterminism like changing the TP sizes or GPU types.
> It is also a bit weird that they are not incorporating speculative decoding
Wouldn’t speculative decoding decrease overall throughput, but optimise (perceived) responsiveness?
YetAnotherNick 4 days ago [-]
For compute bound region(high batch size) yes, but for low batch size it could improve the throughput.
2001zhaozhao 4 days ago [-]
Every example like this makes it obvious that you can now use ML-like optimization approaches on well-specified, very-well-tested software problems with a clear optimization goal. Keep if it improves the objective while maintaining correctness, discard if it doesn't. AI-descent strikes again.
Maybe I should learn more about ML to have a better instinct on optimization methods in general, so I can actually build AI optimizers like these.
lukebechtel 3 days ago [-]
The bitter lesson strikes again, I suppose!
storus 4 days ago [-]
Does it support paged attention like vLLM though? Without that they will run into memory fragmentation quickly.
lukebechtel 4 days ago [-]
Yes, great question!
The system started without paged attention, and recreated its own paged attention implementation automatically once it realized it was a bottleneck.
Pretty cool!
4 days ago [-]
rfw300 4 days ago [-]
OK... we need way more information than this to validate this claim! I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality. No information is given about the source code, correctness, batching, benchmark results, quantization, etc. etc. etc.
lukebechtel 4 days ago [-]
We validate with MMLU and Hellaswag presently, and are getting this independently verified by a 3rd party.
We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this.
Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly.
rfw300 4 days ago [-]
I've no problem with the intuition. But I would hope for a lot more focus in the marketing materials on proving the (statistical) correctness of the implementation. 15% better inference speed is not worth it to use a completely unknown inference engine not tested across a wide range of generation scenarios.
lukebechtel 4 days ago [-]
This is a fair critique! We plan to use our system to generate many more inference libraries of this nature, and I'll make it a point to release better, broader correctness measures when we do so.
LuxBennu 3 days ago [-]
[flagged]
hoerzu 3 days ago [-]
What's the jitter what's the std? What about 1:1 output equality?
What's the post request latency of this part? What the ftt?
lukebechtel 3 days ago [-]
Good questions! It's clear I need to gather more metrics from our next generated inference library.
acuozzo 4 days ago [-]
Luke: Do you have benchmarks for BF16?
lukebechtel 4 days ago [-]
Unfortunately, not at present; we went for FP8 because we believed it was generally the best tradeoff of quality and speed. Allowed faster iteration as well.
We believe our improvements would hold on BF16, but let me check.
ismailmaj 3 days ago [-]
Any place we can find the code?
lukebechtel 3 days ago [-]
Unfortunately it hasn't been open sourced. We're debating how / when to do this right now.
ismailmaj 3 days ago [-]
Confusing, since this is specific to an architecture that no one making money will use (8B is consumer space, not enterprise).
The produced code shouldn't hold much interesting IP?
It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.
You don’t even get that with GPUs in general, or really floating point in general.
The Art of Computer Programming. Volume 2: Seminumerical Algorithms section 4.2.2 with explain where it loses floating addition associativity property.
Apartness relations are another possible lens.
https://arxiv.org/abs/2506.09501
Wouldn’t speculative decoding decrease overall throughput, but optimise (perceived) responsiveness?
Maybe I should learn more about ML to have a better instinct on optimization methods in general, so I can actually build AI optimizers like these.
The system started without paged attention, and recreated its own paged attention implementation automatically once it realized it was a bottleneck.
Pretty cool!
We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this.
Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly.
What's the post request latency of this part? What the ftt?
We believe our improvements would hold on BF16, but let me check.