Rendered at 06:08:24 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
GodelNumbering 1 days ago [-]
As an inference hungry human, I am obviously hooked. Quick feedback:
1. The models/pricing page should be linked from the top perhaps as that is the most interesting part to most users. You have mentioned some impressive numbers (e.g. GLM5~220 tok/s $1.20 in · $3.50 out) but those are way down in the page and many would miss it
2. When looking for inference, I always look at 3 things: which models are supported, at which quantization and what is the cached input pricing (this is way more important than headline pricing for agentic loops). You have the info about the first on the site but not 2 and 3. Would definitely like to know!
2uryaa 1 days ago [-]
Thank you for the feedback! I think we will definitely redo the info on the frontpage to reorg and show quantizations better. For reference, Kimi and Minimax are NVFP4. The rest are FP8. But I will make this more obvious on the site itself.
bethekind 1 days ago [-]
I love the phrase "inference hunger"
nnx 1 days ago [-]
Are you `Ionstream` on OpenRouter?
If so, it would be great to provide more models through OpenRouter.
This looks interesting but not enough to make me go through the trouble of setting up a separate account, funding it, etc.
hazelnut 1 days ago [-]
second that.
for smaller start ups, it's easier to go through one provider (OpenRouter) instead of having the hassle of managing different endpoints and accounts. you might get access to many more users that way.
mid to large companies might want to go directly to the source (you) if they want to really optimize the last mile but even that is debatable for many.
vshah1016 11 hours ago [-]
Hey @nnx & @hazelnut, good question, but no, we're not IonStream on OpenRouter.
The purpose of IonRouter is to let people publicly see the speed of our engine firsthand. It makes the sales pipeline a lot easier when a prospect can just go try it themselves before committing. Signup is low friction ($10 minimum to load, and we preload $0.10) so you can test right away.
That said, we do plan to offer this as a usage-based service within our own cloud. We own every layer of the stack— inference engine, GPU orchestration, scheduling, routing, billing, all of it. No third-party inference runtime, no off-the-shelf serving framework. So there's no reason for us to go through a middleman.
No plans to be an OpenRouter provider right now.
jakestevens2 6 hours ago [-]
Since you're using GH200s for these optimizations you're restricted to single device workloads (since GH series are SOC architecture). Kimi K2 (and many other large MoE models) requires multiple devices. Does that mean you can't scale these optimizations to multi-device workloads?
2uryaa 5 hours ago [-]
Hey Jack, we use GB200s for these workloads. Feel free to check those big models out on our site! We are doing Kimi, GLM, Minimax, etc.
Oras 1 days ago [-]
The problem is well articulated and nice story for both cofounders.
One thing I don’t get is why would anyone use a direct service that does the same thing as others when there are services such as openrouter where you can use the same model from different providers? I would understand if your landing page mentioned fine-tuning only and custom models, but just listing same open source models, tps and pricing wouldn’t tell me how you’re different from other providers.
I remember using banana.dev a few years ago and it was very clear proposition that time (serverless GPU with fast cold start)
I suppose positioning will take multiple iterations before you land on the right one. Good luck!
2uryaa 1 days ago [-]
Hey Oras, thank you for the feedback! I think we definitely could list on OpenRouter but as you point out, our end goal is to host finetuned models for individuals. The IonRouter product is mostly to showcase our engine. In the backend, we are multiplexing finetuned and open source models on a homogenous fleet of GPUs. So if you feel better or even similar performance difference on our cloud, we're already proving what we set out to show.
I do think we will lean harder into the hosting of fine-tuned models though, this is a good insight.
thegeomaster 9 hours ago [-]
Tried on a few of our production prompts and got comparable speeds to what we normally get with Fireworks Serverless (Kimi K2.5), but at a better price. Rooting for you!
2uryaa 5 hours ago [-]
That's really awesome to hear!!
Frannky 1 days ago [-]
I have no idea how much the demand is for fine-tuned models. Is it big? Are people actively looking for endpoints for fine-tuned models? Why? Mostly out of curiosity, I personally never had the need.
What I want from an LLM is smart, super cheap, fast, and private. I wonder if we will ever get there. Like having a cheap Cerebras machine at home with oss 400B models on it.
2uryaa 5 hours ago [-]
For consumers, we want to just pass on price to performance ratio. For enthusiasts and companies, we do see people want their own models/ ability to use the massive amounts of data they have.
rationably 1 days ago [-]
From the Privacy Policy:
> When you use the Service, we collect and store:
> Input prompts and parameters submitted to the API
For how long and what for apart from the transient compliance/safety checks?
reactordev 1 days ago [-]
“Pricing is per token, no idle costs: GPT-OSS-120B is $0.02 in / $0.095 out, Qwen3.5-122B is $0.20 in / $1.60 out. Full model list and pricing at https://ionrouter.io.”
Man you had me panicking there for a second. Per token?!? Turns out, it’s per million according to their site.
Cool concept. I used to run a Fortune 500’s cloud and GPU instances hot and ready were the biggest ask. We weren’t ready for that, cost wise, so we would only spin them up when absolutely necessary.
2uryaa 1 days ago [-]
Haha sorry for the typo! Your F500 use case is exactly who we want to target, especially as they start serving finetunes on their own data. Thanks for the feedback!
reactordev 1 days ago [-]
The issue now is they are convinced OpenClaw can solve all their business process problems without touching Conway’s law.
nylonstrung 1 days ago [-]
Unless I misunderstood it seems like this is trailing the pareto frontier in cost and speed.
Compare to providers like Fireworks and even with the openrouter 5% charge it's not competitive
linolevan 1 days ago [-]
According to the providers that I keep track of, Cumulus is typically pretty price competitive, except for MiniMax where DeepInfra and Together are much cheaper and GLM-5 where DeepInfra and z.AI's own hosting is much cheaper.
(Also technically qwen3 8b w/ novita being first place but barely)
2uryaa 1 days ago [-]
our SLA is actually higher and we are lower priced. We are also using this as a step into serving finetuned models for much cheaper than Fireworks/Together and not having the horrible cold starts of Modal. We're essentially trying to prove that our engine can hang with the best providers while multiplexing models.
jeff_antseed 23 hours ago [-]
the p50 latency gap is the thing i'd push on here. 1.46s vs 0.74s is a 2x difference and for interactive use cases that's basically a dealbreaker regardless of throughput wins.
curious how much of that is a fundamental tradeoff of the GH200 architecture vs something you're still optimizing. like, the coherent CPU-GPU link is genuinely interesting for batch workloads but i'd imagine the memory access patterns for single-request latency look pretty different.
the throughput numbers on VLM are impressive though. if your use case is async batch pipelines or offline processing, the cost math could work out well even at that p50.
2uryaa 5 hours ago [-]
Yep, we are actively working on getting this down. We can meet SLAs with tuning for the real time vision workloads but trying to get rid of this compromise is our next big development task.
Yes, we operate on GB200s and GH200s. Usually we are cheaper for many models and can get up to double the TPS.
ibgeek 1 days ago [-]
Since you are very focused on specific Nvidia hardware, I wonder if Nvidia would either buy you out to benefit from your tech or implement their own version without your involvement. Seems risky to me as a potential customer.
1 days ago [-]
cmrdporcupine 1 days ago [-]
Very cool, I see that "Deploy your finetunes, custom LoRAs, or any open-source model on our fleet." is "Book a call" -- any sense of what pricing will actually look like here, since this seems like it's kind of where your approach wins out, the ability to swap in custom model easier/cheaper?
Just curious how close we are to a world where I can fine tune for my (low volume calls) domain and then get it hosted. Right now this is not practical anywhere I've seen, at the volumes I would be doing it at (which are really hobby level).
2uryaa 1 days ago [-]
We usually charge by GPU hour for those finetunes, around 8-10 dollars depending on GPU type and volume! This is similar to Modal, but since the engine is fully ours, you don't wait ~1 min for cold starts. Ideally, we will make onboarding super frictionless and self serve, but onboarding people manually for now.
linolevan 1 days ago [-]
Can we get context length / output length docs (looks like you mention "Max tokens (chat)" of 128k but it's unclear what that means)? Also it looks like your docs page is out of date from your playground page.
Also piece of feedback: it kind of sucks to have glm/minimax/kimi on separate api endpoints. I assume it's a game you play to get lower latency on routing for popular models but from a consumer perspective it's not great.
2uryaa 5 hours ago [-]
Thank you for the feedback. Taking note of this!
erichocean 1 days ago [-]
> what would make this actually useful for you?
A privacy policy that's at least as good as Vertex.ai at Google.
Otherwise it's a non-starter at any price.
2uryaa 1 days ago [-]
Also curious about this. We have a 30 day content retention policy and have to have access to your fine-tuned model/LoRa if deploying that. If there's anything we can change, happy to hear it out.
Oras 1 days ago [-]
What's unique about Vertex's privacy policy?
erichocean 1 days ago [-]
They don't read the things you send them, not even for "safety checks" or sys-admins accessing the system. Totally opaque (as it should be).
Keeping chat content around for 30 days might as well mean "forever." Anyone at the company can steal your customers chats.
My agreements with customers would prevent me from using any service that did that.
1. The models/pricing page should be linked from the top perhaps as that is the most interesting part to most users. You have mentioned some impressive numbers (e.g. GLM5~220 tok/s $1.20 in · $3.50 out) but those are way down in the page and many would miss it
2. When looking for inference, I always look at 3 things: which models are supported, at which quantization and what is the cached input pricing (this is way more important than headline pricing for agentic loops). You have the info about the first on the site but not 2 and 3. Would definitely like to know!
If so, it would be great to provide more models through OpenRouter. This looks interesting but not enough to make me go through the trouble of setting up a separate account, funding it, etc.
for smaller start ups, it's easier to go through one provider (OpenRouter) instead of having the hassle of managing different endpoints and accounts. you might get access to many more users that way.
mid to large companies might want to go directly to the source (you) if they want to really optimize the last mile but even that is debatable for many.
The purpose of IonRouter is to let people publicly see the speed of our engine firsthand. It makes the sales pipeline a lot easier when a prospect can just go try it themselves before committing. Signup is low friction ($10 minimum to load, and we preload $0.10) so you can test right away.
That said, we do plan to offer this as a usage-based service within our own cloud. We own every layer of the stack— inference engine, GPU orchestration, scheduling, routing, billing, all of it. No third-party inference runtime, no off-the-shelf serving framework. So there's no reason for us to go through a middleman.
No plans to be an OpenRouter provider right now.
One thing I don’t get is why would anyone use a direct service that does the same thing as others when there are services such as openrouter where you can use the same model from different providers? I would understand if your landing page mentioned fine-tuning only and custom models, but just listing same open source models, tps and pricing wouldn’t tell me how you’re different from other providers.
I remember using banana.dev a few years ago and it was very clear proposition that time (serverless GPU with fast cold start)
I suppose positioning will take multiple iterations before you land on the right one. Good luck!
I do think we will lean harder into the hosting of fine-tuned models though, this is a good insight.
What I want from an LLM is smart, super cheap, fast, and private. I wonder if we will ever get there. Like having a cheap Cerebras machine at home with oss 400B models on it.
> When you use the Service, we collect and store: > Input prompts and parameters submitted to the API
For how long and what for apart from the transient compliance/safety checks?
Man you had me panicking there for a second. Per token?!? Turns out, it’s per million according to their site.
Cool concept. I used to run a Fortune 500’s cloud and GPU instances hot and ready were the biggest ask. We weren’t ready for that, cost wise, so we would only spin them up when absolutely necessary.
Compare to providers like Fireworks and even with the openrouter 5% charge it's not competitive
(Also technically qwen3 8b w/ novita being first place but barely)
curious how much of that is a fundamental tradeoff of the GH200 architecture vs something you're still optimizing. like, the coherent CPU-GPU link is genuinely interesting for batch workloads but i'd imagine the memory access patterns for single-request latency look pretty different.
the throughput numbers on VLM are impressive though. if your use case is async batch pipelines or offline processing, the cost math could work out well even at that p50.
Is this a result of renting more expensive gpus?
Just curious how close we are to a world where I can fine tune for my (low volume calls) domain and then get it hosted. Right now this is not practical anywhere I've seen, at the volumes I would be doing it at (which are really hobby level).
Also piece of feedback: it kind of sucks to have glm/minimax/kimi on separate api endpoints. I assume it's a game you play to get lower latency on routing for popular models but from a consumer perspective it's not great.
A privacy policy that's at least as good as Vertex.ai at Google.
Otherwise it's a non-starter at any price.
Keeping chat content around for 30 days might as well mean "forever." Anyone at the company can steal your customers chats.
My agreements with customers would prevent me from using any service that did that.