Rendered at 06:43:10 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
vladgur 16 hours ago [-]
That window seat with the 14” laptop seems extremely claustrophobic.
That’s the real limitation on an economy flight - space rather than power or the internet… at least it would be for me.
The only times I was able to get my laptop out and do some productive work was when I either was sitting in premium economy isle seat with room to spare or when there was an empty seat next to me
rootusrootus 16 hours ago [-]
I'd probably choose the window seat myself, because while it is cramped, it is predictably so. When I sit in an aisle seat, it's not as cramped but I regularly get shoulder checked by passing people or beverage carts.
What really makes me nervous if I'm in an economy seat is the seat in front of me. Depending on how the seat is designed, if the person suddenly reclines (or hell, just flexes the seat a bunch while moving around), it can come pretty close to pinching the laptop screen. That would be bad news.
ryandrake 15 hours ago [-]
That was the first thing I thought of when I saw the image. That's a very expensive computer that you risk destroying when the 300lb guy in front of you decides to lean back.
The ergonomics of using a laptop on an economy-class tray table are not worth it. You're sitting there like a T-rex trying to make your arms as small as possible to tap on the keys. And the vertical viewing angle to your screen sometimes prevents you from even seeing anything. I wouldn't even bring my laptop out during a flight.
democracy 4 hours ago [-]
You just learn how to type with your toes. It is an easy skill to master and then comes handy when you drive, or sit at important meetings (you can just keep coding away) and noone's the wiser.
sweetjuly 15 hours ago [-]
>The ergonomics of using a laptop on an economy-class tray table are not worth it. You're sitting there like a T-rex
The trick I've found is to pack a bluetooth keyboard. If you put your laptop on the tray table, you can put the bluetooth keyboard on your legs _under_ the tray table and have your arms fully and comfortably extended. This works especially well if you're a vim/emacs/other keyboard driven editor user as you very rarely need to reach up to poke the trackpad .
walthamstow 15 hours ago [-]
In the image it's on his lap, not the tray table. I agree, using the tray is not worth it. The ideal is a tray that folds in half so I can use that to hold a drink and keep the machine on my lap.
The tradeoff of poor comfort is insane productivity, for me anyway. Being restricted in place, no wifi, inconvenient toilet breaks, not in control of meal times, all means I get a lot of work done
Der_Einzige 14 hours ago [-]
Obese people (250lb+) shouldn't even be allowed in Economy.
hansvm 14 hours ago [-]
Maybe.
On the other hand, in economy on some planes I just literally don't fit in a forward direction because of femur length and cycling muscles, I don't fit in a sideways direction either because of broad shoulders and arm muscles, and I don't comfortably fit vertically on some planes with fixed-position headrests which push into the middle of my shoulder blades and have me hunched the whole flight.
I'm also not _that_ big. I'm 6'2" and have lived my life moderately actively. That's it. I'm biased, but I believe economy should be designed so that I can fit too.
If you agree with that premise, that'd leave plenty of space for most 250lb people too, and there'd be no reason to exclude them.
zrail 14 hours ago [-]
Sweet, free upgrades!
Edit to be slightly less obtuse: surely you're not implying that a common carrier be allowed to discriminate based on facts about a passenger's body without making reasonable accommodations. Surely you're not implying that obese people not be allowed to fly at all. Surely you're not suggesting that fat people should just remove themselves from society so you don't have to deal with them.
Therefore, obese people should get free upgrades to economy plus or better. Thanks for the idea!
0xffff2 10 hours ago [-]
I'll imply those things. If you don't fit in the seat, you should have to buy two seats is a not very controversial opinion on the internet IMO. I think that opinion basically violates all of your "surely"s already.
Where do you draw the line? A 250lb person probably mostly fits in their seat still, but at some point a person is just physically going to take up two seats. Do you really think the airline should be responsible for flying them in business class (premium economy doesn't give you more width on most/all airlines)? Does it matter if their weight is due to a medical condition or just laziness? What if they're so big that even a first class seat won't contain them?
rootusrootus 9 hours ago [-]
The issue is for the airline to solve, since they are the ones trying to make seats comically small.
Also, you have to include other attributes. E.g. Not my problem that you have freakishly long legs, if you have to prevent me reclining then maybe you should have to pay for premium economy. And what if you are broad shouldered? Same deal, not my problem, you have to stay inside the boundaries of your own seat.
I would rather we used regulation to make economy seats a bit larger. Call it a safety issue, since it is.
0x1ch 13 hours ago [-]
Man, I hate being rude because I myself weighed 230lbs once upon a time, I get it. I just dealt with a 200lb+ man who spread his legs past the arm rests. Pissed me off the whole flight because I had to contort my body in my own seat so he wasn't spilling into me.
phantomathkg 5 hours ago [-]
How about having a law that ban super tight economic seat?
verzali 40 minutes ago [-]
What are you, a communist? /s
bs7280 15 hours ago [-]
I have a 16" M1 Max that I only got because it was $1500 cheaper than MSRP, and it sucks on planes. I have really long arms and I can barely get it out of my bag without elbowing my neighbor.
A few years ago I saw some very interesting custom ergonomic setups optimized for traveling + flying.
One person with a thinkpad is able to get the monitor to be 180 degrees flat w/ the keyboard, and can hang it off the seat. He also brings a split ergo keyboard with a lap mount.
Another person did something similar with a M1 laptop, but needs an Ipad to act as the external monitor (laptop stays in bag) with a built and designed from scratch split ergo keyboard.
zdw 15 hours ago [-]
That's a 16" (from the size of the speaker grille on each side of the keyboard), so even more claustrophobic.
stavros 15 hours ago [-]
I got some Xreal glasses and it's made flights so much more enjoyable. I can watch movies or work on something lying back, and the "screen" looks massive.
JSR_FDED 15 hours ago [-]
I’ve been so tempted but some of the reviews say it’s not good for reading code. What’s been your experience? What is the effective resolution of the screen you get? Is it sharp enough for coding?
garethsprice 13 hours ago [-]
Currently working on an Amtrak with XReal One Pro glasses and a ThinkPad bluetooth keyboard from my Macbook Pro that is folded up in the seat pocket.
They are "OK enough" that it will be a matter of taste if they are acceptable or not at this point for you to use.
For coding they work fine for me, terminal tools work particularly well as I can bump the font size up. IDEs and web browsing aren't bad either, it's about the equivalent of a single 1080p screen. They are nicer than hunching over a laptop for travel use but I still prefer a proper monitor when available.
The optics are a generation or two from being where they need to be to market these as productivity devices, but if you like being an early adopter with all the quirks that come with it, they're fun.
mrbonner 14 hours ago [-]
Don’t listen to anyone saying it is fine for reading or writing extensively with the xReal. I have one and it is PITA to do that over a long period. You better just stick with watching videos or playing games with it.
SkyPuncher 14 hours ago [-]
They're like reading a projector. It's not very good, but it's better than awkwardly staring at the computer screen.
stavros 15 hours ago [-]
It's a definite "it depends". The resolution is fine, but I think it's more about the specific pair of glasses you get? I got the same model three times (long story), and the first two were fine, the third has some blurring in the middle of the right eye.
It's also uncomfortable to look at the very bottom of the screen (which is where all the chat text boxes are), so I usually resize all my windows to be a bit smaller. With that, it's very good (and you can always just increase the font size).
I would like glasses with smaller fov, so I didn't have to look around so much, but that's probably just me, since everyone else likes them larger.
asimovDev 40 minutes ago [-]
My 14in M3 Max turns the fans way up when running local agentic coding for me to be comfortable with using it in a public place.
gck1 24 minutes ago [-]
I don't use my M3 Max in public, but running local models on it makes me uncomfortable because of fans too.
I also have a Dell laptop that spin up fans if I open a text editor, and it feels normal, but on Mac, I feel like fans spinning up is me somehow abusing it and shortening its lifespan.
cube00 13 hours ago [-]
MacBook cable: 94W delivered
Return flight will test this with the correct cable. I expect at least 16% improvement against the 70W cap
Some plane sockets cut out completely if you attempt to draw more then the limit rather then continuing to provide power at the limit.
davidcox143 13 hours ago [-]
The MBP also throttles surprisingly easily. I have a 16 MBP and ended up buying a cooling stand that uses a 20w peltier cooler (solid state heat pump). It’s fixed the throttling completely, although I’m somewhat nervous about forgetting to turn the cooler off and creating condensation inside the case…
ntcho 5 hours ago [-]
Have you tried MacsFanControl? MBP fans rarely turn on by themselves, but controlling yourself makes throttling not a problem
djsavvy 4 hours ago [-]
I’ve long suspected as such. Is there a way to power limit what my macbook pulls from the outlet?
deanc 16 hours ago [-]
This has been exactly my experience too. I've tried multiple harnesses (pi, claude code, codex) with multiple variants of qwen3.6 and gemma4 driven by both o mlx and ollama - and every single time I try to do anything meaningful I end up in a loop. On a 64GB Macbook Pro M3 Max.
I really don't know what the hell people are doing locally, and suspect a lot of the hype around running these models locally is bullshit. Sure, you can make it do something but certainly nothing useful or substantial.
sleepyeldrazi 14 hours ago [-]
I have been testing and using Qwen3.6 27B (running from my 3090) since it dropped and I genuinely think this is the first consumer hardware-grade model that can actually replace frontiers for a lot of workloads.
I ran 8 tests on a variety of open-weights models, and opus 4.7 (1mil ctx version) and the little dense model was right behind it: https://github.com/sleepyeldrazi/llm_programming_tests/tree/...
Of note is that opus was the only model to push back against the spec on the hardest challenge, saying 'thats not possible', when there are links in the spec to examples of it being done.
There may be problems with the mlx versions, as i haven't had any looping in all the testing i've done, which is all my agentic and coding work the last couple of days (since it dropped). I have had tool_call misses 4 or 5 times so far, which isn't ideal but no looping. First I used it in pi-mono and later when i realized it's a serious model switched to opencode.
My setup is llama.cpp running on a 3090 in WSL, unsloth IQ4_NL with those flags:
--ctx-size 128000 \
--jinja \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--repeat-penalty 1.0 \
--presence-penalty 0.0 \
--threads 12 \
--gpu-layers 99 \
--no-warmup \
--no-mmap \
-fa on
Maybe someone knows any tips to optimise prompt processing as that's the slowest part? It takes a few minutes before OpenCode with ~20k initial context first responds, but subsequent responses are pretty fast due to caching.
note: 27b is going to be slow; use the 35b MoE if you want decent token/sec speed.
sleepyeldrazi 9 hours ago [-]
I haven't honestly dug around to figure out if there's a hardware reason for it, but prompt processing has always been a lot slower for me on macs in general. I mostly use MLX on my 24GB M4 Pro though, so I will pull llama.cpp on it as well to see what the prefill is like.
I've gotten around 16 t/s gen with 4bit and mxfp4 on that model for generation. The 3090 I mentioned has a little over 900 gb/s, while those macs i think are around 270 GB/s. If my understanding is correct, macs do utilize the bandwidth better in this case, but it still doesn't make up the difference (on the 3090 it's around 30-35 t/s depending on size of ctx).
Also, do run a quick experiment removing the cache quants if you want to tinker with it a bit more, iirc KV quant does add a small overhead during prefill.
I would be very interested to know your prefill and generation numbers.
cadamsdotcom 5 hours ago [-]
> I have been testing
With local models which are often benchmaxxed, testing unfortunately isn’t as predictive as you’d like.
usagisushi 13 hours ago [-]
If the "loop" you mean is the infinite reasoning cycle ("Wait, actually... On second thought..."), you might want to try setting a reasoning budget. For llama.cpp, use `--reasoning-budget 1024 --reasoning-budget-message "Proceed to final answer."` to force the model to reach a conclusion.
I admit I sometimes get caught up in the tooling for its own sake, but I find local models useful for specific tasks like migrating configuration schemas, writing homelab scripts, or exploring financial data.
It might sound a bit paranoid, but privacy is another major driver for me. Keeping credentials and private information off cloud services is worth the extra friction.
mtrifonov 6 hours ago [-]
This matches what I've seen. It's a variance-on-hard-tasks problem.
Local 30B models do fine on isolated, well-scoped tasks. Autocomplete, single-function refactors, naming, type fixes, stuff like that. They fall apart on anything that requires sustained reasoning over multiple steps.
Frontier cloud models stay coherent across the chain because they have higher peak capability AND lower variance per step. Local models compound their per-step variance and end up in loops.
The "run everything locally" hype mostly assumes you're doing the easy half. Once you try the hard half, the cloud-vs-local ratio is way bigger than the parameter-count comparison suggests.
The play is hybrid routing. Local for the easy stuff, cloud for the hard stuff. Anyone doing "100% local" at scale is either cherry-picking tasks or eating losses they don't measure, or just straight up full of shit.
rgovostes 5 hours ago [-]
And was this written by a local model or a frontier cloud one?
NitpickLawyer 15 hours ago [-]
> a lot of the hype around running these models locally is bullshit. Sure, you can make it do something but certainly nothing useful or substantial.
There is certainly a lot of hype around local models. Some of it is overhype, some of it is just "people finding out" and discovering what cool stuff you can do. I suspect the post is a reply to the other one a few days ago where someone from hf posted a pic with them in the plane, using a local model, and saying it's really really close to opus. That was BS.
That being said, I've been working with local LMs since before chatgpt launched. The progress we've made from the likes of gpt-j (6B) and gpt-neoX (22B) (some of the first models you could run on regular consumer hardware) is absolutely amazing. It has gone way above my expectations. We're past "we have chatgpt at home" (as it was when launched), and now it is actually usable in a lot of tasks. Nowhere near SotA, but "good enough".
I will push back a bit on the "substantial" part, and I will push a lot on "nothing useful". You can, absolutely get useful stuff out of these models. Not in a claude-code leave it to cook for 6 hours and get a working product, but with a bit of hand holding and scope reduction you can get useful stuff. When devstral came out (24B) I ran it for about a week as a "daily driver" just to see where it's at. It was ok-ish. Lots of hand holding, figured out I can't use it for planning much (looked fine at a glance, but either didn't make sense, or used outdated stuff). But with a better plan, it could handle implementation fine. I coded 2 small services that have been running in prod for ~6mo without any issues. That is useful, imo. And the current models are waaay better than devstral1.
As to substantial, eh... Your substantial can be someone else's taj mahal, and their substantial could be your toy project. It all depends. I draw the line at useful. If you can string together a couple of useful tasks, it starts to become substantial.
ryandrake 16 hours ago [-]
Same here. Every time a new local model comes out, I give it a spin with a pretty vanilla coding task ("refactor this method to take two parameters instead of one", or "fix this class of compiler warning across the ~20 file codebase") and more often than not, they get in endless loops, or fail in very unusual ways. They don't yet even approach the usefulness of SOTA models. It's obviously not a fair comparison, though. My 20GB GPU is never going to beat whatever enormous backend Google or Anthropic have.
2ndorderthought 16 hours ago [-]
You can do this with really small models but you have to do a more legwork. I wouldn't expect most trivially small models to handle anything more than 1 file reliably. The new qwen 3.6 is different though, I have heard cases where it is behaving close to sonnet.
That said I don't see why people are so scared to touch code even if it saves them 500 euro a month. Using my IDEs find across my repo and auto replacing 2 patterns is trivial to do and way faster to do by hand. I mostly use small models, it prevents a lot of the issues I've seen with large models and vibe/agentic coding medium to long term. I also write a lot of code.
I've been using qwen3.5 (122b) with claude code for months, and it's definitely dumber than sonnet/opus, but it works through things reasonably well (i.e. writes half-decent code and tool calls usually work), and I pretty much never run into loops now.
and make sure you're following Unsloth's recommendations for temperature/etc.
proxysna 16 hours ago [-]
You need to set sampling parameters for the llm. Had the same issue with Qwen3.5 when i first started. You can grab them off the model card page usually.
From Qwen3.6 page:
Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
Yes, have tried all of these (as per the docs). Have you actually tried these? Because I have tried all 3 configurations with agentic coding that you mentioned and have the same result - loops.
proxysna 14 hours ago [-]
I've used only Qwen3.5 so far for work and was, after initial struggles, successful with GPU setup, no mlx. Ngl the fact that they are using `presence_penalty: 0` and no `max_tokens` is weird after that exact setup caused me "initial struggles", but i've set up a simple docker-compose with vllm and qwen3.6 right now to test it out and it worked perfectly fine for me.
min_p author here. min_p is strictly better than top_p and top_k. The big labs don't know shit about sampling, and give absolutely nuts recommendations like this.
set min_p to like 0.3 and ignore top_p and top_k and you'll be fine.
There's better samplers now like top N sigma, top-h, P-less decoding, etc, but they're often not available in your LLM inference engine (i.e. vLLM)
JSR_FDED 6 hours ago [-]
I’m wondering though, what does extra creativity in code generation actually look like? How is the creativity expressed in code? Does the LLM reach for Bubble Sort instead of Quicksort? Maybe it decides that sorting only the first 10 elements of an array is enough? Funny variable names? Cursing in comments?
Der_Einzige 4 hours ago [-]
In this case, we are not arguing that min_p is better for "creative code" (you really don't want high temperature anywhere near your code generation, despite the "turning up the heat" framing of our paper) - at least in my post claiming min_p is strictly better than top_p above.
We are instead arguing that min_p handles truncating tokens that are more likely to lead to degeneration/looping because it is partially distribution aware. Fully distribution aware samplers like the ones I mentioned above (i.e. P-less decoding) are strictly superior due to using the whole distribution to decide the truncation at every time step.
Code hallucinations, like many LLM hallucinations, can be seen as accumulation of small amounts of "sampling errors".
proxysna 10 hours ago [-]
Cool, i am mostly a plumber for these things, but do you have any sort of reading that i can go through to wrap my head around it to some degree?
bachmeier 14 hours ago [-]
> Sure, you can make it do something but certainly nothing useful or substantial.
It works great for me. But I like to review the code and understand what it's doing, which doesn't appear to be how people do "useful or substantial" programming these days.
2ndorderthought 12 hours ago [-]
Everytime I am on here I am baffled by how many people just spin the wheel these days. The most important part of the sdlc for me is having humans involved in the code base. Can't plan improvements, features, refactors, etc if you don't know what the code looks like. But here we are I guess.
bityard 14 hours ago [-]
Hosted models are big, and there is a lot going on behind the scenes that we users have no visibility into. OpenAI, Anthropic, Google, etc do much more than just feed raw prompt tokens straight into a big 1-2TB static model and pipe the output tokens back to the web browser. The result of this is that they can do more, and end-users can get away with a lot more in terms of vague prompts and missing background.
The biggest lesson I've learned working with local models so far is: with the smaller models, you have to understand their limitations, be willing to run experiments, and fine-tune the heck out of everything. There are endless choices to be made: which model to use, which quant, thinking or not, sampling parameters, llama.cpp vs vLLM, etc. They much more fiddly for serious work than just downloading Claude Code and having it one-shot your application. But some of us enjoy fiddling so it all works out in the end.
2ndorderthought 12 hours ago [-]
I've done zero fine tuning in the local models I use. I also didn't do a lot of experiments except asking the 4 or 5 I downloaded what version of x package was the newest. For my work flows small models are king.
2ndorderthought 16 hours ago [-]
In the article the author describes what they made. It's definitely not bullshit, but it's also not as reliable or as handsfree as the 1t models.
For people who aren't completely vibe or agent coding these models are better than say copilot or the free models appearing after a Google search. Probably better than chatgpts flagships in some ways.
I mostly use 4b to 9b models for basic inquiries and code examples from libraries I haven't used before. Many of them can solve pretty hard math problems, and these are several steps away from say qwen3.6.
I would not discount running models locally. It's the best case scenario of a future with LLMs from a human rights and ecological perspective.
mft_ 15 hours ago [-]
I’m frequently surprised how little I can find online about exactly this - different harnesses for local models and how to set them up. The documentation for opencode with local models is (IMO) pretty bad - and even Claude Opus (!) struggled to get it running. And so far I’ve not found a decent alternative to Claude Desktop.
(I’ve recently discovered that you can pipe local models into Claude’s Code and Desktop, so this is on my list to try).
2ndorderthought 15 hours ago [-]
Qwen3.6 is brand new. But also, search engines are so plastered with AI slop that is written by tools and companies that have no interest in you using local models. Ollama makes it 1 command to run local small models, but with the newest ones there can be kinks to work out first.
/R/localllama is okay for some information but beyond that there is so much noise and very little signal. I think it's intentional.
mft_ 14 hours ago [-]
Thanks. I’ve been experimenting with local models for over a year now, on and off, so this isn’t just limited to the latest Qwen. Anyway, I have no problem running them, but there’s a huge difference between running something via a chat interface and running it a la Claude Code so that it can interact with the local environment and create/edit files. This is the aspect that’s difficult, in my experience.
RALaBarge 14 hours ago [-]
It’s all about tooling, if the ai can fetch data it can do something rad with it. Use something like an ai harness to have an mcp server and other tooling to improve the harness and the tools I made this for my own learning: GitHub.com/ralabarge/beigebox
slowmovintarget 11 hours ago [-]
Task-driven repo, clear your context (restart the harness), check the results.
Don't try for a rambling session where you let the thing grind for hours on a huge system. It will predictably choke or end in those loops. But do a few small chunks of work, exit the harness, then pick up the next few small chunks... It doesn't feel as magical, but it seems to be more effective, even when your model is Claude.
xienze 15 hours ago [-]
It's probably a combination of things:
* New models running in llama.cpp (what's under the hood of ollama et al) frequently require bug fixes.
* The GGUF models that run in llama.cpp frequently require bug fixes (Unsloth is notorious for this -- they release GGUF models about 10 minutes after official .safetensors releases).
* You're probably running a <Q8 quantization of the model, and a good chance <BF16 quantization for KV cache. This makes for compounding issues as context grows and tool calls multiply.
Local models really are great but I think a major problem are the people in groups like r/localllama who run models at absurd quantization levels in order to cram them on their underpowered hardware and convince themselves that they're running SOTA at home.
The best way to run these models is, frankly, a lot of VRAM and vLLM (which is what the people developing these models are almost certainly targeting).
scastiel 15 hours ago [-]
Interesting, I did and document the same kind of experiment a few months ago [1], it looks like so much changed since then!
Yes. The author is really sloppy if that wasn’t clear from the article.
Johnny_Bonk 16 hours ago [-]
So I have a RTX 3080 10GB VRAM which I've been using with Qwen2.5 Coder and Gemma 4 E2B. Im wondering what models you have tried with which quants.
mikeatlas 16 hours ago [-]
yes
tamimio 15 hours ago [-]
Can’t wait for more people to do the same and eventually getting laptops banned on board due fear of catching fire..
seattle_spring 14 hours ago [-]
With more and more flights offering Starlink, I don't see why this would really ever be necessary.
Also, agreed with the other commenters: just read a damn book and take a nap.
AntiUSAbah 14 hours ago [-]
I hope starlink will not be the default.
Not only shouldn't we support someone like Elon Musk but also don't you find it hypcritic to respond with 'just read a damn book' and suggesting starlink?
builderminkyu 15 hours ago [-]
tried doing exactly this with ollama on a cross-country flight last month. my macbook basically turned into a jet engine and the battery died in under an hour.
curious if you had to heavily throttle the cpu or stick to super small quants (like 4 bit phi3) to actually make it through 10 hours without a power outlet?
walrus01 15 hours ago [-]
As much as it's a fun gimmick to run a relatively good sized LLM like qwen 3.6 35B locally, I would much rather have the ability to run it remotely on a piece of hardware I control via VPN session. Much better on battery life and heat. If I'm on an airplane I care about having as much battery life as possible.
Let's say you have a basic setup like llama.cpp and llama-server on a remote server (even if it's just sitting under your home office desk) running a 35GB Q8 quantized model of qwen 3.6 35B, it's not difficult to make llama-server available to your laptop over just about any form of internet connection and VPN.
Having the ability to run that same model locally if you really need to because no internet connection whatsoever is available, but the times that you simultaneously have no internet and a serious need for something the model can output are fairly rare these days.
garethsprice 9 hours ago [-]
This is what I am doing - it is rare that I'm in a situation with no Internet while traveling, but very often there is an intermittent connection. Using local models or even hosted foundation models is frustrating exercise in cancelled jobs and timeouts, but Tailscale + mosh + tmux is a really nice way to connect to a workstation and resume from where the session left off - or leave it running doing its thing and come back to it later.
Same with running my local dev environment's docker containers, now they run on that workstation and my battery life is far higher, treating my portable device as a dumb terminal.
datadrivenangel 5 hours ago [-]
Agreed. I got a beefy M5 MBP for local llms and for sustained inference it gets hot enough that I worry it may end up shortening its life.
j1000 15 hours ago [-]
To be honest, I think possibility to work and travel is con rather than perk of current times.
HoldOnAMinute 15 hours ago [-]
They keep removing the ability for you to have any downtime.
ryandrake 15 hours ago [-]
It hit different at different points in my life. When I was in my 20s I thought "Wow! I get to go on an international trip to a place I've never been, and work is paying for everything?!? I'll go whenever you need me to go!" Now that I'm almost 50, it's "Fuck. Another 14 hour international flight, to somewhere I'll likely only have time to see the inside of two buildings. What's the local language again? Do I drive on the left or right? Wait, how long do I need to stay? Please no."
jamesu 14 hours ago [-]
I'm not a huge fan of working while travelling, it's way too distracting while trying to do intricate development work.
JDPy 14 hours ago [-]
[dead]
bobro 16 hours ago [-]
Can’t you guys just read a book and take a nap?
3form 16 hours ago [-]
I suppose the ones that do, wouldn't consider such a turn of events postworthy.
cpursley 15 hours ago [-]
I'm jealous of people who can actually get comfortable enough to sleep on flights.
koolba 15 hours ago [-]
With enough drinks and a long enough flight, it’s unavoidable.
fernie 15 hours ago [-]
The keyword being "comfortable".
Most certainly avoidable, unfortunately.
15 hours ago [-]
mdni007 15 hours ago [-]
But then how can I show random people how productive I am?
stavros 15 hours ago [-]
Why would I do that when making things is so much fun?
dude250711 14 hours ago [-]
If you nap, then you might end up living in a world where someone else is making the world a better place better than you are.
ducttape12 15 hours ago [-]
Yeah, for real. Imagine being so addicted to the AI slot machine that you can't be without it for 10 hours.
fhn 10 hours ago [-]
Imagine criticizing people for what they do with their time.
yjadsfgasdf 8 hours ago [-]
[dead]
yjadsfgasdf 8 hours ago [-]
[dead]
itsuckslol 11 hours ago [-]
[dead]
bilekas 15 hours ago [-]
Trying LLM in the air with a 6.200 EUR laptop... Sorry if it's not exactly relatable..
dgacmu 13 hours ago [-]
one one hand, I'm typing this on a 5 year old M1 MBP that I still can't bring myself to replace because it just continues to do the things I need;
on the other hand, $6200 every few years is pretty tiny compared to a typical US developer salary, so is this really that crazy if it's your primary work machine?
zahlman 14 hours ago [-]
You got downvoted, but my reaction was much the same honestly.
bilekas 10 hours ago [-]
I don't mind being downvoted, if we all had the same opinion it would be quite boring around here. I guess I'm just jaded from AI news.
That’s the real limitation on an economy flight - space rather than power or the internet… at least it would be for me.
The only times I was able to get my laptop out and do some productive work was when I either was sitting in premium economy isle seat with room to spare or when there was an empty seat next to me
What really makes me nervous if I'm in an economy seat is the seat in front of me. Depending on how the seat is designed, if the person suddenly reclines (or hell, just flexes the seat a bunch while moving around), it can come pretty close to pinching the laptop screen. That would be bad news.
The ergonomics of using a laptop on an economy-class tray table are not worth it. You're sitting there like a T-rex trying to make your arms as small as possible to tap on the keys. And the vertical viewing angle to your screen sometimes prevents you from even seeing anything. I wouldn't even bring my laptop out during a flight.
The trick I've found is to pack a bluetooth keyboard. If you put your laptop on the tray table, you can put the bluetooth keyboard on your legs _under_ the tray table and have your arms fully and comfortably extended. This works especially well if you're a vim/emacs/other keyboard driven editor user as you very rarely need to reach up to poke the trackpad .
The tradeoff of poor comfort is insane productivity, for me anyway. Being restricted in place, no wifi, inconvenient toilet breaks, not in control of meal times, all means I get a lot of work done
On the other hand, in economy on some planes I just literally don't fit in a forward direction because of femur length and cycling muscles, I don't fit in a sideways direction either because of broad shoulders and arm muscles, and I don't comfortably fit vertically on some planes with fixed-position headrests which push into the middle of my shoulder blades and have me hunched the whole flight.
I'm also not _that_ big. I'm 6'2" and have lived my life moderately actively. That's it. I'm biased, but I believe economy should be designed so that I can fit too.
If you agree with that premise, that'd leave plenty of space for most 250lb people too, and there'd be no reason to exclude them.
Edit to be slightly less obtuse: surely you're not implying that a common carrier be allowed to discriminate based on facts about a passenger's body without making reasonable accommodations. Surely you're not implying that obese people not be allowed to fly at all. Surely you're not suggesting that fat people should just remove themselves from society so you don't have to deal with them.
Therefore, obese people should get free upgrades to economy plus or better. Thanks for the idea!
Where do you draw the line? A 250lb person probably mostly fits in their seat still, but at some point a person is just physically going to take up two seats. Do you really think the airline should be responsible for flying them in business class (premium economy doesn't give you more width on most/all airlines)? Does it matter if their weight is due to a medical condition or just laziness? What if they're so big that even a first class seat won't contain them?
Also, you have to include other attributes. E.g. Not my problem that you have freakishly long legs, if you have to prevent me reclining then maybe you should have to pay for premium economy. And what if you are broad shouldered? Same deal, not my problem, you have to stay inside the boundaries of your own seat.
I would rather we used regulation to make economy seats a bit larger. Call it a safety issue, since it is.
A few years ago I saw some very interesting custom ergonomic setups optimized for traveling + flying.
One person with a thinkpad is able to get the monitor to be 180 degrees flat w/ the keyboard, and can hang it off the seat. He also brings a split ergo keyboard with a lap mount.
Another person did something similar with a M1 laptop, but needs an Ipad to act as the external monitor (laptop stays in bag) with a built and designed from scratch split ergo keyboard.
They are "OK enough" that it will be a matter of taste if they are acceptable or not at this point for you to use.
For coding they work fine for me, terminal tools work particularly well as I can bump the font size up. IDEs and web browsing aren't bad either, it's about the equivalent of a single 1080p screen. They are nicer than hunching over a laptop for travel use but I still prefer a proper monitor when available.
The optics are a generation or two from being where they need to be to market these as productivity devices, but if you like being an early adopter with all the quirks that come with it, they're fun.
It's also uncomfortable to look at the very bottom of the screen (which is where all the chat text boxes are), so I usually resize all my windows to be a bit smaller. With that, it's very good (and you can always just increase the font size).
I would like glasses with smaller fov, so I didn't have to look around so much, but that's probably just me, since everyone else likes them larger.
I also have a Dell laptop that spin up fans if I open a text editor, and it feels normal, but on Mac, I feel like fans spinning up is me somehow abusing it and shortening its lifespan.
Return flight will test this with the correct cable. I expect at least 16% improvement against the 70W cap
Some plane sockets cut out completely if you attempt to draw more then the limit rather then continuing to provide power at the limit.
I really don't know what the hell people are doing locally, and suspect a lot of the hype around running these models locally is bullshit. Sure, you can make it do something but certainly nothing useful or substantial.
I ran 8 tests on a variety of open-weights models, and opus 4.7 (1mil ctx version) and the little dense model was right behind it: https://github.com/sleepyeldrazi/llm_programming_tests/tree/... Of note is that opus was the only model to push back against the spec on the hardest challenge, saying 'thats not possible', when there are links in the spec to examples of it being done.
There may be problems with the mlx versions, as i haven't had any looping in all the testing i've done, which is all my agentic and coding work the last couple of days (since it dropped). I have had tool_call misses 4 or 5 times so far, which isn't ideal but no looping. First I used it in pi-mono and later when i realized it's a serious model switched to opencode.
My setup is llama.cpp running on a 3090 in WSL, unsloth IQ4_NL with those flags: --ctx-size 128000 \ --jinja \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ --threads 12 \ --gpu-layers 99 \ --no-warmup \ --no-mmap \ -fa on
note: 27b is going to be slow; use the 35b MoE if you want decent token/sec speed.
I've gotten around 16 t/s gen with 4bit and mxfp4 on that model for generation. The 3090 I mentioned has a little over 900 gb/s, while those macs i think are around 270 GB/s. If my understanding is correct, macs do utilize the bandwidth better in this case, but it still doesn't make up the difference (on the 3090 it's around 30-35 t/s depending on size of ctx).
Also, do run a quick experiment removing the cache quants if you want to tinker with it a bit more, iirc KV quant does add a small overhead during prefill.
I would be very interested to know your prefill and generation numbers.
With local models which are often benchmaxxed, testing unfortunately isn’t as predictive as you’d like.
I admit I sometimes get caught up in the tooling for its own sake, but I find local models useful for specific tasks like migrating configuration schemas, writing homelab scripts, or exploring financial data.
It might sound a bit paranoid, but privacy is another major driver for me. Keeping credentials and private information off cloud services is worth the extra friction.
Local 30B models do fine on isolated, well-scoped tasks. Autocomplete, single-function refactors, naming, type fixes, stuff like that. They fall apart on anything that requires sustained reasoning over multiple steps. Frontier cloud models stay coherent across the chain because they have higher peak capability AND lower variance per step. Local models compound their per-step variance and end up in loops.
The "run everything locally" hype mostly assumes you're doing the easy half. Once you try the hard half, the cloud-vs-local ratio is way bigger than the parameter-count comparison suggests.
The play is hybrid routing. Local for the easy stuff, cloud for the hard stuff. Anyone doing "100% local" at scale is either cherry-picking tasks or eating losses they don't measure, or just straight up full of shit.
There is certainly a lot of hype around local models. Some of it is overhype, some of it is just "people finding out" and discovering what cool stuff you can do. I suspect the post is a reply to the other one a few days ago where someone from hf posted a pic with them in the plane, using a local model, and saying it's really really close to opus. That was BS.
That being said, I've been working with local LMs since before chatgpt launched. The progress we've made from the likes of gpt-j (6B) and gpt-neoX (22B) (some of the first models you could run on regular consumer hardware) is absolutely amazing. It has gone way above my expectations. We're past "we have chatgpt at home" (as it was when launched), and now it is actually usable in a lot of tasks. Nowhere near SotA, but "good enough".
I will push back a bit on the "substantial" part, and I will push a lot on "nothing useful". You can, absolutely get useful stuff out of these models. Not in a claude-code leave it to cook for 6 hours and get a working product, but with a bit of hand holding and scope reduction you can get useful stuff. When devstral came out (24B) I ran it for about a week as a "daily driver" just to see where it's at. It was ok-ish. Lots of hand holding, figured out I can't use it for planning much (looked fine at a glance, but either didn't make sense, or used outdated stuff). But with a better plan, it could handle implementation fine. I coded 2 small services that have been running in prod for ~6mo without any issues. That is useful, imo. And the current models are waaay better than devstral1.
As to substantial, eh... Your substantial can be someone else's taj mahal, and their substantial could be your toy project. It all depends. I draw the line at useful. If you can string together a couple of useful tasks, it starts to become substantial.
That said I don't see why people are so scared to touch code even if it saves them 500 euro a month. Using my IDEs find across my repo and auto replacing 2 patterns is trivial to do and way faster to do by hand. I mostly use small models, it prevents a lot of the issues I've seen with large models and vibe/agentic coding medium to long term. I also write a lot of code.
I've been using qwen3.5 (122b) with claude code for months, and it's definitely dumber than sonnet/opus, but it works through things reasonably well (i.e. writes half-decent code and tool calls usually work), and I pretty much never run into loops now.
and make sure you're following Unsloth's recommendations for temperature/etc.
From Qwen3.6 page:
Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Gist with the compose and example of an output. https://gist.github.com/meaty-popsicle/f883f4a118ff345b430c3...
set min_p to like 0.3 and ignore top_p and top_k and you'll be fine.
There's better samplers now like top N sigma, top-h, P-less decoding, etc, but they're often not available in your LLM inference engine (i.e. vLLM)
We are instead arguing that min_p handles truncating tokens that are more likely to lead to degeneration/looping because it is partially distribution aware. Fully distribution aware samplers like the ones I mentioned above (i.e. P-less decoding) are strictly superior due to using the whole distribution to decide the truncation at every time step.
Code hallucinations, like many LLM hallucinations, can be seen as accumulation of small amounts of "sampling errors".
It works great for me. But I like to review the code and understand what it's doing, which doesn't appear to be how people do "useful or substantial" programming these days.
The biggest lesson I've learned working with local models so far is: with the smaller models, you have to understand their limitations, be willing to run experiments, and fine-tune the heck out of everything. There are endless choices to be made: which model to use, which quant, thinking or not, sampling parameters, llama.cpp vs vLLM, etc. They much more fiddly for serious work than just downloading Claude Code and having it one-shot your application. But some of us enjoy fiddling so it all works out in the end.
For people who aren't completely vibe or agent coding these models are better than say copilot or the free models appearing after a Google search. Probably better than chatgpts flagships in some ways.
I mostly use 4b to 9b models for basic inquiries and code examples from libraries I haven't used before. Many of them can solve pretty hard math problems, and these are several steps away from say qwen3.6.
I would not discount running models locally. It's the best case scenario of a future with LLMs from a human rights and ecological perspective.
(I’ve recently discovered that you can pipe local models into Claude’s Code and Desktop, so this is on my list to try).
/R/localllama is okay for some information but beyond that there is so much noise and very little signal. I think it's intentional.
Don't try for a rambling session where you let the thing grind for hours on a huge system. It will predictably choke or end in those loops. But do a few small chunks of work, exit the harness, then pick up the next few small chunks... It doesn't feel as magical, but it seems to be more effective, even when your model is Claude.
* New models running in llama.cpp (what's under the hood of ollama et al) frequently require bug fixes.
* The GGUF models that run in llama.cpp frequently require bug fixes (Unsloth is notorious for this -- they release GGUF models about 10 minutes after official .safetensors releases).
* You're probably running a <Q8 quantization of the model, and a good chance <BF16 quantization for KV cache. This makes for compounding issues as context grows and tool calls multiply.
Local models really are great but I think a major problem are the people in groups like r/localllama who run models at absurd quantization levels in order to cram them on their underpowered hardware and convince themselves that they're running SOTA at home.
The best way to run these models is, frankly, a lot of VRAM and vLLM (which is what the people developing these models are almost certainly targeting).
[1] https://betweentheprompts.com/40000-feet/
Did the author mean Qwen3.6-27B? Qwen3.6-35B-A3B?
Also, agreed with the other commenters: just read a damn book and take a nap.
Not only shouldn't we support someone like Elon Musk but also don't you find it hypcritic to respond with 'just read a damn book' and suggesting starlink?
curious if you had to heavily throttle the cpu or stick to super small quants (like 4 bit phi3) to actually make it through 10 hours without a power outlet?
Let's say you have a basic setup like llama.cpp and llama-server on a remote server (even if it's just sitting under your home office desk) running a 35GB Q8 quantized model of qwen 3.6 35B, it's not difficult to make llama-server available to your laptop over just about any form of internet connection and VPN.
Having the ability to run that same model locally if you really need to because no internet connection whatsoever is available, but the times that you simultaneously have no internet and a serious need for something the model can output are fairly rare these days.
Same with running my local dev environment's docker containers, now they run on that workstation and my battery life is far higher, treating my portable device as a dumb terminal.
Most certainly avoidable, unfortunately.
on the other hand, $6200 every few years is pretty tiny compared to a typical US developer salary, so is this really that crazy if it's your primary work machine?