Last year when Framework announced the Framework Desktop I immediately ordered one. I’d been wanting a new gaming PC, but I’d also been kicking around the idea of running a local LLM. When it finally arrived it worked great for gaming… but there wasn’t much that would run on the AMD hardware from an LLM standpoint. Over the next few months more tools became available, but it was very slow going. I had many long nights where I’d work and work and work and end up right back where I started.
So I got a Claude Code subscription and used it to help me build out my LLM setup. I made a lot of progress, but now I was comparing my local LLM to Claude, and there was no comparison.
Then I started messing with OpenClaw. First with Claude (expensive, fast), then with my local llama.cpp (cheap, frustrating). I didn’t know enough about it, so I used Claude to help me build a custom app around my llama.cpp. That was fun and I learned a lot, but I was spending most of my time chasing bugs instead of actually optimizing anything.
Around that time I heard about Qwen3-Coder-Next, dropped it into llama.cpp, and wow that was a huge step forward. Better direction-following, better tool calls, just better. I felt like my homegrown app was now holding the model back, so I converted over to OpenClaw. Some growing pains, but once things settled I was impressed again.
We built a lot of tooling along the way: a vector database memory system that cleans itself up each night, a filesystem-based context system, speech-to-text and text-to-speech, and a vision model. At this point my local LLM could see me, hear me, speak to me, and remember things about me, and all of it was built to be LLM-agnostic so Claude and my local system could share the same tools.
I was still leaning on Claude heavily for coding, because honestly it’s amazing at it. I decided to give Qwen a small test project: build a web-based kanban board: desktop and mobile friendly. It built it… but it sucked. Drag between columns? Broken. Fixed that, now you can’t add items. Fixed that, dragging broke on mobile. I kept asking Claude to help troubleshoot and it kept just wanting to rewrite the app. Finally I gave in and said “just fix it” and Claude rewrote the whole thing and it was great. I was disheartened. On top of that, Qwen kept getting into these loops, sometimes running for hours doing nothing productive.
So about a week and a half ago I decided to rethink what I even wanted my local LLM to do. Coding was obviously out. I decided to start fresh and use it to help me journal. A few times a day it reaches out, asks what I’m doing, and if it’s relevant, adds an entry to my journal.
I went through a couple more model swaps trying to get it stable, Qwen3.5 was better than Coder-Next for this use case but I was still hitting loop issues. It was consistently prompting me and doing a decent job with the journal, which was at least a step in the right direction.
Then Qwen3.6 dropped. I put the Q6 quant on the same day it released and immediately I could tell it was faster and the output quality was much higher. And I realized earlier today that since I switched to Qwen3.6 I haven’t had to ask Claude to check in on Qwen even once. The looping is gone. It’s actually following the anti-loop protocols I’ve been trying to get models to follow for months.
I haven’t tried coding with it yet (I don’t have high hopes there) but I’ve given it the ability to create and modify its own skills and it’s been doing that beautifully. Scheduled tasks, multiple agents (voice assistant, primary, Home Assistant), all running smoothly.
My reliance on Claude has dropped off sharply since moving to Qwen3.6, and my system resource usage has gone down significantly too. If you’ve tried to get a local LLM setup running and gave up out of frustration… now might be a good time to jump back in, especially if you know your hardware should be able to handle it.
On the coding front; Bijan Bowen likes to test various llm with coding challenges…Qwen 3.6 is the first small model I’ve seen actually one shot a few of them (browser os test, 3d flight sim).
Sadly, it borked the third (c++ skateboarding game) but it shows def progress.
The way it failed in pathognomonic - CoT collapse / looping. They need to figure that shit out.
Keep going, Qwen.
This is very exciting to hear! I might see about creating a dedicated coding agent to test it out.
Creating dedicated agents isn’t something I’ve really done much of yet, I’ve mostly just kept it as a single agent, but this might be a good use case for it.
If you do, please write back and let us all know how it fares. I think a lot of us are pinning our hopes of the Qwen line as “Claude at Home”. I don’t think 3.6 is it…but 3.7? 3.8?
I did some analysis on OpenCode, Little Coder, Aider, OpenClaude and ClawCode and each one had potential issues with my specific setup (OpenCode and OpenClaude have a looping bug at the moment, Little Coder is brand new and has potential, but I’m leary of new projects, Aider is promising but had some unknowns)… anyway, as I was evaluating these I realized that I currently use Claude Code, all I have to do is use that with my model rather than using their models. A few quick changes and now I have Qwen3.6 in Claude Code. I asked it to create a web-based Tetris game… 7m13s later I have a fully functional, minimally buggy Tetris game in my browser!
That’s awesome! Full circle :)
I have to admit, once you have your work environment set to how you like it, it can be difficult to transition away. I find OpenCode actively hostile to how I work, to say nothing of how shit the OOBE / on ramping is. I was willing to eat shit to get it set with Little Coder (because as you say, if that works as advertised, it will be a hell of a thing) but time will tell.
I might need Little Coder eventually / as I try to step back from cloud-based, because the best local coding models I can run at good speeds are in the 8-14B range. Anything that can automatically enforce structured output discipline to get more out of smaller models would be a huge win.
Did you happen to see the tests on the new Qwen3.6 27 dense?
I am the target audience for this post, hah. Same kit as you but I threw in a 3090 via OcuLink for specialized tasks. At the moment it’s primarily a gaming rig.
I’d love to hear more about the various tools you’ve found - I stopped using local models a while back because it felt like the Strix Halo support just wasn’t ready.
I hadn’t heard of OcuLink before the Framework event yesterday… and now I’m intrigued. How does that connect to the Framework Desktop?
And as far as tools go, I tend to borrow ideas from others and then build them specific for my setup. For instance, a few months ago I came across a project called OpenBrain that uses a vector database for memory storage and retrieval. I leaned on Claude and asked it to evaluate the OpenBrain project and to let me know how we could use the concepts for my local system. Claude chugged away and then gave me its recommendations, which included spinning up PostgreSQL, creating an ‘Observer’ to watch my OpenClaw and Claude sessions to pull memories into the database, setting up an embedding LLM server to generate the embeddings, and then a process that runs nightly to remove duplicate memories, combine similar memories and archive old memories. If I recall almost all of these were components of OpenBrain, but I’m cautious about using other people’s code and appreciate that Claude is a good tool for using ideas from other projects without actually needing to use the whole project itself.
One instance recently where I’m glad I did this was Milla Jovovich’s MemPalace… since I already had a memory system in place I didn’t need to use her project, but one thing I did find very intriguing was the AAAK, which she had described as a lossless compression language for AI agents. I asked Claude to evaluate all of MemPalace and see if there was anything that would be an improvement to my setup, and Claude also got excited about AAAK. Claude said that everything else we were doing was as good or better than MemPalace, but if we could add an AAAK version of each memory that would help lower the token usage (but we kept the full prose memory to help with search). We implemented it and I asked Claude to pull some random memories from the db… and it was immediately clear that there was no way AAAK was lossless because a lot of my memories had to do with system details, including IP addresses, and the AAAK didn’t include any IP addresses. Of the 1800 memories that I had stored, Claude was only able to find 3 where enough meaning had been preserved with AAAK to be usable. It was easy to remove from my system - and I just checked and the project no longer refers to AAAK as lossless.
Using this method I also built a filesystem based context system that allows any LLM on my system to use the same skills, context, agents, projects, and memories (although the memory folder just has instructions on how to access the memory database). This was another project I saw from someone else and I used a lot of their ideas, but tweaked things to fit my needs.
Other tools where I am using other’s projects are ‘Faster-Whisper’ for speech-to-text, ‘Piper TTS’ for text-to-speech, ‘Moondream2’ for image analysis and QMD Search for indexing my Obsidian Vault and putting it into the memory system.
Qwen kept getting into these loops, sometimes running for hours doing nothing productive.
Use a timeout mechanism and then retry or fail as appropriate; I’m doing that to avoid getting stuck forever in loops with a lower quantization qwen3.6 model that I’m doing image analysis with for work currently.
I was trying that with Coder-Next but it would often just ignore those instructions. I didn’t change any of my instructions when I moved to Qwen3.6 and it has been following them solidly.
Running the timer and retry logic from a deterministic (non-LLM based) control script? Hmm. Was the context in a bad state after the timeout, and it thought it had already done the failed instructions perhaps? I haven’t used Coder-Next specifically.
If you’ve got things working solidly now without it then great! Maybe revisit the idea though if you do start bumping into long unproductive loops again. In my case, I was seeing it happen occasionally with a complicated one-shot image analysis prompt and a Q4 version of the model.
I had high hopes for Gemma4 and for the most part, for non coding tasks, its been great. I really want to find a relatively reliable local coding model though
Same, I believe we’ll eventually get there with open models, but right now it’s hard to beat the frontier models for coding.
Started trying out GLM 4.7 Flash last night. Fits in 24 GB VRAM with 64k context window in Ollama and seems to be working well with OpenCode so far! Might have a new winner…
What kind of hardware are you using on the desktop?
I went with the decked out Framework Desktop: AMD Strix Halo APU (Zen 5 + RDNA 3.5 iGPU), 128GB shared system RAM, running Nobara Linux.
The only thing I skimped on was ssd. I only have 2TD. I was going to go with 4TB but I figured I should hold off because ‘storage prices only go down’.
No discrete GPU? When my 4070 was slightly out of its depth on context for Mistral-NeMo 12B a while back it offloaded some to my cpu and system RAM and it was so slow. Recipe generation timed out after 10 mins while on VRAM it takes 30 seconds.
It’s a tradeoff between RAM and computing power. Discrete GPU RAM is limited. 24 GB is typically the most you can get as a consumer. That means you can‘t run the big models and running several at the same time ca be tricky as well. Offloading from GPU to system RAM is very slow because you run into memory bandwidth between GPU and CPU as the bottleneck.
As long as the model fits on the GPU, they’re fast. Once you hit the memory limit, it’s game over.
With unified memory you can run bigger models and also have higher memory bandwidth (depending on CPU).
Memory bandwidth is super important for running LLMs. That’s also the reason Macs can be good for running LLMs. The M series have unified memory and the Pro, Max, and Ultra Chips have serious high memory bandwidth. An M series Mac with 48 GBs or more is a serious machine for LLMs.
The amd strix halo itself should be a beast for ai, no need for a dgpu. However, you are sometimes limited because nvidia has pushed for cuda so bad its the go to for a lot of tools if you dont want to run the llm on cpu
Haven’t tried it yet.
“ZLUDA is a drop-in replacement for CUDA on non-NVIDIA GPUs. ZLUDA allows running unmodified CUDA applications using non-NVIDIA GPUs with near-native performance”
I need to try that tbh. See what things like ollama do with it and what performance is like
I’m considering this rig. What kind of tok/s are you getting with Qwen 3.6?
For ingesting I’m getting around 200 t/s, and for generation it is around 48 t/s.
Thats excellent throughput. If you hook it up to little coder (or pi-coder with little coder extension when it drops) let me know. Tiny coder is meant to make small models much more obedient.
Something that hits 79% on Aider and gets 74% SWE makes it in the same benchmark league as Gpt 5.4 mini. If you can run that at near 50tok/s, that really impressive.
Thanks, I plan on checking out Little Coder later this weekend, it looks really cool, thanks for sharing!
Congrats! Love that you’ve been able to experiment so much locally. Do you have any experience renting processing from something like Vast or RunPod?
No, I feel fortunate that I was able to get the Framework Desktop before the tariffs and chip shortages hit and so I have the computational power that I need without needing to rent it. I’m sure if I did the math I would probably come out ahead by renting but I don’t like monthly bills, especially ones that fluctuate.
For sure. Thanks! Happy coding.
Anyone able to comment on qwen-coder-3 vs any of the new qwen3.6 models?
Sounds cool. The most recent Qwen I’ve used is 3. It writes well, but thinking takes like 20 minutes. 😭
I have mine setup to only use thinking for complex tasks, and that can take significantly longer (2-8 minutes for thinking versus <30s for not thinking) but when I look at its thought process it is impressive the logic steps it goes through to solve an issue.









