Qwen3.6 finally makes my Local LlaMa useful

Bob Robertson IX @discuss.tchncs.de · 3 days ago

Qwen3.6 finally makes my Local LlaMa useful

SuspciousCarrot78@lemmy.world · 3 days ago

On the coding front; Bijan Bowen likes to test various llm with coding challenges…Qwen 3.6 is the first small model I’ve seen actually one shot a few of them (browser os test, 3d flight sim).

Sadly, it borked the third (c++ skateboarding game) but it shows def progress.

The way it failed in pathognomonic - CoT collapse / looping. They need to figure that shit out.

Keep going, Qwen.

Bob Robertson IX @discuss.tchncs.de · 3 days ago

This is very exciting to hear! I might see about creating a dedicated coding agent to test it out.

Creating dedicated agents isn’t something I’ve really done much of yet, I’ve mostly just kept it as a single agent, but this might be a good use case for it.

SuspciousCarrot78@lemmy.world · 3 days ago

https://github.com/itayinbarr/little-coder

SuspciousCarrot78@lemmy.world · 3 days ago

If you do, please write back and let us all know how it fares. I think a lot of us are pinning our hopes of the Qwen line as “Claude at Home”. I don’t think 3.6 is it…but 3.7? 3.8?

Bob Robertson IX @discuss.tchncs.de · 2 days ago

I did some analysis on OpenCode, Little Coder, Aider, OpenClaude and ClawCode and each one had potential issues with my specific setup (OpenCode and OpenClaude have a looping bug at the moment, Little Coder is brand new and has potential, but I’m leary of new projects, Aider is promising but had some unknowns)… anyway, as I was evaluating these I realized that I currently use Claude Code, all I have to do is use that with my model rather than using their models. A few quick changes and now I have Qwen3.6 in Claude Code. I asked it to create a web-based Tetris game… 7m13s later I have a fully functional, minimally buggy Tetris game in my browser!

SuspciousCarrot78@lemmy.world · edit-2 1 day ago

That’s awesome! Full circle :)

I have to admit, once you have your work environment set to how you like it, it can be difficult to transition away. I find OpenCode actively hostile to how I work, to say nothing of how shit the OOBE / on ramping is. I was willing to eat shit to get it set with Little Coder (because as you say, if that works as advertised, it will be a hell of a thing) but time will tell.

I might need Little Coder eventually / as I try to step back from cloud-based, because the best local coding models I can run at good speeds are in the 8-14B range. Anything that can automatically enforce structured output discipline to get more out of smaller models would be a huge win.

Did you happen to see the tests on the new Qwen3.6 27 dense?

https://www.youtube.com/watch?v=N-0WtgxJ7ZU

recklessengagement@lemmy.world · 3 days ago

I am the target audience for this post, hah. Same kit as you but I threw in a 3090 via OcuLink for specialized tasks. At the moment it’s primarily a gaming rig.

I’d love to hear more about the various tools you’ve found - I stopped using local models a while back because it felt like the Strix Halo support just wasn’t ready.

Bob Robertson IX @discuss.tchncs.de · 3 days ago

I hadn’t heard of OcuLink before the Framework event yesterday… and now I’m intrigued. How does that connect to the Framework Desktop?

And as far as tools go, I tend to borrow ideas from others and then build them specific for my setup. For instance, a few months ago I came across a project called OpenBrain that uses a vector database for memory storage and retrieval. I leaned on Claude and asked it to evaluate the OpenBrain project and to let me know how we could use the concepts for my local system. Claude chugged away and then gave me its recommendations, which included spinning up PostgreSQL, creating an ‘Observer’ to watch my OpenClaw and Claude sessions to pull memories into the database, setting up an embedding LLM server to generate the embeddings, and then a process that runs nightly to remove duplicate memories, combine similar memories and archive old memories. If I recall almost all of these were components of OpenBrain, but I’m cautious about using other people’s code and appreciate that Claude is a good tool for using ideas from other projects without actually needing to use the whole project itself.

One instance recently where I’m glad I did this was Milla Jovovich’s MemPalace… since I already had a memory system in place I didn’t need to use her project, but one thing I did find very intriguing was the AAAK, which she had described as a lossless compression language for AI agents. I asked Claude to evaluate all of MemPalace and see if there was anything that would be an improvement to my setup, and Claude also got excited about AAAK. Claude said that everything else we were doing was as good or better than MemPalace, but if we could add an AAAK version of each memory that would help lower the token usage (but we kept the full prose memory to help with search). We implemented it and I asked Claude to pull some random memories from the db… and it was immediately clear that there was no way AAAK was lossless because a lot of my memories had to do with system details, including IP addresses, and the AAAK didn’t include any IP addresses. Of the 1800 memories that I had stored, Claude was only able to find 3 where enough meaning had been preserved with AAAK to be usable. It was easy to remove from my system - and I just checked and the project no longer refers to AAAK as lossless.

Using this method I also built a filesystem based context system that allows any LLM on my system to use the same skills, context, agents, projects, and memories (although the memory folder just has instructions on how to access the memory database). This was another project I saw from someone else and I used a lot of their ideas, but tweaked things to fit my needs.

Other tools where I am using other’s projects are ‘Faster-Whisper’ for speech-to-text, ‘Piper TTS’ for text-to-speech, ‘Moondream2’ for image analysis and QMD Search for indexing my Obsidian Vault and putting it into the memory system.

e0qdk@reddthat.com · 3 days ago

Qwen kept getting into these loops, sometimes running for hours doing nothing productive.

Use a timeout mechanism and then retry or fail as appropriate; I’m doing that to avoid getting stuck forever in loops with a lower quantization qwen3.6 model that I’m doing image analysis with for work currently.

Bob Robertson IX @discuss.tchncs.de · 3 days ago

I was trying that with Coder-Next but it would often just ignore those instructions. I didn’t change any of my instructions when I moved to Qwen3.6 and it has been following them solidly.

e0qdk@reddthat.com · 2 days ago

Running the timer and retry logic from a deterministic (non-LLM based) control script? Hmm. Was the context in a bad state after the timeout, and it thought it had already done the failed instructions perhaps? I haven’t used Coder-Next specifically.

If you’ve got things working solidly now without it then great! Maybe revisit the idea though if you do start bumping into long unproductive loops again. In my case, I was seeing it happen occasionally with a complicated one-shot image analysis prompt and a Q4 version of the model.

epchris@programming.dev · 3 days ago

I had high hopes for Gemma4 and for the most part, for non coding tasks, its been great. I really want to find a relatively reliable local coding model though

Bob Robertson IX @discuss.tchncs.de · 3 days ago

Same, I believe we’ll eventually get there with open models, but right now it’s hard to beat the frontier models for coding.

epchris@programming.dev · 2 days ago

Started trying out GLM 4.7 Flash last night. Fits in 24 GB VRAM with 64k context window in Ollama and seems to be working well with OpenCode so far! Might have a new winner…

Otter@lemmy.ca · 3 days ago

What kind of hardware are you using on the desktop?

Bob Robertson IX @discuss.tchncs.de · 3 days ago

I went with the decked out Framework Desktop: AMD Strix Halo APU (Zen 5 + RDNA 3.5 iGPU), 128GB shared system RAM, running Nobara Linux.

The only thing I skimped on was ssd. I only have 2TD. I was going to go with 4TB but I figured I should hold off because ‘storage prices only go down’.

Vinny_93@lemmy.world · 3 days ago

No discrete GPU? When my 4070 was slightly out of its depth on context for Mistral-NeMo 12B a while back it offloaded some to my cpu and system RAM and it was so slow. Recipe generation timed out after 10 mins while on VRAM it takes 30 seconds.

Samskara@sh.itjust.works · 3 days ago

It’s a tradeoff between RAM and computing power. Discrete GPU RAM is limited. 24 GB is typically the most you can get as a consumer. That means you can‘t run the big models and running several at the same time ca be tricky as well. Offloading from GPU to system RAM is very slow because you run into memory bandwidth between GPU and CPU as the bottleneck.

As long as the model fits on the GPU, they’re fast. Once you hit the memory limit, it’s game over.

With unified memory you can run bigger models and also have higher memory bandwidth (depending on CPU).

Memory bandwidth is super important for running LLMs. That’s also the reason Macs can be good for running LLMs. The M series have unified memory and the Pro, Max, and Ultra Chips have serious high memory bandwidth. An M series Mac with 48 GBs or more is a serious machine for LLMs.

DacoTaco@lemmy.world · 3 days ago

The amd strix halo itself should be a beast for ai, no need for a dgpu. However, you are sometimes limited because nvidia has pushed for cuda so bad its the go to for a lot of tools if you dont want to run the llm on cpu

Sims@lemmy.ml · 3 days ago

Haven’t tried it yet.

“ZLUDA is a drop-in replacement for CUDA on non-NVIDIA GPUs. ZLUDA allows running unmodified CUDA applications using non-NVIDIA GPUs with near-native performance”

https://github.com/vosen/ZLUDA

DacoTaco@lemmy.world · 3 days ago

I need to try that tbh. See what things like ollama do with it and what performance is like

SuspciousCarrot78@lemmy.world · 3 days ago

I’m considering this rig. What kind of tok/s are you getting with Qwen 3.6?

Bob Robertson IX @discuss.tchncs.de · 3 days ago

For ingesting I’m getting around 200 t/s, and for generation it is around 48 t/s.

SuspciousCarrot78@lemmy.world · edit-2 2 days ago

Thats excellent throughput. If you hook it up to little coder (or pi-coder with little coder extension when it drops) let me know. Tiny coder is meant to make small models much more obedient.

Something that hits 79% on Aider and gets 74% SWE makes it in the same benchmark league as Gpt 5.4 mini. If you can run that at near 50tok/s, that really impressive.

Bob Robertson IX @discuss.tchncs.de · 2 days ago

Thanks, I plan on checking out Little Coder later this weekend, it looks really cool, thanks for sharing!

venusaur@lemmy.world · 3 days ago

Congrats! Love that you’ve been able to experiment so much locally. Do you have any experience renting processing from something like Vast or RunPod?

Bob Robertson IX @discuss.tchncs.de · 3 days ago

No, I feel fortunate that I was able to get the Framework Desktop before the tariffs and chip shortages hit and so I have the computational power that I need without needing to rent it. I’m sure if I did the math I would probably come out ahead by renting but I don’t like monthly bills, especially ones that fluctuate.

venusaur@lemmy.world · 3 days ago

For sure. Thanks! Happy coding.

Rando@sh.itjust.works · 3 days ago

Anyone able to comment on qwen-coder-3 vs any of the new qwen3.6 models?

swelter_spark@reddthat.com · 3 days ago

Sounds cool. The most recent Qwen I’ve used is 3. It writes well, but thinking takes like 20 minutes. 😭

Bob Robertson IX @discuss.tchncs.de · 3 days ago

I have mine setup to only use thinking for complex tasks, and that can take significantly longer (2-8 minutes for thinking versus <30s for not thinking) but when I look at its thought process it is impressive the logic steps it goes through to solve an issue.