Running Local LLM

Running Model locally - Cursor@Home

see also

4-node AMD cluster - runing 512GB Model locally / needs \(\)
- 4-node Minisforum MS-S1 Max Strix Halo cluster
tokenspeed / HN

# Models ⮺

see also

The Best Local Agentic Coding Workflow (Complete Guide)

Model	Size	Speed	Comments
Qwen3.6 27B
qwen3-14b-claude-4.5-opus-high-reasoning-distill	9GB	80 tok/sec	LM Studio 4.12
unsloth/qwen3-coder-30b-a3b-instruct	11GB/12.4GB	55.46 tok/sec - 1019 tokens - 0.03s to first token	LM Studio 3.6
llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-NVFP4-Experts-Only-GGUF	12.5GB	10 tok/sec - 2898 tokens - 1.16s to first token	LM Studio 4.12
llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-NVFP4-Experts-Only-GGUF	12.5GB	10 tok/sec - 2898 tokens - 1.16s to first token	LM Studio 4.12

# Qwen3.6 ⮺

This release delivers substantial upgrades, particularly in

Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
Thinking Preservation: we’ve introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.

see also

Qwen 3.6 Locally: Can I use it for my daily tasks?
- 27B - better on coding task (but dense)
- 35B - faster
- Prompt Vault - a curated collection of structured coding prompts and challenges designed for testing Large Language Models (LLMs)

# Qwen3.5 ⮺

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device.

See all GGUFs here.

How to run Qwen 3.5 locally

# Qwen3-Coder-30B-A3B-Instruct-GGUF

This is the standard instruction-tuned code model.
Typical strengths:

generating code from prompts
explaining code
quick coding tasks
IDE autocomplete / chat coding

It’s also easier to run locally because the total model size is smaller (30B).

use Qwen3 Coder 30B A3B Instruct. This model delivers strong coding performance and reliable tool use.
- Which local models actually work with Cline? AMD tested them all

RTX 5070 Ti

Running 3.5 9B on my ASUS 5070ti 16G with lm studio

# Qwen3-Coder-Next ⮺

Much larger model: ~80B parameters total / ~3B active per token

very sparse Mixture-of-Experts
optimized for fast repeated calls in agent loops

Designed for coding agents that run loops:

read repo
plan changes
edit files
run tests
debug
repeat

It was trained with environment interaction and executable coding tasks, so it learns from test results and feedback loops.
Typical strengths:

repo-scale refactors
debugging from logs
multi-step tool use
IDE agents (Cline, Claude Code, etc.)

# Quantisation

# Q4_K_M ⮺

Q4_K_M is a compressed 4-bit version of an AI model that balances small size with good accuracy.

Q4 - means each model weight uses 4 bits instead of 16 bits, reducing memory usage by about 4×.

smaller files
faster inference
lower RAM/VRAM requirements

K stands for K-block quantisation, an improved quantisation method used in GGML/GGUF models.
Instead of compressing each weight independently, it compresses blocks of weights together, which:

preserves more information
improves accuracy compared to older Q4 formats

M a variant of the quantisation scheme (S (small) / M (Medium) / L (Large) precision)

# Model Parameters

# A3B ⮺

Some newer models use a Mixture-of-Experts (MoE) architecture.
A3B → Active 3B parameters during inference.

Instead of using all parameters every time, the model:

Contains a large total parameter count (e.g., 35B).
Activates only a subset of them per token using a router that selects a few “experts.”

Example

Model name	Total parameters	Active parameters
Qwen3.5-35B-A3B	35B	~3B active
ERNIE-4.5-21B-A3B	21B	~3B active

# instruct ⮺

This refers to a model that is specifically trained or fine-tuned to follow instructions from users in a helpful, safe, and coherent way.

# DFlash ⮺

DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.

# GGUF

What’s in a GGUF, besides the weights - and what’s still missing? / HN

Written on March 8, 2026, Last update on May 14, 2026

LLM at_home network