Running Local LLM

Running Model locally - Cursor@Home

Qwen3.5

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device.

See all GGUFs here.

How to run Qwen 3.5 locally

Qwen3-Coder-30B-A3B-Instruct-GGUF

This is the standard instruction-tuned code model.
Typical strengths:

generating code from prompts
explaining code
quick coding tasks
IDE autocomplete / chat coding

It’s also easier to run locally because the total model size is smaller (30B).

use Qwen3 Coder 30B A3B Instruct. This model delivers strong coding performance and reliable tool use.
- Which local models actually work with Cline? AMD tested them all

RTX 5070 Ti

Running 3.5 9B on my ASUS 5070ti 16G with lm studio

Qwen3-Coder-Next

Much larger model: ~80B parameters total / ~3B active per token

very sparse Mixture-of-Experts
optimized for fast repeated calls in agent loops

Designed for coding agents that run loops:

read repo
plan changes
edit files
run tests
debug
repeat

It was trained with environment interaction and executable coding tasks, so it learns from test results and feedback loops.
Typical strengths:

repo-scale refactors
debugging from logs
multi-step tool use
IDE agents (Cline, Claude Code, etc.)

Quantisation

Q4_K_M

Q4_K_M is a compressed 4-bit version of an AI model that balances small size with good accuracy.

Q4 - means each model weight uses 4 bits instead of 16 bits, reducing memory usage by about 4×.

smaller files
faster inference
lower RAM/VRAM requirements

K stands for K-block quantisation, an improved quantisation method used in GGML/GGUF models.
Instead of compressing each weight independently, it compresses blocks of weights together, which:

preserves more information
improves accuracy compared to older Q4 formats

M a variant of the quantisation scheme (S (small) / M (Medium) / L (Large) precision)

Model Parameters

A3B

Some newer models use a Mixture-of-Experts (MoE) architecture.
A3B → Active 3B parameters during inference.

Instead of using all parameters every time, the model:

Contains a large total parameter count (e.g., 35B).
Activates only a subset of them per token using a router that selects a few “experts.”

Example

Model name	Total parameters	Active parameters
Qwen3.5-35B-A3B	35B	~3B active
ERNIE-4.5-21B-A3B	21B	~3B active

instruct

This refers to a model that is specifically trained or fine-tuned to follow instructions from users in a helpful, safe, and coherent way.

Written on March 8, 2026, Last update on March 11, 2026

LLM at_home