Running Local LLM

Running Model locally - Cursor@Home

Qwen3.5

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device.

See all GGUFs here.

Qwen3-Coder-30B-A3B-Instruct-GGUF

This is the standard instruction-tuned code model.
Typical strengths:

  • generating code from prompts
  • explaining code
  • quick coding tasks
  • IDE autocomplete / chat coding

It’s also easier to run locally because the total model size is smaller (30B).

RTX 5070 Ti

Qwen3-Coder-Next

Much larger model: ~80B parameters total / ~3B active per token

  • very sparse Mixture-of-Experts
  • optimized for fast repeated calls in agent loops

Designed for coding agents that run loops:

  • read repo
  • plan changes
  • edit files
  • run tests
  • debug
  • repeat

It was trained with environment interaction and executable coding tasks, so it learns from test results and feedback loops.
Typical strengths:

  • repo-scale refactors
  • debugging from logs
  • multi-step tool use
  • IDE agents (Cline, Claude Code, etc.)

Quantisation

Q4_K_M

Q4_K_M is a compressed 4-bit version of an AI model that balances small size with good accuracy.

Q4 - means each model weight uses 4 bits instead of 16 bits, reducing memory usage by about 4×.

  • smaller files
  • faster inference
  • lower RAM/VRAM requirements

K stands for K-block quantisation, an improved quantisation method used in GGML/GGUF models.
Instead of compressing each weight independently, it compresses blocks of weights together, which:

  • preserves more information
  • improves accuracy compared to older Q4 formats

M a variant of the quantisation scheme (S (small) / M (Medium) / L (Large) precision)

Model Parameters

A3B

Some newer models use a Mixture-of-Experts (MoE) architecture.
A3B → Active 3B parameters during inference.

Instead of using all parameters every time, the model:

  • Contains a large total parameter count (e.g., 35B).
  • Activates only a subset of them per token using a router that selects a few “experts.”

Example

Model name Total parameters Active parameters
Qwen3.5-35B-A3B 35B ~3B active
ERNIE-4.5-21B-A3B 21B ~3B active

instruct

This refers to a model that is specifically trained or fine-tuned to follow instructions from users in a helpful, safe, and coherent way.

Written on March 8, 2026, Last update on March 11, 2026
LLM at_home