Running Local LLM

Running Model locally - Cursor@Home

Model Size Speed Comments
qwen3-14b-claude-4.5-opus-high-reasoning-distill 9GB 80 tok/sec LM Studio 4.12
unsloth/qwen3-coder-30b-a3b-instruct 11GB/12.4GB 55.46 tok/sec - 1019 tokens - 0.03s to first token LM Studio 3.6
llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-NVFP4-Experts-Only-GGUF 12.5GB 10 tok/sec - 2898 tokens - 1.16s to first token LM Studio 4.12
llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-NVFP4-Experts-Only-GGUF 12.5GB 10 tok/sec - 2898 tokens - 1.16s to first token LM Studio 4.12

# Qwen3.6 ⮺

This release delivers substantial upgrades, particularly in

  • Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.

  • Thinking Preservation: we’ve introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.

see also

# Qwen3.5 ⮺

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device.

See all GGUFs here.

# Qwen3-Coder-30B-A3B-Instruct-GGUF

This is the standard instruction-tuned code model.
Typical strengths:

  • generating code from prompts
  • explaining code
  • quick coding tasks
  • IDE autocomplete / chat coding

It’s also easier to run locally because the total model size is smaller (30B).

RTX 5070 Ti

# Qwen3-Coder-Next ⮺

Much larger model: ~80B parameters total / ~3B active per token

  • very sparse Mixture-of-Experts
  • optimized for fast repeated calls in agent loops

Designed for coding agents that run loops:

  • read repo
  • plan changes
  • edit files
  • run tests
  • debug
  • repeat

It was trained with environment interaction and executable coding tasks, so it learns from test results and feedback loops.
Typical strengths:

  • repo-scale refactors
  • debugging from logs
  • multi-step tool use
  • IDE agents (Cline, Claude Code, etc.)

# Quantisation

# Q4_K_M ⮺

Q4_K_M is a compressed 4-bit version of an AI model that balances small size with good accuracy.

Q4 - means each model weight uses 4 bits instead of 16 bits, reducing memory usage by about 4×.

  • smaller files
  • faster inference
  • lower RAM/VRAM requirements

K stands for K-block quantisation, an improved quantisation method used in GGML/GGUF models.
Instead of compressing each weight independently, it compresses blocks of weights together, which:

  • preserves more information
  • improves accuracy compared to older Q4 formats

M a variant of the quantisation scheme (S (small) / M (Medium) / L (Large) precision)

# Model Parameters

# A3B ⮺

Some newer models use a Mixture-of-Experts (MoE) architecture.
A3B → Active 3B parameters during inference.

Instead of using all parameters every time, the model:

  • Contains a large total parameter count (e.g., 35B).
  • Activates only a subset of them per token using a router that selects a few “experts.”

Example

Model name Total parameters Active parameters
Qwen3.5-35B-A3B 35B ~3B active
ERNIE-4.5-21B-A3B 21B ~3B active

# instruct ⮺

This refers to a model that is specifically trained or fine-tuned to follow instructions from users in a helpful, safe, and coherent way.

# DFlash ⮺

DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.

Written on March 8, 2026, Last update on March 11, 2026
LLM at_home