Adhip Gupta | Engineering Autonomous Systems with LLMs

Claude Code is Anthropic's CLI-based agentic coding assistant. It can read files, write code, run shell commands, and iterate on complex tasks autonomously. What most people don't realize is that you can point it at locally hosted models instead of Anthropic's API — giving you offline capability, zero API costs, and full control over your inference stack.

Here's how I set it up on my Mac using Ollama and LM Studio, and what I learned along the way.

Why Run Local Models with Claude Code?

Claude Code's strength is its agentic loop: it reasons about a task, uses tools (file reads, shell commands, grep, etc.), observes the results, and iterates. That loop doesn't strictly require Anthropic's hosted models. Any model that speaks the right API protocol can drive it.

Running locally gives you:

Zero latency to a remote API — everything stays on your machine
No usage costs — unlimited iterations without watching a token meter
Privacy — your code never leaves your network
Experimentation — try different models, quantizations, and context lengths freely

The tradeoff is capability. Local models (even large ones like Qwen3 Coder 480B MoE) won't match Claude Opus on complex reasoning tasks. But for focused coding work — scaffolding, refactoring, writing tests, exploring a codebase — they're surprisingly effective.

My Hardware

Running LLMs locally is heavily dependent on your hardware. Here's what I'm working with:

Machine: MacBook Pro (M4 Pro, Late 2024)
Chip: Apple M4 Pro — 14 cores (10 performance + 4 efficiency)
GPU: 20-core integrated GPU with Metal 4 support
Unified Memory: 48 GB
Storage: 1 TB SSD (~485 GB free after models and everything else)
OS: macOS 26.3 (Tahoe)

The M4 Pro's unified memory architecture is what makes this setup viable. Unlike discrete GPU setups where you're limited by VRAM, Apple Silicon shares the full 48 GB between CPU and GPU. That means I can load models up to ~40 GB in full and still have headroom for the OS and other apps. The 20-core GPU handles matrix operations at respectable speeds through Metal, and the memory bandwidth (~273 GB/s on M4 Pro) keeps token generation fluid.

For reference, the qwen3-coder:latest model (~16 GB) runs comfortably with room to spare. The larger Qwen3-30B-A3B at Q5 quantization (~40 GB) pushes close to the limit but still works — just don't expect to run Chrome with 50 tabs alongside it. The 480B cloud variant is too large to run locally at full precision, which is where Ollama Cloud comes in (more on that below).

If you're considering this setup, 32 GB of unified memory is the practical minimum for useful coding models. 48 GB or 64 GB gives you access to the larger, more capable models that really shine at code generation.

The Architecture

The setup is straightforward:

Claude Code CLI  →  HTTP (OpenAI-compatible API)  →  Local Inference Server
                                                      ├── Ollama (port 11434)
                                                      └── LM Studio (port 1234)

Both Ollama and LM Studio expose OpenAI-compatible chat completion endpoints. Claude Code can talk to them by overriding a few environment variables.

Setting Up Ollama

Ollama is a lightweight runtime for running LLMs locally. Install it with Homebrew:

brew install ollama

Pull a model — I primarily use Qwen3 Coder, which is excellent for code tasks:

ollama pull qwen3-coder:latest

Start the Ollama server:

ollama serve

Ollama now listens on http://localhost:11434 and serves an OpenAI-compatible chat completions API.

Setting Up LM Studio (Alternative)

LM Studio provides a GUI for managing and running local models. It's useful for loading GGUF quantized models and offers a built-in server mode.

Once installed, enable the local server in settings (default port 1234) and load your desired model. LM Studio also exposes an OpenAI-compatible API at http://localhost:1234/v1.

The Shell Functions

The key insight is that Claude Code respects environment variables for its API configuration. By overriding ANTHROPIC_BASE_URL and a few related variables, you redirect all inference calls to your local server.

I created two shell functions in my .zshrc:

Ollama Launcher

function start_claude() {
    if [ -z "$1" ]; then
        echo "Usage: start_claude <model_name>"
        return 1
    fi

    local MODEL="$1"

    # Verify the model exists locally
    if ! ollama list | grep -q "^$MODEL"; then
        echo "Error: Model '$MODEL' not found."
        ollama list
        return 1
    fi

    # Point Claude Code at the local Ollama server
    export ANTHROPIC_BASE_URL="http://localhost:11434/"
    export ANTHROPIC_AUTH_TOKEN="ollama"
    export ANTHROPIC_DEFAULT_SONNET_MODEL="$MODEL"
    export ANTHROPIC_DEFAULT_HAIKU_MODEL="$MODEL"
    export CLAUDE_CODE_SUBAGENT_MODEL="$MODEL"
    export CLAUDE_CODE_STREAMING=true

    # Tune for local model constraints
    export OLLAMA_CONTEXT_LENGTH=32768
    export CLAUDE_CODE_MAX_CONTEXT_FILES=20
    export CLAUDE_CODE_MAX_LINES_PER_FILE=500
    export CLAUDE_CODE_MAX_TOOL_ITERATIONS=3
    export CLAUDE_CODE_RETRY_TOOLS=false

    # Warm up the model (loads it into memory)
    echo "Warming up model '$MODEL'..."
    ollama run "$MODEL" "Hello"

    # Launch Claude Code
    claude --model "$MODEL"
}

LM Studio Launcher

function start_lmstudio() {
    if [ -z "$1" ]; then
        echo "Usage: start_lmstudio <model_name>"
        return 1
    fi

    local MODEL="$1"

    # Verify LM Studio server is running
    if ! curl -s http://localhost:1234/v1/models > /dev/null; then
        echo "Error: LM Studio server not running on port 1234"
        return 1
    fi

    export OPENAI_BASE_URL="http://localhost:1234/"
    export OPENAI_API_KEY="lm-studio"
    export ANTHROPIC_BASE_URL="http://localhost:1234/"
    export ANTHROPIC_AUTH_TOKEN="lm-studio"
    export CLAUDE_CODE_STREAMING=true
    export CLAUDE_CODE_MAX_TOOL_ITERATIONS=3
    export CLAUDE_CODE_RETRY_TOOLS=false

    # Warm up with a quick inference call
    curl -s http://localhost:1234/v1/chat/completions \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer lm-studio" \
        -d '{"model":"'"$MODEL"'","messages":[{"role":"user","content":"Hello"}],"max_tokens":5}' > /dev/null

    claude --model "$MODEL"
}

Usage is simple:

# With Ollama
start_claude qwen3-coder:latest

# With LM Studio
start_lmstudio Qwen3-Coder-30B-A3B-Instruct-GGUF

Key Environment Variables Explained

Variable	Purpose
`ANTHROPIC_BASE_URL`	Redirects Claude Code's API calls to your local server
`ANTHROPIC_AUTH_TOKEN`	Dummy auth token (local servers don't need real auth)
`ANTHROPIC_DEFAULT_SONNET_MODEL`	Which model to use for the main agent
`ANTHROPIC_DEFAULT_HAIKU_MODEL`	Which model to use for lightweight sub-tasks
`CLAUDE_CODE_SUBAGENT_MODEL`	Model for spawned sub-agents
`CLAUDE_CODE_MAX_TOOL_ITERATIONS`	Limits agentic loops to prevent runaway inference
`CLAUDE_CODE_RETRY_TOOLS`	Disables automatic retries (local models can get stuck)
`OLLAMA_CONTEXT_LENGTH`	Sets the context window size for Ollama

My Model Collection

Over time I've accumulated several models for different use cases, totaling about 173GB of model weights:

qwen3-coder:latest — My default for coding tasks. Fast, capable, great tool use.
qwen3-coder:480b-cloud — The larger MoE variant. Better reasoning, slower inference.
qwen3-coder-flash:30b — Quick iterations when speed matters more than depth.
Qwen3-30B-A3B — General-purpose tasks and exploration.
llama3.3 and llama3.1:8b — Good for quick, lightweight tasks.
qwq — Strong at reasoning and math-heavy problems.

Ollama Cloud: The Best of Both Worlds

Running everything locally is great until you hit a model that won't fit in memory. The full qwen3-coder:480b model is a 480-billion parameter Mixture-of-Experts model — far too large to run on a laptop, even with quantization. This is where Ollama Cloud comes in.

Ollama Cloud lets you reference models that run on Ollama's remote infrastructure while using the exact same local Ollama interface. From Claude Code's perspective, nothing changes — it still talks to http://localhost:11434, and Ollama transparently routes inference to the cloud when needed.

How It Works

When you pull a cloud-tagged model, Ollama stores a lightweight manifest locally that points to the remote endpoint:

ollama pull qwen3-coder:480b-cloud

Instead of downloading 400+ GB of weights, this creates a manifest that tells Ollama to proxy requests to https://ollama.com:443 for the full qwen3-coder:480b model running in BF16 precision with a 262k token context window. Authentication is handled via an SSH key pair that Ollama generates automatically at ~/.ollama/id_ed25519.

Configuring It as a Claude Code Backup

I've configured the cloud model in Ollama's integration config at ~/.ollama/config/config.json:

{
  "integrations": {
    "claude": {
      "models": [
        "qwen3-coder:480b-cloud"
      ]
    }
  }
}

This registers the cloud model as available to Claude Code through the Ollama integration. In practice, I use it as a step-up option: when a local 30B model isn't cutting it on a complex task but I don't want to switch to the Anthropic API, I can launch Claude Code with the cloud model:

start_claude qwen3-coder:480b-cloud

The same start_claude shell function works — it sees the model in ollama list, sets the environment variables, and launches. The only difference is that inference happens on Ollama's servers instead of my laptop's GPU.

Why This Matters

This gives me a three-tier setup:

Local small models (qwen3-coder:latest, llama3.1:8b) — fast, free, fully offline
Ollama Cloud (qwen3-coder:480b-cloud) — more capable, still uses the local Ollama interface, good for harder tasks
Anthropic API (Claude Opus/Sonnet) — the full-powered option for complex architectural work

The key advantage is that tiers 1 and 2 use the exact same tooling and shell functions. There's no context switch between "local mode" and "cloud mode" — just a different model name. And because Ollama Cloud uses SSH key authentication rather than API tokens, there's no key management overhead either.

Practical Tips and Lessons Learned

1. Always Warm Up the Model

The first inference call loads the model into GPU/CPU memory. This can take 10-30 seconds for larger models. The warm-up step in the shell functions handles this so Claude Code doesn't time out on its first tool call.

2. Limit Tool Iterations

Local models can sometimes enter loops — generating a tool call, misinterpreting the result, and retrying endlessly. Setting CLAUDE_CODE_MAX_TOOL_ITERATIONS=3 and CLAUDE_CODE_RETRY_TOOLS=false keeps things under control.

3. Tune Context Length Carefully

More context means more memory consumption. I found 32k tokens to be a good balance for most coding tasks. Going beyond that on consumer hardware leads to slowdowns or OOM crashes.

4. Local Models Struggle with Complex Multi-Step Reasoning

Where Claude Opus might plan a 10-step refactor and execute it flawlessly, a local 30B model might lose track after 3-4 steps. Use local models for focused, well-scoped tasks. Save the hosted API for the complex architectural work.

5. Model Validation Matters

The start_claude function checks that the requested model actually exists in Ollama before proceeding. This prevents confusing errors when you mistype a model name.

When to Use Local vs. Hosted

Use Case	Local LLM	Hosted Claude
Quick code edits	Great	Overkill
Scaffolding boilerplate	Great	Overkill
Complex multi-file refactors	Struggles	Excellent
Writing tests	Good	Great
Codebase exploration	Good	Great
Offline / air-gapped work	Only option	Not available
Cost-sensitive iteration	Free	Metered

Conclusion

Running local LLMs with Claude Code gives you a powerful, private, and cost-free development assistant. The setup is surprisingly simple — just redirect the API endpoint and tune a few parameters. While local models won't replace hosted Claude for the hardest tasks, they handle a huge portion of day-to-day coding work remarkably well.

The combination of Ollama's simplicity, the Qwen3 Coder model family's coding capability, and Claude Code's agentic framework creates a local AI coding setup that would have been unimaginable just a couple of years ago. Give it a try — your GPU will thank you for the exercise.