Local Agentic Programming on the Cheap: Claude Code + Ollama + Gemma4

The escalating operational costs, privacy concerns, and restrictive rate limits associated with cloud-hosted large language model (LLM) agentic workflows have prompted a significant shift towards local inference solutions. A compelling new stack, combining Google DeepMind’s recently released Gemma 4, the local LLM runner Ollama, and Anthropic’s Claude Code CLI, offers developers a robust and cost-effective alternative for complex coding tasks, effectively bringing sophisticated AI agents to the desktop. This setup addresses the core challenges faced by engineers deploying multi-agent systems that can make hundreds of API calls in a single session, often leading to unexpected billing spikes and proprietary code exposure.

The Rise of Local Agentic AI: Addressing Cloud LLM Challenges

In the current landscape of AI-powered development, multi-agent systems have demonstrated remarkable capabilities, from automated code generation and refactoring to complex debugging and testing. However, the reliance on commercial cloud LLM providers introduces a litany of practical hurdles. Every token exchanged with a cloud API incurs a cost, which can rapidly accumulate in iterative, agentic workflows. A single afternoon of a multi-agent system reading files, writing patches, running tests, and iterating across four services, potentially making 400 API calls, can quickly exceed soft limits, demanding higher expenditure. Beyond cost, transmitting proprietary source code to third-party servers raises significant data privacy and intellectual property concerns for many enterprises. Furthermore, rate limits frequently interrupt long-running sessions, hindering productivity and forcing developers into more expensive tier subscriptions. The emergence of powerful, open-weight models like Gemma 4, coupled with efficient local inference engines like Ollama, presents a viable path to mitigate these issues, offering a private, uncapped, and zero-per-token-cost environment for agentic development.

Gemma 4: Google DeepMind’s Strategic Open-Weight Advance

Released by Google DeepMind on April 2, 2026, Gemma 4 represents a pivotal moment in the open-weight LLM ecosystem. It is the most capable model family to date from Google DeepMind to be released under the permissive Apache 2.0 license. This licensing change is strategically significant, as previous Gemma versions utilized a custom Google license with commercial use restrictions that were ambiguous enough to cause enterprise legal teams to flag them as potential blockers. The shift to Apache 2.0 dramatically simplifies integration for internal tooling, product development, and production pipelines, removing a major operational and legal overhead for businesses.

The Gemma 4 family shipped in four variants: E2B (2 billion effective parameters), E4B (4 billion effective), 26B Mixture-of-Experts (MoE), and 31B Dense. The 26B MoE variant, particularly relevant for its efficiency and performance, employs 128 small experts but activates only 8 per token plus one shared expert. This innovative architecture allows it to deliver near-31B quality at a dramatically lower computational cost, making it an ideal candidate for local deployment on capable workstations. This MoE design is a key differentiator, enabling high performance without the prohibitive resource demands typically associated with larger dense models.

Performance Metrics for Code Agents: A Deep Dive

For coding agents, the ability to reliably call tools, execute multi-step workflows, and handle errors is paramount. Gemma 4 demonstrates significant advancements in these critical areas, as evidenced by its performance across specialized benchmarks:

λ2-bench (agentic tool use): This benchmark specifically tests a model’s proficiency in multi-step workflows involving tool calls, execution, and error handling. Gemma 4 31B Dense achieved an impressive 86.4%, while the 26B MoE scored approximately 79%. This represents a monumental leap from the previous generation, Gemma 3 27B, which scored a mere 6.6% on the same benchmark. This delta is not just an incremental improvement; it signifies the difference between a model that struggles with tool invocation and one that can robustly manage a complex agentic loop, consistently formatting function call parameters correctly.
LiveCodeBench v6: Focused on live coding scenarios, Gemma 4 31B Dense scored 80.0%, with the 26B MoE close behind at 77.1%. Gemma 3 27B’s score of 29.1% highlights the substantial progress in code generation and comprehension.
GPQA Diamond: A challenging general-purpose question-answering benchmark, Gemma 4 31B Dense achieved 84.3%, and 26B MoE 82.3%, far surpassing Gemma 3 27B’s 42.4%. While not directly coding-related, strong general reasoning underpins an agent’s ability to understand complex requirements and diagnose issues.
AIME 2026 (math): Demonstrating advanced mathematical reasoning, Gemma 4 31B Dense scored 89.2%, and 26B MoE 88.3%, a dramatic improvement over Gemma 3 27B’s 20.8%. Accurate mathematical capabilities are crucial for tasks involving algorithms, data structures, and performance optimization in coding.
Arena AI ELO: A competitive ranking system for LLMs, Gemma 4 31B Dense reached an ELO of 1452, with 26B MoE at 1441. Gemma 3 27B’s 1365 illustrates the competitive edge gained by the new family.

These benchmark results collectively underscore Gemma 4’s suitability for agentic coding tasks, particularly its enhanced ability to interact with tools and maintain state across multi-turn conversations.

Hardware Considerations for Local Deployment

Before embarking on the download of an 18 GB model, understanding the hardware implications is crucial. The Gemma 4 family was designed with scalability in mind, spanning edge devices to high-end workstations. The four variants reflect this range in their VRAM requirements and active parameters:

Variant	Ollama tag	Active params	VRAM at Q4	Context window
Edge 4B	gemma4:e4b	4B	~6 GB	128K
26B MoE	gemma4:26b	3.8B	~16–18 GB	256K
31B Dense	gemma4:31b	31B	~24–32 GB	256K

For the recommended 26B MoE variant, approximately 16-18 GB of VRAM is needed for quantized (Q4) operation. This typically means a modern GPU like an NVIDIA RTX 3090, 4080, or higher, or equivalent AMD cards, is required for optimal performance without significant swapping to system RAM, which can severely degrade inference speed. Developers should verify their system’s GPU VRAM before pulling the model to ensure a smooth local experience.

Setting Up the Local Agent Stack: Ollama, Gemma 4, and Claude Code

The foundation of this local agentic programming setup involves three key components: Ollama for serving the local LLM, Gemma 4 as the powerful model, and Claude Code as the orchestrating agent.

Step 1: Install Ollama
Ollama is a lightweight, extensible framework for running LLMs locally. Its ease of installation and growing support for various model APIs make it an ideal choice.

# macOS and Linux -- one-line install
curl -fsSL https://ollama.com/install.sh | sh

# Verify version -- must be 0.14.0+ for Anthropic Messages API support
# The Anthropic-compatible endpoint was added in January 2026, making it critical for Claude Code.
ollama version
# Expected: ollama version is 0.22.x or higher (as of May 2026)

# Windows users: download the native installer from https://ollama.com.
# WSL2 (Windows Subsystem for Linux 2) is highly recommended for GPU passthrough on Windows.

Once installed, Ollama runs as a background service, typically listening on port 11434. A quick curl http://localhost:11434 should return "Ollama is running," confirming its operational status.

Step 2: Pull Gemma 4
With Ollama active, the next step is to download the Gemma 4 model.

# The 26B MoE variant is recommended for its balance of performance and VRAM usage (~18 GB download).
ollama pull gemma4:26b

# Monitor the download progress:
ollama ps

# Optional: For comparison or if you have ample VRAM, also pull the 31B Dense variant:
ollama pull gemma4:31b

# Confirm successful download:
ollama list

Step 3: Install Claude Code
Claude Code, Anthropic’s command-line interface for agentic programming, orchestrates the interactions between the user’s codebase and the LLM.

# Prerequisites: Node.js 18 or later is required.
node --version # Confirm your Node.js version.

# Install Claude Code CLI globally:
npm install -g @anthropic-ai/claude-code

# Verify installation:
claude --version

At this point, with Ollama running and Gemma 4 downloaded, the intuitive next step might be to export environment variables and immediately launch Claude Code. However, a crucial intermediary step, the Modelfile, is essential for optimal agentic performance.

Crafting the Agent’s Brain: The Essential Modelfile

One of the most critical aspects of configuring Gemma 4 for agentic sessions with Claude Code is overriding Ollama’s default context window. While Gemma 4’s actual context window can range from 128K to 256K tokens, Ollama defaults to a mere 4K tokens unless explicitly configured otherwise. In an agentic session that requires reading multiple source files, maintaining extensive conversation history, and tracking tool call results across numerous turns, a 4K token limit is exhausted almost instantly. This limitation leads to silent failures: the agent loses track of file contents mid-edit, forgets earlier instructions, and produces fragmented or incomplete changes. For example, attempting to refactor a 200-line service class will result in the agent forgetting the second half of the file, leading to partially correct and often broken output.

The solution is a custom Modelfile. This file allows developers to bake specific inference parameters, including the correct context size, temperature, and a guiding system prompt, directly into a named model variant. This ensures every Claude Code session starts with the optimal configuration.

Create the Modelfile (e.g., ~/.ollama/Modelfiles/gemma4-claude):

# ~/.ollama/Modelfiles/gemma4-claude
# Gemma 4 26B MoE variant tuned for Claude Code agentic sessions.
# Bakes context window, temperature, and system prompt into the model
# so every Claude Code session starts with the correct configuration.
#
# Build with:
#   mkdir -p ~/.ollama/Modelfiles
#   ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

FROM gemma4:26b

# PARAMETER num_ctx: This sets the context window size. 65536 tokens (64K) is a tested-safe floor
# for real-world codebases, preventing VRAM swap on systems with 16-18 GB VRAM.
# For systems with 24 GB+ VRAM, increasing to 131072 (128K) is possible.
# Exceeding 131072 should only be done after careful memory profiling, as Ollama pre-allocates
# the full KV cache upfront, potentially leading to out-of-memory errors if too large.
PARAMETER num_ctx 65536

# PARAMETER temperature: Set deliberately low at 0.2 for agentic coding.
# Higher temperatures introduce variability in tool call parameter formatting,
# causing Claude Code's robust tool validator to reject calls, leading to agent loops.
# For creative or generative tasks, a higher temperature would be appropriate, but for
# precise agentic loops, consistency is key.
PARAMETER temperature 0.2

# PARAMETER top_p: Nucleus sampling threshold. 0.9 keeps generation focused
# while preventing the repetitive loops that top_p=1.0 can sometimes produce
# during long agentic sessions, especially when the model struggles with a task.
PARAMETER top_p 0.9

# PARAMETER repeat_penalty: Penalizes the model for repeating tokens.
# A value of 1.15 is crucial for preventing "tool call loops" where Gemma 4
# might repeatedly attempt the same failed tool call with nearly identical parameters,
# rather than diagnosing and adapting.
PARAMETER repeat_penalty 1.15

# PARAMETER num_predict: Maximum tokens per response. 4096 is typically
# sufficient for most code patches and responses. Increase to 8192 if
# generating very large files or extensive documentation in a single turn.
PARAMETER num_predict 4096

# SYSTEM prompt: Reinforces critical coding agent behaviors and explicit
# tool use discipline. Gemma 4, like many LLMs, benefits significantly from
# being reminded to commit to tool calls directly rather than describing
# what it *would* do. This prompt encourages methodical and precise actions.
SYSTEM """You are a senior software engineer operating as a coding agent.

When working with code:
- Read files before editing them. Never assume file contents.
- Make one focused change at a time and verify it before proceeding.
- When a tool call fails, examine the error carefully before retrying.
  Do not retry with identical parameters. Diagnose first.
- Prefer surgical edits over full file rewrites.
- Run tests after each meaningful change, not after a batch of changes.
- If you are uncertain about the codebase structure, read more files
  rather than guessing.

Be precise and methodical. Avoid explaining what you are about to do
when you could simply do it."""

Build the variant:

mkdir -p ~/.ollama/Modelfiles
ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

# Verify the variant was created:
ollama list
# Should show gemma4-claude alongside gemma4:26b

# Quick smoke test to ensure it loads and responds:
ollama run gemma4-claude "What is the time complexity of binary search and why?"

Connecting the Tools: Wiring Claude Code to Ollama

With the custom model variant prepared, the next step is to configure Claude Code to use the local Ollama endpoint. This involves setting a few key environment variables, either globally or on a per-project basis. It’s crucial to note that Ollama’s Anthropic-compatible endpoint is http://localhost:11434, not http://localhost:11434/v1. The /v1 path is reserved for Ollama’s OpenAI-compatible layer, and using it will result in authentication errors or unexpected behavior when Claude Code attempts to use the Anthropic Messages API protocol.

Global Settings – ~/.claude/settings.json
This configuration applies to every Claude Code session, making it suitable for consistent local inference across all your projects.


  "env": 
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": "",
    "ANTHROPIC_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",
    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"

ANTHROPIC_BASE_URL: Directs Claude Code to your local Ollama instance.
ANTHROPIC_AUTH_TOKEN: Ollama requires this header value (arbitrarily "ollama" is sufficient) to simulate an API key, even though local inference doesn’t require actual authentication.
ANTHROPIC_API_KEY: Set to an empty string to prevent Claude Code from trying to use a real Anthropic API key, ensuring it defaults to the local setup.
ANTHROPIC_MODEL: Specifies the exact model variant (from your Modelfile) that Claude Code should use.
ANTHROPIC_DEFAULT_SONNET_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODEL, ANTHROPIC_DEFAULT_OPUS_MODEL: These variables explicitly map Claude’s different pricing/performance tiers to your local gemma4-claude model. This is crucial as Claude Code often defaults to these tier names internally.
CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: Set to "1" to disable any experimental features that might not be fully compatible with a local Ollama setup, ensuring stability.

Per-Project Configuration – .claude/settings.json
For scenarios requiring isolated local inference, such as private repositories, sensitive codebases, or projects with unique model requirements, a project-level settings.json is preferable. This file, placed in the project root, overrides global settings.

# In your project root
mkdir -p .claude

cat > .claude/settings.json << 'EOF'

  "env": 
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": "",
    "ANTHROPIC_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",
    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
  

EOF

It is good practice to add .claude/settings.json to your .gitignore if it contains environment-specific settings. Alternatively, it can be committed if the entire team is expected to use this local inference configuration.

Ensuring Readiness: A Comprehensive Verification Process

Before entrusting Claude Code with a real codebase, it is imperative to verify the entire stack. This involves confirming Ollama’s service status, the model’s availability, and, most critically, its ability to correctly handle Anthropic Messages API calls, especially tool invocations. The verification script provided below is designed to perform these checks systematically. Tool calling is the bedrock of Claude Code’s functionality; without it, the agent cannot read files, write patches, or execute commands, leading to continuous failures.

Prerequisites:

pip install httpx # An async HTTP client for the Python script.

The Full Verification Script (verify_local_setup.py):


#!/usr/bin/env python3
"""
verify_local_setup.py

Verifies the full Claude Code + Ollama + Gemma 4 stack before use.
Runs three checks in sequence:
  1. Ollama health and model availability
  2. Basic Anthropic Messages API call
  3. Tool calling round-trip

Prerequisites:
  pip install httpx

How to run:
  python verify_local_setup.py

Expected output on a working setup:
  [PASS] Ollama is running on localhost:11434
  [PASS] Model 'gemma4-claude' is available
  [PASS] Anthropic Messages API call successful
  [PASS] Tool calling: model produced a valid tool_use block
  All checks passed -- Claude Code + Ollama + Gemma 4 is ready.
"""

import httpx
import json
import sys

# --- Configuration ---
OLLAMA_BASE_URL = "http://localhost:11434"
MODEL_NAME      = "gemma4-claude"   # Must match your Modelfile variant name
TIMEOUT         = 120.0             # Seconds -- generation can be slow on first call

def check_ollama_health() -> bool:
    """
    Check 1: Verify Ollama is running and responding.
    Hits the root endpoint which returns 'Ollama is running' when healthy.
    """
    print("nCheck 1: Ollama health")
    try:
        response = httpx.get(OLLAMA_BASE_URL, timeout=5.0)
        if "Ollama is running" in response.text:
            print(f"  [PASS] Ollama is running on OLLAMA_BASE_URL")
            return True
        else:
            print(f"  [FAIL] Unexpected response: response.text[:100]")
            return False
    except httpx.ConnectError:
        print(f"  [FAIL] Cannot connect to OLLAMA_BASE_URL")
        print("         Is Ollama running? Try: ollama serve")
        return False

def check_model_available() -> bool:
    """
    Check 2: Verify the specific model variant is available in Ollama.
    Uses the /api/tags endpoint which lists all pulled models.
    """
    print("nCheck 2: Model availability")
    try:
        response = httpx.get(f"OLLAMA_BASE_URL/api/tags", timeout=5.0)
        data     = response.json()
        models   = [m["name"] for m in data.get("models", [])]

        # Normalize: Ollama may add ":latest" if not specified
        normalized = [m.split(":")[0] for m in models]

        if MODEL_NAME in models or MODEL_NAME in normalized:
            print(f"  [PASS] Model 'MODEL_NAME' is available")
            return True
        else:
            print(f"  [FAIL] Model 'MODEL_NAME' not found")
            print(f"         Available models: ', '.join(models) or 'none'")
            print(f"         Run: ollama create MODEL_NAME -f ~/.ollama/Modelfiles/gemma4-claude")
            return False
    except Exception as e:
        print(f"  [FAIL] Error checking model list: e")
        return False

def check_messages_api() -> bool:
    """
    Check 3: Send a basic Anthropic Messages API call to the local endpoint.
    Verifies the request format, model routing, and basic generation work.
    Uses the same /v1/messages path and request schema that Claude Code uses.
    Note: Claude Code uses http://localhost:11434 (root), not /v1.
    The Anthropic-compatible API is at /api/chat or the root -- Ollama routes it.
    """
    print("nCheck 3: Anthropic Messages API call")

    payload = 
        "model": MODEL_NAME,
        "max_tokens": 100,
        "messages": [
            
                "role": "user",
                "content": "Reply with exactly: VERIFICATION_OK"
            
        ]
    

    headers = 
        "Content-Type":      "application/json",
        "x-api-key":         "ollama",            # Required by the API spec; value ignored locally
        "anthropic-version": "2023-06-01"         # Required version header
    

    try:
        response = httpx.post(
            f"OLLAMA_BASE_URL/v1/messages",
            json=payload,
            headers=headers,
            timeout=TIMEOUT
        )

        if response.status_code != 200:
            print(f"  [FAIL] HTTP response.status_code: response.text[:200]")
            return False

        data = response.json()

        # Anthropic Messages API response structure:
        #  "content": ["type": "text", "text": "..."], "stop_reason": "..." 
        content_blocks = data.get("content", [])
        text_blocks    = [b for b in content_blocks if b.get("type") == "text"]

        if not text_blocks:
            print(f"  [FAIL] No text content in response: json.dumps(data, indent=2)")
            return False

        response_text = text_blocks[0].get("text", "")
        print(f"  [PASS] Anthropic Messages API call successful")
        print(f"         Model response: response_text[:80]")
        return True

    except Exception as e:
        print(f"  [FAIL] Request failed: e")
        return False

def check_tool_calling() -> bool:
    """
    Check 4: Verify tool calling works end-to-end.
    This is the most important check for Claude Code agentic use.
    Claude Code relies on the model correctly producing tool_use blocks
    for every file operation, shell command, and code execution.

    Sends a simple tool definition and a prompt that should trigger it.
    Verifies the model returns a tool_use block (not just text describing the call).
    """
    print("nCheck 4: Tool calling verification")

    # A minimal tool definition using the Anthropic function calling schema
    tools = [
        
            "name": "read_file",
            "description": "Read the contents of a file at the given path.",
            "input_schema": 
                "type": "object",
                "properties": 
                    "path": 
                        "type": "string",
                        "description": "The absolute or relative file path to read"
                    
                ,
                "required": ["path"]
            
        
    ]

    payload = 
        "model": MODEL_NAME,
        "max_tokens": 256,
        "tools": tools,
        # Force the model to call a tool rather than respond in text.
        # tool_choice: "type": "any" requires any tool use.
        # Remove this if testing whether the model self-selects tools.
        "tool_choice": "type": "any",
        "messages": [
            
                "role": "user",
                "content": "Read the file at /tmp/test.py and show me its contents."
            
        ]
    

    headers = 
        "Content-Type":      "application/json",
        "x-api-key":         "ollama",
        "anthropic-version": "2023-06-01"
    

    try:
        response = httpx.post(
            f"OLLAMA_BASE_URL/v1/messages",
            json=payload,
            headers=headers,
            timeout=TIMEOUT
        )

        if response.status_code != 200:
            print(f"  [FAIL] HTTP response.status_code: response.text[:200]")
            return False

        data           = response.json()
        content_blocks = data.get("content", [])
        tool_blocks    = [b for b in content_blocks if b.get("type") == "tool_use"]

        if not tool_blocks:
            print("  [FAIL] Model did not produce a tool_use block")
            print("         This means tool calling is not working correctly.")
            print("         Agentic Claude Code sessions will fail on file operations.")
            print(f"         Full response: json.dumps(data, indent=2)")
            return False

        tool_call  = tool_blocks[0]
        tool_name  = tool_call.get("name", "")
        tool_input = tool_call.get("input", )

        print(f"  [PASS] Tool calling: model produced a valid tool_use block")
        print(f"         Tool called: tool_name")
        print(f"         Parameters:  json.dumps(tool_input)")

        # Sanity check: did it call the right tool with the right parameter?
        if tool_name == "read_file" and "path" in tool_input:
            print(f"         Tool name and parameter are correct.")
        else:
            print(f"         WARNING: Unexpected tool name or missing 'path' parameter.")
            print(f"         The model called a tool but not the expected one.")

        return True

    except Exception as e:
        print(f"  [FAIL] Request failed: e")
        return False

def main():
    print("=" * 60)
    print("Claude Code + Ollama + Gemma 4 Setup Verification")
    print("=" * 60)

    checks = [
        check_ollama_health,
        check_model_available,
        check_messages_api,