Building Local AI Systems: Qwen3.6 + MCPs

The landscape of artificial intelligence is rapidly evolving, with a significant shift towards more localized and specialized deployments. A groundbreaking development in this arena is the powerful synergy between Qwen3.6-35B-A3B, a highly capable local large language model, and the Model Context Protocol (MCP), an open standard designed to revolutionize AI tool connectivity. This combination promises to empower developers to build sophisticated, cloud-independent AI agents, exemplified by a local GitHub developer assistant that can read issues, analyze code, draft fixes, and create pull requests entirely on local hardware.

The Unmet Need: Bridging LLMs and Real-World Actions

For a considerable period, developers experimenting with local AI models encountered a fundamental limitation: while these models excelled at reasoning, generating code, and answering complex queries, their utility often ended at the conceptual level. A local AI model, despite its intellectual prowess, could not natively interact with external systems – it couldn’t query a database, open a GitHub issue, or invoke an internal API. This necessitated a laborious process of writing custom Python wrappers for every tool, hardcoding the intricate "glue" between the model’s textual output and the tool’s execution, and enduring the maintenance burden each time an API underwent changes. This bespoke integration process was a significant bottleneck, hindering the development of truly autonomous and actionable AI agents in local environments. The demand for robust, secure, and privacy-preserving AI solutions has surged, with enterprises increasingly seeking to deploy AI closer to their data sources. However, the lack of standardized tooling for local LLMs presented a formidable barrier to widespread adoption.

Model Context Protocol: A Universal Standard for AI Tooling

Addressing this critical gap, the Model Context Protocol (MCP) has emerged as a transformative open standard. Conceived by Anthropic, a leading AI research company known for its commitment to safe and interpretable AI, MCP provides a universal, pluggable framework for AI tool connectivity. Its core innovation lies in abstracting the complexity of tool integration. Instead of writing custom code for each model and tool, developers define a tool once as an MCP server. Any MCP-compatible client, irrespective of the underlying model or framework, can then seamlessly discover and invoke these tools with zero custom integration code. This establishes a "write once, use everywhere" paradigm for AI tools, significantly accelerating development and fostering an interoperable AI ecosystem.

MCP operates as a JSON-RPC 2.0 protocol, communicating over standard I/O (stdio) or HTTP. When an MCP client initiates a connection with a server, its first action is to call tools/list to enumerate the available functionalities. Each tool is described with a name, a comprehensive description, and an input schema defined in JSON Schema. This schema serves as the model’s formal contract with the tool, enabling precise understanding of required inputs. When the model determines the need to use a tool, it generates a structured tool call object. Crucially, the MCP client – not the model itself – executes this call by dispatching a tools/call request to the server. The server performs the actual operation and returns a result, which the client then injects back into the conversation as a tool role message. This clear separation of concerns – the model decides, the client executes, the server performs – eliminates hardwired dependencies and significantly enhances the flexibility and maintainability of AI agents. The advent of MCP, initially announced in late 2023, has been widely welcomed by the developer community as a pivotal step towards democratizing access to advanced AI capabilities beyond proprietary cloud platforms. Industry analysts, such as those at IDC and Gartner, have highlighted the importance of open standards like MCP in fostering innovation and preventing vendor lock-in within the rapidly expanding AI landscape.

Qwen3.6-35B-A3B: Architecting for Local Agentic Intelligence

Complementing MCP’s standardization efforts is Qwen3.6-35B-A3B, an Alibaba Cloud-developed model that currently stands out as one of the most capable local models for agentic workloads. Its architectural innovations directly address the hardware and performance constraints typically associated with large models, making sophisticated local AI agents a practical reality.

The model’s name itself offers key insights: 35 billion total parameters, with "A3B" denoting that only 3 billion parameters are activated per forward pass. This is achieved through a Mixture of Experts (MoE) architecture, a design that allows the model to harness the vast knowledge capacity of a 35B model while incurring the inference compute cost typically associated with a much smaller 3B model. Specifically, Qwen3.6-35B-A3B employs 256 experts per layer, routing 8 plus 1 shared experts per token. This efficient trade-off enables the model to run on hardware configurations that would be overwhelmed by a dense 35B parameter model, making it accessible to a broader range of developers and local deployments.

Further differentiating Qwen3.6 is its hidden layer layout. The 40-layer stack features a 3:1 ratio of Gated DeltaNet layers to Gated Attention layers. DeltaNet, a linear attention mechanism, processes sequences with superior efficiency compared to full quadratic attention, especially critical for long context lengths. The interleaved full Gated Attention layers provide the deep relational reasoning capabilities that linear attention alone might miss. For an AI agent tasked with navigating a large codebase, potentially spanning hundreds of files, this combination is invaluable: it ensures efficient processing of extensive data while maintaining precise reasoning on relevant sections, avoiding "context blindness" common in less optimized models.

Perhaps one of Qwen3.6’s most compelling features for agentic work is its colossal context window. Natively, it supports 262,144 tokens, and with YaRN scaling, this can be extended to an astounding 1,010,000 tokens. In agent development, context length is not merely a convenience; it’s an operational imperative. An agent engaged in complex tasks like reading source files, maintaining a comprehensive history of tool calls, tracking a multi-step plan, and injecting tool results back into its context requires significant headroom. Many 7B and 13B models often cap at 8k or 32k tokens, which can lead to agents running out of context mid-task, losing their operational history, and consequently hallucinating tool results – a critical failure point for automated systems. Qwen3.6’s expansive context window mitigates this risk, ensuring the agent retains a holistic understanding throughout its operation.

Crucially, Qwen3.6 was explicitly trained and evaluated on MCP-based agentic benchmarks. This specialized training ensures the model inherently understands and is optimized for the patterns of tool invocation and response interpretation defined by MCP, leading to more reliable and effective agent behavior. This intentional design choice by Alibaba Cloud underscores a strategic focus on enabling robust, real-world AI applications.

Architecting a Local GitHub Developer Assistant

The practical application of Qwen3.6 and MCP is vividly demonstrated through the creation of a local GitHub developer assistant. This agent is designed to perform a sequence of complex development tasks: identifying open issues in a GitHub repository, locating relevant code sections, drafting a targeted fix, and subsequently creating a pull request – all executed locally without any cloud dependencies.

To set up this powerful assistant, several software components are required. Python 3.11+ is essential, along with core packages such as openai (for model interaction), qwen-agent (for simplified agent orchestration), mcp (the core protocol library), and httpx. For serving the Qwen3.6 model locally, developers can choose between vllm or sglang for NVIDIA GPUs (with SGLang recommended for its faster prefill for long contexts) or ktransformers for CPU/hybrid deployments. Node.js 18+ is also necessary for installing pre-built MCP servers via npx.

The local serving endpoint for Qwen3.6 is configured to expose an OpenAI-compatible API, allowing the MCP integration layer to communicate seamlessly. Whether using SGLang, vLLM, or even a smaller model via Ollama, the principle remains the same: the agent points to localhost rather than api.openai.com. Specific flags are crucial for optimal agent performance, such as --reasoning-parser qwen3 to correctly handle think blocks and --tool-call-parser qwen3_coder to format tool call outputs. The --enable-prefix-caching (or --enable-prefix-caching-v2 for vLLM) flag is particularly vital for agent workloads, as it enables KV cache reuse across turns, dramatically improving efficiency in long conversational sessions.

The GitHub developer assistant can be implemented in two primary ways:

Qwen-Agent Implementation: This approach, leveraging the qwen-agent library, provides the fastest path to a working agent. qwen-agent automatically manages the full agentic loop, including starting MCP servers as subprocesses, managing sessions, and orchestrating tool calls. The LLM_CONFIG dictionary directs the agent to the local Qwen3.6 serving endpoint, while MCP_SERVERS defines the necessary MCP servers, such as filesystem (to interact with local files) and github (to interact with GitHub APIs). A detailed SYSTEM_PROMPT guides the agent’s behavior, outlining its role as a senior software engineer and its step-by-step methodology for issue resolution and pull request creation. This high-level abstraction simplifies development, allowing rapid prototyping and deployment.
Raw MCP SDK Implementation: For development teams requiring granular control over every protocol message, custom error handling, per-tool retry logic, and detailed audit logging of tool calls and results, the raw MCP SDK offers maximum flexibility. This involves directly interacting with the mcp library to start server processes, initialize sessions, discover tools, and manage the conversation history manually. The tool_to_session dictionary is a critical component, mapping each discovered tool name to its owning MCP session, ensuring the agent can invoke any tool without needing to know which specific server provides it. This low-level approach provides unparalleled insight and control, albeit with increased implementation complexity. Both implementations rely on the GITHUB_TOKEN environment variable for authentication with GitHub APIs, underscoring the agent’s capability to perform real-world actions.

Extending Functionality with Custom MCP Servers

The power of MCP truly shines when developers need to extend agent capabilities beyond pre-built servers. For tasks involving internal databases, proprietary CI/CD APIs, or specialized code analysis tools, custom MCP servers can be developed. The provided code_quality_server.py example demonstrates this, exposing ruff (a Python linter) and pytest (a Python testing framework) as MCP tools.

Using the FastMCP framework, a high-level abstraction for MCP server development, developers can quickly define new tools with Python functions. The run_linter tool, for instance, executes ruff on a specified file, returning structured linting results, including issue counts and potential fixes. Similarly, run_tests invokes pytest on a module or directory, providing detailed pass/fail outcomes. These custom servers can then be seamlessly integrated into either the Qwen-Agent or raw MCP SDK configurations, expanding the agent’s operational scope. The ability to create and integrate custom tools is a cornerstone of MCP’s design, fostering a rich and adaptable ecosystem for AI agent development. The npx @modelcontextprotocol/inspector utility allows developers to test custom servers standalone, ensuring their functionality before integration with the agent, further streamlining the development process.

Optimizing Agent Performance: Thinking Mode and Reasoning Preservation

The efficiency and effectiveness of agentic AI systems often hinge on nuanced configuration choices, particularly concerning the model’s "thinking mode" and how it preserves reasoning across turns. These decisions significantly impact both latency and the quality of complex multi-step tasks.

In "thinking mode," Qwen3.6 generates an explicit chain-of-thought reasoning trace, encapsulated within <think>...</think> tags, before producing its final action or tool call. For a multi-step agent task, this trace can add anywhere from 1,000 to 5,000 tokens per turn, depending on the complexity of the task. While these additional tokens consume context budget and increase generation time, the investment is often justified for complex operations such as planning, debugging, or tasks requiring deep understanding across multiple files. The reasoning trace acts as an internal monologue, allowing the model to systematically break down problems, evaluate options, and catch potential mistakes before they manifest as incorrect tool calls.

Conversely, for mechanical tool-call loops where each step is unambiguous – for instance, a sequence like list directory -> read file -> write file -> commit – the model may not require extensive internal deliberation. In such scenarios, a non-thinking mode can be significantly faster, producing equivalent quality output without the overhead of generating detailed reasoning traces. Qwen3.6 allows developers to switch between thinking and non-thinking modes on a per-request basis, providing fine-grained control over computational resources and performance. This can be achieved by adjusting sampling parameters or by explicitly adding /no_think to the system prompt to suppress thinking.

A unique and impactful feature of Qwen3.6 is the preserve_thinking flag, which directly enhances inference efficiency when prefix caching is active in the serving infrastructure (e.g., SGLang with --enable-prefix-caching). When preserve_thinking=True, the complete reasoning trace from prior turns is retained within the conversation history. The server’s KV cache intelligently recognizes and reuses this shared prefix across turns, avoiding redundant computation. Practically, this translates to a meaningfully higher effective tokens-per-second rate for long agent sessions, as the model doesn’t need to re-process its past thoughts from scratch. The practical guideline is to enable preserve_thinking=True for agent sessions expected to run for more than five turns, where the benefits of KV cache efficiency outweigh the marginal overhead. For single-turn queries or very short pipelines, the overhead might not be justified, and preserve_thinking=False (or non-thinking mode) would be more appropriate.

Broader Implications and the Future of Local AI Agents

The convergence of Qwen3.6-35B-A3B and the Model Context Protocol signifies a pivotal moment for the AI ecosystem. It empowers developers and enterprises to transcend the limitations of proprietary cloud-based AI, fostering a new era of decentralized, private, and highly customized AI solutions.

This architecture has profound implications:

Decentralization of AI: By enabling sophisticated AI agents to run locally, this technology reduces reliance on centralized cloud providers. This enhances data privacy and security, as sensitive information remains within an organization’s control, addressing a major concern for industries like finance, healthcare, and government.
Developer Empowerment: The open standard nature of MCP, coupled with powerful local models like Qwen3.6, lowers the barrier to entry for building complex AI applications. Developers are no longer restricted by vendor-specific tool integrations or prohibitive cloud costs, enabling greater experimentation and innovation.
Enterprise Adoption: Businesses can deploy AI agents that interact directly with internal systems – from legacy databases to specialized analytics platforms – without exposing data to external cloud environments. This unlocks automation opportunities across a wide array of business processes, from IT operations and customer support to supply chain management and data analysis.
Future of AI Agents: The GitHub developer assistant is merely one example. The same underlying architecture can power a research assistant capable of searching academic databases and drafting literature reviews, a DevOps agent monitoring CloudWatch logs and automatically opening incident tickets, or a data pipeline agent that reads SQL schemas, writes transformation code, and validates outputs. The robust context handling of Qwen3.6 combined with MCP’s plug-and-play tool capabilities creates a fertile ground for developing highly specialized and efficient autonomous agents tailored to specific enterprise needs.

As the MCP ecosystem continues its rapid expansion, offering hundreds of pre-built servers and a straightforward path for custom tool development, the capability for sophisticated local AI agents is no longer a distant vision but a tangible reality. This synergy between advanced local models and open, interoperable protocols is set to redefine how AI is developed, deployed, and utilized across industries, ushering in an era of intelligent automation that is both powerful and private.

Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.