Self-Brewed Beer is (Almost) Free - Experiences using Ollama in Theia AI

This is part two of an extended version of a talk I gave at TheiaCon 2025. That talk covered my experiences with Ollama and Theia AI in the previous months. In part one I provided an overview of Ollama and how to use it to drive Theia AI agents, and presented the results of my experiments with different local large language models.

In this part, I will draw conclusions from these results and provide a look into the future of local LLM usage.

Considerations Regarding Performance

The experiment described in part one of this article showed that working with local LLMs is already possible, but still limited due to relatively slow performance.

Technical Measures: Context

The first observation is that the LLM is becoming slower as the context grows. The reason is that the LLM needs to parse the entire context for each message. At the same time, a too small context window leads to the LLM forgetting parts of the conversation. In fact, as soon as the context window is filled, the LLM engine will start discarding the first messages in the conversation, while retaining the system prompt. So, if you experience that an agent seems to forget the initial instructions you gave it in the chat, this means most likely that the context window is exceeded. In this case the agent might become unusable, so it is a good idea to use a context window that is large enough to fit the system prompt, the instructions, and the tool calls during processing. On the other hand, at a certain point in long conversations or reasoning chains, the context can become so large that each message takes more than a minute to process.

Consequently, as users, we need to develop an intuition for the necessary context length–long enough for the task, but not too excessive.

Also, it is a good idea to reduce the necessary context by

adding paths in the workspace to the context beforehand, so that instead of letting the agent browse and search the workspace for the files to modify via tool calls, we already provide that information. In my experiments, this reduced token consumption from about 60,000 tokens to about 20,000 for the bug analysis task. (Plus, this also speeds up the analysis process as a whole, because the initial steps of searching and browsing the workspace do not need to be performed by the agent).
keeping conversations and tasks short. Theia AI recommends this even for non-local LLMs and provides tools such as Task Context and Chat Summary. So, it is a good idea to follow Theia AI's advice and use these features regularly.
defining specialized agents. It is very easy to define a new agent with its custom prompt and tools in Theia AI. If we can identify a repeating task that needs several specialized tools, it is a good idea to define a specialized agent with this specialized toolset. In particular regarding the support for MCP servers, it might be tempting to start five or more MCP servers and just throw all the available tools into the Architect or Coder agent's prompt. This is a bad idea, though, because each tool's definition is added to the system prompt and thus, consumes a part of the context window.

Note that unloading/loading models is rather expensive as well and usually takes up to several seconds. And in Ollama, even changing the context window size causes a model reload. Therefore, as VRAM is usually limited, it is a good idea to stick to one or two models that can fit into the available memory, and not change context window sizes too often.

Organizational Measures

Even with these considerations regarding the context length, Local LLMs will always be slower than their cloud counterparts.

Therefore, we should compensate for this at the organizational level by adjusting the way we work; for example, while waiting for the LLM to complete a task,

we could start to write and edit prompts for the next features
we can review the previous task contexts or code modifications and adjust them
we can do other things in parallel, like go to lunch, grab a coffee, go to a meeting, etc., and let the LLM finish its work while we are away.

Considerations Regarding Accuracy

As mentioned in part 1, local LLMs are usually quantized (which means basically: rounded) so that weights-or parameters-consume less memory. Therefore, a model can have a lower accuracy. The symptom for this is that the agent does not do the correct thing, or does not use the correct arguments when calling a tool.

In my experience, analyzing the reasoning/thinking content and checking the actual tool calls an agent makes, is a good way to determine what goes wrong. Depending on the results of such an analysis

we can modify the prompt; for example by giving more details, more examples, or by emphasizing important things the model needs to consider
we can modify the implementation of the provided tools. This, of course, requires building a custom version of the Theia IDE or the affected MCP server. But if a tool call regularly fails, because the LLM does not get the arguments 100% correct, but we could mitigate for these errors in the tool implementation, it might be beneficial to invest in making the tool implementation more robust.
we can provide more specific tools; for example, Theia AI only provides general file modification tools, such as writeFileReplacements. If you work mostly with TypeScript code, for example, it might be a better approach to implement and use a specialized TypeScript file modification tool that can automatically take care of linting, formatting, etc. on the fly.

Considerations Regarding Complexity

During my experiments, I have tried to give the agent more complex tasks to work on and let it run overnight. This failed however, because sooner or later, the agent will be unable to continue due to the limited context size; it starts forgetting the beginning of the conversation and thus, its primary objective.

One way to overcome this limitation is to split complex tasks into several smaller, lower-level ones. Starting with version 1.63.0, Theia AI supports agent-to-agent delegation. Based on this idea, we could implement a special Orchestrator agent (or a more programmatic workflow) that is capable to split complex tasks into a series of simpler ones. These simpler tasks could then be delegated to specialized agents (refined versions of Coder, AppTester, etc.) one by one. This would have the advantage that each step could start with a fresh, empty context window, thus following the considerations regarding context discussed above.

This is something that would need to be implemented and experimented with.

Odds and Ends

This blog article has presented my experiences and considerations about using local LLMs with Theia AI.

Several topics have only been touched slightly, or not at all, and are subject of further inspection and experimentation:

Until recently, I had considered Ollama too slow for code completion, mostly because the TTFT (time to first token) is usually rather high. But recently, I have found that at least with the model zdolny/qwen3-coder58k-tools:latest, the response time feels okay. So, I will start experimenting with this and some other models for code completion.
Also, Ollama supports fill-in-the-middle completion. This means that the completion API does not only support providing a prefix, but also a suffix as input. This API is currently not supported by Theia AI directly. The Code Completion Agent in Theia usually provides the prefix and suffix context as part of the user prompt. So Theia AI would have to be enhanced to support the fill-in-the-middle completion feature natively. And it needs to be determined whether this will also help to improve performance and accuracy.
Next, there are multiple approaches regarding optimizing and fine-tuning models for better accuracy and performance. There are several strategies, such as Quantization, Knowledge Distillation, Reinforcement Learning, and Model Fine Tuning which can be used to make models more accurate and performant for one's personal use cases. The Unsloth and MLX projects, for example, aim at providing optimized, local options to perform these tasks.
Finally, regarding Apple Silicon Processors in particular, there are two alternatives to boost performance, if they were supported:
- CoreML is a proprietary Apple framework to use the native Apple Neural Engine (which would provide another performance boost, if an LLM could run fully on it). The bad news is that at the moment, it seems that using the Apple Neural Engine is currently limited by several factors. Therefore, there are no prospects of running a heavier LLM, such as gpt-oss:20b, on the ANE, at the moment.
- MLX is an open framework, also developed by Apple, that runs very efficiently on Apple Silicon processors using a hybrid approach to combine CPU, GPU, and Apple Neural Engine resources. Yet, there is still very limited support available to run LLMs in MLX format. But at least, there are several projects and enhancements in development:
  - there is a Pull Request in development to add MLX support to Ollama, which is the basis for using the Neural Engine
  - other projects, such as LM Studio, swama, mlx-lm and others support models in the optimized MLX format, but in my experiments, tool call processing was unstable, unfortunately.

Outlook

The evolution of running LLMs locally and using them for agentic development in Theia AI has been moving fast recently. The progress made in 2025 alone suggests that LLMs running locally will continue to get better and better over time:

better models keep appearing: from deepseek-r1 to qwen3 and gpt-oss, we can be excited about what will come next
context management is getting better: every other week, we can observe discussions around enhancing or using the context window more effectively in one way or another: the Model-Context-Procol, giving LLMs some form of persistent memory, choosing more optimal representations of data, for example by using TOON, utilizing more intelligent context compression techniques, to name just a few.
hardware is becoming better, cheaper, and more available; I have performed my experiments with a 5 year old processor (Apple M1 Max) and I have already achieved acceptable results. Even today's processors are already much better, and there is more to come in the future
software is becoming better: Ollama is being actively developed and enhanced, and Microsoft has recently published BitNet, an engine to support 1-bit LLMs, etc.

We can be excited to see what 2026 will bring…

Details: Category: Eclipse; Published: 13 November 2025

Dr. Stefan Winkler
freier Softwareentwickler und IT-Berater

Stefan Winkler's Blog