Dr. Stefan Winkler
freier Softwareentwickler und IT-Berater

This blog article is an extended version of a talk I gave at TheiaCon 2025. The talk has covered my experiences with Ollama and Theia AI in the previous months.

What is Ollama?

Ollama is an OpenSource project which aims at making it possible to run Large Language Models (LLMs) locally on your own hardware with a docker-like experience. This means that, as long as your hardware is supported, it is detected and used with no further configuration.

Advantages

Running LLMs locally has several advantages:

  • Unlimited tokens: you only pay for the power you are consuming and the hardware if you do not already own it.
  • Full confidentiality and privacy: the data (code, prompts, etc.) never leaves your network. You do not have to worry about providers using your confidential data to train their models.
  • Custom models: You have the option to choose from a large number of pre-configured models, or you can download and import new models, for example, from huggingface. Or you can take a model and tweak it or fine-tune it to your specific needs.
  • Vendor neutrality: It does not matter who wins the AI race in a few months, you will always be able to run the model you are used to locally.
  • Offline: You can use a local LLM on a suitable laptop even when traveling, for example by train or on the plane. No Internet connection required. (A power outlet might be good, though...)

Disadvantages

Of course, all of this also comes at a cost. The most important disadvantages are:

  • Size limitations: Both the model size (number of parameters) and context size are heavily limited by the available VRAM.
  • Quantization: As a compromise to allow for larger models or contexts, quantization is used to sacrifice weight precision. In other words, a model with quantized parameters can fit more parameters in the same amount of memory. This comes at a cost of lower inference accuracy as we will see further below.

Until recently, the list of disadvantages has included that there was no support for local multimodal models. So, reasoning about images, video, audio, etc. was not possible. But that has changed last week, when ollama 0.12.7 was released along with locally runnable qwen3-vl model variants.

Development in 2025

A lot has happened in 2025 alone. At the beginning of 2025, there was neither a good local LLM for agentic use (especially reasoning and tool calling was not really usable) and also the support for Ollama in Theia AI was limited.

But since then, in the last nine months:

With the combination of these changes, it is now very well possible to use Theia AI agents backed by local models.

Getting Started

To get started with Ollama, you need to follow these steps:

  1. Download and install the most recent version of Ollama. Be sure to regularly check for updates, as with every release of Ollama, new models, new features, and performance improvements are implemented.
  2. Start Ollama using a command line like this:

    OLLAMA_NEW_ESTIMATES="1" OLLAMA_FLASH_ATTENTION="1" OLLAMA_KV_CACHE_TYPE="q8_0" ollama serve

    Keep an eye open for the Ollama release changelogs, as the environment settings can change over time. Make sure to enable and experiment with new features.

  3. Download a model using

    ollama pull gpt-oss:20b

  4. Configure the model in Theia AI by adding it to the Ollama settings under Settings > AI Features > Ollama
  5. Finally, as described in my previous blog post, you need to add request settings for the Ollama models in the settings.json file to adjust the context window size (num_ctx), as the default context window in Ollama is not suitable for agentic usage.

Experiments

As a preparation for TheiaCon, I have conducted several non-scientific experiments on my MacBook Pro M1 Max with 64GB of RAM. Note that this is a 5-year-old processor.

The task I gave the LLM was to locate and fix a small bug: A few months ago, I had created Ciddle - a Daily City Riddle, a daily geographical quiz, mostly written in NestJS and React using Theia AI. In this quiz, the user has to guess a city. After some initial guesses, the letters of the city name are partially revealed as a hint, while keeping some letters masked with underscores. As it turned out, this masking algorithm had a bug related to a regular expression not being Unicode-friendly: it matched only ASCII letters, but not special characters, such as é. So special characters would never be masked with underscores.

Therefore, I wrote a prompt explaining the issue and asked Theia AI to identify the bug and fix it. I followed the process described in this post

  1. I asked the Architect agent to analyze the bug and plan for a fix
    • once without giving the agent the file containing the bug, so the agent needs to analyze and crawl the workspace to locate the bug
    • once with giving the agent the file containing the bug using the "add path to context" feature of Theia AI
  2. I asked Theia AI to summarize the chat into a task context
  3. I asked Coder to implement the task (in agent mode, so it directly changes files, runs tasks, writes tests, etc.)
    • once with the unedited summary (which contained instructions to create a test case)
    • once with the summary with all references to an automated unit test removed, so the agent would only fix the actual bug, but not write any tests for it

The table below shows the comparison of different models and settings:

Model Architect Architect (with file path provided) Summarize Coder (fix and create test) Coder (fix only)
gpt-oss:20b          
 - w/ num_ctx = 16k 175s 33s 32s 2.5m (3) 43s
 - w/ num_ctx = 128k 70s 50s 32s 6m 56s
qwen3-14b
 - w/ num_ctx = 40k
(1) 143s 83s (4) (4)
qwen3-coder:30b
 - w/ num_ctx = 128k
(2) (2) 64s 21m (3) 13m
gpt-oss:120b-cloud 39s 16s 10s 90s (5) 38s

(1) without file path to fix, the wrong file and bugfix location is identified
(2) with or without provided path to fix, qwen3-coder "Architect" agent runs in circles trying to apply fixes instead of providing an implementation plan
(3) implemented fix correctly, but did not write a test case, although instructed to do so.
(4) stops in the middle of the process without any output
(5) in one test, gpt-oss:120b-cloud did not manage to get the test file right and failed when the hourly usage limit was exceeded

Observations

I have performed multiple experiments. The table reports more or less the best case times. As usual when working with LLMs, the results are not always deterministic. But, in general, if the output is similar for a given model, the processing time is also the same within a few seconds, so the table above shows more or less typical results for the case that the outcome was acceptable, if this was possible.

In general, I have achieved the best results with gpt-oss:20b with a context window of 128k tokens (the maximum for this model). A smaller context window can result in faster response times, but at the risk of not performing the task completely; for example, when running with 16k context, the Coder agent would fix the bug, but not provide a test, even though the task context contained this instruction.

Also, in my first experiments, the TypeScript/Jest configuration contained an error which caused the model (even with 128k context) to run around in circles for 20 minutes and eventually deleting the test again before finishing its process.

The other two local models, I used in the tests, qwen3:14b and qwen3-coder:30b were able to perform some of the agentic tasks, but usually at a lower performance and even failing in some scenarios.

Besides the models listed in the table above, I have also tried a few other models that were popular in the Ollama model repository, such as granite4:small-h and gemma3:27b. But they either had a similar behavior as qwen3:14b, so they just stopped at some point without any output, or they did not use the tools provided and just replied with a general answer.

Also note, that some tools (such as deepseek-r1) do not support tool calling in their local variants (yet...?). There are some variants of common models that are modified by users to support tool calling in theory, but in practice the tool calls are either not properly detected by ollama, or the provided tools are not used at all. 

Finally, just for comparison, I have also used the recently released Ollama cloud model feature to run the same tasks with gpt-oss:120b-cloud. As expected, the performance is much better than with local models, but at the same time, the gpt-oss:120b-cloud model also began to run around in circles once. So even that is not perfect in some cases.

To summarize, the best model for local agentic development with Ollama is currently gpt-oss:20b. In case everything works, it is surprisingly fast even with my 5 year old hardware. But, if something goes wrong, it usually goes fatally wrong, and the model will entangle itself in endless considerations and fruitless attempts to fix the situation.

Stay tuned for the second part of this article, where I will describe the conclusions I draw from my experiences and experiments, discuss consequences, and provide a look into the future of local LLMs in the context of agentic software development.