how to

How to run a local coding LLM in your editor, step by step

the short answer

To run a local coding LLM, install a runtime (Ollama or llama.cpp), pull an open coding model like Qwen2.5-Coder or DeepSeek-Coder, then point oi at the runtime — oi's CLI and VS Code extension give you completion, chat, and refactors against that local model, with no API key and nothing leaving your machine.

Running a coding model locally used to mean wrestling with CUDA, build flags, and quantization formats. It's much easier now: a runtime like Ollama or llama.cpp handles the hard parts, and you just pull a model and point your tools at it.

Pick a runtime first

A runtime is the thing that actually loads the model weights and runs inference. Ollama is the easiest start: one install, then `ollama pull <model>` and it's serving on a local port. llama.cpp is the lower-level option — more control over quantization and flags, a bit more setup, and the engine a lot of other tools (including Ollama) build on. oi works with either.

Whichever you pick, the model itself is a separate download — an open-weights coding model you pull once and keep. After that it's all local: the runtime serves the model on localhost, and oi talks to it there.

Then wire your editor and terminal to it

Once the runtime is serving a model, you want it where you actually work: inline in the editor and on hand in the terminal. oi's VS Code extension gives you chat and edits against the local model in the sidebar, and the CLI lets you ask questions or run edits from the shell without leaving it. Both point at the same local runtime, so you configure the model once.

Because there's no remote endpoint, there's nothing to authenticate and no rate limit to hit. The only limits are your hardware and the model you chose, both of which you control.

how it works

  1. 01

    install a runtime

    Install Ollama (easiest) or build llama.cpp. This is what loads the model and serves inference on a local port.

  2. 02

    pull a coding model

    Download an open coding model, e.g. `ollama pull qwen2.5-coder` or a DeepSeek-Coder build. Pick a size that fits your RAM/VRAM.

  3. 03

    install oi

    Install the oi CLI, then run `oi setup` to walk through configuration. oi is free and local-first.

  4. 04

    link the runtime

    Point oi at your local runtime (Ollama or llama.cpp) and choose the model you pulled. Everything stays on localhost.

  5. 05

    add the VS Code extension

    Install the oi VS Code extension for inline chat, completion, and refactors against the same local model — no API key.

frequently asked

Do I need a GPU?

Not strictly. Smaller coding models run on CPU, and run noticeably better on Apple Silicon (unified memory) or with a discrete GPU. Bigger models effectively need a GPU with enough VRAM. Match the model size to your machine.

Ollama or llama.cpp — which should I use?

Start with Ollama: it's the least setup and pulls models with one command. Reach for llama.cpp when you want finer control over quantization and runtime flags. oi links to either, so you can switch later.

Is there any per-token cost?

No. Once the model is on your disk and the runtime is serving it locally, inference is free — you're using your own hardware. oi itself is free and local-first.

Will it work offline?

Yes, once the runtime and model are installed. Inference runs on localhost, so after the initial downloads you can code with no network at all.

Does my code get sent anywhere?

No. oi talks to a runtime on localhost, so prompts and code stay on your machine. That's the point of running the model locally.

Last updated June 19, 2026

ready to try oi?

get oi