Pick a runtime first
A runtime is the thing that actually loads the model weights and runs inference. Ollama is the easiest start: one install, then `ollama pull <model>` and it's serving on a local port. llama.cpp is the lower-level option — more control over quantization and flags, a bit more setup, and the engine a lot of other tools (including Ollama) build on. oi works with either.
Whichever you pick, the model itself is a separate download — an open-weights coding model you pull once and keep. After that it's all local: the runtime serves the model on localhost, and oi talks to it there.
Then wire your editor and terminal to it
Once the runtime is serving a model, you want it where you actually work: inline in the editor and on hand in the terminal. oi's VS Code extension gives you chat and edits against the local model in the sidebar, and the CLI lets you ask questions or run edits from the shell without leaving it. Both point at the same local runtime, so you configure the model once.
Because there's no remote endpoint, there's nothing to authenticate and no rate limit to hit. The only limits are your hardware and the model you chose, both of which you control.
how it works
- 01
install a runtime
Install Ollama (easiest) or build llama.cpp. This is what loads the model and serves inference on a local port.
- 02
pull a coding model
Download an open coding model, e.g. `ollama pull qwen2.5-coder` or a DeepSeek-Coder build. Pick a size that fits your RAM/VRAM.
- 03
install oi
Install the oi CLI, then run `oi setup` to walk through configuration. oi is free and local-first.
- 04
link the runtime
Point oi at your local runtime (Ollama or llama.cpp) and choose the model you pulled. Everything stays on localhost.
- 05
add the VS Code extension
Install the oi VS Code extension for inline chat, completion, and refactors against the same local model — no API key.
frequently asked
Do I need a GPU?
Not strictly. Smaller coding models run on CPU, and run noticeably better on Apple Silicon (unified memory) or with a discrete GPU. Bigger models effectively need a GPU with enough VRAM. Match the model size to your machine.
Ollama or llama.cpp — which should I use?
Start with Ollama: it's the least setup and pulls models with one command. Reach for llama.cpp when you want finer control over quantization and runtime flags. oi links to either, so you can switch later.
Is there any per-token cost?
No. Once the model is on your disk and the runtime is serving it locally, inference is free — you're using your own hardware. oi itself is free and local-first.
Will it work offline?
Yes, once the runtime and model are installed. Inference runs on localhost, so after the initial downloads you can code with no network at all.
Does my code get sent anywhere?
No. oi talks to a runtime on localhost, so prompts and code stay on your machine. That's the point of running the model locally.
Last updated June 19, 2026