Dale Nguyen Posted on May 30 • Originally published at dalenguyen.me Local-first: a Model on Your Own Machine, Zero Cloud # ai # python # ollama # llm This is the concrete, runnable walkthrough for Post 1 of the Portway series . The goal: stand up a single model behind an OpenAI-compatible endpoint on hardware you already own, call it from the official OpenAI SDK, and internalize the stateless contract. Everything here runs locally for $0. What this post covers A demo.py script with two blocks: Round-trip — one chat call via the OpenAI SDK, printing the content and the usage object. Stateless proof — the same final question sent as a 1-turn message and as the last turn of a 5-turn fabricated history; both prompt_tokens values are printed alongside an explanation of the delta. Engine choice on this machine Apple Silicon Mac, 48 GB unified memory, Ollama already installed. The demo uses Ollama's OpenAI-compatible endpoint at http://localhost:11434/v1 and the gpt-oss:20b model (~14 GB). The wider Portway series uses llama.cpp on Mac (Ollama is called out as problematic for Qwen3.5 in Post 2). For Post 1 — one model, prove the contract — Ollama is fine and already on the box. Model options by available RAM The demo script works with any Ollama-served model — just substitute the model name in demo.py . The table below covers machines from 9 GB unified memory upward. Model Pull command Approx size Min RAM Notes llama3.2:3b ollama pull llama3.2:3b ~2 GB 8 GB Fastest; good for testing the contract gemma3:4b ollama pull gemma3:4b ~3 GB 8 GB Google; solid instruction-following mistral:7b ollama pull mistral:7b ~4.1 GB 8 GB Classic 7B baseline llama3.1:8b ollama pull llama3.1:8b ~4.7 GB 9 GB Best quality under 10 GB qwen2.5:7b ollama pull qwen2.5:7b ~4.4 GB 9 GB Strong at instruction + reasoning gpt-oss:20b ollama pull gpt-oss:20b ~14 GB 24 GB Used in this post's sample output On a 9 GB machine, replace gpt-oss:20b in demo.py with llama3.1:8b or qwen2.5:7b — the contract
LIVE
