DEV Community

Nrk Raju Guthikonda
Nrk Raju Guthikonda

Posted on

The Developer's Guide to Running LLMs Locally: Ollama, Gemma 4, and Why Your Side Projects Don't Need an API Key

Every tutorial about building with LLMs starts the same way: "First, get your OpenAI API key." But what if I told you that you can build production-quality AI applications without ever making a cloud API call?

I've built over 90 applications using local LLMs — no API keys, no cloud costs, no rate limits. Here's a practical guide to getting started with Ollama and Gemma 4 for your own projects.


Why Local LLMs?

Before diving into the how, let's talk about why:

1. Zero Cost Per Request

Cloud APIs charge per token. A moderate application making 1,000 requests/day costs $30-100/month. Scale to production and you're looking at thousands per month. Local inference costs electricity — pennies per hour.

2. No Rate Limits

I've hit OpenAI rate limits at 3 AM on a Sunday during a hackathon. With local models, you can generate as fast as your hardware allows, 24/7.

3. Privacy by Default

No data leaves your machine. This isn't just nice-to-have — it's essential for healthcare (HIPAA), legal (attorney-client privilege), finance (PCI), and education (FERPA) applications.

4. Offline Capability

Once the model is downloaded, you need zero internet. Build on a plane. Demo without WiFi. Deploy in air-gapped environments.

5. Reproducibility

Cloud models change without notice. GPT-4 in January behaves differently than GPT-4 in June. Local models are frozen — same model, same behavior, always.

Getting Started: 5 Minutes to Your First Local LLM

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download
Enter fullscreen mode Exit fullscreen mode

Step 2: Pull Gemma 4

ollama pull gemma4
Enter fullscreen mode Exit fullscreen mode

This downloads the model (~5GB). One-time cost, then it's on your machine forever.

Step 3: Test It

ollama run gemma4 "Explain quantum computing in one paragraph"
Enter fullscreen mode Exit fullscreen mode

That's it. You now have a local LLM running on your machine.

Building Applications with Python + Ollama

Here's a minimal Python application:

import ollama

def ask(question: str) -> str:
    response = ollama.generate(
        model="gemma4",
        prompt=question,
        options={"temperature": 0.3}
    )
    return response["response"]

# That's literally it
print(ask("What are the SOLID principles in software engineering?"))
Enter fullscreen mode Exit fullscreen mode

Adding Structure: The Pattern I Use in 90+ Projects

class LocalLLMApp:
    def __init__(self, model: str = "gemma4"):
        self.client = ollama.Client()
        self.model = model

    def generate(self, prompt: str, temperature: float = 0.3, 
                 system: str = None) -> str:
        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})

        response = self.client.chat(
            model=self.model,
            messages=messages,
            options={"temperature": temperature}
        )
        return response["message"]["content"]
Enter fullscreen mode Exit fullscreen mode

This base class pattern is the foundation of every application I've built. Domain-specific logic goes in subclasses — the LLM integration stays clean and swappable.

Adding a Web Interface: Streamlit

import streamlit as st

app = LocalLLMApp()

st.title("My Local AI Tool")
user_input = st.text_area("Enter your text:")

if st.button("Analyze"):
    with st.spinner("Thinking..."):
        result = app.generate(user_input)
    st.write(result)
Enter fullscreen mode Exit fullscreen mode

Three imports. Ten lines. A full web interface for your local AI tool.

Adding an API: FastAPI

from fastapi import FastAPI
from pydantic import BaseModel

api = FastAPI()
app = LocalLLMApp()

class Query(BaseModel):
    text: str
    temperature: float = 0.3

@api.post("/analyze")
async def analyze(query: Query):
    result = app.generate(query.text, temperature=query.temperature)
    return {"result": result}
Enter fullscreen mode Exit fullscreen mode

Now you have a REST API that any frontend, mobile app, or service can call — all running locally.

Docker: One-Command Deployment

Every project I build ships with this docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  app:
    build: .
    ports:
      - "8501:8501"
      - "8000:8000"
    depends_on:
      - ollama
    environment:
      - OLLAMA_HOST=http://ollama:11434

volumes:
  ollama-data:
Enter fullscreen mode Exit fullscreen mode

docker compose up — that's the entire deployment story. Works on any machine with Docker and a GPU.

Performance: What to Expect

On consumer hardware (RTX 3080, 16GB RAM):

  • Simple Q&A: 0.5-1 second
  • Paragraph generation: 2-5 seconds
  • Document analysis (2-3 pages): 5-15 seconds
  • Long-form generation (1000+ words): 15-30 seconds

These are practical, usable response times for interactive applications.

When to Use Cloud vs. Local

Use Case Local Cloud
Prototyping ✅ Zero cost ❌ Token costs add up
Sensitive data ✅ Privacy by default ❌ Requires BAA/DPA
Production (small scale) ✅ Fixed hardware cost ✅ Easy to scale
Production (large scale) ❌ Hardware limits ✅ Elastic scaling
Offline/air-gapped ✅ Works anywhere ❌ Requires internet
Cutting-edge capability ❌ Smaller models ✅ Latest models

My rule: start local, move to cloud only when you've proven the concept and need scale that local hardware can't handle.

90+ Projects and Counting

I've applied this pattern across:

  • Healthcare: Patient intake, lab results, EHR de-identification
  • Legal: Contract analysis, brief generation, compliance checking
  • Education: Study bots, exam generators, flashcard creators
  • Creative: Story generators, poetry engines, mood journals
  • Developer Tools: Code review, API docs, performance profiling
  • Finance: Budget analyzers, financial report summarizers
  • Security: Vulnerability scanners, alert summarizers

Every single one follows the same pattern: Ollama + Gemma 4 + Python + FastAPI + Streamlit + Docker.

The code is open source: github.com/kennedyraju55

Start building locally. Your AI projects don't need an API key.


*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He maintains 116+ original open-source repositories built with local LLMs. Read more on dev.to.*aipythontutorialdocker

Top comments (4)

Collapse
 
theeagle profile image
Victor Okefie

The rule that matters: start local, move to cloud only when you've proven the concept. Most developers invert that, they reach for an API key before they know if the thing should exist at all. Local inference forces honesty. No rate limits to hide behind. No API costs to justify. Just the model and the problem. If it works there, scaling is an infrastructure decision, not a leap of faith.

Collapse
 
burstfirea47050 profile image
AuraCore Cognitive Field AI Developer.

Great paper. Now go check this project to run a local cognitive runtime. It's the "mind" between the prompts. This allows you to have an AI that learns and grows from/with you. This is early proto-AI-OS. LLM's are now plug and play language renderers from Aura's live systems payload.
AuraCoreCF.github.io

Collapse
 
frost_ethan_74b754519917e profile image
Ethan Frost

This is the article I wish existed when I started building AI tools. The "you don't need an API key" framing is powerful because it reframes the whole cost conversation.

The open source LLM ecosystem (Ollama + Gemma/Llama/etc) is one of the best examples of community-driven distribution I've seen. Google and Meta releasing these models openly isn't charity — it's a distribution strategy. But it benefits everyone.

For indie developers and side projects, local LLMs are a game-changer specifically because they remove the variable cost anxiety. You can experiment freely without watching a usage dashboard tick up. That psychological freedom matters more than most people realize — it's the difference between cautious prompting and creative exploration.

Great guide. The Ollama setup section is especially clean.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

side projects yes, but 90 local apps at any real usage scale still hits memory walls fast.