Back to Work
2024Python / Docker / LLM

Steve Collins

A lightweight conversational AI assistant with RAG (Retrieval Augmented Generation) capabilities, packaged in a Docker container. Named after the bouncer from Lock Stock and Two Smoking Barrels. Features multiple AI personalities and runs entirely offline with local LLM inference using Ollama.

PythonFastAPIDockerOllamaRAGLLMChromaDB

Overview

Steve Collins is a self-contained AI assistant that brings the power of large language models to your local environment without relying on external APIs or cloud services. Built with privacy and offline capability as core principles, it demonstrates how modern LLM technology can be deployed in resource-constrained environments.

The project showcases a complete RAG (Retrieval Augmented Generation) implementation where the AI assistant can reason over your documents while maintaining conversation context. All processing happens locally on CPU, making it accessible without expensive GPU hardware.

Key Features

  • 7 AI Personalities: Friendly, Professional, Casual, Medical, Coach, and two creative personalities (Samuel L. Jackson and Gollum styles)
  • Fully Offline: No external API calls - runs entirely on your local machine with complete privacy
  • CPU Optimized: Runs on consumer hardware without requiring GPU acceleration
  • RAG Implementation: Retrieval augmented generation with ChromaDB vector store for document understanding
  • Docker Containerized: Self-contained deployment with all dependencies included
  • Session Management: Persistent conversations with full history tracking
  • RESTful API: Built with FastAPI including interactive documentation
  • Built-in Web UI: No CORS issues, seamless browser-based interaction

System Architecture

🌐
Client
Web Browser
HTTP Requests
Port 8000
FastAPI
Session Manager
Personality System
RAG Pipeline
🧠
Ollama
Mistral LLM
Local Inference
Port 11434
↓ Data Flow ↓
💾
ChromaDB
Vector Store
📁
JSON Storage
Conversations

Component Breakdown

FastAPI Server: RESTful API handling session management, personality injection, and conversation routing. Includes built-in web interface and interactive documentation.
Ollama Service: Local LLM inference engine running Mistral model. Handles text generation with personality-specific prompts and maintains context awareness.
ChromaDB: Vector database for document embeddings using sentence-transformers. Enables semantic search for RAG capabilities.
Conversation Storage: Persistent JSON files tracking session history, personality context, and user interactions with timestamps.

Technology Stack

Core Technologies

  • Python 3.11: Application runtime
  • FastAPI: Web framework with async support
  • Ollama: Local LLM inference (Mistral)
  • ChromaDB: Vector database for RAG
  • Pydantic: Data validation and settings

Infrastructure

  • Docker: Container runtime
  • Docker Compose: Multi-service orchestration
  • Uvicorn: ASGI server
  • Health Checks: Service monitoring
  • Volume Mounts: Data persistence

Quick Start

Docker Deployment (Recommended)

# Clone repository
git clone https://github.com/gstrezoski/steve-collins.git
cd steve-collins

# Start all services
docker-compose up -d

# Monitor initialization
docker-compose logs -f

# Access the application
open http://localhost:8000/app

System Requirements

  • RAM: Minimum 8GB (Ollama LLM requires ~4-5GB)
  • Disk: 10GB free space (for Docker images and models)
  • CPU: Modern multi-core processor (no GPU required)
  • Docker: Docker Desktop or Docker Engine with Compose

AI Personalities

Each personality is implemented through carefully crafted system prompts that shape the AI's communication style while maintaining consistent functionality.

👋Friendly

Warm, enthusiastic, encouraging tone

💼Professional

Courteous, direct, systematic approach

😎Casual

Relaxed, conversational, approachable

🏥Medical

Health-focused with professional terminology

💪Coach

Motivational, energetic, goal-oriented

🎬Samuel

Direct and confident communication style

💍Gollum

Unique character-based interaction

API Usage

Start a Session

POST /session/start
Content-Type: application/json

{
  "personality": "friendly"
}

Send a Message

POST /session/chat
Content-Type: application/json

{
  "session_id": "uuid",
  "message": "Hello, how can you help me?"
}

Available Endpoints

  • GET /health - Health check
  • GET /app - Built-in web interface
  • GET /docs - Interactive API documentation
  • GET /session/{session_id} - Retrieve session details

Development Story

The challenge was to create an AI assistant that could run anywhere without dependencies on cloud services or expensive hardware. Many LLM projects require GPU acceleration or external APIs, making them inaccessible for deployment in resource-constrained or air-gapped environments.

I started with the constraint of "must run on a laptop with no internet" and worked backwards. Ollama provided the local LLM inference, but the real innovation was in the orchestration - ensuring the services could discover each other, handling model initialization gracefully, and maintaining conversation state across container restarts.

The personality system emerged from user testing. Generic AI responses felt sterile for an onboarding assistant. By implementing personality-driven prompting, the AI could adapt its communication style to different user preferences - professional for business users, casual for general audiences, or even entertaining character-based personalities for engagement.

The entire stack fits in a Docker Compose file with automatic health checks, graceful startup ordering, and persistent storage. It demonstrates that sophisticated AI applications don't require complex infrastructure - just thoughtful architecture and the right tools.

Technical Highlights

Containerization Strategy

Multi-stage Docker setup with service health checks ensures proper startup ordering. The Ollama service initializes first, pulls the required model, then signals readiness before the API container starts. This prevents race conditions and ensures reliability.

RAG Implementation

Documents are chunked and embedded using sentence-transformers (all-MiniLM-L6-v2) for efficient semantic search. ChromaDB provides the vector store with persistence, enabling the AI to retrieve relevant context from your documents during conversations.

Session Management

Each conversation is assigned a unique session ID with full history tracking stored as JSON. The system maintains context across multiple interactions, enabling natural multi-turn conversations while keeping data organized and queryable.

Use Cases

  • Onboarding Assistants: Guide users through application features with personality-driven interactions
  • Document Q&A: Query internal documentation with RAG-powered semantic search
  • Privacy-First AI: Deploy in air-gapped or regulated environments requiring data isolation
  • Prototyping Platform: Rapid AI assistant development without cloud dependencies
  • Educational Tool: Learn LLM deployment, RAG architecture, and containerization

Source Code

Full source code and documentation available on GitHub

View on GitHub →