STREZOSKI

Overview

Steve Collins is a self-contained AI assistant that brings the power of large language models to your local environment without relying on external APIs or cloud services. Built with privacy and offline capability as core principles, it demonstrates how modern LLM technology can be deployed in resource-constrained environments.

The project showcases a complete RAG (Retrieval Augmented Generation) implementation where the AI assistant can reason over your documents while maintaining conversation context. All processing happens locally on CPU, making it accessible without expensive GPU hardware.

Key Features

→7 AI Personalities: Friendly, Professional, Casual, Medical, Coach, and two creative personalities (Samuel L. Jackson and Gollum styles)
→Fully Offline: No external API calls - runs entirely on your local machine with complete privacy
→CPU Optimized: Runs on consumer hardware without requiring GPU acceleration
→RAG Implementation: Retrieval augmented generation with ChromaDB vector store for document understanding
→Docker Containerized: Self-contained deployment with all dependencies included
→Session Management: Persistent conversations with full history tracking
→RESTful API: Built with FastAPI including interactive documentation
→Built-in Web UI: No CORS issues, seamless browser-based interaction

System Architecture

🌐

Client

Web Browser

HTTP Requests

Port 8000

⚡

FastAPI

Session Manager

Personality System

RAG Pipeline

🧠

Ollama

Mistral LLM

Local Inference

Port 11434

↓ Data Flow ↓

💾

ChromaDB

Vector Store

📁

JSON Storage

Conversations

Component Breakdown

FastAPI Server: RESTful API handling session management, personality injection, and conversation routing. Includes built-in web interface and interactive documentation.

Ollama Service: Local LLM inference engine running Mistral model. Handles text generation with personality-specific prompts and maintains context awareness.

ChromaDB: Vector database for document embeddings using sentence-transformers. Enables semantic search for RAG capabilities.

Conversation Storage: Persistent JSON files tracking session history, personality context, and user interactions with timestamps.

Technology Stack

Core Technologies

Python 3.11: Application runtime
FastAPI: Web framework with async support
Ollama: Local LLM inference (Mistral)
ChromaDB: Vector database for RAG
Pydantic: Data validation and settings

Infrastructure

Docker: Container runtime
Docker Compose: Multi-service orchestration
Uvicorn: ASGI server
Health Checks: Service monitoring
Volume Mounts: Data persistence

Quick Start

Docker Deployment (Recommended)

# Clone repository
git clone https://github.com/gstrezoski/steve-collins.git
cd steve-collins

# Start all services
docker-compose up -d

# Monitor initialization
docker-compose logs -f

# Access the application
open http://localhost:8000/app

System Requirements

•RAM: Minimum 8GB (Ollama LLM requires ~4-5GB)
•Disk: 10GB free space (for Docker images and models)
•CPU: Modern multi-core processor (no GPU required)
•Docker: Docker Desktop or Docker Engine with Compose

AI Personalities

Each personality is implemented through carefully crafted system prompts that shape the AI's communication style while maintaining consistent functionality.

👋Friendly

Warm, enthusiastic, encouraging tone

💼Professional

Courteous, direct, systematic approach

😎Casual

Relaxed, conversational, approachable

🏥Medical

Health-focused with professional terminology

💪Coach

Motivational, energetic, goal-oriented

🎬Samuel

Direct and confident communication style

💍Gollum

Unique character-based interaction

API Usage

Start a Session

POST /session/start
Content-Type: application/json

{
  "personality": "friendly"
}

Send a Message

POST /session/chat
Content-Type: application/json

{
  "session_id": "uuid",
  "message": "Hello, how can you help me?"
}

Available Endpoints

GET /health - Health check
GET /app - Built-in web interface
GET /docs - Interactive API documentation
GET /session/{session_id} - Retrieve session details

Development Story

The challenge was to create an AI assistant that could run anywhere without dependencies on cloud services or expensive hardware. Many LLM projects require GPU acceleration or external APIs, making them inaccessible for deployment in resource-constrained or air-gapped environments.

I started with the constraint of "must run on a laptop with no internet" and worked backwards. Ollama provided the local LLM inference, but the real innovation was in the orchestration - ensuring the services could discover each other, handling model initialization gracefully, and maintaining conversation state across container restarts.

The personality system emerged from user testing. Generic AI responses felt sterile for an onboarding assistant. By implementing personality-driven prompting, the AI could adapt its communication style to different user preferences - professional for business users, casual for general audiences, or even entertaining character-based personalities for engagement.

The entire stack fits in a Docker Compose file with automatic health checks, graceful startup ordering, and persistent storage. It demonstrates that sophisticated AI applications don't require complex infrastructure - just thoughtful architecture and the right tools.

Technical Highlights

Containerization Strategy

Multi-stage Docker setup with service health checks ensures proper startup ordering. The Ollama service initializes first, pulls the required model, then signals readiness before the API container starts. This prevents race conditions and ensures reliability.

RAG Implementation

Documents are chunked and embedded using sentence-transformers (all-MiniLM-L6-v2) for efficient semantic search. ChromaDB provides the vector store with persistence, enabling the AI to retrieve relevant context from your documents during conversations.

Session Management

Each conversation is assigned a unique session ID with full history tracking stored as JSON. The system maintains context across multiple interactions, enabling natural multi-turn conversations while keeping data organized and queryable.

Use Cases

•Onboarding Assistants: Guide users through application features with personality-driven interactions
•Document Q&A: Query internal documentation with RAG-powered semantic search
•Privacy-First AI: Deploy in air-gapped or regulated environments requiring data isolation
•Prototyping Platform: Rapid AI assistant development without cloud dependencies
•Educational Tool: Learn LLM deployment, RAG architecture, and containerization

Source Code

Full source code and documentation available on GitHub

View on GitHub →