LLM Provider Integration

This document describes Pierre’s LLM (Large Language Model) provider abstraction layer, which enables pluggable AI model integration with streaming support for chat functionality and recipe generation.

Overview

The LLM module provides a trait-based abstraction that allows Pierre to integrate with multiple AI providers through a unified interface. The design mirrors the fitness provider SPI pattern for consistency.

┌─────────────────────────────────────────────────────────────────────────────┐
│                             ChatProvider                                     │
│                  Runtime provider selector (from env)                        │
│           PIERRE_LLM_PROVIDER=groq|gemini|local|ollama|vllm                 │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
           ┌───────────────────────┼───────────────────────┐
           │                       │                       │
           ▼                       ▼                       ▼
    ┌─────────────┐         ┌─────────────┐         ┌─────────────────┐
    │   Gemini    │         │    Groq     │         │ OpenAI-         │
    │  Provider   │         │  Provider   │         │ Compatible      │
    │  (vision,   │         │  (fast LPU  │         │ (Ollama, vLLM,  │
    │   tools)    │         │  inference) │         │  LocalAI)       │
    └──────┬──────┘         └──────┬──────┘         └────────┬────────┘
           │                       │                         │
           │                       │              ┌──────────┴──────────┐
           │                       │              │                     │
           │                       │         ┌────┴────┐          ┌────┴────┐
           │                       │         │ Ollama  │          │  vLLM   │
           │                       │         │localhost│          │localhost│
           │                       │         │ :11434  │          │ :8000   │
           └───────────────────────┴─────────┴────┬────┴──────────┴────┬────┘
                                                  │                    │
                                                  ▼                    ▼
                                         ┌───────────────────────────────────┐
                                         │      LlmProvider Trait            │
                                         │      (shared interface)           │
                                         └───────────────────────────────────┘

Quick Start

Option 1: Cloud Providers (No Setup Required)

# Groq (default, cost-effective, fast)
export GROQ_API_KEY="your-groq-api-key"
export PIERRE_LLM_PROVIDER=groq

# Gemini (full-featured with vision)
export GEMINI_API_KEY="your-gemini-api-key"
export PIERRE_LLM_PROVIDER=gemini

Option 2: Local LLM (Privacy-First, No API Costs)

# Use local Ollama instance
export PIERRE_LLM_PROVIDER=local
export LOCAL_LLM_MODEL=qwen2.5:14b-instruct

# Start Pierre
./bin/start-server.sh

Local LLM Setup Guide

Running a local LLM gives you complete privacy, no API costs, and works offline. This section covers setting up Ollama (recommended) on macOS.

Hardware Requirements

Model Size	RAM Required	GPU VRAM	Recommended Hardware
7B-8B (Q4)	8GB+	8GB	MacBook Air M1/M2 16GB
14B (Q4)	12GB+	12GB	MacBook Air M2 24GB, MacBook Pro
32B (Q4)	20GB+	20-24GB	MacBook Pro M2/M3 Pro 32GB+
70B (Q4)	40GB+	40-48GB	Mac Studio, High-end workstation

Example: Apple Silicon with 24GB unified memory:

✅ Qwen 2.5 7B (~30 tokens/sec)
✅ Qwen 2.5 14B (~15-20 tokens/sec) ← Recommended
⚠️ Qwen 2.5 32B (~5-8 tokens/sec, tight fit)

Step 1: Install Ollama

# macOS (Homebrew)
brew install ollama

# Or download from https://ollama.ai/download

Step 2: Start Ollama Server

# Start the Ollama service (runs in background)
ollama serve

# Verify it's running
curl http://localhost:11434/api/version
# Should return: {"version":"0.x.x"}

Step 3: Pull a Model

Recommended models for function calling:

# Best for 24GB RAM (recommended)
ollama pull qwen2.5:14b-instruct

# Faster, lighter alternative
ollama pull qwen2.5:7b-instruct

# If you have 32GB+ RAM
ollama pull qwen2.5:32b-instruct

# Alternative: Llama 3.1 (also excellent)
ollama pull llama3.1:8b-instruct

Step 4: Test the Model

# Interactive test
ollama run qwen2.5:14b-instruct "What are the benefits of interval training?"

# API test
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:14b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Step 5: Configure Pierre

# Set environment variables
export PIERRE_LLM_PROVIDER=local
export LOCAL_LLM_BASE_URL=http://localhost:11434/v1
export LOCAL_LLM_MODEL=qwen2.5:14b-instruct

# Or add to .envrc:
echo 'export PIERRE_LLM_PROVIDER=local' >> .envrc
echo 'export LOCAL_LLM_MODEL=qwen2.5:14b-instruct' >> .envrc
direnv allow

Step 6: Start Pierre and Test

# Start Pierre server
./bin/start-server.sh

# Test chat endpoint
curl -X POST http://localhost:8081/api/chat/conversations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"title": "Test Chat"}'

Model Recommendations

For Function Calling (Pierre’s 14+ Tools)

Model	Size	Function Calling	Speed	Notes
Qwen 2.5 14B-Instruct	14B	⭐⭐⭐⭐⭐	Fast	Best balance for most hardware
Qwen 2.5 32B-Instruct	32B	⭐⭐⭐⭐⭐	Medium	Best quality, needs 24GB+
Qwen 2.5 7B-Instruct	7B	⭐⭐⭐⭐	Very Fast	Good for lighter hardware
Llama 3.1 8B-Instruct	8B	⭐⭐⭐⭐	Very Fast	Meta’s latest, excellent
Llama 3.3 70B-Instruct	70B	⭐⭐⭐⭐⭐	Slow	Best quality, needs 48GB+
Mistral 7B-Instruct	7B	⭐⭐⭐⭐	Very Fast	Fast and versatile

Ollama Model Commands

# List installed models
ollama list

# Pull a model
ollama pull qwen2.5:14b-instruct

# Remove a model
ollama rm qwen2.5:7b-instruct

# Show model info
ollama show qwen2.5:14b-instruct

Configuration Reference

Environment Variables

Variable	Description	Default	Required
`PIERRE_LLM_PROVIDER`	Provider: `groq`, `gemini`, `local`, `ollama`, `vllm`, `localai`	`groq`	No
`GROQ_API_KEY`	Groq API key	-	Yes (for Groq)
`GEMINI_API_KEY`	Google Gemini API key	-	Yes (for Gemini)
`LOCAL_LLM_BASE_URL`	Local LLM API endpoint	`http://localhost:11434/v1`	No
`LOCAL_LLM_MODEL`	Model name for local provider	`qwen2.5:14b-instruct`	No
`LOCAL_LLM_API_KEY`	API key for local provider	(empty)	No

Provider Capabilities

Capability	Groq	Gemini	Local (Ollama)
Streaming	✅	✅	✅
Function/Tool Calling	✅	✅	✅
Vision/Image Input	❌	✅	❌
JSON Mode	✅	✅	❌
System Messages	✅	✅	✅
Offline Operation	❌	❌	✅
Privacy (No Data Sent)	❌	❌	✅

Supported Models by Provider

Groq (Cloud)

Model	Description	Default
`llama-3.3-70b-versatile`	High-quality general purpose	✓
`llama-3.1-8b-instant`	Fast responses for simple tasks
`llama-3.1-70b-versatile`	Versatile 70B model
`mixtral-8x7b-32768`	Long context window (32K tokens)
`gemma2-9b-it`	Google’s Gemma 2 instruction-tuned

Rate Limits: Free tier has 12,000 tokens-per-minute limit.

Gemini (Cloud)

Model	Description	Default
`gemini-2.5-flash`	Latest fast model with improved capabilities	✓
`gemini-2.0-flash-exp`	Experimental fast model
`gemini-1.5-pro`	Advanced reasoning capabilities
`gemini-1.5-flash`	Balanced performance and cost
`gemini-1.0-pro`	Legacy pro model

Local (Ollama/vLLM)

Model	Description	Recommended For
`qwen2.5:14b-instruct`	Excellent function calling	24GB RAM (default)
`qwen2.5:7b-instruct`	Fast, good function calling	16GB RAM
`qwen2.5:32b-instruct`	Best quality function calling	32GB+ RAM
`llama3.1:8b-instruct`	Meta’s latest 8B	16GB RAM
`llama3.1:70b-instruct`	Meta’s latest 70B	48GB+ RAM
`mistral:7b-instruct`	Fast and versatile	16GB RAM

Testing

Run All LLM Tests

# LLM module unit tests
cargo test --test llm_test -- --nocapture

# LLM provider abstraction tests
cargo test --test llm_provider_test -- --nocapture

Test Local Provider Specifically

# Ensure Ollama is running first
ollama serve &

# Test provider initialization
cargo test test_llm_provider_type -- --nocapture

# Test chat functionality (requires running server)
cargo test --test llm_local_integration_test -- --nocapture

Manual Testing

# 1. Start Ollama
ollama serve

# 2. Pull test model
ollama pull qwen2.5:7b-instruct

# 3. Set environment
export PIERRE_LLM_PROVIDER=local
export LOCAL_LLM_MODEL=qwen2.5:7b-instruct

# 4. Start Pierre
./bin/start-server.sh

# 5. Test health endpoint
curl http://localhost:8081/health

# 6. Test chat (requires authentication)
# Create admin token first:
cargo run --bin admin-setup -- generate-token --service test --expires-days 1

Validation Checklist for Local LLM

Before deploying with local LLM, verify:

Ollama server is running (curl http://localhost:11434/api/version)
Model is pulled (ollama list)
Model supports function calling (use Qwen 2.5 or Llama 3.1)
Environment variables are set correctly
Pierre can connect to Ollama (curl http://localhost:8081/health)
Chat streaming works
Tool execution works (test with fitness tools)

Alternative Local Backends

vLLM (Production)

For production deployments with high throughput:

# Install vLLM
pip install vllm

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-14B-Instruct \
  --port 8000

# Configure Pierre
export PIERRE_LLM_PROVIDER=vllm
export LOCAL_LLM_BASE_URL=http://localhost:8000/v1
export LOCAL_LLM_MODEL=Qwen/Qwen2.5-14B-Instruct

vLLM advantages:

Parallel function calls
Streaming tool calls
Higher throughput via PagedAttention
Better for multiple concurrent users

LocalAI

# Run LocalAI with Docker
docker run -p 8080:8080 localai/localai

# Configure Pierre
export PIERRE_LLM_PROVIDER=localai
export LOCAL_LLM_BASE_URL=http://localhost:8080/v1

Basic Usage

Using ChatProvider (Recommended)

The ChatProvider enum automatically selects the provider based on environment configuration:

#![allow(unused)]
fn main() {
use pierre_mcp_server::llm::{ChatProvider, ChatMessage, ChatRequest};

// Create provider from environment (reads PIERRE_LLM_PROVIDER)
let provider = ChatProvider::from_env()?;

// Build a chat request
let request = ChatRequest::new(vec![
    ChatMessage::system("You are a helpful fitness assistant."),
    ChatMessage::user("What's a good warm-up routine?"),
])
.with_temperature(0.7)
.with_max_tokens(1000);

// Get a response
let response = provider.complete(&request).await?;
println!("{}", response.content);
}

Explicit Provider Selection

#![allow(unused)]
fn main() {
// Force Gemini
let provider = ChatProvider::gemini()?;

// Force Groq
let provider = ChatProvider::groq()?;

// Force Local
let provider = ChatProvider::local()?;
}

Streaming Responses

#![allow(unused)]
fn main() {
use futures_util::StreamExt;

let request = ChatRequest::new(vec![
    ChatMessage::user("Explain the benefits of interval training"),
])
.with_streaming();

let mut stream = provider.complete_stream(&request).await?;

while let Some(chunk) = stream.next().await {
    match chunk {
        Ok(chunk) => {
            print!("{}", chunk.delta);
            if chunk.is_final {
                println!("\n[Done]");
            }
        }
        Err(e) => eprintln!("Error: {e}"),
    }
}
}

Tool/Function Calling

All three providers (Gemini, Groq, Local) support tool calling:

#![allow(unused)]
fn main() {
use pierre_mcp_server::llm::{Tool, FunctionDeclaration};

let tools = vec![Tool {
    function_declarations: vec![FunctionDeclaration {
        name: "get_weather".to_string(),
        description: "Get current weather for a location".to_string(),
        parameters: Some(serde_json::json!({
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            },
            "required": ["location"]
        })),
    }],
}];

let response = provider.complete_with_tools(&request, Some(tools)).await?;

if response.has_function_calls() {
    for call in response.function_calls.unwrap() {
        println!("Call function: {} with args: {}", call.name, call.args);
    }
}
}

Recipe Generation Integration

Pierre uses LLM providers for the “Combat des Chefs” recipe generation architecture. The workflow differs based on whether the client has LLM capabilities:

LLM Clients (Claude, ChatGPT, etc.)

When an LLM client connects to Pierre, it generates recipes itself:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  LLM Client  │────▶│ Pierre MCP   │────▶│    USDA      │
│  (Claude)    │     │   Server     │     │  Database    │
└──────────────┘     └──────────────┘     └──────────────┘
       │                    │                    │
       │  1. get_recipe_    │                    │
       │     constraints    │                    │
       │───────────────────▶│                    │
       │                    │                    │
       │  2. Returns macro  │                    │
       │     targets, hints │                    │
       │◀───────────────────│                    │
       │                    │                    │
       │  [LLM generates    │                    │
       │   recipe locally]  │                    │
       │                    │                    │
       │  3. validate_      │                    │
       │     recipe         │                    │
       │───────────────────▶│                    │
       │                    │  Lookup nutrition  │
       │                    │───────────────────▶│
       │                    │◀───────────────────│
       │  4. Validation     │                    │
       │     result + macros│                    │
       │◀───────────────────│                    │
       │                    │                    │
       │  5. save_recipe    │                    │
       │───────────────────▶│                    │

Non-LLM Clients

For clients without LLM capabilities, Pierre uses its internal LLM (via ChatProvider):

#![allow(unused)]
fn main() {
// The suggest_recipe tool uses Pierre's configured LLM
let provider = ChatProvider::from_env()?;
let recipe = generate_recipe_with_llm(&provider, constraints).await?;
}

Recipe Tools

Tool	Description
`get_recipe_constraints`	Get macro targets and prompt hints for LLM recipe generation
`validate_recipe`	Validate recipe nutrition via USDA FoodData Central
`suggest_recipe`	Uses Pierre’s internal LLM to generate recipes
`save_recipe`	Save validated recipes to user collection
`list_recipes`	List user’s saved recipes
`get_recipe`	Get recipe by ID
`search_recipes`	Search recipes by name, tags, or ingredients

API Reference

LlmCapabilities

Bitflags indicating provider features:

Flag	Description
`STREAMING`	Supports streaming responses
`FUNCTION_CALLING`	Supports function/tool calling
`VISION`	Supports image input
`JSON_MODE`	Supports structured JSON output
`SYSTEM_MESSAGES`	Supports system role messages

#![allow(unused)]
fn main() {
// Check capabilities
let caps = provider.capabilities();
if caps.supports_streaming() {
    // Use streaming API
}
}

ChatMessage

Message structure for conversations:

#![allow(unused)]
fn main() {
// Constructor methods
let system = ChatMessage::system("You are helpful");
let user = ChatMessage::user("Hello!");
let assistant = ChatMessage::assistant("Hi there!");
}

ChatRequest

Request configuration with builder pattern:

#![allow(unused)]
fn main() {
let request = ChatRequest::new(messages)
    .with_model("gemini-1.5-pro")    // Override default model
    .with_temperature(0.7)            // 0.0 to 1.0
    .with_max_tokens(2000)            // Max output tokens
    .with_streaming();                // Enable streaming
}

ChatResponse

Response structure:

Field	Type	Description
`content`	`String`	Generated text
`model`	`String`	Model used
`usage`	`Option<TokenUsage>`	Token counts
`finish_reason`	`Option<String>`	Why generation stopped

StreamChunk

Streaming chunk structure:

Field	Type	Description
`delta`	`String`	Incremental text
`is_final`	`bool`	Whether this is the last chunk
`finish_reason`	`Option<String>`	Reason if final

Module Structure

src/llm/
├── mod.rs              # Trait definitions, types, registry, exports
├── provider.rs         # ChatProvider enum (runtime selector)
├── gemini.rs           # Google Gemini implementation
├── groq.rs             # Groq LPU implementation
├── openai_compatible.rs # Generic OpenAI-compatible provider (Ollama, vLLM, LocalAI)
└── prompts/
    └── mod.rs          # System prompts (pierre_system.md)

Adding New Providers

To implement a new LLM provider:

Implement the trait:

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use pierre_mcp_server::llm::{
    LlmProvider, LlmCapabilities, ChatRequest, ChatResponse,
    ChatStream, AppError,
};

pub struct MyProvider {
    api_key: String,
    // ...
}

#[async_trait]
impl LlmProvider for MyProvider {
    fn name(&self) -> &'static str {
        "myprovider"
    }

    fn display_name(&self) -> &'static str {
        "My Custom Provider"
    }

    fn capabilities(&self) -> LlmCapabilities {
        LlmCapabilities::STREAMING | LlmCapabilities::SYSTEM_MESSAGES
    }

    fn default_model(&self) -> &'static str {
        "my-model-v1"
    }

    fn available_models(&self) -> &'static [&'static str] {
        &["my-model-v1", "my-model-v2"]
    }

    async fn complete(&self, request: &ChatRequest) -> Result<ChatResponse, AppError> {
        // Implementation
    }

    async fn complete_stream(&self, request: &ChatRequest) -> Result<ChatStream, AppError> {
        // Implementation
    }

    async fn health_check(&self) -> Result<bool, AppError> {
        // Implementation
    }
}
}

Add to ChatProvider enum in src/llm/provider.rs:

#![allow(unused)]
fn main() {
pub enum ChatProvider {
    Gemini(GeminiProvider),
    Groq(GroqProvider),
    Local(OpenAiCompatibleProvider),
    MyProvider(MyProvider),  // Add variant
}
}

Update environment config in src/config/environment.rs
Register tests in tests/llm_test.rs

Error Handling

All provider methods return Result<T, AppError>:

#![allow(unused)]
fn main() {
match provider.complete(&request).await {
    Ok(response) => println!("{}", response.content),
    Err(AppError { code, message, .. }) => {
        match code {
            ErrorCode::RateLimitExceeded => // Handle rate limit
            ErrorCode::AuthenticationFailed => // Handle auth error
            _ => // Handle other errors
        }
    }
}
}

Common Local LLM Errors

Error	Cause	Solution
“Cannot connect to Ollama”	Ollama not running	Run `ollama serve`
“Model not found”	Model not pulled	Run `ollama pull MODEL_NAME`
“Connection refused”	Wrong port/URL	Check `LOCAL_LLM_BASE_URL`
“Timeout”	Model loading or slow inference	Wait, or use smaller model

Troubleshooting

Ollama Won’t Start

# Check if already running
pgrep -f ollama

# Kill existing instance
pkill ollama

# Start fresh
ollama serve

Model Too Slow

# Use a smaller quantization
ollama pull qwen2.5:14b-instruct-q4_K_M

# Or use a smaller model
ollama pull qwen2.5:7b-instruct

Out of Memory

# Check model size
ollama show qwen2.5:14b-instruct --modelfile

# Use smaller model
ollama pull qwen2.5:7b-instruct

# Or reduce context length in requests

Function Calling Not Working

Ensure you’re using a model trained for function calling (Qwen 2.5, Llama 3.1)
Verify the model is the instruct/chat variant, not base
Check tool definitions are valid JSON Schema

Keyboard shortcuts

Pierre Fitness Intelligence