Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LLM Provider Integration

This document describes Pierre’s LLM (Large Language Model) provider abstraction layer, which enables pluggable AI model integration with streaming support for chat functionality and recipe generation.

Overview

The LLM module provides a trait-based abstraction that allows Pierre to integrate with multiple AI providers through a unified interface. The design mirrors the fitness provider SPI pattern for consistency.

┌─────────────────────────────────────────────────────────────────────────────┐
│                             ChatProvider                                     │
│                  Runtime provider selector (from env)                        │
│           PIERRE_LLM_PROVIDER=groq|gemini|local|ollama|vllm                 │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
           ┌───────────────────────┼───────────────────────┐
           │                       │                       │
           ▼                       ▼                       ▼
    ┌─────────────┐         ┌─────────────┐         ┌─────────────────┐
    │   Gemini    │         │    Groq     │         │ OpenAI-         │
    │  Provider   │         │  Provider   │         │ Compatible      │
    │  (vision,   │         │  (fast LPU  │         │ (Ollama, vLLM,  │
    │   tools)    │         │  inference) │         │  LocalAI)       │
    └──────┬──────┘         └──────┬──────┘         └────────┬────────┘
           │                       │                         │
           │                       │              ┌──────────┴──────────┐
           │                       │              │                     │
           │                       │         ┌────┴────┐          ┌────┴────┐
           │                       │         │ Ollama  │          │  vLLM   │
           │                       │         │localhost│          │localhost│
           │                       │         │ :11434  │          │ :8000   │
           └───────────────────────┴─────────┴────┬────┴──────────┴────┬────┘
                                                  │                    │
                                                  ▼                    ▼
                                         ┌───────────────────────────────────┐
                                         │      LlmProvider Trait            │
                                         │      (shared interface)           │
                                         └───────────────────────────────────┘

Quick Start

Option 1: Cloud Providers (No Setup Required)

# Groq (default, cost-effective, fast)
export GROQ_API_KEY="your-groq-api-key"
export PIERRE_LLM_PROVIDER=groq

# Gemini (full-featured with vision)
export GEMINI_API_KEY="your-gemini-api-key"
export PIERRE_LLM_PROVIDER=gemini

Option 2: Local LLM (Privacy-First, No API Costs)

# Use local Ollama instance
export PIERRE_LLM_PROVIDER=local
export LOCAL_LLM_MODEL=qwen2.5:14b-instruct

# Start Pierre
./bin/start-server.sh

Local LLM Setup Guide

Running a local LLM gives you complete privacy, no API costs, and works offline. This section covers setting up Ollama (recommended) on macOS.

Hardware Requirements

Model SizeRAM RequiredGPU VRAMRecommended Hardware
7B-8B (Q4)8GB+8GBMacBook Air M1/M2 16GB
14B (Q4)12GB+12GBMacBook Air M2 24GB, MacBook Pro
32B (Q4)20GB+20-24GBMacBook Pro M2/M3 Pro 32GB+
70B (Q4)40GB+40-48GBMac Studio, High-end workstation

Example: Apple Silicon with 24GB unified memory:

  • ✅ Qwen 2.5 7B (~30 tokens/sec)
  • ✅ Qwen 2.5 14B (~15-20 tokens/sec) ← Recommended
  • ⚠️ Qwen 2.5 32B (~5-8 tokens/sec, tight fit)

Step 1: Install Ollama

# macOS (Homebrew)
brew install ollama

# Or download from https://ollama.ai/download

Step 2: Start Ollama Server

# Start the Ollama service (runs in background)
ollama serve

# Verify it's running
curl http://localhost:11434/api/version
# Should return: {"version":"0.x.x"}

Step 3: Pull a Model

Recommended models for function calling:

# Best for 24GB RAM (recommended)
ollama pull qwen2.5:14b-instruct

# Faster, lighter alternative
ollama pull qwen2.5:7b-instruct

# If you have 32GB+ RAM
ollama pull qwen2.5:32b-instruct

# Alternative: Llama 3.1 (also excellent)
ollama pull llama3.1:8b-instruct

Step 4: Test the Model

# Interactive test
ollama run qwen2.5:14b-instruct "What are the benefits of interval training?"

# API test
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:14b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Step 5: Configure Pierre

# Set environment variables
export PIERRE_LLM_PROVIDER=local
export LOCAL_LLM_BASE_URL=http://localhost:11434/v1
export LOCAL_LLM_MODEL=qwen2.5:14b-instruct

# Or add to .envrc:
echo 'export PIERRE_LLM_PROVIDER=local' >> .envrc
echo 'export LOCAL_LLM_MODEL=qwen2.5:14b-instruct' >> .envrc
direnv allow

Step 6: Start Pierre and Test

# Start Pierre server
./bin/start-server.sh

# Test chat endpoint
curl -X POST http://localhost:8081/api/chat/conversations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"title": "Test Chat"}'

Model Recommendations

For Function Calling (Pierre’s 14+ Tools)

ModelSizeFunction CallingSpeedNotes
Qwen 2.5 14B-Instruct14B⭐⭐⭐⭐⭐FastBest balance for most hardware
Qwen 2.5 32B-Instruct32B⭐⭐⭐⭐⭐MediumBest quality, needs 24GB+
Qwen 2.5 7B-Instruct7B⭐⭐⭐⭐Very FastGood for lighter hardware
Llama 3.1 8B-Instruct8B⭐⭐⭐⭐Very FastMeta’s latest, excellent
Llama 3.3 70B-Instruct70B⭐⭐⭐⭐⭐SlowBest quality, needs 48GB+
Mistral 7B-Instruct7B⭐⭐⭐⭐Very FastFast and versatile

Ollama Model Commands

# List installed models
ollama list

# Pull a model
ollama pull qwen2.5:14b-instruct

# Remove a model
ollama rm qwen2.5:7b-instruct

# Show model info
ollama show qwen2.5:14b-instruct

Configuration Reference

Environment Variables

VariableDescriptionDefaultRequired
PIERRE_LLM_PROVIDERProvider: groq, gemini, local, ollama, vllm, localaigroqNo
GROQ_API_KEYGroq API key-Yes (for Groq)
GEMINI_API_KEYGoogle Gemini API key-Yes (for Gemini)
LOCAL_LLM_BASE_URLLocal LLM API endpointhttp://localhost:11434/v1No
LOCAL_LLM_MODELModel name for local providerqwen2.5:14b-instructNo
LOCAL_LLM_API_KEYAPI key for local provider(empty)No

Provider Capabilities

CapabilityGroqGeminiLocal (Ollama)
Streaming
Function/Tool Calling
Vision/Image Input
JSON Mode
System Messages
Offline Operation
Privacy (No Data Sent)

Supported Models by Provider

Groq (Cloud)

ModelDescriptionDefault
llama-3.3-70b-versatileHigh-quality general purpose
llama-3.1-8b-instantFast responses for simple tasks
llama-3.1-70b-versatileVersatile 70B model
mixtral-8x7b-32768Long context window (32K tokens)
gemma2-9b-itGoogle’s Gemma 2 instruction-tuned

Rate Limits: Free tier has 12,000 tokens-per-minute limit.

Gemini (Cloud)

ModelDescriptionDefault
gemini-2.5-flashLatest fast model with improved capabilities
gemini-2.0-flash-expExperimental fast model
gemini-1.5-proAdvanced reasoning capabilities
gemini-1.5-flashBalanced performance and cost
gemini-1.0-proLegacy pro model

Local (Ollama/vLLM)

ModelDescriptionRecommended For
qwen2.5:14b-instructExcellent function calling24GB RAM (default)
qwen2.5:7b-instructFast, good function calling16GB RAM
qwen2.5:32b-instructBest quality function calling32GB+ RAM
llama3.1:8b-instructMeta’s latest 8B16GB RAM
llama3.1:70b-instructMeta’s latest 70B48GB+ RAM
mistral:7b-instructFast and versatile16GB RAM

Testing

Run All LLM Tests

# LLM module unit tests
cargo test --test llm_test -- --nocapture

# LLM provider abstraction tests
cargo test --test llm_provider_test -- --nocapture

Test Local Provider Specifically

# Ensure Ollama is running first
ollama serve &

# Test provider initialization
cargo test test_llm_provider_type -- --nocapture

# Test chat functionality (requires running server)
cargo test --test llm_local_integration_test -- --nocapture

Manual Testing

# 1. Start Ollama
ollama serve

# 2. Pull test model
ollama pull qwen2.5:7b-instruct

# 3. Set environment
export PIERRE_LLM_PROVIDER=local
export LOCAL_LLM_MODEL=qwen2.5:7b-instruct

# 4. Start Pierre
./bin/start-server.sh

# 5. Test health endpoint
curl http://localhost:8081/health

# 6. Test chat (requires authentication)
# Create admin token first:
cargo run --bin admin-setup -- generate-token --service test --expires-days 1

Validation Checklist for Local LLM

Before deploying with local LLM, verify:

  • Ollama server is running (curl http://localhost:11434/api/version)
  • Model is pulled (ollama list)
  • Model supports function calling (use Qwen 2.5 or Llama 3.1)
  • Environment variables are set correctly
  • Pierre can connect to Ollama (curl http://localhost:8081/health)
  • Chat streaming works
  • Tool execution works (test with fitness tools)

Alternative Local Backends

vLLM (Production)

For production deployments with high throughput:

# Install vLLM
pip install vllm

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-14B-Instruct \
  --port 8000

# Configure Pierre
export PIERRE_LLM_PROVIDER=vllm
export LOCAL_LLM_BASE_URL=http://localhost:8000/v1
export LOCAL_LLM_MODEL=Qwen/Qwen2.5-14B-Instruct

vLLM advantages:

  • Parallel function calls
  • Streaming tool calls
  • Higher throughput via PagedAttention
  • Better for multiple concurrent users

LocalAI

# Run LocalAI with Docker
docker run -p 8080:8080 localai/localai

# Configure Pierre
export PIERRE_LLM_PROVIDER=localai
export LOCAL_LLM_BASE_URL=http://localhost:8080/v1

Basic Usage

The ChatProvider enum automatically selects the provider based on environment configuration:

#![allow(unused)]
fn main() {
use pierre_mcp_server::llm::{ChatProvider, ChatMessage, ChatRequest};

// Create provider from environment (reads PIERRE_LLM_PROVIDER)
let provider = ChatProvider::from_env()?;

// Build a chat request
let request = ChatRequest::new(vec![
    ChatMessage::system("You are a helpful fitness assistant."),
    ChatMessage::user("What's a good warm-up routine?"),
])
.with_temperature(0.7)
.with_max_tokens(1000);

// Get a response
let response = provider.complete(&request).await?;
println!("{}", response.content);
}

Explicit Provider Selection

#![allow(unused)]
fn main() {
// Force Gemini
let provider = ChatProvider::gemini()?;

// Force Groq
let provider = ChatProvider::groq()?;

// Force Local
let provider = ChatProvider::local()?;
}

Streaming Responses

#![allow(unused)]
fn main() {
use futures_util::StreamExt;

let request = ChatRequest::new(vec![
    ChatMessage::user("Explain the benefits of interval training"),
])
.with_streaming();

let mut stream = provider.complete_stream(&request).await?;

while let Some(chunk) = stream.next().await {
    match chunk {
        Ok(chunk) => {
            print!("{}", chunk.delta);
            if chunk.is_final {
                println!("\n[Done]");
            }
        }
        Err(e) => eprintln!("Error: {e}"),
    }
}
}

Tool/Function Calling

All three providers (Gemini, Groq, Local) support tool calling:

#![allow(unused)]
fn main() {
use pierre_mcp_server::llm::{Tool, FunctionDeclaration};

let tools = vec![Tool {
    function_declarations: vec![FunctionDeclaration {
        name: "get_weather".to_string(),
        description: "Get current weather for a location".to_string(),
        parameters: Some(serde_json::json!({
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            },
            "required": ["location"]
        })),
    }],
}];

let response = provider.complete_with_tools(&request, Some(tools)).await?;

if response.has_function_calls() {
    for call in response.function_calls.unwrap() {
        println!("Call function: {} with args: {}", call.name, call.args);
    }
}
}

Recipe Generation Integration

Pierre uses LLM providers for the “Combat des Chefs” recipe generation architecture. The workflow differs based on whether the client has LLM capabilities:

LLM Clients (Claude, ChatGPT, etc.)

When an LLM client connects to Pierre, it generates recipes itself:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  LLM Client  │────▶│ Pierre MCP   │────▶│    USDA      │
│  (Claude)    │     │   Server     │     │  Database    │
└──────────────┘     └──────────────┘     └──────────────┘
       │                    │                    │
       │  1. get_recipe_    │                    │
       │     constraints    │                    │
       │───────────────────▶│                    │
       │                    │                    │
       │  2. Returns macro  │                    │
       │     targets, hints │                    │
       │◀───────────────────│                    │
       │                    │                    │
       │  [LLM generates    │                    │
       │   recipe locally]  │                    │
       │                    │                    │
       │  3. validate_      │                    │
       │     recipe         │                    │
       │───────────────────▶│                    │
       │                    │  Lookup nutrition  │
       │                    │───────────────────▶│
       │                    │◀───────────────────│
       │  4. Validation     │                    │
       │     result + macros│                    │
       │◀───────────────────│                    │
       │                    │                    │
       │  5. save_recipe    │                    │
       │───────────────────▶│                    │

Non-LLM Clients

For clients without LLM capabilities, Pierre uses its internal LLM (via ChatProvider):

#![allow(unused)]
fn main() {
// The suggest_recipe tool uses Pierre's configured LLM
let provider = ChatProvider::from_env()?;
let recipe = generate_recipe_with_llm(&provider, constraints).await?;
}

Recipe Tools

ToolDescription
get_recipe_constraintsGet macro targets and prompt hints for LLM recipe generation
validate_recipeValidate recipe nutrition via USDA FoodData Central
suggest_recipeUses Pierre’s internal LLM to generate recipes
save_recipeSave validated recipes to user collection
list_recipesList user’s saved recipes
get_recipeGet recipe by ID
search_recipesSearch recipes by name, tags, or ingredients

API Reference

LlmCapabilities

Bitflags indicating provider features:

FlagDescription
STREAMINGSupports streaming responses
FUNCTION_CALLINGSupports function/tool calling
VISIONSupports image input
JSON_MODESupports structured JSON output
SYSTEM_MESSAGESSupports system role messages
#![allow(unused)]
fn main() {
// Check capabilities
let caps = provider.capabilities();
if caps.supports_streaming() {
    // Use streaming API
}
}

ChatMessage

Message structure for conversations:

#![allow(unused)]
fn main() {
// Constructor methods
let system = ChatMessage::system("You are helpful");
let user = ChatMessage::user("Hello!");
let assistant = ChatMessage::assistant("Hi there!");
}

ChatRequest

Request configuration with builder pattern:

#![allow(unused)]
fn main() {
let request = ChatRequest::new(messages)
    .with_model("gemini-1.5-pro")    // Override default model
    .with_temperature(0.7)            // 0.0 to 1.0
    .with_max_tokens(2000)            // Max output tokens
    .with_streaming();                // Enable streaming
}

ChatResponse

Response structure:

FieldTypeDescription
contentStringGenerated text
modelStringModel used
usageOption<TokenUsage>Token counts
finish_reasonOption<String>Why generation stopped

StreamChunk

Streaming chunk structure:

FieldTypeDescription
deltaStringIncremental text
is_finalboolWhether this is the last chunk
finish_reasonOption<String>Reason if final

Module Structure

src/llm/
├── mod.rs              # Trait definitions, types, registry, exports
├── provider.rs         # ChatProvider enum (runtime selector)
├── gemini.rs           # Google Gemini implementation
├── groq.rs             # Groq LPU implementation
├── openai_compatible.rs # Generic OpenAI-compatible provider (Ollama, vLLM, LocalAI)
└── prompts/
    └── mod.rs          # System prompts (pierre_system.md)

Adding New Providers

To implement a new LLM provider:

  1. Implement the trait:
#![allow(unused)]
fn main() {
use async_trait::async_trait;
use pierre_mcp_server::llm::{
    LlmProvider, LlmCapabilities, ChatRequest, ChatResponse,
    ChatStream, AppError,
};

pub struct MyProvider {
    api_key: String,
    // ...
}

#[async_trait]
impl LlmProvider for MyProvider {
    fn name(&self) -> &'static str {
        "myprovider"
    }

    fn display_name(&self) -> &'static str {
        "My Custom Provider"
    }

    fn capabilities(&self) -> LlmCapabilities {
        LlmCapabilities::STREAMING | LlmCapabilities::SYSTEM_MESSAGES
    }

    fn default_model(&self) -> &'static str {
        "my-model-v1"
    }

    fn available_models(&self) -> &'static [&'static str] {
        &["my-model-v1", "my-model-v2"]
    }

    async fn complete(&self, request: &ChatRequest) -> Result<ChatResponse, AppError> {
        // Implementation
    }

    async fn complete_stream(&self, request: &ChatRequest) -> Result<ChatStream, AppError> {
        // Implementation
    }

    async fn health_check(&self) -> Result<bool, AppError> {
        // Implementation
    }
}
}
  1. Add to ChatProvider enum in src/llm/provider.rs:
#![allow(unused)]
fn main() {
pub enum ChatProvider {
    Gemini(GeminiProvider),
    Groq(GroqProvider),
    Local(OpenAiCompatibleProvider),
    MyProvider(MyProvider),  // Add variant
}
}
  1. Update environment config in src/config/environment.rs

  2. Register tests in tests/llm_test.rs


Error Handling

All provider methods return Result<T, AppError>:

#![allow(unused)]
fn main() {
match provider.complete(&request).await {
    Ok(response) => println!("{}", response.content),
    Err(AppError { code, message, .. }) => {
        match code {
            ErrorCode::RateLimitExceeded => // Handle rate limit
            ErrorCode::AuthenticationFailed => // Handle auth error
            _ => // Handle other errors
        }
    }
}
}

Common Local LLM Errors

ErrorCauseSolution
“Cannot connect to Ollama”Ollama not runningRun ollama serve
“Model not found”Model not pulledRun ollama pull MODEL_NAME
“Connection refused”Wrong port/URLCheck LOCAL_LLM_BASE_URL
“Timeout”Model loading or slow inferenceWait, or use smaller model

Troubleshooting

Ollama Won’t Start

# Check if already running
pgrep -f ollama

# Kill existing instance
pkill ollama

# Start fresh
ollama serve

Model Too Slow

# Use a smaller quantization
ollama pull qwen2.5:14b-instruct-q4_K_M

# Or use a smaller model
ollama pull qwen2.5:7b-instruct

Out of Memory

# Check model size
ollama show qwen2.5:14b-instruct --modelfile

# Use smaller model
ollama pull qwen2.5:7b-instruct

# Or reduce context length in requests

Function Calling Not Working

  • Ensure you’re using a model trained for function calling (Qwen 2.5, Llama 3.1)
  • Verify the model is the instruct/chat variant, not base
  • Check tool definitions are valid JSON Schema

See Also