LLM Provider Integration
This document describes Pierre’s LLM (Large Language Model) provider abstraction layer, which enables pluggable AI model integration with streaming support for chat functionality and recipe generation.
Overview
The LLM module provides a trait-based abstraction that allows Pierre to integrate with multiple AI providers through a unified interface. The design mirrors the fitness provider SPI pattern for consistency.
┌─────────────────────────────────────────────────────────────────────────────┐
│ ChatProvider │
│ Runtime provider selector (from env) │
│ PIERRE_LLM_PROVIDER=groq|gemini|local|ollama|vllm │
└──────────────────────────────────┬──────────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
│ Gemini │ │ Groq │ │ OpenAI- │
│ Provider │ │ Provider │ │ Compatible │
│ (vision, │ │ (fast LPU │ │ (Ollama, vLLM, │
│ tools) │ │ inference) │ │ LocalAI) │
└──────┬──────┘ └──────┬──────┘ └────────┬────────┘
│ │ │
│ │ ┌──────────┴──────────┐
│ │ │ │
│ │ ┌────┴────┐ ┌────┴────┐
│ │ │ Ollama │ │ vLLM │
│ │ │localhost│ │localhost│
│ │ │ :11434 │ │ :8000 │
└───────────────────────┴─────────┴────┬────┴──────────┴────┬────┘
│ │
▼ ▼
┌───────────────────────────────────┐
│ LlmProvider Trait │
│ (shared interface) │
└───────────────────────────────────┘
Quick Start
Option 1: Cloud Providers (No Setup Required)
# Groq (default, cost-effective, fast)
export GROQ_API_KEY="your-groq-api-key"
export PIERRE_LLM_PROVIDER=groq
# Gemini (full-featured with vision)
export GEMINI_API_KEY="your-gemini-api-key"
export PIERRE_LLM_PROVIDER=gemini
Option 2: Local LLM (Privacy-First, No API Costs)
# Use local Ollama instance
export PIERRE_LLM_PROVIDER=local
export LOCAL_LLM_MODEL=qwen2.5:14b-instruct
# Start Pierre
./bin/start-server.sh
Local LLM Setup Guide
Running a local LLM gives you complete privacy, no API costs, and works offline. This section covers setting up Ollama (recommended) on macOS.
Hardware Requirements
| Model Size | RAM Required | GPU VRAM | Recommended Hardware |
|---|---|---|---|
| 7B-8B (Q4) | 8GB+ | 8GB | MacBook Air M1/M2 16GB |
| 14B (Q4) | 12GB+ | 12GB | MacBook Air M2 24GB, MacBook Pro |
| 32B (Q4) | 20GB+ | 20-24GB | MacBook Pro M2/M3 Pro 32GB+ |
| 70B (Q4) | 40GB+ | 40-48GB | Mac Studio, High-end workstation |
Example: Apple Silicon with 24GB unified memory:
- ✅ Qwen 2.5 7B (~30 tokens/sec)
- ✅ Qwen 2.5 14B (~15-20 tokens/sec) ← Recommended
- ⚠️ Qwen 2.5 32B (~5-8 tokens/sec, tight fit)
Step 1: Install Ollama
# macOS (Homebrew)
brew install ollama
# Or download from https://ollama.ai/download
Step 2: Start Ollama Server
# Start the Ollama service (runs in background)
ollama serve
# Verify it's running
curl http://localhost:11434/api/version
# Should return: {"version":"0.x.x"}
Step 3: Pull a Model
Recommended models for function calling:
# Best for 24GB RAM (recommended)
ollama pull qwen2.5:14b-instruct
# Faster, lighter alternative
ollama pull qwen2.5:7b-instruct
# If you have 32GB+ RAM
ollama pull qwen2.5:32b-instruct
# Alternative: Llama 3.1 (also excellent)
ollama pull llama3.1:8b-instruct
Step 4: Test the Model
# Interactive test
ollama run qwen2.5:14b-instruct "What are the benefits of interval training?"
# API test
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:14b-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Step 5: Configure Pierre
# Set environment variables
export PIERRE_LLM_PROVIDER=local
export LOCAL_LLM_BASE_URL=http://localhost:11434/v1
export LOCAL_LLM_MODEL=qwen2.5:14b-instruct
# Or add to .envrc:
echo 'export PIERRE_LLM_PROVIDER=local' >> .envrc
echo 'export LOCAL_LLM_MODEL=qwen2.5:14b-instruct' >> .envrc
direnv allow
Step 6: Start Pierre and Test
# Start Pierre server
./bin/start-server.sh
# Test chat endpoint
curl -X POST http://localhost:8081/api/chat/conversations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_TOKEN" \
-d '{"title": "Test Chat"}'
Model Recommendations
For Function Calling (Pierre’s 14+ Tools)
| Model | Size | Function Calling | Speed | Notes |
|---|---|---|---|---|
| Qwen 2.5 14B-Instruct | 14B | ⭐⭐⭐⭐⭐ | Fast | Best balance for most hardware |
| Qwen 2.5 32B-Instruct | 32B | ⭐⭐⭐⭐⭐ | Medium | Best quality, needs 24GB+ |
| Qwen 2.5 7B-Instruct | 7B | ⭐⭐⭐⭐ | Very Fast | Good for lighter hardware |
| Llama 3.1 8B-Instruct | 8B | ⭐⭐⭐⭐ | Very Fast | Meta’s latest, excellent |
| Llama 3.3 70B-Instruct | 70B | ⭐⭐⭐⭐⭐ | Slow | Best quality, needs 48GB+ |
| Mistral 7B-Instruct | 7B | ⭐⭐⭐⭐ | Very Fast | Fast and versatile |
Ollama Model Commands
# List installed models
ollama list
# Pull a model
ollama pull qwen2.5:14b-instruct
# Remove a model
ollama rm qwen2.5:7b-instruct
# Show model info
ollama show qwen2.5:14b-instruct
Configuration Reference
Environment Variables
| Variable | Description | Default | Required |
|---|---|---|---|
PIERRE_LLM_PROVIDER | Provider: groq, gemini, local, ollama, vllm, localai | groq | No |
GROQ_API_KEY | Groq API key | - | Yes (for Groq) |
GEMINI_API_KEY | Google Gemini API key | - | Yes (for Gemini) |
LOCAL_LLM_BASE_URL | Local LLM API endpoint | http://localhost:11434/v1 | No |
LOCAL_LLM_MODEL | Model name for local provider | qwen2.5:14b-instruct | No |
LOCAL_LLM_API_KEY | API key for local provider | (empty) | No |
Provider Capabilities
| Capability | Groq | Gemini | Local (Ollama) |
|---|---|---|---|
| Streaming | ✅ | ✅ | ✅ |
| Function/Tool Calling | ✅ | ✅ | ✅ |
| Vision/Image Input | ❌ | ✅ | ❌ |
| JSON Mode | ✅ | ✅ | ❌ |
| System Messages | ✅ | ✅ | ✅ |
| Offline Operation | ❌ | ❌ | ✅ |
| Privacy (No Data Sent) | ❌ | ❌ | ✅ |
Supported Models by Provider
Groq (Cloud)
| Model | Description | Default |
|---|---|---|
llama-3.3-70b-versatile | High-quality general purpose | ✓ |
llama-3.1-8b-instant | Fast responses for simple tasks | |
llama-3.1-70b-versatile | Versatile 70B model | |
mixtral-8x7b-32768 | Long context window (32K tokens) | |
gemma2-9b-it | Google’s Gemma 2 instruction-tuned |
Rate Limits: Free tier has 12,000 tokens-per-minute limit.
Gemini (Cloud)
| Model | Description | Default |
|---|---|---|
gemini-2.5-flash | Latest fast model with improved capabilities | ✓ |
gemini-2.0-flash-exp | Experimental fast model | |
gemini-1.5-pro | Advanced reasoning capabilities | |
gemini-1.5-flash | Balanced performance and cost | |
gemini-1.0-pro | Legacy pro model |
Local (Ollama/vLLM)
| Model | Description | Recommended For |
|---|---|---|
qwen2.5:14b-instruct | Excellent function calling | 24GB RAM (default) |
qwen2.5:7b-instruct | Fast, good function calling | 16GB RAM |
qwen2.5:32b-instruct | Best quality function calling | 32GB+ RAM |
llama3.1:8b-instruct | Meta’s latest 8B | 16GB RAM |
llama3.1:70b-instruct | Meta’s latest 70B | 48GB+ RAM |
mistral:7b-instruct | Fast and versatile | 16GB RAM |
Testing
Run All LLM Tests
# LLM module unit tests
cargo test --test llm_test -- --nocapture
# LLM provider abstraction tests
cargo test --test llm_provider_test -- --nocapture
Test Local Provider Specifically
# Ensure Ollama is running first
ollama serve &
# Test provider initialization
cargo test test_llm_provider_type -- --nocapture
# Test chat functionality (requires running server)
cargo test --test llm_local_integration_test -- --nocapture
Manual Testing
# 1. Start Ollama
ollama serve
# 2. Pull test model
ollama pull qwen2.5:7b-instruct
# 3. Set environment
export PIERRE_LLM_PROVIDER=local
export LOCAL_LLM_MODEL=qwen2.5:7b-instruct
# 4. Start Pierre
./bin/start-server.sh
# 5. Test health endpoint
curl http://localhost:8081/health
# 6. Test chat (requires authentication)
# Create admin token first:
cargo run --bin admin-setup -- generate-token --service test --expires-days 1
Validation Checklist for Local LLM
Before deploying with local LLM, verify:
- Ollama server is running (
curl http://localhost:11434/api/version) - Model is pulled (
ollama list) - Model supports function calling (use Qwen 2.5 or Llama 3.1)
- Environment variables are set correctly
- Pierre can connect to Ollama (
curl http://localhost:8081/health) - Chat streaming works
- Tool execution works (test with fitness tools)
Alternative Local Backends
vLLM (Production)
For production deployments with high throughput:
# Install vLLM
pip install vllm
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--port 8000
# Configure Pierre
export PIERRE_LLM_PROVIDER=vllm
export LOCAL_LLM_BASE_URL=http://localhost:8000/v1
export LOCAL_LLM_MODEL=Qwen/Qwen2.5-14B-Instruct
vLLM advantages:
- Parallel function calls
- Streaming tool calls
- Higher throughput via PagedAttention
- Better for multiple concurrent users
LocalAI
# Run LocalAI with Docker
docker run -p 8080:8080 localai/localai
# Configure Pierre
export PIERRE_LLM_PROVIDER=localai
export LOCAL_LLM_BASE_URL=http://localhost:8080/v1
Basic Usage
Using ChatProvider (Recommended)
The ChatProvider enum automatically selects the provider based on environment configuration:
#![allow(unused)]
fn main() {
use pierre_mcp_server::llm::{ChatProvider, ChatMessage, ChatRequest};
// Create provider from environment (reads PIERRE_LLM_PROVIDER)
let provider = ChatProvider::from_env()?;
// Build a chat request
let request = ChatRequest::new(vec![
ChatMessage::system("You are a helpful fitness assistant."),
ChatMessage::user("What's a good warm-up routine?"),
])
.with_temperature(0.7)
.with_max_tokens(1000);
// Get a response
let response = provider.complete(&request).await?;
println!("{}", response.content);
}
Explicit Provider Selection
#![allow(unused)]
fn main() {
// Force Gemini
let provider = ChatProvider::gemini()?;
// Force Groq
let provider = ChatProvider::groq()?;
// Force Local
let provider = ChatProvider::local()?;
}
Streaming Responses
#![allow(unused)]
fn main() {
use futures_util::StreamExt;
let request = ChatRequest::new(vec![
ChatMessage::user("Explain the benefits of interval training"),
])
.with_streaming();
let mut stream = provider.complete_stream(&request).await?;
while let Some(chunk) = stream.next().await {
match chunk {
Ok(chunk) => {
print!("{}", chunk.delta);
if chunk.is_final {
println!("\n[Done]");
}
}
Err(e) => eprintln!("Error: {e}"),
}
}
}
Tool/Function Calling
All three providers (Gemini, Groq, Local) support tool calling:
#![allow(unused)]
fn main() {
use pierre_mcp_server::llm::{Tool, FunctionDeclaration};
let tools = vec![Tool {
function_declarations: vec![FunctionDeclaration {
name: "get_weather".to_string(),
description: "Get current weather for a location".to_string(),
parameters: Some(serde_json::json!({
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
})),
}],
}];
let response = provider.complete_with_tools(&request, Some(tools)).await?;
if response.has_function_calls() {
for call in response.function_calls.unwrap() {
println!("Call function: {} with args: {}", call.name, call.args);
}
}
}
Recipe Generation Integration
Pierre uses LLM providers for the “Combat des Chefs” recipe generation architecture. The workflow differs based on whether the client has LLM capabilities:
LLM Clients (Claude, ChatGPT, etc.)
When an LLM client connects to Pierre, it generates recipes itself:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ LLM Client │────▶│ Pierre MCP │────▶│ USDA │
│ (Claude) │ │ Server │ │ Database │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
│ 1. get_recipe_ │ │
│ constraints │ │
│───────────────────▶│ │
│ │ │
│ 2. Returns macro │ │
│ targets, hints │ │
│◀───────────────────│ │
│ │ │
│ [LLM generates │ │
│ recipe locally] │ │
│ │ │
│ 3. validate_ │ │
│ recipe │ │
│───────────────────▶│ │
│ │ Lookup nutrition │
│ │───────────────────▶│
│ │◀───────────────────│
│ 4. Validation │ │
│ result + macros│ │
│◀───────────────────│ │
│ │ │
│ 5. save_recipe │ │
│───────────────────▶│ │
Non-LLM Clients
For clients without LLM capabilities, Pierre uses its internal LLM (via ChatProvider):
#![allow(unused)]
fn main() {
// The suggest_recipe tool uses Pierre's configured LLM
let provider = ChatProvider::from_env()?;
let recipe = generate_recipe_with_llm(&provider, constraints).await?;
}
Recipe Tools
| Tool | Description |
|---|---|
get_recipe_constraints | Get macro targets and prompt hints for LLM recipe generation |
validate_recipe | Validate recipe nutrition via USDA FoodData Central |
suggest_recipe | Uses Pierre’s internal LLM to generate recipes |
save_recipe | Save validated recipes to user collection |
list_recipes | List user’s saved recipes |
get_recipe | Get recipe by ID |
search_recipes | Search recipes by name, tags, or ingredients |
API Reference
LlmCapabilities
Bitflags indicating provider features:
| Flag | Description |
|---|---|
STREAMING | Supports streaming responses |
FUNCTION_CALLING | Supports function/tool calling |
VISION | Supports image input |
JSON_MODE | Supports structured JSON output |
SYSTEM_MESSAGES | Supports system role messages |
#![allow(unused)]
fn main() {
// Check capabilities
let caps = provider.capabilities();
if caps.supports_streaming() {
// Use streaming API
}
}
ChatMessage
Message structure for conversations:
#![allow(unused)]
fn main() {
// Constructor methods
let system = ChatMessage::system("You are helpful");
let user = ChatMessage::user("Hello!");
let assistant = ChatMessage::assistant("Hi there!");
}
ChatRequest
Request configuration with builder pattern:
#![allow(unused)]
fn main() {
let request = ChatRequest::new(messages)
.with_model("gemini-1.5-pro") // Override default model
.with_temperature(0.7) // 0.0 to 1.0
.with_max_tokens(2000) // Max output tokens
.with_streaming(); // Enable streaming
}
ChatResponse
Response structure:
| Field | Type | Description |
|---|---|---|
content | String | Generated text |
model | String | Model used |
usage | Option<TokenUsage> | Token counts |
finish_reason | Option<String> | Why generation stopped |
StreamChunk
Streaming chunk structure:
| Field | Type | Description |
|---|---|---|
delta | String | Incremental text |
is_final | bool | Whether this is the last chunk |
finish_reason | Option<String> | Reason if final |
Module Structure
src/llm/
├── mod.rs # Trait definitions, types, registry, exports
├── provider.rs # ChatProvider enum (runtime selector)
├── gemini.rs # Google Gemini implementation
├── groq.rs # Groq LPU implementation
├── openai_compatible.rs # Generic OpenAI-compatible provider (Ollama, vLLM, LocalAI)
└── prompts/
└── mod.rs # System prompts (pierre_system.md)
Adding New Providers
To implement a new LLM provider:
- Implement the trait:
#![allow(unused)]
fn main() {
use async_trait::async_trait;
use pierre_mcp_server::llm::{
LlmProvider, LlmCapabilities, ChatRequest, ChatResponse,
ChatStream, AppError,
};
pub struct MyProvider {
api_key: String,
// ...
}
#[async_trait]
impl LlmProvider for MyProvider {
fn name(&self) -> &'static str {
"myprovider"
}
fn display_name(&self) -> &'static str {
"My Custom Provider"
}
fn capabilities(&self) -> LlmCapabilities {
LlmCapabilities::STREAMING | LlmCapabilities::SYSTEM_MESSAGES
}
fn default_model(&self) -> &'static str {
"my-model-v1"
}
fn available_models(&self) -> &'static [&'static str] {
&["my-model-v1", "my-model-v2"]
}
async fn complete(&self, request: &ChatRequest) -> Result<ChatResponse, AppError> {
// Implementation
}
async fn complete_stream(&self, request: &ChatRequest) -> Result<ChatStream, AppError> {
// Implementation
}
async fn health_check(&self) -> Result<bool, AppError> {
// Implementation
}
}
}
- Add to ChatProvider enum in
src/llm/provider.rs:
#![allow(unused)]
fn main() {
pub enum ChatProvider {
Gemini(GeminiProvider),
Groq(GroqProvider),
Local(OpenAiCompatibleProvider),
MyProvider(MyProvider), // Add variant
}
}
-
Update environment config in
src/config/environment.rs -
Register tests in
tests/llm_test.rs
Error Handling
All provider methods return Result<T, AppError>:
#![allow(unused)]
fn main() {
match provider.complete(&request).await {
Ok(response) => println!("{}", response.content),
Err(AppError { code, message, .. }) => {
match code {
ErrorCode::RateLimitExceeded => // Handle rate limit
ErrorCode::AuthenticationFailed => // Handle auth error
_ => // Handle other errors
}
}
}
}
Common Local LLM Errors
| Error | Cause | Solution |
|---|---|---|
| “Cannot connect to Ollama” | Ollama not running | Run ollama serve |
| “Model not found” | Model not pulled | Run ollama pull MODEL_NAME |
| “Connection refused” | Wrong port/URL | Check LOCAL_LLM_BASE_URL |
| “Timeout” | Model loading or slow inference | Wait, or use smaller model |
Troubleshooting
Ollama Won’t Start
# Check if already running
pgrep -f ollama
# Kill existing instance
pkill ollama
# Start fresh
ollama serve
Model Too Slow
# Use a smaller quantization
ollama pull qwen2.5:14b-instruct-q4_K_M
# Or use a smaller model
ollama pull qwen2.5:7b-instruct
Out of Memory
# Check model size
ollama show qwen2.5:14b-instruct --modelfile
# Use smaller model
ollama pull qwen2.5:7b-instruct
# Or reduce context length in requests
Function Calling Not Working
- Ensure you’re using a model trained for function calling (Qwen 2.5, Llama 3.1)
- Verify the model is the instruct/chat variant, not base
- Check tool definitions are valid JSON Schema