Shipping AI Features in Production: GPT-4o Inside a Live Platform
We didn't just add a chatbot — we embedded AI into the core prioritization engine at Fygurs. Here's how we designed the prompts, managed async load, and deployed GPT-4o-mini on Azure without adding latency to our critical path.
The first time I plugged GPT-4o-mini into our prioritization engine at Fygurs, the latency was unacceptable — 4 seconds on a request path that needed to stay under 800ms. The model was fine. The architecture around it wasn't. This covers what we changed: async processing, prompt design, and how we structured the Azure OpenAI integration so AI is a service dependency, not a bottleneck.
Large Language Models: Fundamentals
Large Language Models (LLMs) are neural networks trained on massive text datasets to predict the next token in a sequence. Understanding how they work helps you use them effectively in production.
LLM ARCHITECTURE
Input Text Tokenization Embedding
"Hello world" → [15496, 995] → [0.1, -0.3, ...]
│
▼
┌─────────────────────────────────────┐
│ Transformer Layers │
│ │
│ Self-Attention → Feed-Forward │
│ ↓ ↓ │
│ Self-Attention → Feed-Forward │
│ ↓ ↓ │
│ Self-Attention → Feed-Forward │
│ │
└─────────────────┬───────────────────┘
│
▼
Output Probabilities → Next Token
"!" (0.3), "." (0.2), "," (0.1)...
Key Concepts
- Tokens: Text is split into subwords (GPT-4 uses ~100k vocabulary). "unhappiness" → ["un", "happiness"]
- Context Window: Maximum tokens the model can process at once (GPT-4o: 128k tokens)
- Temperature: Controls randomness. 0 = deterministic, 1 = creative, 2 = chaotic
- Top-p (Nucleus Sampling): Only considers tokens whose cumulative probability exceeds p
- Max Tokens: Limits response length (input + output must fit context window)
Chat Completions API
Modern LLMs use a message-based API with roles that shape the conversation:
MESSAGE ROLES
┌──────────────────────────────────────────────────────────────┐
│ System Message │
│ "You are a helpful assistant that responds in JSON format" │
│ → Sets behavior, persona, and output format │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ User Message │
│ "Generate 3 product recommendations for electronics" │
│ → The actual request or question │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Assistant Message │
│ {"recommendations": [...]} │
│ → The model's response │
└──────────────────────────────────────────────────────────────┘
JSON Mode
For production systems, you need structured output. JSON mode ensures the model returns valid JSON:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Return a JSON object with 'items' array"},
{"role": "user", "content": "List 3 programming languages"}
],
response_format={"type": "json_object"} # Enforces valid JSON
)
Azure OpenAI Service
Azure OpenAI provides enterprise-grade access to OpenAI models with additional security, compliance, and regional deployment options.
AZURE OPENAI ARCHITECTURE
┌─────────────────┐ ┌─────────────────────────────────────────┐
│ Application │ │ Azure OpenAI Service │
│ │ │ │
│ AsyncAzureOAI │────▶│ ┌─────────────┐ ┌─────────────────┐ │
│ Client │ │ │ Endpoint │ │ Deployments │ │
│ │ │ │ (Regional) │ │ │ │
└─────────────────┘ │ └─────────────┘ │ • gpt-4o │ │
│ │ • gpt-4o-mini │ │
│ │ • embeddings │ │
│ └─────────────────┘ │
└─────────────────────────────────────────┘
Key Differences from OpenAI
- Deployments: You create named deployments of models (not just model names)
- Endpoints: Region-specific URLs (e.g., eastus.api.cognitive.microsoft.com)
- API Versions: Azure uses dated API versions for stability
- Authentication: API keys or Azure Active Directory
Fine-Tuning: Customizing Models
Fine-tuning adapts a pre-trained model to your specific domain or task. Instead of training from scratch, you adjust the model's weights using your own examples.
FINE-TUNING PIPELINE
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Training Data │ │ Base Model │ │ Fine-Tuned │
│ │ │ │ │ Model │
│ {"prompt": ... │────▶│ GPT-4o │────▶│ │
│ "completion"} │ │ (frozen) │ │ Your Domain │
│ │ │ │ │ Specialist │
└─────────────────┘ └─────────────────┘ └─────────────────┘
100-1000 Faster + Cheaper
examples + More Accurate
When to Fine-Tune
- Consistent format: Always output in a specific JSON schema
- Domain terminology: Use industry-specific language correctly
- Style/tone: Match your brand voice consistently
- Cost reduction: Fine-tuned small model can match large model quality
Training Data Format (JSONL)
{"messages": [{"role": "system", "content": "You are an assistant that extracts structured data."}, {"role": "user", "content": "Extract entities from: John visited Paris last Monday"}, {"role": "assistant", "content": "{"person": "John", "location": "Paris", "date": "last Monday"}"}]}
{"messages": [{"role": "system", "content": "You are an assistant that extracts structured data."}, {"role": "user", "content": "Extract entities from: The meeting is at 3pm in Room 101"}, {"role": "assistant", "content": "{"time": "3pm", "location": "Room 101"}"}]}
Parameter-Efficient Fine-Tuning (PEFT)
Full fine-tuning updates all model weights, which is expensive. PEFT methods like LoRA (Low-Rank Adaptation) freeze most weights and only train small adapter layers. This is used for self-hosted open-source models (Llama, Mistral). For Azure OpenAI, use the managed fine-tuning API instead.
LoRA ARCHITECTURE ┌───────────────────────────────────────────────────────────┐ │ Transformer Layer │ │ │ │ ┌───────────────────┐ ┌───────────────────┐ │ │ │ Original │ │ LoRA Adapter │ │ │ │ Weights (W) │ │ │ │ │ │ │ │ ┌─────┐ ┌─────┐ │ │ │ │ [Frozen] │ + │ │ A │×│ B │ │ │ │ │ │ │ │ r×d │ │ d×r │ │ │ │ │ │ │ └─────┘ └─────┘ │ │ │ │ │ │ │ │ │ │ │ │ [Trainable] │ │ │ └─────────┬─────────┘ └─────────┬─────────┘ │ │ │ │ │ │ └────────────┬───────────────┘ │ │ │ │ │ Output = Wx + BAx │ └───────────────────────────────────────────────────────────┘ Full Fine-Tune: Update all ~7B parameters LoRA (r=16): Update only ~4M parameters (0.05%)
LoRA with Hugging Face
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank of adaptation matrices
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Create PEFT model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
RAG: Retrieval-Augmented Generation
RAG combines vector search with LLM generation. Instead of relying solely on the model's training data, you retrieve relevant documents and include them in the prompt.
RAG ARCHITECTURE
User Query Response
│ ▲
▼ │
┌───────────────────┐ ┌───────────────────┐
│ Embedding │ │ LLM │
│ Model │ │ Generation │
└─────────┬─────────┘ └─────────┬─────────┘
│ │
▼ │
┌───────────────────┐ ┌───────────────────┐ ┌───────┴─────────┐
│ Query Vector │───▶│ Vector Search │───▶│ Augmented │
│ [0.1, -0.3...] │ │ (Top K=3) │ │ Prompt │
└───────────────────┘ └─────────┬─────────┘ │ │
│ │ Context: ... │
┌─────────┴─────────┐ │ Question: ... │
│ Vector Store │ └─────────────────┘
│ │
│ ● Doc 1 [0.2] │
│ ● Doc 2 [-0.1] │
│ ● Doc 3 [0.4] │
└───────────────────┘
RAG Implementation
class RAGService:
def __init__(self, vector_store):
self.vector_store = vector_store
async def query(self, question: str, top_k: int = 3) -> str:
# 1. Embed the question
query_embedding = await get_embedding(question)
# 2. Search for relevant documents
results = self.vector_store.search(
vector=query_embedding,
top_k=top_k
)
# 3. Build context from retrieved documents
context = "\n\n".join([
f"Document {i+1}:\n{doc.content}"
for i, doc in enumerate(results)
])
# 4. Generate answer with context
prompt = f"""Answer the question based on the context below.
Context:
{context}
Question: {question}
Answer:"""
response = await AIClient.generate(
system_prompt="You are a helpful assistant. Only use information from the provided context.",
user_prompt=prompt,
temperature=0.3
)
return response
When to Use RAG vs Fine-Tuning
| Use Case | RAG | Fine-Tuning |
|---|---|---|
| Up-to-date information | Best choice | Requires retraining |
| Citing sources | Built-in | Not possible |
| Consistent style/format | Limited | Best choice |
| Domain terminology | Good | Best choice |
| Cost per query | Higher (retrieval) | Lower |
Django Integration Patterns
Integrating AI into Django requires careful initialization. Loading prompts and models at startup prevents repeated file I/O and ensures fast request handling.
AppConfig: Startup Initialization
Django's AppConfig.ready() runs once when the application starts. Use it to load prompts into memory:
# apps/ai_service/apps.py
from django.apps import AppConfig
class AIServiceConfig(AppConfig):
default_auto_field = 'django.db.models.BigAutoField'
name = 'apps.ai_service'
def ready(self):
# Import here to avoid circular imports
from .services.prompt_engine import PromptEngine
# Load all prompts into memory at startup
PromptEngine.load_prompts()
Management Commands: Async Initialization
For async clients that need the event loop, use management commands:
# apps/ai_service/management/commands/init_models.py
from django.core.management.base import BaseCommand
import asyncio
class Command(BaseCommand):
help = 'Initialize AI models'
def handle(self, *args, **options):
from apps.ai_service.services.ai_client import AIClient
loop = asyncio.get_event_loop()
loop.run_until_complete(AIClient.initialize())
self.stdout.write(self.style.SUCCESS('AI models initialized'))
Prompt Engine: File-Based Management
Managing prompts as string literals in code is unmaintainable. A proper prompt engine loads templates from files, supports multiple languages, and enables non-developers to modify prompts.
PROMPT FILE STRUCTURE
resources/
└── prompts/
├── system/
│ ├── generator_en.txt
│ ├── generator_fr.txt
│ ├── summarizer_en.txt
│ └── summarizer_fr.txt
└── user/
├── generation_template_en.txt
├── generation_template_fr.txt
├── summary_template_en.txt
└── summary_template_fr.txt
PromptEngine Implementation
# services/prompt_engine.py
import os
from string import Template
from typing import Dict
class PromptEngine:
# Class-level storage for loaded prompts
SYSTEM_PROMPTS: Dict[str, Dict[str, str]] = {}
USER_TEMPLATES: Dict[str, Dict[str, str]] = {}
PROMPTS_DIR = os.path.join(os.path.dirname(__file__), '..', 'resources', 'prompts')
@classmethod
def load_prompts(cls):
"""Load all prompts from files into memory at startup."""
# Load system prompts
system_dir = os.path.join(cls.PROMPTS_DIR, 'system')
for filename in os.listdir(system_dir):
name, lang = cls._parse_filename(filename)
if name not in cls.SYSTEM_PROMPTS:
cls.SYSTEM_PROMPTS[name] = {}
with open(os.path.join(system_dir, filename), 'r') as f:
cls.SYSTEM_PROMPTS[name][lang] = f.read()
# Load user templates
user_dir = os.path.join(cls.PROMPTS_DIR, 'user')
for filename in os.listdir(user_dir):
name, lang = cls._parse_filename(filename)
if name not in cls.USER_TEMPLATES:
cls.USER_TEMPLATES[name] = {}
with open(os.path.join(user_dir, filename), 'r') as f:
cls.USER_TEMPLATES[name][lang] = f.read()
@classmethod
def _parse_filename(cls, filename: str) -> tuple:
"""Extract prompt name and language from filename."""
# generator_en.txt → ('generator', 'En')
name = filename.rsplit('_', 1)[0]
lang = filename.rsplit('_', 1)[1].replace('.txt', '').capitalize()
return name, lang
@classmethod
def get_system_prompt(cls, name: str, language: str = 'En') -> str:
"""Get a system prompt by name and language."""
return cls.SYSTEM_PROMPTS.get(name, {}).get(language, '')
@classmethod
def render_user_prompt(cls, name: str, language: str = 'En', **kwargs) -> str:
"""Render a user template with variables."""
template_str = cls.USER_TEMPLATES.get(name, {}).get(language, '')
return Template(template_str).safe_substitute(**kwargs)
Template Variables
User prompts use Python's Template syntax for variable substitution:
# resources/prompts/user/generation_template_en.txt
Generate $count content items for the following topic:
Category: $category
Format: $item_type
Requirements:
$requirements
Additional context:
$context
Return a JSON object with an "items" array containing id, title, content, and type fields.
AsyncAzureOpenAI Client
For high-throughput services, use the async client to avoid blocking the event loop:
# services/ai_client.py
from openai import AsyncAzureOpenAI
import os
class AIClient:
_instance: AsyncAzureOpenAI = None
@classmethod
async def initialize(cls):
"""Initialize the singleton client."""
if cls._instance is None:
cls._instance = AsyncAzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-01"),
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)
@classmethod
def get_client(cls) -> AsyncAzureOpenAI:
"""Get the singleton client instance."""
if cls._instance is None:
raise RuntimeError("AIClient not initialized. Call initialize() first.")
return cls._instance
@classmethod
async def generate(cls, system_prompt: str, user_prompt: str,
temperature: float = 0.5) -> str:
"""Generate a completion with JSON mode."""
client = cls.get_client()
response = await client.chat.completions.create(
model=os.getenv("AZURE_DEPLOYMENT_NAME"),
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=temperature,
max_tokens=4000,
response_format={"type": "json_object"}
)
return response.choices[0].message.content
Response Validation
LLMs are probabilistic. They can return malformed JSON, missing fields, or incorrect types. A validation layer with auto-correction ensures reliability:
VALIDATION FLOW
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ AI Response│────▶│ Validate │────▶│ Valid? │────▶│ Return │
│ (JSON) │ │ Schema │ │ Yes │ │ Result │
└────────────┘ └────────────┘ └─────┬──────┘ └────────────┘
│ No
▼
┌────────────┐ ┌────────────┐
│ Auto │────▶│ Retry │──┐
│ Correct │ │ (3x) │ │
└────────────┘ └────────────┘ │
▲ │
└────────────────────────────┘
Validator Implementation
# services/validator.py
import json
import re
from typing import List, Dict, Any, Optional
class Validator:
REQUIRED_FIELDS = ['id', 'title', 'content', 'format']
VALID_FORMATS = ['summary', 'analysis', 'tutorial']
def validate(self, items: List[Dict]) -> bool:
"""Validate a list of items."""
if not isinstance(items, list) or len(items) == 0:
return False
for item in items:
if not self._validate_item(item):
return False
return True
def _validate_item(self, item: Dict) -> bool:
"""Validate a single item."""
# Check required fields exist
for field in self.REQUIRED_FIELDS:
if field not in item:
return False
# Validate format enum
if item.get('format', '').lower() not in self.VALID_FORMATS:
return False
# Validate string fields are not empty
if not item.get('title') or not item.get('content'):
return False
return True
def auto_correct(self, raw_response: str) -> Optional[Dict]:
"""Attempt to fix common JSON issues."""
try:
# Try direct parse first
return json.loads(raw_response)
except json.JSONDecodeError:
pass
# Fix common issues
corrected = raw_response
# Remove markdown code blocks
corrected = re.sub(r'```json?\n?', '', corrected)
corrected = re.sub(r'```', '', corrected)
# Fix trailing commas
corrected = re.sub(r',\s*}', '}', corrected)
corrected = re.sub(r',\s*]', ']', corrected)
# Fix unquoted keys (simple cases)
corrected = re.sub(r'(\{|,)\s*(\w+)\s*:', r'\1"\2":', corrected)
try:
return json.loads(corrected)
except json.JSONDecodeError:
return None
Parallel Processing with asyncio
Generating many outputs sequentially takes minutes. By splitting work into batches and running them in parallel, we reduce latency dramatically.
PARALLEL BATCH PROCESSING
Sequential (120 seconds) Parallel (15 seconds)
──────────────────────── ─────────────────────
[Batch 1] ──▶ [Batch 2] ──▶ [Batch 1] ──┐
[Batch 3] ──▶ [Batch 4] ──▶ [Batch 2] ──┤
[Batch 5] ──▶ [Batch 6] ──▶ [Batch 3] ──┤
[Batch 7] ──▶ [Batch 8] [Batch 4] ──┼──▶ Combine
[Batch 5] ──┤
Total: 8 × 15s = 120s [Batch 6] ──┤
[Batch 7] ──┤
[Batch 8] ──┘
Total: max(15s) = 15s
Batch Generator Implementation
# services/generator.py
import asyncio
import json
from typing import List, Dict, Any
class BatchGenerator:
def __init__(self, language: str = 'En'):
self.language = language
self.validator = Validator()
async def generate_items(self, context: Dict[str, Any]) -> List[Dict]:
"""Generate items using parallel batch processing."""
# Define categories for each batch (ensures diversity)
batch_configs = [
{'category': 'Technology', 'count': 3},
{'category': 'Science', 'count': 3},
{'category': 'Business', 'count': 3},
{'category': 'Education', 'count': 3},
{'category': 'Healthcare', 'count': 2},
{'category': 'Environment', 'count': 2},
]
# Create parallel tasks
tasks = [
self._generate_batch(config, context)
for config in batch_configs
]
# Execute all batches in parallel
results = await asyncio.gather(*tasks, return_exceptions=True)
# Combine results, handling failures gracefully
all_items = []
for i, result in enumerate(results):
if isinstance(result, Exception):
print(f"Batch {i+1} failed: {result}")
continue
all_items.extend(result)
return all_items
async def _generate_batch(self, config: Dict, context: Dict,
retries: int = 3) -> List[Dict]:
"""Generate a single batch with retry logic."""
system_prompt = PromptEngine.get_system_prompt('generator', self.language)
user_prompt = PromptEngine.render_user_prompt(
'generation_template',
self.language,
count=config['count'],
category=config['category'],
item_type=context.get('format', 'summary'),
requirements=context.get('requirements', ''),
context=json.dumps(context.get('data', {}))
)
response = await AIClient.generate(system_prompt, user_prompt)
# Parse and validate
parsed = self.validator.auto_correct(response)
if parsed is None:
if retries > 0:
return await self._generate_batch(config, context, retries - 1)
raise ValueError("Failed to parse AI response")
items = parsed.get('items', [])
if not self.validator.validate(items):
if retries > 0:
print(f"Validation failed, retrying... ({retries} left)")
return await self._generate_batch(config, context, retries - 1)
raise ValueError("AI produced invalid output")
return items
Parameter Tuning
Different use cases require different model parameters:
| Parameter | Creative Tasks | Analytical Tasks | Classification |
|---|---|---|---|
| Temperature | 0.7-1.0 | 0.3-0.5 | 0.0-0.2 |
| Top-p | 0.9-1.0 | 0.5-0.7 | 0.1-0.3 |
| Frequency Penalty | 0.3-0.5 | 0.0-0.2 | 0.0 |
| Presence Penalty | 0.5-0.8 | 0.0-0.3 | 0.0 |
- Temperature: Higher = more random/creative, lower = more focused/deterministic
- Top-p: Limits token selection to top probability mass. Lower = more predictable
- Frequency Penalty: Reduces repetition of already-used tokens
- Presence Penalty: Encourages the model to discuss new topics
Error Handling Patterns
Production AI services must handle various failure modes:
async def safe_generate(self, prompt: str, max_retries: int = 3) -> Dict:
"""Generate with comprehensive error handling."""
for attempt in range(max_retries):
try:
response = await AIClient.generate(
self.system_prompt,
prompt,
temperature=0.5
)
parsed = json.loads(response)
if self.validator.validate(parsed):
return parsed
# Validation failed, retry with lower temperature
print(f"Validation failed, attempt {attempt + 1}")
except json.JSONDecodeError as e:
print(f"JSON parse error: {e}")
# Try auto-correction
corrected = self.validator.auto_correct(response)
if corrected and self.validator.validate(corrected):
return corrected
except Exception as e:
print(f"API error: {e}")
if "rate_limit" in str(e).lower():
await asyncio.sleep(2 ** attempt) # Exponential backoff
continue
raise RuntimeError(f"Failed after {max_retries} attempts")
Best Practices
Prompt Engineering
- Be specific: "Return exactly 5 items" not "Return some items"
- Show format: Include example JSON in system prompt
- Constrain output: List valid values for enums
- Separate concerns: System prompt for behavior, user prompt for data
Performance
- Batch parallel calls: Use asyncio.gather for independent requests
- Pre-load prompts: Load from files at startup, not per-request
- Use appropriate models: GPT-4o-mini for simple tasks, GPT-4o for complex reasoning
- Stream responses: For user-facing chat, stream tokens as they arrive
Reliability
- Always validate: Never trust LLM output without validation
- Implement retries: 3 retries with exponential backoff
- Auto-correct JSON: Handle common formatting errors
- Graceful degradation: Return partial results if some batches fail
Conclusion
Production AI engineering spans multiple disciplines. LLM fundamentals (tokens, temperature, context windows) inform how you interact with models. Fine-tuning with PEFT/LoRA lets you specialize models for your domain without massive compute costs. RAG enables knowledge-grounded responses that cite sources.
On the implementation side, Django patterns like AppConfig ensure efficient initialization. The PromptEngine separates content from code, enabling non-developers to modify prompts. Validation with auto-correction handles the probabilistic nature of LLMs. The parallel processing pattern with asyncio.gather reduces latency from minutes to seconds. Combined with retry logic and graceful error handling, this architecture delivers reliable AI features at production scale.
Related Reading
These posts cover the microservices and cloud infrastructure that this AI layer runs on top of:
- How We Architected a Production SaaS: A Microservices Deep Dive — the NestJS service architecture that the GPT-4o-mini integration is embedded into, including the gateway pattern and inter-service communication.
- How We Deployed and Scaled on Azure: A Production Playbook — the Azure Container Apps environment that hosts the AI service, including the scaling configuration needed to handle burst inference load.
See the Fygurs project in the projects section for the full product context around the AI prioritization feature.
Written by

Technical Lead and Full Stack Engineer leading a 5-engineer team at Fygurs (Paris, Remote) on Azure cloud-native SaaS. Graduate of 1337 Coding School (42 Network / UM6P). Writes about architecture, cloud infrastructure, and engineering leadership.