ArchitectureFebruary 25, 202514 min readBy Brahim Boumlik

Shipping AI Features in Production: GPT-4o Inside a Live Platform

We didn't just add a chatbot — we embedded AI into the core prioritization engine at Fygurs. Here's how we designed the prompts, managed async load, and deployed GPT-4o-mini on Azure without adding latency to our critical path.

The first time I plugged GPT-4o-mini into our prioritization engine at Fygurs, the latency was unacceptable — 4 seconds on a request path that needed to stay under 800ms. The model was fine. The architecture around it wasn't. This covers what we changed: async processing, prompt design, and how we structured the Azure OpenAI integration so AI is a service dependency, not a bottleneck.

Large Language Models: Fundamentals

Large Language Models (LLMs) are neural networks trained on massive text datasets to predict the next token in a sequence. Understanding how they work helps you use them effectively in production.

LLM ARCHITECTURE

Input Text                  Tokenization                Embedding
"Hello world"       →       [15496, 995]       →       [0.1, -0.3, ...]
                                                              │
                                                              ▼
                           ┌─────────────────────────────────────┐
                           │        Transformer Layers           │
                           │                                     │
                           │  Self-Attention  →  Feed-Forward    │
                           │        ↓                ↓           │
                           │  Self-Attention  →  Feed-Forward    │
                           │        ↓                ↓           │
                           │  Self-Attention  →  Feed-Forward    │
                           │                                     │
                           └─────────────────┬───────────────────┘
                                             │
                                             ▼
                           Output Probabilities → Next Token
                           "!" (0.3), "." (0.2), "," (0.1)...

Key Concepts

Tokens: Text is split into subwords (GPT-4 uses ~100k vocabulary). "unhappiness" → ["un", "happiness"]
Context Window: Maximum tokens the model can process at once (GPT-4o: 128k tokens)
Temperature: Controls randomness. 0 = deterministic, 1 = creative, 2 = chaotic
Top-p (Nucleus Sampling): Only considers tokens whose cumulative probability exceeds p
Max Tokens: Limits response length (input + output must fit context window)

Chat Completions API

Modern LLMs use a message-based API with roles that shape the conversation:

MESSAGE ROLES

┌──────────────────────────────────────────────────────────────┐
│  System Message                                              │
│  "You are a helpful assistant that responds in JSON format"  │
│  → Sets behavior, persona, and output format                 │
└──────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│  User Message                                                │
│  "Generate 3 product recommendations for electronics"        │
│  → The actual request or question                            │
└──────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│  Assistant Message                                           │
│  {"recommendations": [...]}                                  │
│  → The model's response                                      │
└──────────────────────────────────────────────────────────────┘

JSON Mode

For production systems, you need structured output. JSON mode ensures the model returns valid JSON:


response = await client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Return a JSON object with 'items' array"},
        {"role": "user", "content": "List 3 programming languages"}
    ],
    response_format={"type": "json_object"}  # Enforces valid JSON
)

Azure OpenAI Service

Azure OpenAI provides enterprise-grade access to OpenAI models with additional security, compliance, and regional deployment options.

AZURE OPENAI ARCHITECTURE

┌─────────────────┐     ┌─────────────────────────────────────────┐
│   Application   │     │          Azure OpenAI Service           │
│                 │     │                                         │
│  AsyncAzureOAI  │────▶│  ┌─────────────┐   ┌─────────────────┐  │
│     Client      │     │  │  Endpoint   │   │   Deployments   │  │
│                 │     │  │ (Regional)  │   │                 │  │
└─────────────────┘     │  └─────────────┘   │  • gpt-4o       │  │
                        │                     │  • gpt-4o-mini  │  │
                        │                     │  • embeddings   │  │
                        │                     └─────────────────┘  │
                        └─────────────────────────────────────────┘

Key Differences from OpenAI

Deployments: You create named deployments of models (not just model names)
Endpoints: Region-specific URLs (e.g., eastus.api.cognitive.microsoft.com)
API Versions: Azure uses dated API versions for stability
Authentication: API keys or Azure Active Directory

Fine-Tuning: Customizing Models

Fine-tuning adapts a pre-trained model to your specific domain or task. Instead of training from scratch, you adjust the model's weights using your own examples.

FINE-TUNING PIPELINE

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Training Data  │     │   Base Model    │     │  Fine-Tuned     │
│                 │     │                 │     │     Model       │
│  {"prompt": ... │────▶│    GPT-4o      │────▶│                 │
│   "completion"} │     │    (frozen)     │     │  Your Domain    │
│                 │     │                 │     │   Specialist    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
     100-1000                                     Faster + Cheaper
     examples                                     + More Accurate

When to Fine-Tune

Consistent format: Always output in a specific JSON schema
Domain terminology: Use industry-specific language correctly
Style/tone: Match your brand voice consistently
Cost reduction: Fine-tuned small model can match large model quality

Training Data Format (JSONL)


{"messages": [{"role": "system", "content": "You are an assistant that extracts structured data."}, {"role": "user", "content": "Extract entities from: John visited Paris last Monday"}, {"role": "assistant", "content": "{"person": "John", "location": "Paris", "date": "last Monday"}"}]}
{"messages": [{"role": "system", "content": "You are an assistant that extracts structured data."}, {"role": "user", "content": "Extract entities from: The meeting is at 3pm in Room 101"}, {"role": "assistant", "content": "{"time": "3pm", "location": "Room 101"}"}]}

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning updates all model weights, which is expensive. PEFT methods like LoRA (Low-Rank Adaptation) freeze most weights and only train small adapter layers. This is used for self-hosted open-source models (Llama, Mistral). For Azure OpenAI, use the managed fine-tuning API instead.

LoRA ARCHITECTURE

┌───────────────────────────────────────────────────────────┐
│                   Transformer Layer                       │
│                                                           │
│  ┌───────────────────┐        ┌───────────────────┐      │
│  │   Original        │        │   LoRA Adapter    │      │
│  │   Weights (W)     │        │                   │      │
│  │                   │        │  ┌─────┐ ┌─────┐  │      │
│  │   [Frozen]        │   +    │  │  A  │×│  B  │  │      │
│  │                   │        │  │ r×d │ │ d×r │  │      │
│  │                   │        │  └─────┘ └─────┘  │      │
│  │                   │        │                   │      │
│  │                   │        │   [Trainable]     │      │
│  └─────────┬─────────┘        └─────────┬─────────┘      │
│            │                            │                │
│            └────────────┬───────────────┘                │
│                         │                                │
│                 Output = Wx + BAx                        │
└───────────────────────────────────────────────────────────┘

Full Fine-Tune: Update all ~7B parameters
LoRA (r=16):    Update only ~4M parameters (0.05%)

LoRA with Hugging Face


from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")

# Configure LoRA
lora_config = LoraConfig(
    r=16,                    # Rank of adaptation matrices
    lora_alpha=32,           # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Create PEFT model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

RAG: Retrieval-Augmented Generation

RAG combines vector search with LLM generation. Instead of relying solely on the model's training data, you retrieve relevant documents and include them in the prompt.

RAG ARCHITECTURE

   User Query                                            Response
       │                                                    ▲
       ▼                                                    │
┌───────────────────┐                            ┌───────────────────┐
│     Embedding     │                            │        LLM        │
│       Model       │                            │    Generation     │
└─────────┬─────────┘                            └─────────┬─────────┘
          │                                                │
          ▼                                                │
┌───────────────────┐    ┌───────────────────┐    ┌───────┴─────────┐
│   Query Vector    │───▶│   Vector Search   │───▶│    Augmented    │
│  [0.1, -0.3...]   │    │    (Top K=3)      │    │     Prompt      │
└───────────────────┘    └─────────┬─────────┘    │                 │
                                   │              │  Context: ...   │
                        ┌─────────┴─────────┐     │  Question: ...  │
                        │   Vector Store    │     └─────────────────┘
                        │                   │
                        │  ● Doc 1 [0.2]    │
                        │  ● Doc 2 [-0.1]   │
                        │  ● Doc 3 [0.4]    │
                        └───────────────────┘

RAG Implementation


class RAGService:
    def __init__(self, vector_store):
        self.vector_store = vector_store

    async def query(self, question: str, top_k: int = 3) -> str:
        # 1. Embed the question
        query_embedding = await get_embedding(question)

        # 2. Search for relevant documents
        results = self.vector_store.search(
            vector=query_embedding,
            top_k=top_k
        )

        # 3. Build context from retrieved documents
        context = "\n\n".join([
            f"Document {i+1}:\n{doc.content}"
            for i, doc in enumerate(results)
        ])

        # 4. Generate answer with context
        prompt = f"""Answer the question based on the context below.

Context:
{context}

Question: {question}

Answer:"""

        response = await AIClient.generate(
            system_prompt="You are a helpful assistant. Only use information from the provided context.",
            user_prompt=prompt,
            temperature=0.3
        )

        return response

When to Use RAG vs Fine-Tuning

Use Case	RAG	Fine-Tuning
Up-to-date information	Best choice	Requires retraining
Citing sources	Built-in	Not possible
Consistent style/format	Limited	Best choice
Domain terminology	Good	Best choice
Cost per query	Higher (retrieval)	Lower

Django Integration Patterns

Integrating AI into Django requires careful initialization. Loading prompts and models at startup prevents repeated file I/O and ensures fast request handling.

AppConfig: Startup Initialization

Django's AppConfig.ready() runs once when the application starts. Use it to load prompts into memory:


# apps/ai_service/apps.py
from django.apps import AppConfig

class AIServiceConfig(AppConfig):
    default_auto_field = 'django.db.models.BigAutoField'
    name = 'apps.ai_service'

    def ready(self):
        # Import here to avoid circular imports
        from .services.prompt_engine import PromptEngine

        # Load all prompts into memory at startup
        PromptEngine.load_prompts()

Management Commands: Async Initialization

For async clients that need the event loop, use management commands:


# apps/ai_service/management/commands/init_models.py
from django.core.management.base import BaseCommand
import asyncio

class Command(BaseCommand):
    help = 'Initialize AI models'

    def handle(self, *args, **options):
        from apps.ai_service.services.ai_client import AIClient

        loop = asyncio.get_event_loop()
        loop.run_until_complete(AIClient.initialize())

        self.stdout.write(self.style.SUCCESS('AI models initialized'))

Prompt Engine: File-Based Management

Managing prompts as string literals in code is unmaintainable. A proper prompt engine loads templates from files, supports multiple languages, and enables non-developers to modify prompts.

PROMPT FILE STRUCTURE

resources/
└── prompts/
    ├── system/
    │   ├── generator_en.txt
    │   ├── generator_fr.txt
    │   ├── summarizer_en.txt
    │   └── summarizer_fr.txt
    └── user/
        ├── generation_template_en.txt
        ├── generation_template_fr.txt
        ├── summary_template_en.txt
        └── summary_template_fr.txt

PromptEngine Implementation


# services/prompt_engine.py
import os
from string import Template
from typing import Dict

class PromptEngine:
    # Class-level storage for loaded prompts
    SYSTEM_PROMPTS: Dict[str, Dict[str, str]] = {}
    USER_TEMPLATES: Dict[str, Dict[str, str]] = {}

    PROMPTS_DIR = os.path.join(os.path.dirname(__file__), '..', 'resources', 'prompts')

    @classmethod
    def load_prompts(cls):
        """Load all prompts from files into memory at startup."""

        # Load system prompts
        system_dir = os.path.join(cls.PROMPTS_DIR, 'system')
        for filename in os.listdir(system_dir):
            name, lang = cls._parse_filename(filename)
            if name not in cls.SYSTEM_PROMPTS:
                cls.SYSTEM_PROMPTS[name] = {}

            with open(os.path.join(system_dir, filename), 'r') as f:
                cls.SYSTEM_PROMPTS[name][lang] = f.read()

        # Load user templates
        user_dir = os.path.join(cls.PROMPTS_DIR, 'user')
        for filename in os.listdir(user_dir):
            name, lang = cls._parse_filename(filename)
            if name not in cls.USER_TEMPLATES:
                cls.USER_TEMPLATES[name] = {}

            with open(os.path.join(user_dir, filename), 'r') as f:
                cls.USER_TEMPLATES[name][lang] = f.read()

    @classmethod
    def _parse_filename(cls, filename: str) -> tuple:
        """Extract prompt name and language from filename."""
        # generator_en.txt → ('generator', 'En')
        name = filename.rsplit('_', 1)[0]
        lang = filename.rsplit('_', 1)[1].replace('.txt', '').capitalize()
        return name, lang

    @classmethod
    def get_system_prompt(cls, name: str, language: str = 'En') -> str:
        """Get a system prompt by name and language."""
        return cls.SYSTEM_PROMPTS.get(name, {}).get(language, '')

    @classmethod
    def render_user_prompt(cls, name: str, language: str = 'En', **kwargs) -> str:
        """Render a user template with variables."""
        template_str = cls.USER_TEMPLATES.get(name, {}).get(language, '')
        return Template(template_str).safe_substitute(**kwargs)

Template Variables

User prompts use Python's Template syntax for variable substitution:


# resources/prompts/user/generation_template_en.txt
Generate $count content items for the following topic:

Category: $category
Format: $item_type

Requirements:
$requirements

Additional context:
$context

Return a JSON object with an "items" array containing id, title, content, and type fields.

AsyncAzureOpenAI Client

For high-throughput services, use the async client to avoid blocking the event loop:


# services/ai_client.py
from openai import AsyncAzureOpenAI
import os

class AIClient:
    _instance: AsyncAzureOpenAI = None

    @classmethod
    async def initialize(cls):
        """Initialize the singleton client."""
        if cls._instance is None:
            cls._instance = AsyncAzureOpenAI(
                api_key=os.getenv("AZURE_OPENAI_API_KEY"),
                api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-01"),
                azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
            )

    @classmethod
    def get_client(cls) -> AsyncAzureOpenAI:
        """Get the singleton client instance."""
        if cls._instance is None:
            raise RuntimeError("AIClient not initialized. Call initialize() first.")
        return cls._instance

    @classmethod
    async def generate(cls, system_prompt: str, user_prompt: str,
                       temperature: float = 0.5) -> str:
        """Generate a completion with JSON mode."""
        client = cls.get_client()

        response = await client.chat.completions.create(
            model=os.getenv("AZURE_DEPLOYMENT_NAME"),
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=temperature,
            max_tokens=4000,
            response_format={"type": "json_object"}
        )

        return response.choices[0].message.content

Response Validation

LLMs are probabilistic. They can return malformed JSON, missing fields, or incorrect types. A validation layer with auto-correction ensures reliability:

VALIDATION FLOW

┌────────────┐     ┌────────────┐     ┌────────────┐     ┌────────────┐
│ AI Response│────▶│  Validate  │────▶│   Valid?   │────▶│   Return   │
│   (JSON)   │     │   Schema   │     │    Yes     │     │   Result   │
└────────────┘     └────────────┘     └─────┬──────┘     └────────────┘
                                            │ No
                                            ▼
                   ┌────────────┐     ┌────────────┐
                   │    Auto    │────▶│   Retry    │──┐
                   │   Correct  │     │   (3x)     │  │
                   └────────────┘     └────────────┘  │
                         ▲                            │
                         └────────────────────────────┘

Validator Implementation


# services/validator.py
import json
import re
from typing import List, Dict, Any, Optional

class Validator:
    REQUIRED_FIELDS = ['id', 'title', 'content', 'format']
    VALID_FORMATS = ['summary', 'analysis', 'tutorial']

    def validate(self, items: List[Dict]) -> bool:
        """Validate a list of items."""
        if not isinstance(items, list) or len(items) == 0:
            return False

        for item in items:
            if not self._validate_item(item):
                return False

        return True

    def _validate_item(self, item: Dict) -> bool:
        """Validate a single item."""
        # Check required fields exist
        for field in self.REQUIRED_FIELDS:
            if field not in item:
                return False

        # Validate format enum
        if item.get('format', '').lower() not in self.VALID_FORMATS:
            return False

        # Validate string fields are not empty
        if not item.get('title') or not item.get('content'):
            return False

        return True

    def auto_correct(self, raw_response: str) -> Optional[Dict]:
        """Attempt to fix common JSON issues."""
        try:
            # Try direct parse first
            return json.loads(raw_response)
        except json.JSONDecodeError:
            pass

        # Fix common issues
        corrected = raw_response

        # Remove markdown code blocks
        corrected = re.sub(r'```json?\n?', '', corrected)
        corrected = re.sub(r'```', '', corrected)

        # Fix trailing commas
        corrected = re.sub(r',\s*}', '}', corrected)
        corrected = re.sub(r',\s*]', ']', corrected)

        # Fix unquoted keys (simple cases)
        corrected = re.sub(r'(\{|,)\s*(\w+)\s*:', r'\1"\2":', corrected)

        try:
            return json.loads(corrected)
        except json.JSONDecodeError:
            return None

Parallel Processing with asyncio

Generating many outputs sequentially takes minutes. By splitting work into batches and running them in parallel, we reduce latency dramatically.

PARALLEL BATCH PROCESSING

 Sequential (120 seconds)                  Parallel (15 seconds)
 ────────────────────────                  ─────────────────────

 [Batch 1] ──▶ [Batch 2] ──▶               [Batch 1] ──┐
 [Batch 3] ──▶ [Batch 4] ──▶               [Batch 2] ──┤
 [Batch 5] ──▶ [Batch 6] ──▶               [Batch 3] ──┤
 [Batch 7] ──▶ [Batch 8]                   [Batch 4] ──┼──▶ Combine
                                           [Batch 5] ──┤
       Total: 8 × 15s = 120s               [Batch 6] ──┤
                                           [Batch 7] ──┤
                                           [Batch 8] ──┘

                                           Total: max(15s) = 15s

Batch Generator Implementation


# services/generator.py
import asyncio
import json
from typing import List, Dict, Any

class BatchGenerator:
    def __init__(self, language: str = 'En'):
        self.language = language
        self.validator = Validator()

    async def generate_items(self, context: Dict[str, Any]) -> List[Dict]:
        """Generate items using parallel batch processing."""

        # Define categories for each batch (ensures diversity)
        batch_configs = [
            {'category': 'Technology', 'count': 3},
            {'category': 'Science', 'count': 3},
            {'category': 'Business', 'count': 3},
            {'category': 'Education', 'count': 3},
            {'category': 'Healthcare', 'count': 2},
            {'category': 'Environment', 'count': 2},
        ]

        # Create parallel tasks
        tasks = [
            self._generate_batch(config, context)
            for config in batch_configs
        ]

        # Execute all batches in parallel
        results = await asyncio.gather(*tasks, return_exceptions=True)

        # Combine results, handling failures gracefully
        all_items = []
        for i, result in enumerate(results):
            if isinstance(result, Exception):
                print(f"Batch {i+1} failed: {result}")
                continue
            all_items.extend(result)

        return all_items

    async def _generate_batch(self, config: Dict, context: Dict,
                               retries: int = 3) -> List[Dict]:
        """Generate a single batch with retry logic."""

        system_prompt = PromptEngine.get_system_prompt('generator', self.language)
        user_prompt = PromptEngine.render_user_prompt(
            'generation_template',
            self.language,
            count=config['count'],
            category=config['category'],
            item_type=context.get('format', 'summary'),
            requirements=context.get('requirements', ''),
            context=json.dumps(context.get('data', {}))
        )

        response = await AIClient.generate(system_prompt, user_prompt)

        # Parse and validate
        parsed = self.validator.auto_correct(response)
        if parsed is None:
            if retries > 0:
                return await self._generate_batch(config, context, retries - 1)
            raise ValueError("Failed to parse AI response")

        items = parsed.get('items', [])

        if not self.validator.validate(items):
            if retries > 0:
                print(f"Validation failed, retrying... ({retries} left)")
                return await self._generate_batch(config, context, retries - 1)
            raise ValueError("AI produced invalid output")

        return items

Parameter Tuning

Different use cases require different model parameters:

Parameter	Creative Tasks	Analytical Tasks	Classification
Temperature	0.7-1.0	0.3-0.5	0.0-0.2
Top-p	0.9-1.0	0.5-0.7	0.1-0.3
Frequency Penalty	0.3-0.5	0.0-0.2	0.0
Presence Penalty	0.5-0.8	0.0-0.3	0.0

Temperature: Higher = more random/creative, lower = more focused/deterministic
Top-p: Limits token selection to top probability mass. Lower = more predictable
Frequency Penalty: Reduces repetition of already-used tokens
Presence Penalty: Encourages the model to discuss new topics

Error Handling Patterns

Production AI services must handle various failure modes:


async def safe_generate(self, prompt: str, max_retries: int = 3) -> Dict:
    """Generate with comprehensive error handling."""

    for attempt in range(max_retries):
        try:
            response = await AIClient.generate(
                self.system_prompt,
                prompt,
                temperature=0.5
            )

            parsed = json.loads(response)
            if self.validator.validate(parsed):
                return parsed

            # Validation failed, retry with lower temperature
            print(f"Validation failed, attempt {attempt + 1}")

        except json.JSONDecodeError as e:
            print(f"JSON parse error: {e}")
            # Try auto-correction
            corrected = self.validator.auto_correct(response)
            if corrected and self.validator.validate(corrected):
                return corrected

        except Exception as e:
            print(f"API error: {e}")
            if "rate_limit" in str(e).lower():
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
            continue

    raise RuntimeError(f"Failed after {max_retries} attempts")

Best Practices

Prompt Engineering

Be specific: "Return exactly 5 items" not "Return some items"
Show format: Include example JSON in system prompt
Constrain output: List valid values for enums
Separate concerns: System prompt for behavior, user prompt for data

Performance

Batch parallel calls: Use asyncio.gather for independent requests
Pre-load prompts: Load from files at startup, not per-request
Use appropriate models: GPT-4o-mini for simple tasks, GPT-4o for complex reasoning
Stream responses: For user-facing chat, stream tokens as they arrive

Reliability

Always validate: Never trust LLM output without validation
Implement retries: 3 retries with exponential backoff
Auto-correct JSON: Handle common formatting errors
Graceful degradation: Return partial results if some batches fail

Conclusion

Production AI engineering spans multiple disciplines. LLM fundamentals (tokens, temperature, context windows) inform how you interact with models. Fine-tuning with PEFT/LoRA lets you specialize models for your domain without massive compute costs. RAG enables knowledge-grounded responses that cite sources.

On the implementation side, Django patterns like AppConfig ensure efficient initialization. The PromptEngine separates content from code, enabling non-developers to modify prompts. Validation with auto-correction handles the probabilistic nature of LLMs. The parallel processing pattern with asyncio.gather reduces latency from minutes to seconds. Combined with retry logic and graceful error handling, this architecture delivers reliable AI features at production scale.

Shipping AI Features in Production: GPT-4o Inside a Live Platform

Large Language Models: Fundamentals

Large Language Models (LLMs) are neural networks trained on massive text datasets to predict the next token in a sequence. Understanding how they work helps you use them effectively in production.

LLM ARCHITECTURE

Input Text                  Tokenization                Embedding
"Hello world"       →       [15496, 995]       →       [0.1, -0.3, ...]
                                                              │
                                                              ▼
                           ┌─────────────────────────────────────┐
                           │        Transformer Layers           │
                           │                                     │
                           │  Self-Attention  →  Feed-Forward    │
                           │        ↓                ↓           │
                           │  Self-Attention  →  Feed-Forward    │
                           │        ↓                ↓           │
                           │  Self-Attention  →  Feed-Forward    │
                           │                                     │
                           └─────────────────┬───────────────────┘
                                             │
                                             ▼
                           Output Probabilities → Next Token
                           "!" (0.3), "." (0.2), "," (0.1)...

Key Concepts

Tokens: Text is split into subwords (GPT-4 uses ~100k vocabulary). "unhappiness" → ["un", "happiness"]
Context Window: Maximum tokens the model can process at once (GPT-4o: 128k tokens)
Temperature: Controls randomness. 0 = deterministic, 1 = creative, 2 = chaotic
Top-p (Nucleus Sampling): Only considers tokens whose cumulative probability exceeds p
Max Tokens: Limits response length (input + output must fit context window)

Chat Completions API

Modern LLMs use a message-based API with roles that shape the conversation:

MESSAGE ROLES

┌──────────────────────────────────────────────────────────────┐
│  System Message                                              │
│  "You are a helpful assistant that responds in JSON format"  │
│  → Sets behavior, persona, and output format                 │
└──────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│  User Message                                                │
│  "Generate 3 product recommendations for electronics"        │
│  → The actual request or question                            │
└──────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│  Assistant Message                                           │
│  {"recommendations": [...]}                                  │
│  → The model's response                                      │
└──────────────────────────────────────────────────────────────┘

JSON Mode

For production systems, you need structured output. JSON mode ensures the model returns valid JSON:


response = await client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Return a JSON object with 'items' array"},
        {"role": "user", "content": "List 3 programming languages"}
    ],
    response_format={"type": "json_object"}  # Enforces valid JSON
)

Azure OpenAI Service

Azure OpenAI provides enterprise-grade access to OpenAI models with additional security, compliance, and regional deployment options.

AZURE OPENAI ARCHITECTURE

┌─────────────────┐     ┌─────────────────────────────────────────┐
│   Application   │     │          Azure OpenAI Service           │
│                 │     │                                         │
│  AsyncAzureOAI  │────▶│  ┌─────────────┐   ┌─────────────────┐  │
│     Client      │     │  │  Endpoint   │   │   Deployments   │  │
│                 │     │  │ (Regional)  │   │                 │  │
└─────────────────┘     │  └─────────────┘   │  • gpt-4o       │  │
                        │                     │  • gpt-4o-mini  │  │
                        │                     │  • embeddings   │  │
                        │                     └─────────────────┘  │
                        └─────────────────────────────────────────┘

Key Differences from OpenAI

Deployments: You create named deployments of models (not just model names)
Endpoints: Region-specific URLs (e.g., eastus.api.cognitive.microsoft.com)
API Versions: Azure uses dated API versions for stability
Authentication: API keys or Azure Active Directory

Fine-Tuning: Customizing Models

Fine-tuning adapts a pre-trained model to your specific domain or task. Instead of training from scratch, you adjust the model's weights using your own examples.

FINE-TUNING PIPELINE

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Training Data  │     │   Base Model    │     │  Fine-Tuned     │
│                 │     │                 │     │     Model       │
│  {"prompt": ... │────▶│    GPT-4o      │────▶│                 │
│   "completion"} │     │    (frozen)     │     │  Your Domain    │
│                 │     │                 │     │   Specialist    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
     100-1000                                     Faster + Cheaper
     examples                                     + More Accurate

When to Fine-Tune

Consistent format: Always output in a specific JSON schema
Domain terminology: Use industry-specific language correctly
Style/tone: Match your brand voice consistently
Cost reduction: Fine-tuned small model can match large model quality

Training Data Format (JSONL)


{"messages": [{"role": "system", "content": "You are an assistant that extracts structured data."}, {"role": "user", "content": "Extract entities from: John visited Paris last Monday"}, {"role": "assistant", "content": "{"person": "John", "location": "Paris", "date": "last Monday"}"}]}
{"messages": [{"role": "system", "content": "You are an assistant that extracts structured data."}, {"role": "user", "content": "Extract entities from: The meeting is at 3pm in Room 101"}, {"role": "assistant", "content": "{"time": "3pm", "location": "Room 101"}"}]}

Parameter-Efficient Fine-Tuning (PEFT)

LoRA ARCHITECTURE

┌───────────────────────────────────────────────────────────┐
│                   Transformer Layer                       │
│                                                           │
│  ┌───────────────────┐        ┌───────────────────┐      │
│  │   Original        │        │   LoRA Adapter    │      │
│  │   Weights (W)     │        │                   │      │
│  │                   │        │  ┌─────┐ ┌─────┐  │      │
│  │   [Frozen]        │   +    │  │  A  │×│  B  │  │      │
│  │                   │        │  │ r×d │ │ d×r │  │      │
│  │                   │        │  └─────┘ └─────┘  │      │
│  │                   │        │                   │      │
│  │                   │        │   [Trainable]     │      │
│  └─────────┬─────────┘        └─────────┬─────────┘      │
│            │                            │                │
│            └────────────┬───────────────┘                │
│                         │                                │
│                 Output = Wx + BAx                        │
└───────────────────────────────────────────────────────────┘

Full Fine-Tune: Update all ~7B parameters
LoRA (r=16):    Update only ~4M parameters (0.05%)

LoRA with Hugging Face


from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")

# Configure LoRA
lora_config = LoraConfig(
    r=16,                    # Rank of adaptation matrices
    lora_alpha=32,           # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Create PEFT model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

RAG: Retrieval-Augmented Generation

RAG combines vector search with LLM generation. Instead of relying solely on the model's training data, you retrieve relevant documents and include them in the prompt.

RAG ARCHITECTURE

   User Query                                            Response
       │                                                    ▲
       ▼                                                    │
┌───────────────────┐                            ┌───────────────────┐
│     Embedding     │                            │        LLM        │
│       Model       │                            │    Generation     │
└─────────┬─────────┘                            └─────────┬─────────┘
          │                                                │
          ▼                                                │
┌───────────────────┐    ┌───────────────────┐    ┌───────┴─────────┐
│   Query Vector    │───▶│   Vector Search   │───▶│    Augmented    │
│  [0.1, -0.3...]   │    │    (Top K=3)      │    │     Prompt      │
└───────────────────┘    └─────────┬─────────┘    │                 │
                                   │              │  Context: ...   │
                        ┌─────────┴─────────┐     │  Question: ...  │
                        │   Vector Store    │     └─────────────────┘
                        │                   │
                        │  ● Doc 1 [0.2]    │
                        │  ● Doc 2 [-0.1]   │
                        │  ● Doc 3 [0.4]    │
                        └───────────────────┘

RAG Implementation


class RAGService:
    def __init__(self, vector_store):
        self.vector_store = vector_store

    async def query(self, question: str, top_k: int = 3) -> str:
        # 1. Embed the question
        query_embedding = await get_embedding(question)

        # 2. Search for relevant documents
        results = self.vector_store.search(
            vector=query_embedding,
            top_k=top_k
        )

        # 3. Build context from retrieved documents
        context = "\n\n".join([
            f"Document {i+1}:\n{doc.content}"
            for i, doc in enumerate(results)
        ])

        # 4. Generate answer with context
        prompt = f"""Answer the question based on the context below.

Context:
{context}

Question: {question}

Answer:"""

        response = await AIClient.generate(
            system_prompt="You are a helpful assistant. Only use information from the provided context.",
            user_prompt=prompt,
            temperature=0.3
        )

        return response

When to Use RAG vs Fine-Tuning

Use Case	RAG	Fine-Tuning
Up-to-date information	Best choice	Requires retraining
Citing sources	Built-in	Not possible
Consistent style/format	Limited	Best choice
Domain terminology	Good	Best choice
Cost per query	Higher (retrieval)	Lower

Django Integration Patterns

Integrating AI into Django requires careful initialization. Loading prompts and models at startup prevents repeated file I/O and ensures fast request handling.

AppConfig: Startup Initialization

Django's AppConfig.ready() runs once when the application starts. Use it to load prompts into memory:


# apps/ai_service/apps.py
from django.apps import AppConfig

class AIServiceConfig(AppConfig):
    default_auto_field = 'django.db.models.BigAutoField'
    name = 'apps.ai_service'

    def ready(self):
        # Import here to avoid circular imports
        from .services.prompt_engine import PromptEngine

        # Load all prompts into memory at startup
        PromptEngine.load_prompts()

Management Commands: Async Initialization

For async clients that need the event loop, use management commands:


# apps/ai_service/management/commands/init_models.py
from django.core.management.base import BaseCommand
import asyncio

class Command(BaseCommand):
    help = 'Initialize AI models'

    def handle(self, *args, **options):
        from apps.ai_service.services.ai_client import AIClient

        loop = asyncio.get_event_loop()
        loop.run_until_complete(AIClient.initialize())

        self.stdout.write(self.style.SUCCESS('AI models initialized'))

Prompt Engine: File-Based Management

Managing prompts as string literals in code is unmaintainable. A proper prompt engine loads templates from files, supports multiple languages, and enables non-developers to modify prompts.

PROMPT FILE STRUCTURE

resources/
└── prompts/
    ├── system/
    │   ├── generator_en.txt
    │   ├── generator_fr.txt
    │   ├── summarizer_en.txt
    │   └── summarizer_fr.txt
    └── user/
        ├── generation_template_en.txt
        ├── generation_template_fr.txt
        ├── summary_template_en.txt
        └── summary_template_fr.txt

PromptEngine Implementation


# services/prompt_engine.py
import os
from string import Template
from typing import Dict

class PromptEngine:
    # Class-level storage for loaded prompts
    SYSTEM_PROMPTS: Dict[str, Dict[str, str]] = {}
    USER_TEMPLATES: Dict[str, Dict[str, str]] = {}

    PROMPTS_DIR = os.path.join(os.path.dirname(__file__), '..', 'resources', 'prompts')

    @classmethod
    def load_prompts(cls):
        """Load all prompts from files into memory at startup."""

        # Load system prompts
        system_dir = os.path.join(cls.PROMPTS_DIR, 'system')
        for filename in os.listdir(system_dir):
            name, lang = cls._parse_filename(filename)
            if name not in cls.SYSTEM_PROMPTS:
                cls.SYSTEM_PROMPTS[name] = {}

            with open(os.path.join(system_dir, filename), 'r') as f:
                cls.SYSTEM_PROMPTS[name][lang] = f.read()

        # Load user templates
        user_dir = os.path.join(cls.PROMPTS_DIR, 'user')
        for filename in os.listdir(user_dir):
            name, lang = cls._parse_filename(filename)
            if name not in cls.USER_TEMPLATES:
                cls.USER_TEMPLATES[name] = {}

            with open(os.path.join(user_dir, filename), 'r') as f:
                cls.USER_TEMPLATES[name][lang] = f.read()

    @classmethod
    def _parse_filename(cls, filename: str) -> tuple:
        """Extract prompt name and language from filename."""
        # generator_en.txt → ('generator', 'En')
        name = filename.rsplit('_', 1)[0]
        lang = filename.rsplit('_', 1)[1].replace('.txt', '').capitalize()
        return name, lang

    @classmethod
    def get_system_prompt(cls, name: str, language: str = 'En') -> str:
        """Get a system prompt by name and language."""
        return cls.SYSTEM_PROMPTS.get(name, {}).get(language, '')

    @classmethod
    def render_user_prompt(cls, name: str, language: str = 'En', **kwargs) -> str:
        """Render a user template with variables."""
        template_str = cls.USER_TEMPLATES.get(name, {}).get(language, '')
        return Template(template_str).safe_substitute(**kwargs)

Template Variables

User prompts use Python's Template syntax for variable substitution:


# resources/prompts/user/generation_template_en.txt
Generate $count content items for the following topic:

Category: $category
Format: $item_type

Requirements:
$requirements

Additional context:
$context

Return a JSON object with an "items" array containing id, title, content, and type fields.

AsyncAzureOpenAI Client

For high-throughput services, use the async client to avoid blocking the event loop:


# services/ai_client.py
from openai import AsyncAzureOpenAI
import os

class AIClient:
    _instance: AsyncAzureOpenAI = None

    @classmethod
    async def initialize(cls):
        """Initialize the singleton client."""
        if cls._instance is None:
            cls._instance = AsyncAzureOpenAI(
                api_key=os.getenv("AZURE_OPENAI_API_KEY"),
                api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-01"),
                azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
            )

    @classmethod
    def get_client(cls) -> AsyncAzureOpenAI:
        """Get the singleton client instance."""
        if cls._instance is None:
            raise RuntimeError("AIClient not initialized. Call initialize() first.")
        return cls._instance

    @classmethod
    async def generate(cls, system_prompt: str, user_prompt: str,
                       temperature: float = 0.5) -> str:
        """Generate a completion with JSON mode."""
        client = cls.get_client()

        response = await client.chat.completions.create(
            model=os.getenv("AZURE_DEPLOYMENT_NAME"),
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=temperature,
            max_tokens=4000,
            response_format={"type": "json_object"}
        )

        return response.choices[0].message.content

Response Validation

LLMs are probabilistic. They can return malformed JSON, missing fields, or incorrect types. A validation layer with auto-correction ensures reliability:

VALIDATION FLOW

┌────────────┐     ┌────────────┐     ┌────────────┐     ┌────────────┐
│ AI Response│────▶│  Validate  │────▶│   Valid?   │────▶│   Return   │
│   (JSON)   │     │   Schema   │     │    Yes     │     │   Result   │
└────────────┘     └────────────┘     └─────┬──────┘     └────────────┘
                                            │ No
                                            ▼
                   ┌────────────┐     ┌────────────┐
                   │    Auto    │────▶│   Retry    │──┐
                   │   Correct  │     │   (3x)     │  │
                   └────────────┘     └────────────┘  │
                         ▲                            │
                         └────────────────────────────┘

Validator Implementation


# services/validator.py
import json
import re
from typing import List, Dict, Any, Optional

class Validator:
    REQUIRED_FIELDS = ['id', 'title', 'content', 'format']
    VALID_FORMATS = ['summary', 'analysis', 'tutorial']

    def validate(self, items: List[Dict]) -> bool:
        """Validate a list of items."""
        if not isinstance(items, list) or len(items) == 0:
            return False

        for item in items:
            if not self._validate_item(item):
                return False

        return True

    def _validate_item(self, item: Dict) -> bool:
        """Validate a single item."""
        # Check required fields exist
        for field in self.REQUIRED_FIELDS:
            if field not in item:
                return False

        # Validate format enum
        if item.get('format', '').lower() not in self.VALID_FORMATS:
            return False

        # Validate string fields are not empty
        if not item.get('title') or not item.get('content'):
            return False

        return True

    def auto_correct(self, raw_response: str) -> Optional[Dict]:
        """Attempt to fix common JSON issues."""
        try:
            # Try direct parse first
            return json.loads(raw_response)
        except json.JSONDecodeError:
            pass

        # Fix common issues
        corrected = raw_response

        # Remove markdown code blocks
        corrected = re.sub(r'```json?\n?', '', corrected)
        corrected = re.sub(r'```', '', corrected)

        # Fix trailing commas
        corrected = re.sub(r',\s*}', '}', corrected)
        corrected = re.sub(r',\s*]', ']', corrected)

        # Fix unquoted keys (simple cases)
        corrected = re.sub(r'(\{|,)\s*(\w+)\s*:', r'\1"\2":', corrected)

        try:
            return json.loads(corrected)
        except json.JSONDecodeError:
            return None

Parallel Processing with asyncio

Generating many outputs sequentially takes minutes. By splitting work into batches and running them in parallel, we reduce latency dramatically.

PARALLEL BATCH PROCESSING

 Sequential (120 seconds)                  Parallel (15 seconds)
 ────────────────────────                  ─────────────────────

 [Batch 1] ──▶ [Batch 2] ──▶               [Batch 1] ──┐
 [Batch 3] ──▶ [Batch 4] ──▶               [Batch 2] ──┤
 [Batch 5] ──▶ [Batch 6] ──▶               [Batch 3] ──┤
 [Batch 7] ──▶ [Batch 8]                   [Batch 4] ──┼──▶ Combine
                                           [Batch 5] ──┤
       Total: 8 × 15s = 120s               [Batch 6] ──┤
                                           [Batch 7] ──┤
                                           [Batch 8] ──┘

                                           Total: max(15s) = 15s

Batch Generator Implementation


# services/generator.py
import asyncio
import json
from typing import List, Dict, Any

class BatchGenerator:
    def __init__(self, language: str = 'En'):
        self.language = language
        self.validator = Validator()

    async def generate_items(self, context: Dict[str, Any]) -> List[Dict]:
        """Generate items using parallel batch processing."""

        # Define categories for each batch (ensures diversity)
        batch_configs = [
            {'category': 'Technology', 'count': 3},
            {'category': 'Science', 'count': 3},
            {'category': 'Business', 'count': 3},
            {'category': 'Education', 'count': 3},
            {'category': 'Healthcare', 'count': 2},
            {'category': 'Environment', 'count': 2},
        ]

        # Create parallel tasks
        tasks = [
            self._generate_batch(config, context)
            for config in batch_configs
        ]

        # Execute all batches in parallel
        results = await asyncio.gather(*tasks, return_exceptions=True)

        # Combine results, handling failures gracefully
        all_items = []
        for i, result in enumerate(results):
            if isinstance(result, Exception):
                print(f"Batch {i+1} failed: {result}")
                continue
            all_items.extend(result)

        return all_items

    async def _generate_batch(self, config: Dict, context: Dict,
                               retries: int = 3) -> List[Dict]:
        """Generate a single batch with retry logic."""

        system_prompt = PromptEngine.get_system_prompt('generator', self.language)
        user_prompt = PromptEngine.render_user_prompt(
            'generation_template',
            self.language,
            count=config['count'],
            category=config['category'],
            item_type=context.get('format', 'summary'),
            requirements=context.get('requirements', ''),
            context=json.dumps(context.get('data', {}))
        )

        response = await AIClient.generate(system_prompt, user_prompt)

        # Parse and validate
        parsed = self.validator.auto_correct(response)
        if parsed is None:
            if retries > 0:
                return await self._generate_batch(config, context, retries - 1)
            raise ValueError("Failed to parse AI response")

        items = parsed.get('items', [])

        if not self.validator.validate(items):
            if retries > 0:
                print(f"Validation failed, retrying... ({retries} left)")
                return await self._generate_batch(config, context, retries - 1)
            raise ValueError("AI produced invalid output")

        return items

Parameter Tuning

Different use cases require different model parameters:

Parameter	Creative Tasks	Analytical Tasks	Classification
Temperature	0.7-1.0	0.3-0.5	0.0-0.2
Top-p	0.9-1.0	0.5-0.7	0.1-0.3
Frequency Penalty	0.3-0.5	0.0-0.2	0.0
Presence Penalty	0.5-0.8	0.0-0.3	0.0

Temperature: Higher = more random/creative, lower = more focused/deterministic
Top-p: Limits token selection to top probability mass. Lower = more predictable
Frequency Penalty: Reduces repetition of already-used tokens
Presence Penalty: Encourages the model to discuss new topics

Error Handling Patterns

Production AI services must handle various failure modes:


async def safe_generate(self, prompt: str, max_retries: int = 3) -> Dict:
    """Generate with comprehensive error handling."""

    for attempt in range(max_retries):
        try:
            response = await AIClient.generate(
                self.system_prompt,
                prompt,
                temperature=0.5
            )

            parsed = json.loads(response)
            if self.validator.validate(parsed):
                return parsed

            # Validation failed, retry with lower temperature
            print(f"Validation failed, attempt {attempt + 1}")

        except json.JSONDecodeError as e:
            print(f"JSON parse error: {e}")
            # Try auto-correction
            corrected = self.validator.auto_correct(response)
            if corrected and self.validator.validate(corrected):
                return corrected

        except Exception as e:
            print(f"API error: {e}")
            if "rate_limit" in str(e).lower():
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
            continue

    raise RuntimeError(f"Failed after {max_retries} attempts")

Best Practices

Prompt Engineering

Be specific: "Return exactly 5 items" not "Return some items"
Show format: Include example JSON in system prompt
Constrain output: List valid values for enums
Separate concerns: System prompt for behavior, user prompt for data

Performance

Batch parallel calls: Use asyncio.gather for independent requests
Pre-load prompts: Load from files at startup, not per-request
Use appropriate models: GPT-4o-mini for simple tasks, GPT-4o for complex reasoning
Stream responses: For user-facing chat, stream tokens as they arrive

Reliability

Always validate: Never trust LLM output without validation
Implement retries: 3 retries with exponential backoff
Auto-correct JSON: Handle common formatting errors
Graceful degradation: Return partial results if some batches fail

Large Language Models: Fundamentals

Key Concepts

Chat Completions API

JSON Mode

Azure OpenAI Service

Key Differences from OpenAI

Fine-Tuning: Customizing Models

When to Fine-Tune

Training Data Format (JSONL)

Parameter-Efficient Fine-Tuning (PEFT)

LoRA with Hugging Face

RAG: Retrieval-Augmented Generation

RAG Implementation

When to Use RAG vs Fine-Tuning

Django Integration Patterns

AppConfig: Startup Initialization

Management Commands: Async Initialization

Prompt Engine: File-Based Management

PromptEngine Implementation

Template Variables

AsyncAzureOpenAI Client

Response Validation

Validator Implementation

Parallel Processing with asyncio

Batch Generator Implementation

Parameter Tuning

Error Handling Patterns

Best Practices

Prompt Engineering

Performance

Reliability

Conclusion

Related Reading

Large Language Models: Fundamentals

Key Concepts

Chat Completions API

JSON Mode

Azure OpenAI Service

Key Differences from OpenAI

Fine-Tuning: Customizing Models

When to Fine-Tune

Training Data Format (JSONL)

Parameter-Efficient Fine-Tuning (PEFT)

LoRA with Hugging Face

RAG: Retrieval-Augmented Generation

RAG Implementation

When to Use RAG vs Fine-Tuning

Django Integration Patterns

AppConfig: Startup Initialization

Management Commands: Async Initialization

Prompt Engine: File-Based Management

PromptEngine Implementation

Template Variables

AsyncAzureOpenAI Client

Response Validation

Validator Implementation

Parallel Processing with asyncio

Batch Generator Implementation

Parameter Tuning

Error Handling Patterns

Best Practices

Prompt Engineering

Performance

Reliability

Conclusion

Related Reading