Skip to main content
RAG & Enterprise Chatbots

Enterprise RAG Architecture Patterns: From Proof of Concept to Production

Comprehensive guide to building production-ready Retrieval-Augmented Generation systems. Learn architectural patterns, Azure implementation strategies, and best practices for scaling RAG applications in enterprise environments.

Mohd Ali
14 min read
#RAG #Azure AI Search #Azure OpenAI #Enterprise AI #Vector Databases #Production Architecture

Executive Summary

TL;DR: Retrieval-Augmented Generation (RAG) combines semantic retrieval with LLM generation to deliver up-to-date, auditable answers from large knowledge bases. Start with Basic RAG for quick validation, add Hybrid search (vector + keyword) to handle exact matches, and move to Multi-Stage or Agentic RAG for high-accuracy enterprise scenarios. Focus early on chunking strategy, observability, and citation generation — these drive real-world reliability.

Pattern Selection Guide:

  • Basic RAG: Quick POC, static knowledge sets, low budget (~$50-200/month)
  • Hybrid RAG: Production systems needing exact-match support and higher recall (recommended starting point)
  • Multi-Stage RAG: Mission-critical accuracy where top-k precision matters (financial, legal, healthcare)
  • Agentic RAG: Complex workflows requiring dynamic tool use and decision-making

Typical Performance Benchmarks (Azure AI Search + GPT-4o):

  • Retrieval latency: p50 ~150ms, p95 ~400ms
  • Generation latency: p50 ~2s, p95 ~4.5s
  • End-to-end: p50 ~2.5s, p95 ~5s
  • Cost per 1K queries: $2-8 (depends on context size, caching, model choice)

Introduction

Retrieval-Augmented Generation (RAG) has emerged as the dominant pattern for building AI applications that need to reference external knowledge bases. While the concept is simple - retrieve relevant context, then generate a response - production implementations face challenges around accuracy, latency, cost, and scale.

This comprehensive guide walks through proven architectural patterns for enterprise RAG systems, with a focus on Azure-native implementations. Whether you’re building a customer support chatbot, internal knowledge assistant, or document analysis tool, these patterns will help you move from proof of concept to production-ready system.

Understanding RAG: Core Concepts

The RAG Pipeline

User Query

Query Processing & Embedding

Retrieval (Vector + Keyword Search)

Context Ranking & Filtering

Prompt Construction

LLM Generation

Response Post-Processing

User Response

Why RAG Over Fine-Tuning?

AspectRAGFine-Tuning
Data UpdatesReal-time, just add documentsRequires retraining
CostLower (no training runs)Higher (GPU hours)
ExplainabilityCitations to source docsBlack box
AccuracyHigh for factual queriesVariable
LatencyHigher (retrieval overhead)Lower (single LLM call)
Use CaseDynamic knowledge basesFixed behavior patterns

Verdict: RAG is better for most enterprise scenarios where knowledge changes frequently and traceability matters.

Architectural Pattern 1: Basic RAG

When to use: Initial POC, small knowledge bases (<10K documents), internal tools with <100 users, budget-constrained projects.

Expected performance: 2-4s latency, 85-90% accuracy for straightforward queries, ~$50-200/month at 10K queries.

Architecture Overview

interface BasicRAGConfig {
  vectorStore: VectorDatabase;
  embedding: EmbeddingModel;
  llm: LanguageModel;
  chunkSize: number;
  topK: number;
}

async function basicRAG(query: string, config: BasicRAGConfig): Promise<string> {
  // 1. Embed query
  const queryEmbedding = await config.embedding.embed(query);
  
  // 2. Retrieve similar documents
  const documents = await config.vectorStore.similaritySearch(
    queryEmbedding,
    config.topK
  );
  
  // 3. Construct prompt
  const context = documents.map(doc => doc.content).join('\n\n');
  const prompt = `Context:\n${context}\n\nQuestion: ${query}\n\nAnswer:`;
  
  // 4. Generate response
  const response = await config.llm.generate(prompt);
  
  return response;
}

Azure Implementation

Components:

  • Azure OpenAI: text-embedding-3-large for embeddings, gpt-4o for generation
  • Azure AI Search: Vector + keyword search engine
  • Azure Blob Storage: Document storage

Infrastructure as Code (Bicep):

resource searchService 'Microsoft.Search/searchServices@2024-03-01-preview' = {
  name: 'rag-search-${uniqueString(resourceGroup().id)}'
  location: location
  sku: {
    name: 'standard'
  }
  properties: {
    replicaCount: 1
    partitionCount: 1
    hostingMode: 'default'
    semanticSearch: 'standard' // Enable semantic ranking
  }
}

resource openAIAccount 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  name: 'rag-openai-${uniqueString(resourceGroup().id)}'
  location: location
  kind: 'OpenAI'
  sku: {
    name: 'S0'
  }
  properties: {
    customSubDomainName: 'rag-openai-${uniqueString(resourceGroup().id)}'
    publicNetworkAccess: 'Enabled'
  }
}

resource embeddingDeployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = {
  parent: openAIAccount
  name: 'text-embedding-3-large'
  properties: {
    model: {
      format: 'OpenAI'
      name: 'text-embedding-3-large'
      version: '1'
    }
  }
  sku: {
    name: 'Standard'
    capacity: 120 // Tokens per minute (thousands)
  }
}

Document Ingestion Pipeline

import { BlobServiceClient } from '@azure/storage-blob';
import { SearchClient, SearchIndexClient } from '@azure/search-documents';
import { OpenAIClient } from '@azure/openai';

interface Document {
  id: string;
  content: string;
  metadata: Record<string, any>;
}

async function ingestDocuments(documents: Document[]) {
  const openAI = new OpenAIClient(endpoint, credential);
  const searchClient = new SearchClient(searchEndpoint, indexName, credential);
  
  for (const doc of documents) {
    // 1. Chunk document (semantic chunking preferred)
    const chunks = await chunkDocument(doc.content, {
      maxTokens: 512,
      overlap: 50,
      preserveSentences: true
    });
    
    // 2. Generate embeddings
    const embeddings = await openAI.embeddings.create({
      model: 'text-embedding-3-large',
      input: chunks.map(c => c.text),
      dimensions: 3072 // Full dimensions for maximum accuracy
    });
    
    // 3. Index documents
    const searchDocuments = chunks.map((chunk, i) => ({
      id: `${doc.id}_chunk_${i}`,
      content: chunk.text,
      contentVector: embeddings.data[i].embedding,
      documentId: doc.id,
      metadata: doc.metadata,
      chunkIndex: i
    }));
    
    await searchClient.uploadDocuments(searchDocuments);
  }
}

Limitations of Basic RAG

Poor retrieval accuracy for complex queries
No keyword fallback (fails on exact matches like codes, names)
Context window waste (irrelevant chunks consume tokens)
No source attribution (can’t cite where answers come from)
Single-shot retrieval (can’t refine based on initial results)

Architectural Pattern 2: Hybrid RAG

When to use: Production systems, knowledge bases with codes/IDs/proper names, customer-facing applications, need for >92% recall.

Expected performance: 2-3s latency, 92-95% accuracy, ~$200-800/month at 50K queries (with caching).

Architecture Overview

Combines vector search (semantic similarity) with keyword search (BM25) for best-of-both-worlds retrieval.

interface HybridSearchConfig {
  vectorWeight: number;    // 0.0 - 1.0
  keywordWeight: number;   // 0.0 - 1.0
  minScore: number;        // Relevance threshold
}

async function hybridSearch(
  query: string, 
  config: HybridSearchConfig
): Promise<SearchResult[]> {
  
  const searchClient = new SearchClient(endpoint, indexName, credential);
  
  const results = await searchClient.search(query, {
    vectorQueries: [{
      kind: 'vector',
      vector: await embedQuery(query),
      kNearestNeighborsCount: 50,
      fields: ['contentVector']
    }],
    searchFields: ['content', 'title', 'metadata'],
    select: ['id', 'content', 'documentId', 'metadata'],
    top: 10,
    
    // Hybrid ranking formula
    scoringProfile: 'hybrid-profile',
    scoringParameters: [
      `vectorWeight-${config.vectorWeight}`,
      `keywordWeight-${config.keywordWeight}`
    ]
  });
  
  return results.results
    .filter(r => r.score >= config.minScore)
    .map(r => r.document);
}

Azure AI Search Hybrid Configuration

Index Schema:

{
  "name": "hybrid-rag-index",
  "fields": [
    { "name": "id", "type": "Edm.String", "key": true },
    { "name": "content", "type": "Edm.String", "searchable": true },
    { "name": "title", "type": "Edm.String", "searchable": true },
    { "name": "contentVector", "type": "Collection(Edm.Single)", 
      "searchable": true, "dimensions": 3072, 
      "vectorSearchProfile": "vector-profile" },
    { "name": "documentId", "type": "Edm.String", "filterable": true },
    { "name": "metadata", "type": "Edm.ComplexType", "fields": [
      { "name": "category", "type": "Edm.String", "filterable": true },
      { "name": "date", "type": "Edm.DateTimeOffset", "filterable": true, "sortable": true },
      { "name": "author", "type": "Edm.String", "filterable": true }
    ]}
  ],
  "vectorSearch": {
    "algorithms": [
      { "name": "hnsw-config", "kind": "hnsw", 
        "hnswParameters": { "m": 4, "efConstruction": 400, "efSearch": 500 } }
    ],
    "profiles": [
      { "name": "vector-profile", "algorithm": "hnsw-config" }
    ]
  },
  "semantic": {
    "configurations": [
      {
        "name": "semantic-config",
        "prioritizedFields": {
          "titleField": { "fieldName": "title" },
          "contentFields": [{ "fieldName": "content" }]
        }
      }
    ]
  }
}

Semantic Ranking

Azure AI Search’s semantic ranking uses a Microsoft-trained model to rerank results:

const results = await searchClient.search(query, {
  queryType: 'semantic',
  semanticConfiguration: 'semantic-config',
  queryAnswer: 'extractive', // Get direct answer extraction
  captions: 'extractive',    // Highlight relevant passages
  top: 10
});

// Results now include semantic captions
for await (const result of results.results) {
  console.log('Score:', result.score);
  console.log('Caption:', result.captions?.[0]?.text);
  console.log('Highlights:', result.captions?.[0]?.highlights);
}

Benefits Over Basic RAG

Better recall: Finds documents missed by vector-only search
Exact match support: Handles codes, IDs, proper names
Semantic reranking: Microsoft’s model improves top results
Answer extraction: Highlights specific passages that answer query
Configurable weights: Tune vector vs keyword importance per use case

Architectural Pattern 3: Multi-Stage RAG

When to use: High-stakes domains (financial, legal, healthcare), citation requirements, need for >95% precision, compliance/audit needs.

Expected performance: 3-5s latency, 95-98% accuracy, ~$500-2000/month at 50K queries (compression helps reduce costs).

Architecture Overview

Uses multiple retrieval stages with increasing specificity:

Stage 1: Broad Retrieval (100 candidates)

Stage 2: Reranking (Top 20)

Stage 3: Relevance Filtering (Top 5-10)

Stage 4: Context Compression

LLM Generation

Implementation

interface MultiStageRAGConfig {
  stage1_topK: number;      // Broad retrieval
  stage2_topK: number;      // After reranking
  stage3_minScore: number;  // Relevance threshold
  useCompression: boolean;
}

async function multiStageRAG(
  query: string,
  config: MultiStageRAGConfig
): Promise<RAGResponse> {
  
  // Stage 1: Broad hybrid retrieval
  const candidates = await hybridSearch(query, {
    vectorWeight: 0.5,
    keywordWeight: 0.5,
    top: config.stage1_topK
  });
  
  // Stage 2: Semantic reranking
  const reranked = await semanticRerank(query, candidates, {
    top: config.stage2_topK
  });
  
  // Stage 3: LLM-based relevance filtering
  const filtered = await llmFilter(query, reranked, {
    minScore: config.stage3_minScore,
    prompt: `Rate relevance of each document to query on scale 0-1.`
  });
  
  // Stage 4: Context compression
  let context: string;
  if (config.useCompression) {
    context = await compressContext(query, filtered);
  } else {
    context = filtered.map(doc => doc.content).join('\n\n');
  }
  
  // Final generation
  const response = await generateWithCitations(query, context, filtered);
  
  return response;
}

Context Compression

Reduces token usage while preserving relevant information:

async function compressContext(
  query: string,
  documents: Document[]
): Promise<string> {
  
  const compressionPrompt = `
Given the query: "${query}"

For each document below, extract ONLY the sentences directly relevant to answering the query.
Preserve exact wording. Omit irrelevant details.

Documents:
${documents.map((doc, i) => `[${i+1}] ${doc.content}`).join('\n\n')}

Compressed context:
`;

  const compressed = await openAI.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'You extract relevant information concisely.' },
      { role: 'user', content: compressionPrompt }
    ],
    temperature: 0
  });
  
  return compressed.choices[0].message.content;
}

Citation Generation

interface CitedResponse {
  answer: string;
  citations: Citation[];
}

interface Citation {
  documentId: string;
  chunkId: string;
  text: string;
  relevanceScore: number;
}

async function generateWithCitations(
  query: string,
  context: string,
  sourceDocuments: Document[]
): Promise<CitedResponse> {
  
  const prompt = `
Context with source markers:
${sourceDocuments.map((doc, i) => `[Source ${i+1}]: ${doc.content}`).join('\n\n')}

Question: ${query}

Instructions:
1. Answer the question using ONLY information from the context above
2. Cite sources using [Source N] format after each claim
3. If information is not in context, say "I don't have enough information"

Answer:
`;

  const response = await openAI.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'You are a helpful assistant that answers questions with citations.' },
      { role: 'user', content: prompt }
    ],
    temperature: 0.3
  });
  
  const answer = response.choices[0].message.content;
  
  // Extract citations from response
  const citationPattern = /\[Source (\d+)\]/g;
  const citationMatches = [...answer.matchAll(citationPattern)];
  
  const citations: Citation[] = citationMatches.map(match => {
    const sourceIndex = parseInt(match[1]) - 1;
    const doc = sourceDocuments[sourceIndex];
    return {
      documentId: doc.documentId,
      chunkId: doc.id,
      text: doc.content.substring(0, 200) + '...',
      relevanceScore: doc.score
    };
  });
  
  return { answer, citations };
}

Architectural Pattern 4: Agentic RAG

When to use: Complex multi-step queries, dynamic tool selection needed, research/analysis workflows, need to combine multiple data sources.

Expected performance: 5-15s latency (multiple LLM calls), 96-99% accuracy for complex queries, ~$1000-5000/month at 20K queries.

Architecture Overview

AI agent decides retrieval strategy dynamically:

interface AgenticRAGConfig {
  tools: Tool[];
  maxIterations: number;
  reasoningModel: string;
}

interface Tool {
  name: string;
  description: string;
  execute: (params: any) => Promise<any>;
}

async function agenticRAG(
  query: string,
  config: AgenticRAGConfig
): Promise<string> {
  
  const tools: Tool[] = [
    {
      name: 'vector_search',
      description: 'Semantic search for conceptually similar documents',
      execute: async ({ query, topK }) => vectorSearch(query, topK)
    },
    {
      name: 'keyword_search',
      description: 'Exact keyword matching for codes, names, IDs',
      execute: async ({ query, topK }) => keywordSearch(query, topK)
    },
    {
      name: 'filter_by_metadata',
      description: 'Filter documents by category, date range, author',
      execute: async ({ filter }) => metadataFilter(filter)
    },
    {
      name: 'summarize_documents',
      description: 'Summarize long documents before answering',
      execute: async ({ documentIds }) => summarizeDocs(documentIds)
    }
  ];
  
  let iteration = 0;
  let finalAnswer = '';
  
  while (iteration < config.maxIterations && !finalAnswer) {
    // Agent decides next action
    const action = await decideNextAction(query, tools, iteration);
    
    if (action.type === 'use_tool') {
      const tool = tools.find(t => t.name === action.toolName);
      const result = await tool.execute(action.parameters);
      
      // Agent evaluates if it has enough information
      const evaluation = await evaluateInformation(query, result);
      
      if (evaluation.sufficient) {
        finalAnswer = await generateFinalAnswer(query, result);
      }
    } else if (action.type === 'answer') {
      finalAnswer = action.answer;
    }
    
    iteration++;
  }
  
  return finalAnswer;
}

Azure Implementation with Semantic Kernel

import { Kernel, KernelArguments } from '@microsoft/semantic-kernel';
import { AzureOpenAIChatCompletion } from '@microsoft/semantic-kernel';

// Initialize kernel
const kernel = new Kernel();

kernel.addService(
  'chat',
  new AzureOpenAIChatCompletion({
    deploymentName: 'gpt-4o',
    endpoint: process.env.AZURE_OPENAI_ENDPOINT,
    apiKey: process.env.AZURE_OPENAI_KEY
  })
);

// Define retrieval functions as plugins
kernel.importPluginFromObject({
  vectorSearch: async (query: string, topK: number = 5) => {
    return await performVectorSearch(query, topK);
  },
  filterByDate: async (startDate: string, endDate: string) => {
    return await filterDocumentsByDateRange(startDate, endDate);
  }
}, 'RAGPlugin');

// Agent reasoning loop
const planner = kernel.createPlanner('sequential');

const plan = await planner.createPlan(
  `Answer the following question using available tools: ${userQuery}`
);

const result = await plan.invoke(kernel, new KernelArguments());

Production Considerations

1. Chunk Size Optimization

// Experiment with different strategies
const chunkingStrategies = {
  fixed: { size: 512, overlap: 50 },
  
  semantic: {
    // Split on sentence boundaries
    preserveSentences: true,
    maxTokens: 512,
    minTokens: 128
  },
  
  sliding_window: {
    windowSize: 256,
    stride: 128 // 50% overlap
  },
  
  hierarchical: {
    // Parent chunks (1024 tokens) for retrieval
    // Child chunks (256 tokens) for context
    parentSize: 1024,
    childSize: 256
  }
};

Recommendation: Start with semantic chunking at 512 tokens with 50-token overlap. Optimize based on your domain.

2. Embedding Model Selection

ModelDimensionsPerformanceCostBest For
text-embedding-3-small1536GoodLowHigh-volume, cost-sensitive
text-embedding-3-large3072ExcellentMediumProduction, accuracy-critical
text-embedding-ada-0021536GoodLowLegacy compatibility

Recommendation: Use text-embedding-3-large for production. The improved accuracy justifies the cost.

3. Caching Strategy

import { createClient } from 'redis';

const redis = createClient({ url: process.env.REDIS_URL });

async function cachedRAG(query: string): Promise<string> {
  // Check cache
  const cacheKey = `rag:${hashQuery(query)}`;
  const cached = await redis.get(cacheKey);
  
  if (cached) {
    return JSON.parse(cached);
  }
  
  // Perform RAG
  const result = await performRAG(query);
  
  // Cache for 1 hour
  await redis.setEx(cacheKey, 3600, JSON.stringify(result));
  
  return result;
}

Cache Strategies:

  • Query-level: Cache full responses (high hit rate for common questions)
  • Retrieval-level: Cache search results (reuse across similar queries)
  • Embedding-level: Cache embeddings (avoid recomputation)

4. Monitoring & Observability

import { ApplicationInsights } from 'applicationinsights';

const appInsights = new ApplicationInsights({
  connectionString: process.env.APPINSIGHTS_CONNECTION_STRING
});

async function instrumentedRAG(query: string): Promise<string> {
  const startTime = Date.now();
  
  try {
    // Track custom event
    appInsights.trackEvent({
      name: 'RAG_Query',
      properties: {
        query: sanitize(query),
        timestamp: new Date().toISOString()
      }
    });
    
    // Perform retrieval
    const retrievalStart = Date.now();
    const documents = await retrieveDocuments(query);
    const retrievalTime = Date.now() - retrievalStart;
    
    appInsights.trackMetric({
      name: 'RetrievalLatency',
      value: retrievalTime
    });
    
    // Perform generation
    const generationStart = Date.now();
    const response = await generateResponse(query, documents);
    const generationTime = Date.now() - generationStart;
    
    appInsights.trackMetric({
      name: 'GenerationLatency',
      value: generationTime
    });
    
    // Track success
    appInsights.trackMetric({
      name: 'TotalLatency',
      value: Date.now() - startTime
    });
    
    return response;
    
  } catch (error) {
    appInsights.trackException({ exception: error });
    throw error;
  }
}

Key Metrics to Track:

  • Retrieval latency (p50, p95, p99)
  • Generation latency
  • Retrieval accuracy (requires human evaluation dataset)
  • Cache hit rate
  • Token usage (cost monitoring)
  • Error rates by type

5. Cost Optimization

interface CostOptimizationConfig {
  cacheEnabled: boolean;
  compressionEnabled: boolean;
  tierByComplexity: boolean;
}

async function costOptimizedRAG(
  query: string,
  config: CostOptimizationConfig
): Promise<string> {
  
  // Use cache if enabled
  if (config.cacheEnabled) {
    const cached = await getFromCache(query);
    if (cached) return cached;
  }
  
  // Retrieve documents
  let documents = await hybridSearch(query, { top: 10 });
  
  // Compress context if enabled
  if (config.compressionEnabled) {
    documents = await compressContext(query, documents);
  }
  
  // Route to appropriate model based on complexity
  let model = 'gpt-4o';
  if (config.tierByComplexity) {
    const complexity = await assessQueryComplexity(query);
    model = complexity < 0.5 ? 'gpt-4o-mini' : 'gpt-4o';
  }
  
  const response = await generate(query, documents, { model });
  
  return response;
}

Cost Reduction Strategies:

  • ✅ Use gpt-4o-mini for simple queries (85% cheaper)
  • ✅ Enable prompt caching (50% savings on repeated context)
  • ✅ Compress context before generation (30-50% token savings)
  • ✅ Batch embeddings API calls (up to 16 inputs per request)
  • ✅ Use Azure Reserved Capacity for predictable workloads (savings up to 50%)

6. Security & Compliance

Data Privacy & Encryption

// Azure AI Search with encryption and private endpoints
resource searchService 'Microsoft.Search/searchServices@2024-03-01-preview' = {
  name: 'secure-rag-search'
  location: location
  sku: { name: 'standard' }
  properties: {
    replicaCount: 2
    partitionCount: 1
    publicNetworkAccess: 'Disabled' // Force private endpoint
    encryptionWithCmk: {
      enforcement: 'Enabled'
      encryptionComplianceStatus: 'Compliant'
    }
  }
}

// Private endpoint for search
resource searchPrivateEndpoint 'Microsoft.Network/privateEndpoints@2023-04-01' = {
  name: 'search-pe'
  location: location
  properties: {
    subnet: { id: subnetId }
    privateLinkServiceConnections: [{
      name: 'search-connection'
      properties: {
        privateLinkServiceId: searchService.id
        groupIds: ['searchService']
      }
    }]
  }
}

// Azure OpenAI with managed identity
resource openAI 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  name: 'secure-rag-openai'
  location: location
  kind: 'OpenAI'
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    publicNetworkAccess: 'Disabled'
    networkAcls: {
      defaultAction: 'Deny'
      virtualNetworkRules: [{
        id: subnetId
        ignoreMissingVnetServiceEndpoint: false
      }]
    }
    customSubDomainName: 'secure-rag-openai'
  }
}

PII Detection & Redaction

import { TextAnalyticsClient } from '@azure/ai-text-analytics';

async function detectAndRedactPII(
  text: string
): Promise<{ redacted: string; entities: PIIEntity[] }> {
  
  const client = new TextAnalyticsClient(endpoint, credential);
  
  const results = await client.analyzePiiEntities([text]);
  const piiEntities = results[0].entities;
  
  // Redact PII
  let redacted = text;
  for (const entity of piiEntities.sort((a, b) => b.offset - a.offset)) {
    const before = redacted.substring(0, entity.offset);
    const after = redacted.substring(entity.offset + entity.length);
    redacted = `${before}[REDACTED:${entity.category}]${after}`;
  }
  
  return { redacted, entities: piiEntities };
}

async function secureRAG(query: string): Promise<RAGResponse> {
  // 1. Detect PII in query
  const { redacted: safeQuery, entities: queryPII } = await detectAndRedactPII(query);
  
  // 2. Log PII detection event (for compliance audit)
  await auditLog({
    timestamp: new Date().toISOString(),
    action: 'pii_detection',
    userId: currentUser.id,
    piiDetected: queryPII.length > 0,
    categories: queryPII.map(e => e.category)
  });
  
  // 3. Perform RAG with redacted query
  const response = await performRAG(safeQuery);
  
  return response;
}

Compliance & Audit Logging

interface AuditLog {
  timestamp: string;
  userId: string;
  action: 'query' | 'retrieval' | 'generation' | 'pii_detection';
  queryHash: string;        // SHA-256 of query (never store raw)
  documentsRetrieved: number;
  tokensUsed: number;
  responseTime: number;
  piiDetected: boolean;
  complianceFlags: string[];
}

async function logRAGActivity(log: AuditLog): Promise<void> {
  // Store in Azure Monitor Logs for HIPAA/SOC2/GDPR compliance
  await appInsights.trackEvent({
    name: 'RAG_Activity',
    properties: log,
    measurements: {
      latency: log.responseTime,
      tokens: log.tokensUsed
    }
  });
  
  // For regulations requiring long-term retention
  await cosmosClient
    .database('compliance')
    .container('audit_logs')
    .items.create(log);
}

Data Residency & Sovereignty

Key considerations for enterprise deployments:

Azure region selection: Deploy Azure OpenAI and AI Search in same region as data (EU: West Europe/North Europe, US: East US/West US)

Customer-managed keys (CMK): Use Azure Key Vault for encryption keys (required for HIPAA, GDPR)

Private endpoints: Disable public internet access, use VNet integration

Data retention policies: Configure TTL on indexed documents per compliance requirements

Access controls: Use Azure RBAC + Managed Identity, never API keys in production

// Example: Managed Identity authentication (no keys)
import { DefaultAzureCredential } from '@azure/identity';

const credential = new DefaultAzureCredential();

const searchClient = new SearchClient(
  endpoint,
  indexName,
  credential // Uses managed identity, not API key
);

const openAIClient = new OpenAIClient(
  endpoint,
  credential // Same for OpenAI
);

Testing & Evaluation

Retrieval Quality Metrics

interface EvaluationDataset {
  queries: EvaluationQuery[];
}

interface EvaluationQuery {
  query: string;
  relevantDocIds: string[]; // Ground truth
}

async function evaluateRetrieval(
  dataset: EvaluationDataset
): Promise<RetrievalMetrics> {
  
  let totalPrecisionAtK = 0;
  let totalRecallAtK = 0;
  let totalMRR = 0;
  
  for (const item of dataset.queries) {
    const results = await hybridSearch(item.query, { top: 10 });
    const retrievedIds = results.map(r => r.documentId);
    
    // Precision@K
    const relevantRetrieved = retrievedIds.filter(id => 
      item.relevantDocIds.includes(id)
    );
    const precision = relevantRetrieved.length / retrievedIds.length;
    totalPrecisionAtK += precision;
    
    // Recall@K
    const recall = relevantRetrieved.length / item.relevantDocIds.length;
    totalRecallAtK += recall;
    
    // Mean Reciprocal Rank
    const firstRelevantIndex = retrievedIds.findIndex(id => 
      item.relevantDocIds.includes(id)
    );
    const mrr = firstRelevantIndex >= 0 ? 1 / (firstRelevantIndex + 1) : 0;
    totalMRR += mrr;
  }
  
  return {
    precision_at_10: totalPrecisionAtK / dataset.queries.length,
    recall_at_10: totalRecallAtK / dataset.queries.length,
    mean_reciprocal_rank: totalMRR / dataset.queries.length
  };
}

End-to-End Quality Metrics

async function evaluateRAGQuality(
  testQueries: TestQuery[]
): Promise<QualityMetrics> {
  
  const results = await Promise.all(
    testQueries.map(async (test) => {
      const response = await performRAG(test.query);
      
      // LLM-as-judge evaluation
      const evaluation = await evaluateResponse({
        query: test.query,
        response: response,
        groundTruth: test.expectedAnswer,
        criteria: ['accuracy', 'completeness', 'relevance', 'citation_quality']
      });
      
      return evaluation;
    })
  );
  
  return {
    accuracy: average(results.map(r => r.accuracy)),
    completeness: average(results.map(r => r.completeness)),
    relevance: average(results.map(r => r.relevance)),
    citation_quality: average(results.map(r => r.citation_quality))
  };
}

Common Pitfalls & Solutions

Pitfall 1: Hallucination Despite Context

Problem: LLM generates information not present in retrieved documents

Solution: Stricter prompt engineering + verification

const strictPrompt = `
CRITICAL INSTRUCTIONS:
1. Use ONLY information from the provided context
2. If the context doesn't contain the answer, respond: "I don't have enough information to answer this question."
3. Never make assumptions or use external knowledge
4. Cite sources for every claim using [Source N] format

Context:
${context}

Question: ${query}

Answer (following instructions above):
`;

Pitfall 2: Poor Retrieval Quality

Problem: Relevant documents not retrieved in top results

Solutions:

  1. Improve chunking: Use semantic chunking instead of fixed-size
  2. Add metadata: Enrich documents with category, date, author for filtering
  3. Tune hybrid weights: Experiment with vector vs keyword ratios
  4. Use query expansion: Reformulate query with synonyms/variations
async function queryExpansion(originalQuery: string): Promise<string[]> {
  const expansionPrompt = `
Generate 3 alternative phrasings of this question that mean the same thing:
"${originalQuery}"

Alternative phrasings:
`;

  const response = await openAI.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: expansionPrompt }],
    temperature: 0.7
  });
  
  const alternatives = response.choices[0].message.content
    .split('\n')
    .filter(line => line.trim().length > 0);
  
  return [originalQuery, ...alternatives];
}

Pitfall 3: High Latency

Problem: RAG responses take 5+ seconds

Solutions:

  1. Parallel retrieval + generation: Don’t wait for all chunks to process
  2. Streaming responses: Start showing answer before completion
  3. Precompute embeddings: Index ahead of time, not on-demand
  4. Optimize chunk count: More isn’t always better (diminishing returns after 5-10 chunks)
async function streamingRAG(query: string): Promise<ReadableStream> {
  const documents = await hybridSearch(query, { top: 5 });
  const context = documents.map(d => d.content).join('\n\n');
  
  // Stream response token by token
  const stream = await openAI.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: `Context: ${context}\n\nQuestion: ${query}` }
    ],
    stream: true
  });
  
  return new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content || '';
        controller.enqueue(content);
      }
      controller.close();
    }
  });
}

Troubleshooting Checklist

When Retrieval Quality is Poor

Symptom: Relevant documents not appearing in top 5 results

Check chunking strategy:

  • Try semantic chunking instead of fixed-size
  • Reduce chunk size if concepts span boundaries (512→256 tokens)
  • Increase overlap (50→100 tokens)

Tune HNSW parameters:

// Increase recall at cost of latency
{
  "hnswParameters": {
    "m": 8,              // Default 4, increase to 8-16 for better recall
    "efConstruction": 800, // Default 400, higher = better index quality
    "efSearch": 1000      // Default 500, higher = better search recall
  }
}

Adjust hybrid weights:

// If exact matches failing: increase keyword weight
const config = {
  vectorWeight: 0.3,   // Reduce from 0.5
  keywordWeight: 0.7   // Increase from 0.5
};

// If semantic similarity failing: increase vector weight
const config = {
  vectorWeight: 0.8,   // Increase from 0.5
  keywordWeight: 0.2   // Reduce from 0.5
};

Verify embeddings:

// Test embedding similarity
const query = "How do I reset my password?";
const doc = "Password reset instructions: ...";

const queryEmb = await embed(query);
const docEmb = await embed(doc);

// Should be >0.7 for relevant doc/query pairs
const similarity = cosineSimilarity(queryEmb, docEmb);
console.log('Similarity:', similarity);

if (similarity < 0.7) {
  // Problem: embeddings not capturing semantic relationship
  // Fix: Try text-embedding-3-large, check for domain mismatch
}

Add metadata filters:

// Filter by category, date, or tags to narrow search space
const results = await searchClient.search(query, {
  filter: `category eq 'technical_docs' and date ge 2024-01-01`,
  vectorQueries: [/* ... */],
  top: 10
});

When Responses are Hallucinating

Symptom: LLM inventing information not in context

Strengthen prompt instructions:

const strictPrompt = `
CRITICAL RULES (you will be penalized for violations):
1. Use ONLY information from Context below
2. If Context doesn't answer the question, respond EXACTLY: "I don't have enough information to answer this question."
3. Never use external knowledge or make assumptions
4. Cite sources using [Source N] for every claim
5. If uncertain, say "The context suggests..." not "The answer is..."

Context:
${context}

Question: ${query}

Answer (following rules above):
`;

Use lower temperature:

const response = await openAI.chat.completions.create({
  model: 'gpt-4o',
  temperature: 0,  // Use 0 for factual tasks (default 1)
  messages: [/* ... */]
});

Add verification step:

// Two-pass approach: generate, then verify
const answer = await generateAnswer(query, context);

const verificationPrompt = `
Context: ${context}
Answer: ${answer}

Is every claim in the Answer supported by the Context?
Respond with JSON: { "verified": true/false, "unsupported_claims": [] }
`;

const verification = await openAI.chat.completions.create({
  model: 'gpt-4o',
  response_format: { type: 'json_object' },
  messages: [{ role: 'user', content: verificationPrompt }]
});

if (!verification.verified) {
  return "I couldn't verify all claims against the source documents.";
}

When Latency is Too High

Symptom: Responses taking >5 seconds

Profile the pipeline:

const timings = {};

const start = Date.now();
const docs = await retrieve(query);
timings.retrieval = Date.now() - start;

const genStart = Date.now();
const response = await generate(query, docs);
timings.generation = Date.now() - genStart;

console.log('Retrieval:', timings.retrieval, 'ms');
console.log('Generation:', timings.generation, 'ms');

// If retrieval >500ms: check HNSW parameters, reduce top-k
// If generation >3s: reduce context size, use streaming

Optimize retrieval:

  • Reduce top from 50→20 (less reranking)
  • Lower efSearch if recall is acceptable
  • Use semantic caching for common queries

Optimize generation:

  • Use streaming responses (perceived latency)
  • Compress context (reduce tokens)
  • Switch to gpt-4o-mini for simple queries (3x faster)

Implement caching:

// Cache at multiple levels
const cacheKey = `emb:${hashQuery(query)}`;
const cachedEmbedding = await redis.get(cacheKey);

if (!cachedEmbedding) {
  embedding = await embed(query);
  await redis.setEx(cacheKey, 86400, JSON.stringify(embedding));
}

1. Multi-Modal RAG

Retrieve and reason over text, images, tables, charts:

interface MultiModalDocument {
  text: string;
  images: string[]; // URIs
  tables: TableData[];
  charts: ChartData[];
}

async function multiModalRAG(query: string): Promise<string> {
  // Retrieve documents with all modalities
  const documents = await hybridSearch(query, { top: 5 });
  
  // Use GPT-4o vision for image understanding
  const imageDescriptions = await Promise.all(
    documents.flatMap(doc => doc.images).map(async (imageUrl) => {
      return await openAI.chat.completions.create({
        model: 'gpt-4o',
        messages: [{
          role: 'user',
          content: [
            { type: 'text', text: 'Describe this image in detail.' },
            { type: 'image_url', image_url: { url: imageUrl } }
          ]
        }]
      });
    })
  );
  
  // Combine text + image descriptions for generation
  const enrichedContext = documents.map((doc, i) => ({
    text: doc.text,
    imageContext: imageDescriptions[i]?.choices[0]?.message?.content
  }));
  
  return await generate(query, enrichedContext);
}

2. Graph RAG

Combine vector search with knowledge graph traversal:

// Azure Cosmos DB for Apache Gremlin (graph database)
import gremlin from 'gremlin';

async function graphRAG(query: string): Promise<string> {
  // 1. Vector search for initial nodes
  const initialNodes = await vectorSearch(query, { top: 3 });
  
  // 2. Traverse graph to find related entities
  const graphClient = getGremlinClient();
  const relatedEntities = await graphClient.submit(
    `g.V(${initialNodes.map(n => n.id).join(',')})
       .out('RELATED_TO')
       .dedup()
       .limit(10)
       .valueMap()`
  );
  
  // 3. Retrieve full documents for related entities
  const documents = await fetchDocuments(relatedEntities);
  
  // 4. Generate response with graph-enriched context
  return await generate(query, documents);
}

3. Adaptive RAG

System learns optimal retrieval strategies per query type:

interface AdaptiveRAGModel {
  predict(query: string): Promise<RAGStrategy>;
  train(query: string, strategy: RAGStrategy, feedback: number): Promise<void>;
}

async function adaptiveRAG(
  query: string,
  model: AdaptiveRAGModel
): Promise<string> {
  
  // Predict optimal strategy based on query characteristics
  const strategy = await model.predict(query);
  
  // Execute predicted strategy
  let response: string;
  switch (strategy.type) {
    case 'hybrid':
      response = await hybridRAG(query, strategy.config);
      break;
    case 'multi_stage':
      response = await multiStageRAG(query, strategy.config);
      break;
    case 'agentic':
      response = await agenticRAG(query, strategy.config);
      break;
  }
  
  // Collect feedback for continuous learning
  const feedback = await getUserFeedback(response);
  await model.train(query, strategy, feedback);
  
  return response;
}

References & Further Reading

Azure Documentation

Pricing & Cost Optimization

Research Papers

Tools & Libraries

  • Semantic Kernel - Microsoft’s AI orchestration framework
  • LangChain - RAG framework with Azure integrations
  • Shiki - Syntax highlighting (used in this blog)

Sample Code & Templates

Conclusion

Building production-ready RAG systems requires more than basic vector search + LLM generation. The patterns we’ve covered - from hybrid search to agentic RAG - represent the current state of the art in enterprise AI applications.

Key architectural decisions:

  • Start with Hybrid RAG (vector + keyword + semantic ranking)
  • Add multi-stage retrieval when accuracy is critical
  • Consider agentic patterns for complex, multi-step queries
  • Invest in observability from day one
  • Optimize for cost with caching, compression, and model routing

Azure provides a comprehensive platform for RAG:

  • Azure AI Search: Best-in-class hybrid search
  • Azure OpenAI: GPT-4o and embeddings with enterprise SLAs
  • Azure Cosmos DB: Scalable metadata and graph storage
  • Azure Monitor: End-to-end observability

Key Takeaways

RAG beats fine-tuning for most enterprise knowledge applications

Hybrid search (vector + keyword + semantic) dramatically improves retrieval quality

Multi-stage retrieval with reranking and compression optimizes accuracy and cost

Citations are non-negotiable for enterprise trust and compliance

Monitor everything: retrieval quality, latency, cost, and user satisfaction

Azure AI Search + Azure OpenAI provide production-ready RAG infrastructure

Next Steps

Ready to implement these patterns? Here’s your roadmap:

  1. Start Simple: Build basic RAG, measure baseline metrics
  2. Add Hybrid Search: Implement vector + keyword with Azure AI Search
  3. Enable Semantic Ranking: Significant quality boost for minimal effort
  4. Iterate on Chunking: Experiment with strategies for your domain
  5. Add Observability: Track metrics to guide optimization
  6. Scale Progressively: Add multi-stage or agentic patterns as needed

Want to dive deeper into specific RAG use cases? Check out:

Need help architecting your RAG system? Get in touch - I specialize in Azure-native AI solutions for enterprises.


This article is part of the RAG & Enterprise Chatbots series. Subscribe below for in-depth technical guides on AI architecture.