Executive Summary
TL;DR: Retrieval-Augmented Generation (RAG) combines semantic retrieval with LLM generation to deliver up-to-date, auditable answers from large knowledge bases. Start with Basic RAG for quick validation, add Hybrid search (vector + keyword) to handle exact matches, and move to Multi-Stage or Agentic RAG for high-accuracy enterprise scenarios. Focus early on chunking strategy, observability, and citation generation — these drive real-world reliability.
Pattern Selection Guide:
- Basic RAG: Quick POC, static knowledge sets, low budget (~$50-200/month)
- Hybrid RAG: Production systems needing exact-match support and higher recall (recommended starting point)
- Multi-Stage RAG: Mission-critical accuracy where top-k precision matters (financial, legal, healthcare)
- Agentic RAG: Complex workflows requiring dynamic tool use and decision-making
Typical Performance Benchmarks (Azure AI Search + GPT-4o):
- Retrieval latency: p50 ~150ms, p95 ~400ms
- Generation latency: p50 ~2s, p95 ~4.5s
- End-to-end: p50 ~2.5s, p95 ~5s
- Cost per 1K queries: $2-8 (depends on context size, caching, model choice)
Introduction
Retrieval-Augmented Generation (RAG) has emerged as the dominant pattern for building AI applications that need to reference external knowledge bases. While the concept is simple - retrieve relevant context, then generate a response - production implementations face challenges around accuracy, latency, cost, and scale.
This comprehensive guide walks through proven architectural patterns for enterprise RAG systems, with a focus on Azure-native implementations. Whether you’re building a customer support chatbot, internal knowledge assistant, or document analysis tool, these patterns will help you move from proof of concept to production-ready system.
Understanding RAG: Core Concepts
The RAG Pipeline
User Query
↓
Query Processing & Embedding
↓
Retrieval (Vector + Keyword Search)
↓
Context Ranking & Filtering
↓
Prompt Construction
↓
LLM Generation
↓
Response Post-Processing
↓
User Response
Why RAG Over Fine-Tuning?
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Data Updates | Real-time, just add documents | Requires retraining |
| Cost | Lower (no training runs) | Higher (GPU hours) |
| Explainability | Citations to source docs | Black box |
| Accuracy | High for factual queries | Variable |
| Latency | Higher (retrieval overhead) | Lower (single LLM call) |
| Use Case | Dynamic knowledge bases | Fixed behavior patterns |
Verdict: RAG is better for most enterprise scenarios where knowledge changes frequently and traceability matters.
Architectural Pattern 1: Basic RAG
When to use: Initial POC, small knowledge bases (<10K documents), internal tools with <100 users, budget-constrained projects.
Expected performance: 2-4s latency, 85-90% accuracy for straightforward queries, ~$50-200/month at 10K queries.
Architecture Overview
interface BasicRAGConfig {
vectorStore: VectorDatabase;
embedding: EmbeddingModel;
llm: LanguageModel;
chunkSize: number;
topK: number;
}
async function basicRAG(query: string, config: BasicRAGConfig): Promise<string> {
// 1. Embed query
const queryEmbedding = await config.embedding.embed(query);
// 2. Retrieve similar documents
const documents = await config.vectorStore.similaritySearch(
queryEmbedding,
config.topK
);
// 3. Construct prompt
const context = documents.map(doc => doc.content).join('\n\n');
const prompt = `Context:\n${context}\n\nQuestion: ${query}\n\nAnswer:`;
// 4. Generate response
const response = await config.llm.generate(prompt);
return response;
}
Azure Implementation
Components:
- Azure OpenAI:
text-embedding-3-largefor embeddings,gpt-4ofor generation - Azure AI Search: Vector + keyword search engine
- Azure Blob Storage: Document storage
Infrastructure as Code (Bicep):
resource searchService 'Microsoft.Search/searchServices@2024-03-01-preview' = {
name: 'rag-search-${uniqueString(resourceGroup().id)}'
location: location
sku: {
name: 'standard'
}
properties: {
replicaCount: 1
partitionCount: 1
hostingMode: 'default'
semanticSearch: 'standard' // Enable semantic ranking
}
}
resource openAIAccount 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
name: 'rag-openai-${uniqueString(resourceGroup().id)}'
location: location
kind: 'OpenAI'
sku: {
name: 'S0'
}
properties: {
customSubDomainName: 'rag-openai-${uniqueString(resourceGroup().id)}'
publicNetworkAccess: 'Enabled'
}
}
resource embeddingDeployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = {
parent: openAIAccount
name: 'text-embedding-3-large'
properties: {
model: {
format: 'OpenAI'
name: 'text-embedding-3-large'
version: '1'
}
}
sku: {
name: 'Standard'
capacity: 120 // Tokens per minute (thousands)
}
}
Document Ingestion Pipeline
import { BlobServiceClient } from '@azure/storage-blob';
import { SearchClient, SearchIndexClient } from '@azure/search-documents';
import { OpenAIClient } from '@azure/openai';
interface Document {
id: string;
content: string;
metadata: Record<string, any>;
}
async function ingestDocuments(documents: Document[]) {
const openAI = new OpenAIClient(endpoint, credential);
const searchClient = new SearchClient(searchEndpoint, indexName, credential);
for (const doc of documents) {
// 1. Chunk document (semantic chunking preferred)
const chunks = await chunkDocument(doc.content, {
maxTokens: 512,
overlap: 50,
preserveSentences: true
});
// 2. Generate embeddings
const embeddings = await openAI.embeddings.create({
model: 'text-embedding-3-large',
input: chunks.map(c => c.text),
dimensions: 3072 // Full dimensions for maximum accuracy
});
// 3. Index documents
const searchDocuments = chunks.map((chunk, i) => ({
id: `${doc.id}_chunk_${i}`,
content: chunk.text,
contentVector: embeddings.data[i].embedding,
documentId: doc.id,
metadata: doc.metadata,
chunkIndex: i
}));
await searchClient.uploadDocuments(searchDocuments);
}
}
Limitations of Basic RAG
❌ Poor retrieval accuracy for complex queries
❌ No keyword fallback (fails on exact matches like codes, names)
❌ Context window waste (irrelevant chunks consume tokens)
❌ No source attribution (can’t cite where answers come from)
❌ Single-shot retrieval (can’t refine based on initial results)
Architectural Pattern 2: Hybrid RAG
When to use: Production systems, knowledge bases with codes/IDs/proper names, customer-facing applications, need for >92% recall.
Expected performance: 2-3s latency, 92-95% accuracy, ~$200-800/month at 50K queries (with caching).
Architecture Overview
Combines vector search (semantic similarity) with keyword search (BM25) for best-of-both-worlds retrieval.
interface HybridSearchConfig {
vectorWeight: number; // 0.0 - 1.0
keywordWeight: number; // 0.0 - 1.0
minScore: number; // Relevance threshold
}
async function hybridSearch(
query: string,
config: HybridSearchConfig
): Promise<SearchResult[]> {
const searchClient = new SearchClient(endpoint, indexName, credential);
const results = await searchClient.search(query, {
vectorQueries: [{
kind: 'vector',
vector: await embedQuery(query),
kNearestNeighborsCount: 50,
fields: ['contentVector']
}],
searchFields: ['content', 'title', 'metadata'],
select: ['id', 'content', 'documentId', 'metadata'],
top: 10,
// Hybrid ranking formula
scoringProfile: 'hybrid-profile',
scoringParameters: [
`vectorWeight-${config.vectorWeight}`,
`keywordWeight-${config.keywordWeight}`
]
});
return results.results
.filter(r => r.score >= config.minScore)
.map(r => r.document);
}
Azure AI Search Hybrid Configuration
Index Schema:
{
"name": "hybrid-rag-index",
"fields": [
{ "name": "id", "type": "Edm.String", "key": true },
{ "name": "content", "type": "Edm.String", "searchable": true },
{ "name": "title", "type": "Edm.String", "searchable": true },
{ "name": "contentVector", "type": "Collection(Edm.Single)",
"searchable": true, "dimensions": 3072,
"vectorSearchProfile": "vector-profile" },
{ "name": "documentId", "type": "Edm.String", "filterable": true },
{ "name": "metadata", "type": "Edm.ComplexType", "fields": [
{ "name": "category", "type": "Edm.String", "filterable": true },
{ "name": "date", "type": "Edm.DateTimeOffset", "filterable": true, "sortable": true },
{ "name": "author", "type": "Edm.String", "filterable": true }
]}
],
"vectorSearch": {
"algorithms": [
{ "name": "hnsw-config", "kind": "hnsw",
"hnswParameters": { "m": 4, "efConstruction": 400, "efSearch": 500 } }
],
"profiles": [
{ "name": "vector-profile", "algorithm": "hnsw-config" }
]
},
"semantic": {
"configurations": [
{
"name": "semantic-config",
"prioritizedFields": {
"titleField": { "fieldName": "title" },
"contentFields": [{ "fieldName": "content" }]
}
}
]
}
}
Semantic Ranking
Azure AI Search’s semantic ranking uses a Microsoft-trained model to rerank results:
const results = await searchClient.search(query, {
queryType: 'semantic',
semanticConfiguration: 'semantic-config',
queryAnswer: 'extractive', // Get direct answer extraction
captions: 'extractive', // Highlight relevant passages
top: 10
});
// Results now include semantic captions
for await (const result of results.results) {
console.log('Score:', result.score);
console.log('Caption:', result.captions?.[0]?.text);
console.log('Highlights:', result.captions?.[0]?.highlights);
}
Benefits Over Basic RAG
✅ Better recall: Finds documents missed by vector-only search
✅ Exact match support: Handles codes, IDs, proper names
✅ Semantic reranking: Microsoft’s model improves top results
✅ Answer extraction: Highlights specific passages that answer query
✅ Configurable weights: Tune vector vs keyword importance per use case
Architectural Pattern 3: Multi-Stage RAG
When to use: High-stakes domains (financial, legal, healthcare), citation requirements, need for >95% precision, compliance/audit needs.
Expected performance: 3-5s latency, 95-98% accuracy, ~$500-2000/month at 50K queries (compression helps reduce costs).
Architecture Overview
Uses multiple retrieval stages with increasing specificity:
Stage 1: Broad Retrieval (100 candidates)
↓
Stage 2: Reranking (Top 20)
↓
Stage 3: Relevance Filtering (Top 5-10)
↓
Stage 4: Context Compression
↓
LLM Generation
Implementation
interface MultiStageRAGConfig {
stage1_topK: number; // Broad retrieval
stage2_topK: number; // After reranking
stage3_minScore: number; // Relevance threshold
useCompression: boolean;
}
async function multiStageRAG(
query: string,
config: MultiStageRAGConfig
): Promise<RAGResponse> {
// Stage 1: Broad hybrid retrieval
const candidates = await hybridSearch(query, {
vectorWeight: 0.5,
keywordWeight: 0.5,
top: config.stage1_topK
});
// Stage 2: Semantic reranking
const reranked = await semanticRerank(query, candidates, {
top: config.stage2_topK
});
// Stage 3: LLM-based relevance filtering
const filtered = await llmFilter(query, reranked, {
minScore: config.stage3_minScore,
prompt: `Rate relevance of each document to query on scale 0-1.`
});
// Stage 4: Context compression
let context: string;
if (config.useCompression) {
context = await compressContext(query, filtered);
} else {
context = filtered.map(doc => doc.content).join('\n\n');
}
// Final generation
const response = await generateWithCitations(query, context, filtered);
return response;
}
Context Compression
Reduces token usage while preserving relevant information:
async function compressContext(
query: string,
documents: Document[]
): Promise<string> {
const compressionPrompt = `
Given the query: "${query}"
For each document below, extract ONLY the sentences directly relevant to answering the query.
Preserve exact wording. Omit irrelevant details.
Documents:
${documents.map((doc, i) => `[${i+1}] ${doc.content}`).join('\n\n')}
Compressed context:
`;
const compressed = await openAI.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You extract relevant information concisely.' },
{ role: 'user', content: compressionPrompt }
],
temperature: 0
});
return compressed.choices[0].message.content;
}
Citation Generation
interface CitedResponse {
answer: string;
citations: Citation[];
}
interface Citation {
documentId: string;
chunkId: string;
text: string;
relevanceScore: number;
}
async function generateWithCitations(
query: string,
context: string,
sourceDocuments: Document[]
): Promise<CitedResponse> {
const prompt = `
Context with source markers:
${sourceDocuments.map((doc, i) => `[Source ${i+1}]: ${doc.content}`).join('\n\n')}
Question: ${query}
Instructions:
1. Answer the question using ONLY information from the context above
2. Cite sources using [Source N] format after each claim
3. If information is not in context, say "I don't have enough information"
Answer:
`;
const response = await openAI.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a helpful assistant that answers questions with citations.' },
{ role: 'user', content: prompt }
],
temperature: 0.3
});
const answer = response.choices[0].message.content;
// Extract citations from response
const citationPattern = /\[Source (\d+)\]/g;
const citationMatches = [...answer.matchAll(citationPattern)];
const citations: Citation[] = citationMatches.map(match => {
const sourceIndex = parseInt(match[1]) - 1;
const doc = sourceDocuments[sourceIndex];
return {
documentId: doc.documentId,
chunkId: doc.id,
text: doc.content.substring(0, 200) + '...',
relevanceScore: doc.score
};
});
return { answer, citations };
}
Architectural Pattern 4: Agentic RAG
When to use: Complex multi-step queries, dynamic tool selection needed, research/analysis workflows, need to combine multiple data sources.
Expected performance: 5-15s latency (multiple LLM calls), 96-99% accuracy for complex queries, ~$1000-5000/month at 20K queries.
Architecture Overview
AI agent decides retrieval strategy dynamically:
interface AgenticRAGConfig {
tools: Tool[];
maxIterations: number;
reasoningModel: string;
}
interface Tool {
name: string;
description: string;
execute: (params: any) => Promise<any>;
}
async function agenticRAG(
query: string,
config: AgenticRAGConfig
): Promise<string> {
const tools: Tool[] = [
{
name: 'vector_search',
description: 'Semantic search for conceptually similar documents',
execute: async ({ query, topK }) => vectorSearch(query, topK)
},
{
name: 'keyword_search',
description: 'Exact keyword matching for codes, names, IDs',
execute: async ({ query, topK }) => keywordSearch(query, topK)
},
{
name: 'filter_by_metadata',
description: 'Filter documents by category, date range, author',
execute: async ({ filter }) => metadataFilter(filter)
},
{
name: 'summarize_documents',
description: 'Summarize long documents before answering',
execute: async ({ documentIds }) => summarizeDocs(documentIds)
}
];
let iteration = 0;
let finalAnswer = '';
while (iteration < config.maxIterations && !finalAnswer) {
// Agent decides next action
const action = await decideNextAction(query, tools, iteration);
if (action.type === 'use_tool') {
const tool = tools.find(t => t.name === action.toolName);
const result = await tool.execute(action.parameters);
// Agent evaluates if it has enough information
const evaluation = await evaluateInformation(query, result);
if (evaluation.sufficient) {
finalAnswer = await generateFinalAnswer(query, result);
}
} else if (action.type === 'answer') {
finalAnswer = action.answer;
}
iteration++;
}
return finalAnswer;
}
Azure Implementation with Semantic Kernel
import { Kernel, KernelArguments } from '@microsoft/semantic-kernel';
import { AzureOpenAIChatCompletion } from '@microsoft/semantic-kernel';
// Initialize kernel
const kernel = new Kernel();
kernel.addService(
'chat',
new AzureOpenAIChatCompletion({
deploymentName: 'gpt-4o',
endpoint: process.env.AZURE_OPENAI_ENDPOINT,
apiKey: process.env.AZURE_OPENAI_KEY
})
);
// Define retrieval functions as plugins
kernel.importPluginFromObject({
vectorSearch: async (query: string, topK: number = 5) => {
return await performVectorSearch(query, topK);
},
filterByDate: async (startDate: string, endDate: string) => {
return await filterDocumentsByDateRange(startDate, endDate);
}
}, 'RAGPlugin');
// Agent reasoning loop
const planner = kernel.createPlanner('sequential');
const plan = await planner.createPlan(
`Answer the following question using available tools: ${userQuery}`
);
const result = await plan.invoke(kernel, new KernelArguments());
Production Considerations
1. Chunk Size Optimization
// Experiment with different strategies
const chunkingStrategies = {
fixed: { size: 512, overlap: 50 },
semantic: {
// Split on sentence boundaries
preserveSentences: true,
maxTokens: 512,
minTokens: 128
},
sliding_window: {
windowSize: 256,
stride: 128 // 50% overlap
},
hierarchical: {
// Parent chunks (1024 tokens) for retrieval
// Child chunks (256 tokens) for context
parentSize: 1024,
childSize: 256
}
};
Recommendation: Start with semantic chunking at 512 tokens with 50-token overlap. Optimize based on your domain.
2. Embedding Model Selection
| Model | Dimensions | Performance | Cost | Best For |
|---|---|---|---|---|
text-embedding-3-small | 1536 | Good | Low | High-volume, cost-sensitive |
text-embedding-3-large | 3072 | Excellent | Medium | Production, accuracy-critical |
text-embedding-ada-002 | 1536 | Good | Low | Legacy compatibility |
Recommendation: Use text-embedding-3-large for production. The improved accuracy justifies the cost.
3. Caching Strategy
import { createClient } from 'redis';
const redis = createClient({ url: process.env.REDIS_URL });
async function cachedRAG(query: string): Promise<string> {
// Check cache
const cacheKey = `rag:${hashQuery(query)}`;
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}
// Perform RAG
const result = await performRAG(query);
// Cache for 1 hour
await redis.setEx(cacheKey, 3600, JSON.stringify(result));
return result;
}
Cache Strategies:
- Query-level: Cache full responses (high hit rate for common questions)
- Retrieval-level: Cache search results (reuse across similar queries)
- Embedding-level: Cache embeddings (avoid recomputation)
4. Monitoring & Observability
import { ApplicationInsights } from 'applicationinsights';
const appInsights = new ApplicationInsights({
connectionString: process.env.APPINSIGHTS_CONNECTION_STRING
});
async function instrumentedRAG(query: string): Promise<string> {
const startTime = Date.now();
try {
// Track custom event
appInsights.trackEvent({
name: 'RAG_Query',
properties: {
query: sanitize(query),
timestamp: new Date().toISOString()
}
});
// Perform retrieval
const retrievalStart = Date.now();
const documents = await retrieveDocuments(query);
const retrievalTime = Date.now() - retrievalStart;
appInsights.trackMetric({
name: 'RetrievalLatency',
value: retrievalTime
});
// Perform generation
const generationStart = Date.now();
const response = await generateResponse(query, documents);
const generationTime = Date.now() - generationStart;
appInsights.trackMetric({
name: 'GenerationLatency',
value: generationTime
});
// Track success
appInsights.trackMetric({
name: 'TotalLatency',
value: Date.now() - startTime
});
return response;
} catch (error) {
appInsights.trackException({ exception: error });
throw error;
}
}
Key Metrics to Track:
- Retrieval latency (p50, p95, p99)
- Generation latency
- Retrieval accuracy (requires human evaluation dataset)
- Cache hit rate
- Token usage (cost monitoring)
- Error rates by type
5. Cost Optimization
interface CostOptimizationConfig {
cacheEnabled: boolean;
compressionEnabled: boolean;
tierByComplexity: boolean;
}
async function costOptimizedRAG(
query: string,
config: CostOptimizationConfig
): Promise<string> {
// Use cache if enabled
if (config.cacheEnabled) {
const cached = await getFromCache(query);
if (cached) return cached;
}
// Retrieve documents
let documents = await hybridSearch(query, { top: 10 });
// Compress context if enabled
if (config.compressionEnabled) {
documents = await compressContext(query, documents);
}
// Route to appropriate model based on complexity
let model = 'gpt-4o';
if (config.tierByComplexity) {
const complexity = await assessQueryComplexity(query);
model = complexity < 0.5 ? 'gpt-4o-mini' : 'gpt-4o';
}
const response = await generate(query, documents, { model });
return response;
}
Cost Reduction Strategies:
- ✅ Use
gpt-4o-minifor simple queries (85% cheaper) - ✅ Enable prompt caching (50% savings on repeated context)
- ✅ Compress context before generation (30-50% token savings)
- ✅ Batch embeddings API calls (up to 16 inputs per request)
- ✅ Use Azure Reserved Capacity for predictable workloads (savings up to 50%)
6. Security & Compliance
Data Privacy & Encryption
// Azure AI Search with encryption and private endpoints
resource searchService 'Microsoft.Search/searchServices@2024-03-01-preview' = {
name: 'secure-rag-search'
location: location
sku: { name: 'standard' }
properties: {
replicaCount: 2
partitionCount: 1
publicNetworkAccess: 'Disabled' // Force private endpoint
encryptionWithCmk: {
enforcement: 'Enabled'
encryptionComplianceStatus: 'Compliant'
}
}
}
// Private endpoint for search
resource searchPrivateEndpoint 'Microsoft.Network/privateEndpoints@2023-04-01' = {
name: 'search-pe'
location: location
properties: {
subnet: { id: subnetId }
privateLinkServiceConnections: [{
name: 'search-connection'
properties: {
privateLinkServiceId: searchService.id
groupIds: ['searchService']
}
}]
}
}
// Azure OpenAI with managed identity
resource openAI 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
name: 'secure-rag-openai'
location: location
kind: 'OpenAI'
identity: {
type: 'SystemAssigned'
}
properties: {
publicNetworkAccess: 'Disabled'
networkAcls: {
defaultAction: 'Deny'
virtualNetworkRules: [{
id: subnetId
ignoreMissingVnetServiceEndpoint: false
}]
}
customSubDomainName: 'secure-rag-openai'
}
}
PII Detection & Redaction
import { TextAnalyticsClient } from '@azure/ai-text-analytics';
async function detectAndRedactPII(
text: string
): Promise<{ redacted: string; entities: PIIEntity[] }> {
const client = new TextAnalyticsClient(endpoint, credential);
const results = await client.analyzePiiEntities([text]);
const piiEntities = results[0].entities;
// Redact PII
let redacted = text;
for (const entity of piiEntities.sort((a, b) => b.offset - a.offset)) {
const before = redacted.substring(0, entity.offset);
const after = redacted.substring(entity.offset + entity.length);
redacted = `${before}[REDACTED:${entity.category}]${after}`;
}
return { redacted, entities: piiEntities };
}
async function secureRAG(query: string): Promise<RAGResponse> {
// 1. Detect PII in query
const { redacted: safeQuery, entities: queryPII } = await detectAndRedactPII(query);
// 2. Log PII detection event (for compliance audit)
await auditLog({
timestamp: new Date().toISOString(),
action: 'pii_detection',
userId: currentUser.id,
piiDetected: queryPII.length > 0,
categories: queryPII.map(e => e.category)
});
// 3. Perform RAG with redacted query
const response = await performRAG(safeQuery);
return response;
}
Compliance & Audit Logging
interface AuditLog {
timestamp: string;
userId: string;
action: 'query' | 'retrieval' | 'generation' | 'pii_detection';
queryHash: string; // SHA-256 of query (never store raw)
documentsRetrieved: number;
tokensUsed: number;
responseTime: number;
piiDetected: boolean;
complianceFlags: string[];
}
async function logRAGActivity(log: AuditLog): Promise<void> {
// Store in Azure Monitor Logs for HIPAA/SOC2/GDPR compliance
await appInsights.trackEvent({
name: 'RAG_Activity',
properties: log,
measurements: {
latency: log.responseTime,
tokens: log.tokensUsed
}
});
// For regulations requiring long-term retention
await cosmosClient
.database('compliance')
.container('audit_logs')
.items.create(log);
}
Data Residency & Sovereignty
Key considerations for enterprise deployments:
✅ Azure region selection: Deploy Azure OpenAI and AI Search in same region as data (EU: West Europe/North Europe, US: East US/West US)
✅ Customer-managed keys (CMK): Use Azure Key Vault for encryption keys (required for HIPAA, GDPR)
✅ Private endpoints: Disable public internet access, use VNet integration
✅ Data retention policies: Configure TTL on indexed documents per compliance requirements
✅ Access controls: Use Azure RBAC + Managed Identity, never API keys in production
// Example: Managed Identity authentication (no keys)
import { DefaultAzureCredential } from '@azure/identity';
const credential = new DefaultAzureCredential();
const searchClient = new SearchClient(
endpoint,
indexName,
credential // Uses managed identity, not API key
);
const openAIClient = new OpenAIClient(
endpoint,
credential // Same for OpenAI
);
Testing & Evaluation
Retrieval Quality Metrics
interface EvaluationDataset {
queries: EvaluationQuery[];
}
interface EvaluationQuery {
query: string;
relevantDocIds: string[]; // Ground truth
}
async function evaluateRetrieval(
dataset: EvaluationDataset
): Promise<RetrievalMetrics> {
let totalPrecisionAtK = 0;
let totalRecallAtK = 0;
let totalMRR = 0;
for (const item of dataset.queries) {
const results = await hybridSearch(item.query, { top: 10 });
const retrievedIds = results.map(r => r.documentId);
// Precision@K
const relevantRetrieved = retrievedIds.filter(id =>
item.relevantDocIds.includes(id)
);
const precision = relevantRetrieved.length / retrievedIds.length;
totalPrecisionAtK += precision;
// Recall@K
const recall = relevantRetrieved.length / item.relevantDocIds.length;
totalRecallAtK += recall;
// Mean Reciprocal Rank
const firstRelevantIndex = retrievedIds.findIndex(id =>
item.relevantDocIds.includes(id)
);
const mrr = firstRelevantIndex >= 0 ? 1 / (firstRelevantIndex + 1) : 0;
totalMRR += mrr;
}
return {
precision_at_10: totalPrecisionAtK / dataset.queries.length,
recall_at_10: totalRecallAtK / dataset.queries.length,
mean_reciprocal_rank: totalMRR / dataset.queries.length
};
}
End-to-End Quality Metrics
async function evaluateRAGQuality(
testQueries: TestQuery[]
): Promise<QualityMetrics> {
const results = await Promise.all(
testQueries.map(async (test) => {
const response = await performRAG(test.query);
// LLM-as-judge evaluation
const evaluation = await evaluateResponse({
query: test.query,
response: response,
groundTruth: test.expectedAnswer,
criteria: ['accuracy', 'completeness', 'relevance', 'citation_quality']
});
return evaluation;
})
);
return {
accuracy: average(results.map(r => r.accuracy)),
completeness: average(results.map(r => r.completeness)),
relevance: average(results.map(r => r.relevance)),
citation_quality: average(results.map(r => r.citation_quality))
};
}
Common Pitfalls & Solutions
Pitfall 1: Hallucination Despite Context
Problem: LLM generates information not present in retrieved documents
Solution: Stricter prompt engineering + verification
const strictPrompt = `
CRITICAL INSTRUCTIONS:
1. Use ONLY information from the provided context
2. If the context doesn't contain the answer, respond: "I don't have enough information to answer this question."
3. Never make assumptions or use external knowledge
4. Cite sources for every claim using [Source N] format
Context:
${context}
Question: ${query}
Answer (following instructions above):
`;
Pitfall 2: Poor Retrieval Quality
Problem: Relevant documents not retrieved in top results
Solutions:
- Improve chunking: Use semantic chunking instead of fixed-size
- Add metadata: Enrich documents with category, date, author for filtering
- Tune hybrid weights: Experiment with vector vs keyword ratios
- Use query expansion: Reformulate query with synonyms/variations
async function queryExpansion(originalQuery: string): Promise<string[]> {
const expansionPrompt = `
Generate 3 alternative phrasings of this question that mean the same thing:
"${originalQuery}"
Alternative phrasings:
`;
const response = await openAI.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: expansionPrompt }],
temperature: 0.7
});
const alternatives = response.choices[0].message.content
.split('\n')
.filter(line => line.trim().length > 0);
return [originalQuery, ...alternatives];
}
Pitfall 3: High Latency
Problem: RAG responses take 5+ seconds
Solutions:
- Parallel retrieval + generation: Don’t wait for all chunks to process
- Streaming responses: Start showing answer before completion
- Precompute embeddings: Index ahead of time, not on-demand
- Optimize chunk count: More isn’t always better (diminishing returns after 5-10 chunks)
async function streamingRAG(query: string): Promise<ReadableStream> {
const documents = await hybridSearch(query, { top: 5 });
const context = documents.map(d => d.content).join('\n\n');
// Stream response token by token
const stream = await openAI.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: `Context: ${context}\n\nQuestion: ${query}` }
],
stream: true
});
return new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
controller.enqueue(content);
}
controller.close();
}
});
}
Troubleshooting Checklist
When Retrieval Quality is Poor
Symptom: Relevant documents not appearing in top 5 results
✅ Check chunking strategy:
- Try semantic chunking instead of fixed-size
- Reduce chunk size if concepts span boundaries (512→256 tokens)
- Increase overlap (50→100 tokens)
✅ Tune HNSW parameters:
// Increase recall at cost of latency
{
"hnswParameters": {
"m": 8, // Default 4, increase to 8-16 for better recall
"efConstruction": 800, // Default 400, higher = better index quality
"efSearch": 1000 // Default 500, higher = better search recall
}
}
✅ Adjust hybrid weights:
// If exact matches failing: increase keyword weight
const config = {
vectorWeight: 0.3, // Reduce from 0.5
keywordWeight: 0.7 // Increase from 0.5
};
// If semantic similarity failing: increase vector weight
const config = {
vectorWeight: 0.8, // Increase from 0.5
keywordWeight: 0.2 // Reduce from 0.5
};
✅ Verify embeddings:
// Test embedding similarity
const query = "How do I reset my password?";
const doc = "Password reset instructions: ...";
const queryEmb = await embed(query);
const docEmb = await embed(doc);
// Should be >0.7 for relevant doc/query pairs
const similarity = cosineSimilarity(queryEmb, docEmb);
console.log('Similarity:', similarity);
if (similarity < 0.7) {
// Problem: embeddings not capturing semantic relationship
// Fix: Try text-embedding-3-large, check for domain mismatch
}
✅ Add metadata filters:
// Filter by category, date, or tags to narrow search space
const results = await searchClient.search(query, {
filter: `category eq 'technical_docs' and date ge 2024-01-01`,
vectorQueries: [/* ... */],
top: 10
});
When Responses are Hallucinating
Symptom: LLM inventing information not in context
✅ Strengthen prompt instructions:
const strictPrompt = `
CRITICAL RULES (you will be penalized for violations):
1. Use ONLY information from Context below
2. If Context doesn't answer the question, respond EXACTLY: "I don't have enough information to answer this question."
3. Never use external knowledge or make assumptions
4. Cite sources using [Source N] for every claim
5. If uncertain, say "The context suggests..." not "The answer is..."
Context:
${context}
Question: ${query}
Answer (following rules above):
`;
✅ Use lower temperature:
const response = await openAI.chat.completions.create({
model: 'gpt-4o',
temperature: 0, // Use 0 for factual tasks (default 1)
messages: [/* ... */]
});
✅ Add verification step:
// Two-pass approach: generate, then verify
const answer = await generateAnswer(query, context);
const verificationPrompt = `
Context: ${context}
Answer: ${answer}
Is every claim in the Answer supported by the Context?
Respond with JSON: { "verified": true/false, "unsupported_claims": [] }
`;
const verification = await openAI.chat.completions.create({
model: 'gpt-4o',
response_format: { type: 'json_object' },
messages: [{ role: 'user', content: verificationPrompt }]
});
if (!verification.verified) {
return "I couldn't verify all claims against the source documents.";
}
When Latency is Too High
Symptom: Responses taking >5 seconds
✅ Profile the pipeline:
const timings = {};
const start = Date.now();
const docs = await retrieve(query);
timings.retrieval = Date.now() - start;
const genStart = Date.now();
const response = await generate(query, docs);
timings.generation = Date.now() - genStart;
console.log('Retrieval:', timings.retrieval, 'ms');
console.log('Generation:', timings.generation, 'ms');
// If retrieval >500ms: check HNSW parameters, reduce top-k
// If generation >3s: reduce context size, use streaming
✅ Optimize retrieval:
- Reduce
topfrom 50→20 (less reranking) - Lower
efSearchif recall is acceptable - Use semantic caching for common queries
✅ Optimize generation:
- Use streaming responses (perceived latency)
- Compress context (reduce tokens)
- Switch to
gpt-4o-minifor simple queries (3x faster)
✅ Implement caching:
// Cache at multiple levels
const cacheKey = `emb:${hashQuery(query)}`;
const cachedEmbedding = await redis.get(cacheKey);
if (!cachedEmbedding) {
embedding = await embed(query);
await redis.setEx(cacheKey, 86400, JSON.stringify(embedding));
}
Future Trends in RAG
1. Multi-Modal RAG
Retrieve and reason over text, images, tables, charts:
interface MultiModalDocument {
text: string;
images: string[]; // URIs
tables: TableData[];
charts: ChartData[];
}
async function multiModalRAG(query: string): Promise<string> {
// Retrieve documents with all modalities
const documents = await hybridSearch(query, { top: 5 });
// Use GPT-4o vision for image understanding
const imageDescriptions = await Promise.all(
documents.flatMap(doc => doc.images).map(async (imageUrl) => {
return await openAI.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image in detail.' },
{ type: 'image_url', image_url: { url: imageUrl } }
]
}]
});
})
);
// Combine text + image descriptions for generation
const enrichedContext = documents.map((doc, i) => ({
text: doc.text,
imageContext: imageDescriptions[i]?.choices[0]?.message?.content
}));
return await generate(query, enrichedContext);
}
2. Graph RAG
Combine vector search with knowledge graph traversal:
// Azure Cosmos DB for Apache Gremlin (graph database)
import gremlin from 'gremlin';
async function graphRAG(query: string): Promise<string> {
// 1. Vector search for initial nodes
const initialNodes = await vectorSearch(query, { top: 3 });
// 2. Traverse graph to find related entities
const graphClient = getGremlinClient();
const relatedEntities = await graphClient.submit(
`g.V(${initialNodes.map(n => n.id).join(',')})
.out('RELATED_TO')
.dedup()
.limit(10)
.valueMap()`
);
// 3. Retrieve full documents for related entities
const documents = await fetchDocuments(relatedEntities);
// 4. Generate response with graph-enriched context
return await generate(query, documents);
}
3. Adaptive RAG
System learns optimal retrieval strategies per query type:
interface AdaptiveRAGModel {
predict(query: string): Promise<RAGStrategy>;
train(query: string, strategy: RAGStrategy, feedback: number): Promise<void>;
}
async function adaptiveRAG(
query: string,
model: AdaptiveRAGModel
): Promise<string> {
// Predict optimal strategy based on query characteristics
const strategy = await model.predict(query);
// Execute predicted strategy
let response: string;
switch (strategy.type) {
case 'hybrid':
response = await hybridRAG(query, strategy.config);
break;
case 'multi_stage':
response = await multiStageRAG(query, strategy.config);
break;
case 'agentic':
response = await agenticRAG(query, strategy.config);
break;
}
// Collect feedback for continuous learning
const feedback = await getUserFeedback(response);
await model.train(query, strategy, feedback);
return response;
}
References & Further Reading
Azure Documentation
- Azure AI Search - Hybrid Search
- Azure OpenAI Service
- Vector Search in Azure AI Search
- Semantic Ranking
- Azure AI Search Security Best Practices
Pricing & Cost Optimization
Research Papers
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)
- HNSW: Efficient and robust approximate nearest neighbor search (Malkov & Yashunin, 2016)
- Improving Retrieval Performance with Query Expansion
Tools & Libraries
- Semantic Kernel - Microsoft’s AI orchestration framework
- LangChain - RAG framework with Azure integrations
- Shiki - Syntax highlighting (used in this blog)
Sample Code & Templates
Conclusion
Building production-ready RAG systems requires more than basic vector search + LLM generation. The patterns we’ve covered - from hybrid search to agentic RAG - represent the current state of the art in enterprise AI applications.
Key architectural decisions:
- Start with Hybrid RAG (vector + keyword + semantic ranking)
- Add multi-stage retrieval when accuracy is critical
- Consider agentic patterns for complex, multi-step queries
- Invest in observability from day one
- Optimize for cost with caching, compression, and model routing
Azure provides a comprehensive platform for RAG:
- Azure AI Search: Best-in-class hybrid search
- Azure OpenAI: GPT-4o and embeddings with enterprise SLAs
- Azure Cosmos DB: Scalable metadata and graph storage
- Azure Monitor: End-to-end observability
Key Takeaways
✅ RAG beats fine-tuning for most enterprise knowledge applications
✅ Hybrid search (vector + keyword + semantic) dramatically improves retrieval quality
✅ Multi-stage retrieval with reranking and compression optimizes accuracy and cost
✅ Citations are non-negotiable for enterprise trust and compliance
✅ Monitor everything: retrieval quality, latency, cost, and user satisfaction
✅ Azure AI Search + Azure OpenAI provide production-ready RAG infrastructure
Next Steps
Ready to implement these patterns? Here’s your roadmap:
- Start Simple: Build basic RAG, measure baseline metrics
- Add Hybrid Search: Implement vector + keyword with Azure AI Search
- Enable Semantic Ranking: Significant quality boost for minimal effort
- Iterate on Chunking: Experiment with strategies for your domain
- Add Observability: Track metrics to guide optimization
- Scale Progressively: Add multi-stage or agentic patterns as needed
Want to dive deeper into specific RAG use cases? Check out:
- Healthcare Document RAG: HIPAA-Compliant Chatbots - Sector-specific implementation
- Legal Document AI: Building RAG for Case Law - Handling complex legal documents
- Human-in-the-Loop RAG: Escalation Patterns - Enterprise support workflows
Need help architecting your RAG system? Get in touch - I specialize in Azure-native AI solutions for enterprises.
This article is part of the RAG & Enterprise Chatbots series. Subscribe below for in-depth technical guides on AI architecture.