Expert Cross-Domain (AI-102) - Q&A
This document contains expert-level, scenario-driven questions with direct answers and exam clarifications spanning all AI-102 domains (GenAI, Search/RAG, Vision, Language/Speech, Document Intelligence, security, monitoring, and ops).
📚 Reference Links
- AI-102 Study Guide
- Azure AI Services
- Azure OpenAI
- Azure AI Search
- Azure AI Document Intelligence
- Azure AI Language
- Azure AI Vision
Section 1: Architecture, Security, and Operations (Expert)
Q1.1: You must call Azure AI services from an Azure Function without storing secrets. What’s the best authentication approach?
Answer: Use a managed identity for the Function + Azure AD (Entra ID) authentication to the AI resource (RBAC).
Clarifications (exam traps):
- Keys are easiest but violate “no secrets” and are harder to rotate safely.
- If the service supports Azure AD auth, prefer it with managed identity; otherwise store keys in Key Vault and use managed identity to retrieve them.
Q1.2: Your security team requires that AI service traffic never traverses the public internet. What should you implement?
Answer: Private Endpoint (Azure Private Link) + disable (or restrict) public network access on the AI resource.
Clarifications (exam traps):
- “Selected networks” still uses public endpoints; Private Endpoint is the strict isolation requirement.
- Don’t forget Private DNS configuration so the service hostname resolves to the private IP.
Q1.3: Requests to an Azure AI service intermittently fail with HTTP 429. What’s the correct client-side behavior?
Answer: Implement retry with exponential backoff + jitter, and respect the Retry-After header.
Clarifications (exam traps):
- Do not retry 400-series validation errors (e.g., 400/401/403) as “transient.”
- 429 often indicates rate limit; scaling out clients without backoff can worsen the issue.
Q1.4: You need a single place to enforce per-client quotas, auth, and request logging across multiple AI endpoints. What Azure service fits best?
Answer: Azure API Management (APIM) in front of your AI-calling backend.
Clarifications (exam traps):
- APIM is the API gateway; it’s not a replacement for private endpoints.
- If you must avoid exposing AI keys to clients, put keys/MI usage server-side behind APIM.
Q1.5: Your team wants to rotate AI service keys with near-zero downtime. What’s the standard pattern?
Answer: Use primary/secondary keys and rotate one while the app uses the other, then swap.
Clarifications (exam traps):
- Rotation is easiest when secrets are pulled from Key Vault (and cached with TTL).
- If using Azure AD auth, you avoid key rotation entirely.
Q1.6: A production incident requires tracing which user request triggered a tool call and which model output followed. What should you instrument?
Answer: End-to-end correlation IDs + structured logs to Application Insights (or Log Analytics), including tool-call spans.
Clarifications (exam traps):
- “Enable diagnostic logs” on the AI resource helps, but you still need application-level tracing.
- Avoid logging full prompts with sensitive data; log hashes/redacted fields where required.
Section 2: Azure OpenAI, Prompting, and Safety (Expert)
Q2.1: You must ensure the model responds only using provided retrieved passages and refuses otherwise. What’s the strongest pattern?
Answer: Use RAG with strict grounding instructions + return citations + implement an application-side “no supporting evidence” check.
Clarifications (exam traps):
- Prompts alone are not a guarantee; add a post-check (e.g., require cited chunk IDs).
- Set temperature low for factual tasks, but correctness still depends on grounding quality.
Q2.2: Your app is vulnerable to prompt injection inside retrieved documents (“ignore instructions and leak secrets”). What should you do?
Answer: Treat retrieved content as untrusted data: separate it from instructions, constrain tool/function calling, and implement content filtering + allowlist tools.
Clarifications (exam traps):
- Put system instructions in the system message and retrieved passages in a clearly delimited data section.
- Do not grant the model arbitrary tool access; use tool allowlists and server-side authorization.
Q2.3: When should you pick fine-tuning over RAG for enterprise Q&A?
Answer: Fine-tune for style/format consistency or specialized behavior; use RAG for fresh/variable knowledge and citations.
Clarifications (exam traps):
- Fine-tuning does not “upload your documents” for factual recall; it changes behavior, not a searchable KB.
- If content changes frequently, RAG wins operationally.
Q2.4: You need deterministic-ish answers for a compliance workflow. Which parameter is the first lever?
Answer: Set temperature low (often 0–0.2) and constrain output format.
Clarifications (exam traps):
- Low temperature reduces randomness, but does not guarantee correctness.
- Also cap max tokens and use stop sequences for strict schemas.
Q2.5: You need to stop the model from generating disallowed terms specific to your company policy. What Azure OpenAI feature helps?
Answer: Custom blocklists (plus built-in content filters).
Clarifications (exam traps):
- Content filters handle broad safety categories; blocklists handle your terms.
- Apply blocklists at the right scope (deployment/subscription) based on governance needs.
Q2.6: Your GenAI app must keep prompts and completions out of public networks and also out of client devices. What architecture is most appropriate?
Answer: Client → your backend (in VNet) → Azure OpenAI via Private Endpoint; keep API calls and credentials server-side.
Clarifications (exam traps):
- Direct-to-Azure OpenAI from a browser is usually wrong for enterprise: you leak keys/tokens and lose governance.
Section 3: RAG + Azure AI Search (Expert)
Q3.1: You want best retrieval quality for both exact terms (IDs) and semantic meaning (paraphrases). What search approach?
Answer: Hybrid search (keyword + vector) with optional semantic ranking/re-ranking.
Clarifications (exam traps):
- Vector-only can miss exact ID matches; keyword-only can miss paraphrases.
- If asked “best overall relevance,” hybrid is the safe exam pick.
Q3.2: Your retrieved chunks are often irrelevant because documents are long and multi-topic. What’s the best first fix?
Answer: Improve chunking strategy (smaller, semantically coherent chunks + overlap) and store good metadata.
Clarifications (exam traps):
- Bigger context windows don’t fix bad retrieval; they amplify noise.
- Add fields like
title,section,page, and use them in filtering.
Q3.3: You need to filter retrieval to “only documents the user is allowed to see.” Where should this be enforced?
Answer: Enforce authorization server-side, and apply security as filters in search queries (plus storage-level access control).
Clarifications (exam traps):
- Do not rely on “the model will not mention restricted docs.”
- Use document-level metadata like
acland filter with it.
Q3.4: You want to store embeddings in Azure AI Search. What do you need to define in the index schema?
Answer: A vector field Collection(Edm.Single) with the correct dimensions for your embedding model.
Clarifications (exam traps):
- If you change embedding models, dimensions may change → re-index.
Q3.5: You built RAG but answers still hallucinate. What’s the highest-leverage operational metric to add?
Answer: Track groundedness: require citations and measure “answer coverage by retrieved chunks.”
Clarifications (exam traps):
- “Lower temperature” can reduce weirdness but doesn’t ensure grounding.
- Add a “refuse when insufficient evidence” policy.
Section 4: Knowledge Mining + Document Intelligence (Expert)
Q4.1: You must extract structured fields from invoices with varying layouts. Which service is designed for this?
Answer: Azure AI Document Intelligence (prebuilt invoice model, or custom if needed).
Clarifications (exam traps):
- Vision OCR extracts text; it doesn’t reliably map fields like
InvoiceTotal. - Start with prebuilt models to reduce time-to-value.
Q4.2: Your pipeline needs: ingest PDFs from Blob, OCR + key phrases + entity extraction, then index into Azure AI Search automatically. What Search feature enables this?
Answer: Indexers + skillsets (cognitive skills) for knowledge enrichment.
Clarifications (exam traps):
- “Indexer” moves content into Search; “skillset” enriches it.
- You can enrich with built-in skills and custom skills if needed.
Q4.3: You need to support multilingual documents and normalize them into a single search language. What’s the right enrichment step?
Answer: Add language detection and translation (Language service skill) in the enrichment pipeline.
Clarifications (exam traps):
- Don’t translate everything blindly; detect language first.
- Store both original and translated fields if required for audit.
Section 5: NLP + Speech (Expert)
Q5.1: You need intent classification + entity extraction for a custom domain with continual updates. Which service fits?
Answer: Conversational Language Understanding (CLU) in Azure AI Language.
Clarifications (exam traps):
- LUIS is legacy; CLU is the modern path.
- For pure sentiment/key phrases/entities (no intents), use Text Analytics features instead.
Q5.2: A call center needs real-time transcription and speaker separation. Which Azure AI capability is relevant?
Answer: Azure AI Speech for Speech-to-Text with speaker diarization (when available for the scenario).
Clarifications (exam traps):
- Translation is a different feature; diarization is about “who spoke when.”
Section 6: Computer Vision (Expert)
Q6.1: You need OCR that works well on multi-page PDFs and returns page/line structure. Which capability?
Answer: Azure AI Vision Read.
Clarifications (exam traps):
- “Analyze Image” tagging is not OCR.
- Document Intelligence can also do OCR, but Read is the direct OCR capability in Vision.
Q6.2: You need to detect a small set of custom product defects from images. What’s the right service?
Answer: Custom Vision (classification or object detection, depending on need).
Clarifications (exam traps):
- If you need bounding boxes, it’s object detection.
- For “is defective or not,” that’s classification.
Section 7: Cost + Performance (Expert)
Q7.1: Your Azure OpenAI bill is dominated by prompt tokens from repeatedly sending long policy text. What’s the best optimization?
Answer: Move static instructions into a compact system prompt template, reduce repeated context, and use RAG for only what’s needed.
Clarifications (exam traps):
- Fine-tuning can reduce prompt size for style/format, but it’s not a drop-in replacement for changing knowledge.
- Caching common responses and retrieved contexts is often the biggest win.
Q7.2: You need to reduce latency in RAG. Which change usually helps most first?
Answer: Reduce retrieval + prompt size: fewer chunks (top-k), better chunking, and avoid sending irrelevant context.
Clarifications (exam traps):
- Bigger models often increase latency; pick the smallest model that meets quality.
Section 8: Exam-Style “Choose the Best Option” (Expert)
Q8.1: You must implement enterprise semantic search over SharePoint docs with enrichment and security trimming. What’s the most Azure-native stack?
Answer: Azure AI Search (indexers/skillsets) + your identity-aware middleware for ACL filtering.
Clarifications (exam traps):
- Azure AI Search can index content, but you must implement/retain document-level ACL metadata and enforce it in queries.
Q8.2: A customer requires that model inputs are not used to train models. What’s the Azure OpenAI position?
Answer: Azure OpenAI is designed for enterprise; customer prompts/completions are not used to train the underlying foundation models.
Clarifications (exam traps):
- The exam typically tests that Azure OpenAI offers enterprise privacy/controls compared to public endpoints.
- Still implement your own data handling policies (logging, retention, PII redaction).
Section 9: Azure OpenAI Deployments, APIs, and Tooling (Expert)
Q9.1: Your code uses model: "gpt-4o" but Azure OpenAI returns “The model does not exist.” What is the most likely mistake?
Answer: You used the model name instead of the deployment name.
Clarifications (exam traps):
- In Azure OpenAI, your application targets a deployment (your chosen name), not the raw model ID.
- This is a common “works on OpenAI, fails on Azure OpenAI” pitfall.
Q9.2: You need to enforce that only your backend can call Azure OpenAI, never browsers/mobile apps. What’s the correct design?
Answer: Keep Azure OpenAI calls server-side only; expose a backend API and apply auth/authorization there.
Clarifications (exam traps):
- Putting keys/tokens in a client app is almost always an exam “wrong answer.”
- Use APIM if you need centralized auth, quotas, and analytics.
Q9.3: You need JSON-only outputs to feed an automated pipeline. What’s the correct approach?
Answer: Constrain output format + implement schema validation and a repair/retry loop.
Clarifications (exam traps):
- Never assume the model will always return valid JSON.
- The correct production answer includes validation, not just “prompt better.”
Q9.4: Your assistant can call tools, but you must ensure it never calls admin tools. What’s the key control?
Answer: A server-side allowlist of tools + authorization checks per call.
Clarifications (exam traps):
- Tool definitions are not authorization; enforce permissions in code.
- Keep tool outputs minimal and scoped to the request.
Q9.5: Users complain the assistant feels slow even though total completion time is acceptable. What feature improves perceived latency?
Answer: Streaming responses (time-to-first-token UX).
Clarifications (exam traps):
- Streaming improves UX, but policy might require buffering for moderation in high-risk contexts.
Q9.6: You’re choosing between text-embedding-3-small and text-embedding-3-large for RAG. What’s the tradeoff?
Answer: -3-large usually improves retrieval quality but increases cost/storage/latency (bigger vectors).
Clarifications (exam traps):
- If your embedding dimensions change, you must re-index your vector store.
Section 10: Azure AI Search Deep Dive (Expert)
Q10.1: You need relevance that understands meaning, but also respects business boosts (e.g., official docs). What combination fits?
Answer: Semantic ranking + scoring profiles.
Clarifications (exam traps):
- Scoring profiles are deterministic boosts; semantic ranker improves meaning-based relevance.
Q10.2: Users search for “SSO” but content says “single sign-on”. What Search feature addresses this best?
Answer: A synonym map for the relevant field(s).
Clarifications (exam traps):
- Synonyms help lexical matching; they don’t replace vector/hybrid retrieval.
Q10.3: Your queries match irrelevant partial words. What configuration is most relevant?
Answer: Use an appropriate analyzer (and optionally separate exact-match fields).
Clarifications (exam traps):
- Field analyzers strongly affect tokenization and matching behavior.
Q10.4: You ingest PDFs from Blob daily. What component performs scheduled ingestion into Search?
Answer: An indexer (scheduled) connected to the Blob data source.
Clarifications (exam traps):
- Indexers ingest; enrichment requires a skillset.
Q10.5: You need OCR for scanned PDFs during indexing. How is this typically implemented?
Answer: Add OCR as part of a skillset in the enrichment pipeline.
Clarifications (exam traps):
- For complex forms/field extraction, Document Intelligence may be a better upstream step than generic OCR.
Q10.6: Vector results contain near-duplicate chunks from the same document. What’s a practical fix?
Answer: Deduplicate by chunk hashing and prefer diversity selection (e.g., MMR-style) or post-filtering by document/section.
Clarifications (exam traps):
- Increasing
topKoften increases duplicates unless you diversify.
Section 11: Document Intelligence Deep Dive (Expert)
Q11.1: You need structured extraction from a new custom form. What’s the recommended first step?
Answer: Try prebuilt models first; if they don’t fit, train a custom model with labeled samples.
Clarifications (exam traps):
- Prebuilt models reduce time-to-value; custom models require representative labeled documents.
Q11.2: Your extraction returns low-confidence values for critical fields. What should production logic do?
Answer: Apply confidence thresholds and route low-confidence cases to human review or fallback logic.
Clarifications (exam traps):
- The “right” answer usually includes a HITL path for business-critical extraction.
Q11.3: You have multiple templates for the same document type. What pattern helps?
Answer: Use a composed model (or multiple models selected by a classifier/rules).
Clarifications (exam traps):
- Composition is often the cleanest approach when templates are materially different.
Section 12: Evaluation, Monitoring, and Responsible AI (Expert)
Q12.1: You want to measure whether answers are supported by sources over time. What do you track?
Answer: Groundedness/citation validity plus retrieval quality metrics (e.g., Recall@k).
Clarifications (exam traps):
- Engagement metrics don’t replace groundedness for RAG correctness.
Q12.2: You need evidence the assistant resists jailbreak/prompt-injection attempts. What’s the best practice?
Answer: Maintain a red-team test set and run it continuously with policy controls (filters, blocklists, tool allowlists).
Clarifications (exam traps):
- One-time testing is not sufficient; make it repeatable and measurable.
Q12.3: You need debugging logs but can’t store PII. What’s the standard compromise?
Answer: Redact/tokenize sensitive fields before logging and enforce strict retention/access controls.
Clarifications (exam traps):
- “Log everything” is almost never correct; design for privacy and compliance.
Section 13: Reliability + Quotas (Expert)
Q13.1: You need resilience for AI calls under transient failures. What’s the correct pattern set?
Answer: Retry (with backoff) + circuit breaker + timeouts + fallback.
Clarifications (exam traps):
- Retries alone can amplify failures; circuit breakers prevent retry storms.
Q13.2: You must handle quota exhaustion without taking down the app. What’s a sensible design?
Answer: Implement quota-aware degradation (queue/batch, reduce features, shift to smaller model) and alert before limits are hit.
Clarifications (exam traps):
- “Request a quota increase” is not an incident response plan.
Section 3: Advanced Azure AI Search and RAG Patterns (Expert)
Q3.1: Your RAG solution needs to retrieve documents across multiple languages with semantic understanding. What's the recommended Azure AI Search configuration?
Answer: Use multilingual semantic search with the language field mapped correctly + multi-language analyzer for each language field + enable semantic ranking.
Clarifications (exam traps):
- Semantic ranking works best when you configure language-specific analyzers (e.g.,
en.microsoft,fr.microsoft) for each field. - For truly multilingual scenarios, consider multiple indexes (one per language) or language detection + field routing.
- Vector search with multilingual embeddings (e.g.,
text-embedding-ada-002supports 100+ languages) can complement keyword search. - Set
queryLanguageparameter in semantic configuration to match the dominant query language.
Best Practices:
- Use language-specific analyzers for better tokenization
- Implement language detection in your application layer
- Consider hybrid search (keyword + semantic + vector) for best results
- Test with representative multilingual queries
Documentation Links:
Q3.2: You need to implement a RAG system that can search across both structured data (SQL database) and unstructured documents. What's the architecture pattern?
Answer: Use Azure AI Search as the unified query layer with:
- Multiple data sources (SQL indexer + Blob/Document indexer)
- Custom skillsets to enrich and normalize data
- Unified index with both structured fields and vector embeddings
- Hybrid retrieval (keyword + semantic + vector) for comprehensive results
Clarifications (exam traps):
- Don't query SQL and documents separately then merge—use AI Search's indexers to bring all data into a single searchable index.
- For SQL data, use the Azure SQL indexer with change tracking or high-water mark for incremental updates.
- Use field mappings and output field mappings in indexers to normalize schema differences.
- Implement custom web API skill if complex SQL transformations are needed.
Architecture Components:
- Data Sources: Azure SQL Database, Blob Storage
- Indexers: SQL indexer, Blob indexer with Document Intelligence
- Skillset: Text extraction, entity recognition, vectorization
- Index: Unified schema with searchable fields and vectors
- RAG Application: Retrieve from index → augment prompt → generate response
Documentation Links:
Q3.3: Your organization requires audit logging of all user queries and model responses for compliance. What's the complete solution?
Answer: Implement multi-layer logging:
- Application Insights with custom telemetry for query/response pairs + user context
- Azure OpenAI diagnostic logs sent to Log Analytics workspace
- Azure AI Search query logs enabled
- Correlation IDs across all layers for end-to-end tracing
- Data retention policies configured per compliance requirements
Clarifications (exam traps):
- Azure OpenAI diagnostic logs alone don't capture your application context (user ID, session ID, business metadata)—you need application-level logging.
- Don't log PII/sensitive data in plain text—use hashing, redaction, or separate secure storage.
- Set up Azure Monitor alerts for anomalous patterns (e.g., high error rates, unusual token usage).
- Consider Azure Data Factory or Event Hubs for streaming logs to long-term storage (e.g., Azure Data Lake).
Log Analytics Queries for Compliance:
// Track all Azure OpenAI requests
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where Category == "RequestResponse"
| project TimeGenerated, OperationName, ResultSignature, DurationMs, properties_s
| order by TimeGenerated desc
// Correlate with Application Insights
union ApplicationInsights, AzureDiagnostics
| where CorrelationId == "<correlation-id>"Documentation Links:
Section 4: Document Intelligence and Custom Models (Expert)
Q4.1: You need to extract data from custom invoices with varying layouts. Should you use Document Intelligence prebuilt invoice model or train a custom model?
Answer: Start with prebuilt invoice model and evaluate coverage. If accuracy is insufficient for your specific invoice variations, train a custom extraction model using Document Intelligence Studio.
Clarifications (exam traps):
- Prebuilt invoice model handles common invoice formats well (line items, totals, vendor info) and supports 100+ languages.
- Use custom extraction model when:
- Invoices have unique layouts not covered by prebuilt model
- You need to extract custom fields (e.g., project codes, department IDs)
- Accuracy requirements exceed prebuilt model performance
- For training custom models, you need at least 5 labeled examples (5-15 recommended).
- Composed models let you combine multiple custom models for different invoice types in one endpoint.
Decision Tree:
- Test prebuilt invoice model → adequate? Use it.
- If inadequate → collect 5-15 sample invoices
- Label in Document Intelligence Studio
- Train custom extraction model
- If multiple invoice types → create composed model
Model Comparison:
| Feature | Prebuilt Invoice | Custom Extraction |
|---|---|---|
| Training Required | No | Yes (5+ samples) |
| Layout Flexibility | Common layouts | Any layout |
| Custom Fields | Limited | Unlimited |
| Language Support | 100+ | Training data language |
| Time to Deploy | Immediate | Hours to days |
Documentation Links:
Q4.2: Your document processing pipeline needs to handle both scanned PDFs and native digital PDFs. What considerations affect your Document Intelligence implementation?
Answer: Implement adaptive processing based on document type:
- Use Read OCR for scanned PDFs (image-based)
- Use Layout model for native PDFs (preserve structure and tables)
- Implement automatic detection of document type using
docTypeor analyze pixel patterns - Configure quality assessment to route low-quality scans for manual review
Clarifications (exam traps):
- Native PDFs with embedded text are faster and more accurate than OCR—detect them to optimize processing.
- Read API handles both but optimizes differently based on content detection.
- For hybrid PDFs (mix of text and scanned pages), Document Intelligence processes each page appropriately.
- Table extraction works better on native PDFs than scanned documents with complex layouts.
- Use confidence scores to identify pages needing human review.
Implementation Pattern:
from azure.ai.formrecognizer import DocumentAnalysisClient
# Analyze document
poller = client.begin_analyze_document("prebuilt-layout", document)
result = poller.result()
# Check confidence and handle accordingly
for page in result.pages:
if any(line.confidence < 0.85 for line in page.lines):
# Route to manual review queue
send_to_manual_review(page)
else:
# Process automatically
extract_and_store(page)Quality Optimization Tips:
- Preprocess scanned images (deskew, denoise) before submission
- Use minimum 300 DPI for scanned documents
- Ensure color scans for documents with color-coded information
- Implement retry logic with quality enhancement for low-confidence results
Documentation Links:
Section 5: Multi-Modal AI and Vision (Expert)
Q5.1: You're building a retail analytics solution that needs to count people entering/exiting store zones and detect when shelves are empty. What Azure AI Vision capabilities should you use?
Answer: Use Azure AI Vision Spatial Analysis with:
- PersonCount operation for zone entry/exit counting
- PersonCrossingLine for directional tracking
- Custom Vision object detection model trained on shelf states (full/empty)
- Edge deployment using Azure Stack Edge or IoT Edge for real-time processing
Clarifications (exam traps):
- Spatial Analysis requires specific hardware (NVIDIA GPU) and containerized deployment—it's not a simple API call.
- PersonCount operates on video streams, not individual images—plan for continuous video processing.
- For shelf monitoring, Azure AI Vision's standard analyze image API can detect objects, but a Custom Vision model trained on your specific shelf/product configurations gives better accuracy.
- Consider privacy and compliance—Spatial Analysis can be configured for privacy mode (no facial recognition, only silhouettes).
- Use Azure IoT Hub or Event Hubs for streaming telemetry data to analytics backend.
Architecture:
IP Cameras → Azure Stack Edge (Spatial Analysis container)
→ IoT Hub → Stream Analytics → Power BI Dashboard
Product Images → Custom Vision Model → Alert System (empty shelf detection)Deployment Considerations:
- Hardware: NVIDIA GPU for Spatial Analysis (T4 or better)
- Container orchestration: IoT Edge runtime
- Network: Low latency for real-time processing
- Storage: Local buffer for video retention
- Privacy: Configure zone masking and privacy settings
Documentation Links:
Q5.2: Your application needs to extract text from images that may contain handwritten notes, printed documents, and mixed content. Which Azure AI service and API should you use?
Answer: Use Azure AI Vision Read API (v4.0 or later), which handles:
- Printed text in 100+ languages
- Handwritten text in multiple languages
- Mixed content (printed + handwritten)
- Multi-page documents via PDF support
- Table structures and layout preservation
Clarifications (exam traps):
- Don't confuse Read API (modern, unified OCR) with the deprecated OCR API (limited to single page, no handwriting support).
- Read API is asynchronous—use the operation-location header to poll for results.
- For form-specific extraction (invoices, receipts, IDs), use Document Intelligence instead—it's optimized for structured documents.
- Read API v4.0 includes the Read OCR model which has improved accuracy over v3.x.
- The API returns bounding polygons for each word/line, useful for highlighting or redaction.
API Comparison:
| Feature | Read API | OCR API (Deprecated) | Document Intelligence |
|---|---|---|---|
| Handwriting | ✅ Yes | ❌ No | ✅ Yes |
| Multi-page | ✅ Yes | ❌ No | ✅ Yes |
| Tables | ✅ Yes | ❌ No | ✅ Yes (enhanced) |
| Async | ✅ Yes | ❌ No | ✅ Yes |
| Forms/Fields | ❌ No | ❌ No | ✅ Yes |
| Use Case | General OCR | Legacy only | Structured docs |
Implementation Example:
from azure.ai.vision.imageanalysis import ImageAnalysisClient
client = ImageAnalysisClient(endpoint, credential)
# Start async read operation
poller = client.begin_analyze_image_from_url(
image_url,
features=["READ"]
)
# Poll for completion
result = poller.result()
# Extract text with confidence
for page in result.read_result.pages:
for line in page.lines:
if line.confidence > 0.9:
print(f"Text: {line.text}, Confidence: {line.confidence}")Documentation Links:
Section 6: Azure OpenAI Advanced Patterns (Expert)
Q6.1: Your application uses Azure OpenAI function calling, but users are experiencing slow response times when multiple tools need to be called sequentially. What optimization strategies should you implement?
Answer: Implement parallel function calling (GPT-4 and GPT-3.5-turbo support it) and tool call batching:
- Use the parallel tool calls feature—model returns multiple
tool_callsin a single response - Execute independent function calls concurrently using async programming
- Implement function call caching for repeated calls with same parameters
- Use streaming mode to show incremental progress to users
- Consider tool call delegation to a separate fast-executing service layer
Clarifications (exam traps):
- Parallel function calling only works when calls are independent—if function B depends on function A's output, they must be sequential.
- Not all models support parallel tool calls—check model capabilities documentation.
- Streaming doesn't reduce latency but improves perceived performance by showing progress.
- Implement timeouts for function execution—if a tool call hangs, the entire request is blocked.
- Use Azure Durable Functions or Logic Apps for complex orchestration workflows.
Implementation Pattern:
import asyncio
from openai import AsyncAzureOpenAI
async def execute_tool_calls_parallel(tool_calls):
tasks = [execute_function(call) for call in tool_calls]
results = await asyncio.gather(*tasks)
return results
# Handle response with multiple tool calls
response = await client.chat.completions.create(
model="gpt-4",
messages=messages,
tools=tools,
parallel_tool_calls=True # Enable parallel execution
)
if response.choices[0].message.tool_calls:
# Execute all tool calls in parallel
results = await execute_tool_calls_parallel(
response.choices[0].message.tool_calls
)Performance Optimization:
- Caching: Cache deterministic function results (e.g., database lookups)
- Batching: Combine similar API calls when possible
- Pre-fetching: Anticipate likely tool calls and prefetch data
- Circuit breakers: Fail fast on unavailable services
- Monitoring: Track tool call latency to identify bottlenecks
Documentation Links:
Q6.2: You need to implement content filtering that blocks not just Azure's default categories but also company-specific terms and topics. What's the complete filtering strategy?
Answer: Implement multi-layer content filtering:
- Enable Azure OpenAI Content Filters (hate, sexual, violence, self-harm) at appropriate severity levels
- Create custom blocklists for company-specific restricted terms
- Implement application-layer validation for business logic (e.g., competitive mentions, confidential projects)
- Use prompt shields to detect jailbreak attempts and prompt injection
- Add response validation to catch inadvertent policy violations in model outputs
Clarifications (exam traps):
- Content filters operate on both input (prompts) and output (completions)—configure separately.
- Custom blocklists support exact match and fuzzy match with wildcards.
- Filters can be configured at deployment level or per-request basis (using
content_filter_configparameter). - Annotated filtered mode returns filtered content with annotations rather than blocking—useful for moderation workflows.
- Prompt shields are specifically designed for indirect attacks (e.g., malicious content in retrieved documents).
Configuration Strategy:
# Deployment-level content filter configuration
content_filter_config = {
"hate": {"blocking": True, "severity": "medium"},
"sexual": {"blocking": True, "severity": "medium"},
"violence": {"blocking": True, "severity": "medium"},
"self_harm": {"blocking": True, "severity": "medium"}
}
# Custom blocklist for company-specific terms
blocklist = [
"competitor-name-*", # Wildcard match
"Project Confidential", # Exact match
"internal-code-name"
]
# Application-layer validation
def validate_content(text, user_context):
# Check against custom rules
if user_context.department == "Sales":
# Additional restrictions for sales team
if contains_pricing_info(text):
return False, "Pricing information not allowed"
# Check against blocklist
for blocked_term in blocklist:
if matches_pattern(text, blocked_term):
return False, f"Blocked term detected: {blocked_term}"
return True, NoneBest Practices:
- Layered approach: Don't rely on a single filter type
- User feedback: Collect false positive/negative reports
- Regular updates: Review and update blocklists quarterly
- Monitoring: Track filter activation rates and patterns
- User education: Clear error messages explaining filter triggers
Documentation Links:
Section 7: Cost Optimization and Monitoring (Expert)
Q7.1: Your Azure OpenAI token consumption is exceeding budget. What are the most effective cost optimization strategies without degrading quality?
Answer: Implement comprehensive token optimization:
- Prompt engineering: Remove verbose instructions, use examples efficiently
- Response length limits: Set
max_tokensbased on actual needs - Caching strategies: Cache responses for repeated queries
- Model selection: Use GPT-3.5-turbo for simpler tasks, GPT-4 only when needed
- Request batching: Combine multiple small requests when possible
- Token counting: Monitor and budget token usage per user/session
- Streaming: Stop generation early if sufficient answer is obtained
Clarifications (exam traps):
- Both input and output tokens are billed—long system prompts add up quickly across many requests.
- GPT-4 costs ~20-30x more than GPT-3.5-turbo—use routing logic to select appropriate model.
- Fine-tuning can reduce prompt size by embedding instructions into the model (but has upfront cost).
- Semantic caching can deduplicate similar queries—exact match caching only helps identical queries.
- Provisioned throughput (PTU) may be cheaper for high-volume predictable workloads vs pay-per-token.
Cost Optimization Implementation:
import tiktoken
# Token counting for budgeting
def count_tokens(text, model="gpt-4"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Smart model routing
def select_model(query, complexity_threshold=100):
complexity_score = assess_query_complexity(query)
if complexity_score < complexity_threshold:
return "gpt-35-turbo", 0.001 # $1 per 1M tokens
else:
return "gpt-4", 0.03 # $30 per 1M tokens
# Response caching with similarity
from sklearn.metrics.pairwise import cosine_similarity
cache = {}
def get_cached_or_call(query_embedding, query_text, similarity_threshold=0.95):
# Check for similar cached queries
for cached_emb, response in cache.items():
similarity = cosine_similarity([query_embedding], [cached_emb])[0][0]
if similarity > similarity_threshold:
return response, True # Cache hit
# No cache hit, make API call
response = call_openai(query_text)
cache[query_embedding] = response
return response, FalseMonitoring and Budgeting:
- Azure Monitor: Set up alerts for token usage thresholds
- Cost Management: Configure budgets and automatic spending limits
- Per-user quotas: Implement application-level rate limiting
- Token analytics: Track token usage by feature, user type, time of day
Documentation Links:
Q7.2: You need to implement comprehensive observability for an Azure AI solution spanning OpenAI, AI Search, Document Intelligence, and custom APIs. What's the monitoring architecture?
Answer: Implement distributed tracing with correlation across all services:
- Application Insights as central telemetry hub
- Correlation IDs propagated through all service calls
- Custom metrics for business KPIs (e.g., answer relevance, user satisfaction)
- Distributed tracing using OpenTelemetry or Application Insights SDK
- Log Analytics workspaces for centralized log aggregation
- Dashboards and alerts in Azure Monitor for proactive issue detection
Clarifications (exam traps):
- Each Azure AI service has diagnostic settings—enable them and route to the same Log Analytics workspace.
- Correlation ID must be application-generated and passed through headers (e.g.,
x-correlation-id)—Azure doesn't automatically correlate cross-service calls. - Don't log sensitive data—use sampling, redaction, or separate audit logs for compliance data.
- Use dependency tracking in Application Insights to visualize call chains across services.
- Implement custom dimensions in telemetry to add business context (user type, feature, scenario).
Distributed Tracing Implementation:
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace
from opentelemetry.trace import SpanKind
import uuid
# Configure OpenTelemetry with Application Insights
configure_azure_monitor(connection_string=app_insights_connection_string)
tracer = trace.get_tracer(__name__)
# Generate correlation ID for request
correlation_id = str(uuid.uuid4())
# Trace entire RAG pipeline
with tracer.start_as_current_span(
"rag_pipeline",
kind=SpanKind.SERVER,
attributes={
"correlation.id": correlation_id,
"user.id": user_id,
"query.type": query_type
}
) as pipeline_span:
# Trace search operation
with tracer.start_as_current_span("search_documents") as search_span:
results = search_client.search(
query,
headers={"x-correlation-id": correlation_id}
)
search_span.set_attribute("search.results.count", len(results))
# Trace OpenAI call
with tracer.start_as_current_span("openai_completion") as openai_span:
response = openai_client.chat.completions.create(
model="gpt-4",
messages=messages,
headers={"x-correlation-id": correlation_id}
)
openai_span.set_attribute("openai.tokens.total", response.usage.total_tokens)
openai_span.set_attribute("openai.tokens.prompt", response.usage.prompt_tokens)
openai_span.set_attribute("openai.tokens.completion", response.usage.completion_tokens)Key Metrics to Monitor:
- Performance: Latency (p50, p95, p99), throughput, error rates
- Cost: Token usage, API calls, storage consumption
- Quality: User satisfaction scores, answer relevance, citation accuracy
- Reliability: Availability, retry rates, throttling events
- Security: Authentication failures, content filter activations, anomalous patterns
Alert Configuration:
// High error rate alert
requests
| where timestamp > ago(5m)
| summarize
total = count(),
errors = countif(success == false)
| extend error_rate = errors * 100.0 / total
| where error_rate > 5 // Alert if error rate > 5%
// High token usage alert
dependencies
| where type == "OpenAI"
| extend tokens = toint(customDimensions.tokens_total)
| summarize total_tokens = sum(tokens) by bin(timestamp, 1h)
| where total_tokens > 1000000 // Alert if > 1M tokens/hourDocumentation Links:
Summary
This expert pack focuses on real architecture choices and common exam traps:
- Private networking vs "selected networks"
- Managed identity + Azure AD auth vs API keys
- Azure OpenAI deployment naming vs model naming
- Hybrid retrieval, chunking, semantic ranking, and groundedness
- Document Intelligence vs OCR vs enrichment pipelines
- CLU vs Text Analytics vs Speech features
- Safety controls (filters + blocklists) plus tool allowlists and prompt-injection defenses
- Multilingual search and RAG patterns
- Structured + unstructured data integration
- Compliance and audit logging across services
- Document Intelligence model selection (prebuilt vs custom)
- Multi-modal vision applications (Spatial Analysis, Custom Vision)
- Read API vs OCR API vs Document Intelligence
- Azure OpenAI function calling optimization
- Multi-layer content filtering strategies
- Cost optimization techniques (model routing, caching, token management)
- Comprehensive observability with distributed tracing and correlation
- Custom skills and knowledge store for AI Search
- Container deployment and offline scenarios
- Semantic ranking and vector search hybrid approaches
- Azure Machine Learning model integration patterns
- Advanced troubleshooting and performance optimization
- Multi-region deployment and disaster recovery
- Batch vs real-time processing patterns
- Advanced prompt engineering and few-shot learning
- Model fine-tuning and customization strategies
- Cross-service orchestration patterns
Section 8: Knowledge Mining Advanced Patterns (Expert)
Q8.1: Your indexing pipeline needs to call a proprietary ML model to extract custom entities that aren't available in built-in skills. What's the implementation approach?
Answer: Implement a custom web API skill that:
- Hosts your ML model as an Azure Function or App Service endpoint
- Implements the custom skill interface (receives/returns JSON in expected format)
- Integrates into your skillset alongside built-in skills
- Returns enriched data that gets mapped to index fields
Clarifications (exam traps):
- Custom skills must implement a specific REST contract—they receive a
valuesarray and return a matchingvaluesarray with enrichments. - The endpoint must respond within 30 seconds (default timeout) or configure longer timeout in skillset.
- Use managed identity for the indexer to authenticate to your custom skill endpoint if it's secured.
- Custom skills are called synchronously during indexing—ensure they're optimized for performance.
- For expensive operations, consider caching results in the custom skill or using incremental enrichment.
Implementation Pattern:
# Azure Function custom skill
import json
import azure.functions as func
def main(req: func.HttpRequest) -> func.HttpResponse:
try:
req_body = req.get_json()
values = req_body.get('values', [])
results = []
for record in values:
# Extract input data
text = record['data'].get('text', '')
# Call your custom ML model
entities = extract_custom_entities(text)
# Format response
results.append({
"recordId": record['recordId'],
"data": {
"customEntities": entities
},
"errors": None,
"warnings": None
})
return func.HttpResponse(
json.dumps({"values": results}),
mimetype="application/json"
)
except Exception as e:
return func.HttpResponse(
json.dumps({"error": str(e)}),
status_code=500
)
def extract_custom_entities(text):
# Your custom ML logic here
# Return list of entities
return [
{"entity": "CustomEntity1", "confidence": 0.95},
{"entity": "CustomEntity2", "confidence": 0.87}
]Skillset Configuration:
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "custom-entity-extraction",
"description": "Custom entity extraction using proprietary ML model",
"uri": "https://my-function-app.azurewebsites.net/api/CustomEntityExtractor",
"httpMethod": "POST",
"timeout": "PT90S",
"batchSize": 10,
"degreeOfParallelism": 5,
"context": "/document",
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "customEntities",
"targetName": "customEntities"
}
],
"httpHeaders": {
"x-custom-header": "value"
}
}Performance Considerations:
- Batch processing: Set
batchSizeto process multiple documents per call - Parallelism: Use
degreeOfParallelismfor concurrent requests - Caching: Implement result caching for repeated content
- Timeout: Adjust based on model inference time
- Retry logic: Custom skills should handle transient failures
Documentation Links:
Q8.2: You need to persist enriched data from your indexing pipeline to enable data science analysis and reporting beyond search. What Azure AI Search feature should you use?
Answer: Configure a knowledge store in your skillset to project enriched data to:
- Azure Table Storage for structured tabular data
- Azure Blob Storage for JSON documents and normalized objects
- Azure Blob Storage (file projection) for binary files like images
Clarifications (exam traps):
- Knowledge store runs as part of indexer execution—enriched data is saved even if index population fails.
- Projections define how enriched data is shaped and stored—you can create multiple projections from the same enrichment tree.
- Object projections to Blob create one JSON file per document; table projections create normalized relational tables.
- Unlike the search index, knowledge store is write-only from the indexer perspective—it's meant for external analytics tools.
- Shaper skill is often needed to restructure enriched data before projection.
- Knowledge store data persists independently—deleting the index doesn't delete knowledge store data.
Knowledge Store Configuration:
{
"name": "my-skillset",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
"name": "shape-for-knowledge-store",
"context": "/document",
"inputs": [
{
"name": "id",
"source": "/document/id"
},
{
"name": "content",
"source": "/document/content"
},
{
"name": "keyPhrases",
"source": "/document/keyPhrases"
},
{
"name": "entities",
"source": "/document/entities"
},
{
"name": "sentiment",
"source": "/document/sentiment"
}
],
"outputs": [
{
"name": "output",
"targetName": "shapedDocument"
}
]
}
],
"knowledgeStore": {
"storageConnectionString": "DefaultEndpointsProtocol=https;AccountName=...",
"projections": [
{
"tables": [
{
"tableName": "Documents",
"generatedKeyName": "DocumentId",
"source": "/document/shapedDocument"
},
{
"tableName": "KeyPhrases",
"generatedKeyName": "KeyPhraseId",
"source": "/document/shapedDocument/keyPhrases/*"
}
],
"objects": [
{
"storageContainer": "enriched-docs",
"source": "/document/shapedDocument"
}
]
}
]
}
}Use Cases:
- Power BI reporting on enriched metadata (entities, sentiment, key phrases)
- Data science analysis of document corpus
- Compliance archiving of enriched content
- Training data for custom ML models
- Downstream ETL pipelines feeding data warehouses
Projection Types:
| Type | Storage | Use Case | Structure |
|---|---|---|---|
| Table | Table Storage | Relational analytics, Power BI | Normalized rows |
| Object | Blob (JSON) | Full document analysis | One JSON per doc |
| File | Blob (Binary) | Image extraction, attachments | Binary files |
Documentation Links:
Q8.3: Your AI Search indexer is failing on certain documents but you can't identify why. What debugging approach should you use?
Answer: Use debug sessions in Azure Portal to:
- Test skillset execution on specific documents
- Inspect enrichment tree at each skill step
- Modify skill configurations and re-run
- Identify which skill is causing failures
- Validate field mappings and output
Clarifications (exam traps):
- Debug sessions create a snapshot of indexer state in Blob Storage—you need to configure a storage connection.
- Sessions are temporary (expire after several hours) and consume storage—not meant for production debugging.
- You can edit skills inline during a debug session and immediately see results without re-running the full indexer.
- Debug sessions show the enrichment document structure—essential for understanding the
/documentcontext path. - Use debug sessions to validate custom skills before deploying to production.
- Skillset cache must be enabled for incremental enrichment testing.
Debug Session Workflow:
1. Azure Portal → AI Search → Indexers → Select failing indexer
2. Click "Debug session" → Select storage account
3. Choose a problematic document from the index
4. View enrichment tree visualization:
- /document/content (original text)
- /document/normalized_images/*/text (OCR results)
- /document/keyPhrases (extracted phrases)
- /document/entities (recognized entities)
- etc.
5. Edit skill parameters inline:
- Change entity categories
- Adjust confidence thresholds
- Modify language settings
6. Re-execute individual skills
7. Validate output field mappings
8. Save validated changes to skillsetCommon Issues Identified:
- Missing context paths: Skill output doesn't match input expectations
- Null reference errors: Optional fields not handled gracefully
- Language mismatches: Document language doesn't match skill configuration
- Timeout issues: Skills taking too long on large documents
- Field mapping errors: Enriched data not mapped to correct index fields
Alternative Debugging Approaches:
- Indexer execution history: Review error messages and warnings
- Enable indexer logging: Send detailed logs to Application Insights
- Test skills independently: Call REST APIs directly with sample data
- Use postman/curl: Manually invoke custom skills to validate behavior
Best Practices:
- Start with simple documents to validate skillset
- Use debug sessions early in development
- Test each skill type (OCR, entity recognition, etc.) separately
- Validate field mappings before full indexing
- Monitor enrichment costs during debug sessions
Documentation Links:
Section 9: Container Deployment and Edge Scenarios (Expert)
Q9.1: Your solution requires running Azure AI services in an on-premises data center with intermittent internet connectivity. What deployment pattern should you use?
Answer: Deploy containerized AI services with:
- Pull Docker containers from Microsoft Container Registry (MCR)
- Deploy to on-premises Kubernetes or Docker environment
- Configure disconnected mode with commitment-based billing
- Implement periodic connection to Azure for metering and compliance
- Use Azure IoT Edge (optional) for orchestration and updates
Clarifications (exam traps):
- Not all AI services support containers—check service-specific documentation (e.g., Speech, Vision, Language are available; Azure OpenAI is NOT containerized).
- Containers require Azure-connected billing—you need to connect at minimum every 14-30 days for metering.
- Disconnected containers use commitment-based pricing (not pay-per-transaction).
- Container images must be pulled from Microsoft Container Registry—you can't customize or redistribute.
- CPU/GPU requirements vary by service—Speech/Vision may need GPUs for acceptable performance.
- Containers still communicate with Azure for billing telemetry—ensure firewall rules allow this.
Container Deployment Pattern:
# Pull container from MCR
docker pull mcr.microsoft.com/azure-cognitive-services/textanalytics/language:latest
# Run container with billing configuration
docker run --rm -it \
-p 5000:5000 \
--memory 8g \
--cpus 4 \
mcr.microsoft.com/azure-cognitive-services/textanalytics/language:latest \
Eula=accept \
Billing=https://<resource-name>.cognitiveservices.azure.com/ \
ApiKey=<your-api-key>Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: language-service
spec:
replicas: 2
selector:
matchLabels:
app: language-service
template:
metadata:
labels:
app: language-service
spec:
containers:
- name: language
image: mcr.microsoft.com/azure-cognitive-services/textanalytics/language:latest
ports:
- containerPort: 5000
env:
- name: Eula
value: "accept"
- name: Billing
valueFrom:
secretKeyRef:
name: cognitive-services-secret
key: billing-endpoint
- name: ApiKey
valueFrom:
secretKeyRef:
name: cognitive-services-secret
key: api-key
resources:
requests:
memory: "8Gi"
cpu: "4"
limits:
memory: "12Gi"
cpu: "6"Use Cases:
- Data sovereignty: Process sensitive data on-premises
- Low latency: Eliminate round-trip to cloud
- Offline scenarios: Manufacturing floors, remote locations
- Regulatory compliance: Healthcare, financial services
- Edge computing: IoT scenarios with local processing
Supported Services (Containers):
- ✅ Computer Vision (Read OCR, Analyze, Spatial Analysis)
- ✅ Face API
- ✅ Language Service (sentiment, key phrases, NER, language detection)
- ✅ Speech (speech-to-text, text-to-speech, translation)
- ✅ Translator
- ❌ Azure OpenAI (not available as container)
- ❌ Custom Vision (training - only prediction available)
Documentation Links:
Q9.2: You're deploying Speech-to-Text containers on Azure IoT Edge devices in retail stores. How should you handle model updates and configuration management?
Answer: Implement centralized deployment using:
- Azure IoT Hub for device management and module deployment
- Container Registry (ACR) for hosting custom module images
- Module twin properties for runtime configuration
- Deployment manifests defining container versions and settings
- Automatic rollback on deployment failures
Clarifications (exam traps):
- IoT Edge modules are Docker containers—you define them in deployment manifests.
- Module twins enable cloud-to-device configuration without redeploying containers.
- Speech containers need continuous billing connection to Azure—ensure IoT Edge devices have internet (even if intermittent).
- GPU support on IoT Edge requires specific runtime configuration—not automatic.
- Use layered deployments to apply different configurations to device groups (e.g., by store region).
- edgeHub handles offline buffering if devices lose connectivity temporarily.
Deployment Manifest Example:
{
"modulesContent": {
"$edgeAgent": {
"properties.desired": {
"schemaVersion": "1.1",
"runtime": {
"type": "docker",
"settings": {
"minDockerVersion": "v1.25",
"registryCredentials": {
"myregistry": {
"username": "$CONTAINER_REGISTRY_USERNAME",
"password": "$CONTAINER_REGISTRY_PASSWORD",
"address": "myregistry.azurecr.io"
}
}
}
},
"systemModules": {
"edgeAgent": {
"type": "docker",
"settings": {
"image": "mcr.microsoft.com/azureiotedge-agent:1.4",
"createOptions": {}
}
},
"edgeHub": {
"type": "docker",
"status": "running",
"restartPolicy": "always",
"settings": {
"image": "mcr.microsoft.com/azureiotedge-hub:1.4",
"createOptions": {
"HostConfig": {
"PortBindings": {
"443/tcp": [{"HostPort": "443"}],
"5671/tcp": [{"HostPort": "5671"}],
"8883/tcp": [{"HostPort": "8883"}]
}
}
}
}
}
},
"modules": {
"speechToText": {
"version": "1.0",
"type": "docker",
"status": "running",
"restartPolicy": "always",
"settings": {
"image": "mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:latest",
"createOptions": {
"HostConfig": {
"PortBindings": {
"5000/tcp": [{"HostPort": "5000"}]
},
"Memory": 8589934592
},
"Env": [
"Eula=accept",
"Billing=$SPEECH_BILLING_ENDPOINT",
"ApiKey=$SPEECH_API_KEY"
]
}
}
}
}
}
},
"$edgeHub": {
"properties.desired": {
"routes": {
"speechToCloud": "FROM /messages/modules/speechToText/outputs/* INTO $upstream"
},
"storeAndForwardConfiguration": {
"timeToLiveSecs": 7200
}
}
},
"speechToText": {
"properties.desired": {
"modelVersion": "2024-01-15",
"language": "en-US",
"enableLogging": true,
"profanityOption": "Masked"
}
}
}
}Configuration Management:
# Update module twin from cloud
from azure.iot.hub import IoTHubRegistryManager
registry_manager = IoTHubRegistryManager(connection_string)
# Update module twin desired properties
twin_patch = {
"properties": {
"desired": {
"modelVersion": "2024-02-01",
"language": "en-US"
}
}
}
registry_manager.update_module_twin(
device_id="store-001",
module_id="speechToText",
twin_patch=twin_patch,
etag="*"
)Best Practices:
- Version pinning: Use specific image tags, not
latest - Gradual rollout: Deploy to test devices first
- Health monitoring: Implement module health checks
- Offline tolerance: Configure store-and-forward for network outages
- Secure secrets: Use Azure Key Vault integration for credentials
Documentation Links:
Section 10: Hybrid Search and Vector Optimization (Expert)
Q10.1: Your RAG application needs to balance keyword precision with semantic understanding. What's the optimal search strategy in Azure AI Search?
Answer: Implement hybrid search with RRF (Reciprocal Rank Fusion) combining:
- Vector search for semantic similarity using embeddings
- Full-text search for keyword matching and exact phrases
- Semantic ranking as a re-ranker on top of hybrid results
- RRF to merge vector and keyword results with balanced scoring
Clarifications (exam traps):
- Hybrid search ≠ semantic search—hybrid combines vector + keyword; semantic is a re-ranking feature.
- RRF is the default fusion algorithm—it handles score normalization between vector and keyword results.
- Vector search alone can miss exact matches (e.g., product codes, IDs)—always combine with keyword search for production.
- Semantic ranking adds ~50-100ms latency but significantly improves relevance for natural language queries.
- Set appropriate k (number of neighbors) for vector search—too high wastes compute, too low misses results.
- Use HNSW algorithm for vector indexing (not flat/IVF) for best query performance.
Hybrid Search Implementation:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
# Generate query embedding
query_vector = get_embedding(user_query) # Call your embedding model
# Create vectorized query
vector_query = VectorizedQuery(
vector=query_vector,
k_nearest_neighbors=50, # Top 50 vector matches
fields="contentVector" # Vector field name
)
# Execute hybrid search
results = search_client.search(
search_text=user_query, # Keyword search
vector_queries=[vector_query], # Vector search
select=["id", "content", "title"],
query_type="semantic", # Enable semantic ranking
semantic_configuration_name="my-semantic-config",
top=10 # Final results after RRF + semantic ranking
)
for result in results:
print(f"Score: {result['@search.score']}")
print(f"Content: {result['content']}")
print(f"Reranker Score: {result['@search.reranker_score']}")Index Configuration for Hybrid Search:
{
"name": "hybrid-index",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true
},
{
"name": "content",
"type": "Edm.String",
"searchable": true,
"analyzer": "en.microsoft"
},
{
"name": "contentVector",
"type": "Collection(Edm.Single)",
"searchable": true,
"dimensions": 1536,
"vectorSearchProfile": "my-vector-profile"
}
],
"vectorSearch": {
"algorithms": [
{
"name": "hnsw-algorithm",
"kind": "hnsw",
"hnswParameters": {
"metric": "cosine",
"m": 4,
"efConstruction": 400,
"efSearch": 500
}
}
],
"profiles": [
{
"name": "my-vector-profile",
"algorithm": "hnsw-algorithm"
}
]
},
"semantic": {
"configurations": [
{
"name": "my-semantic-config",
"prioritizedFields": {
"titleField": {
"fieldName": "title"
},
"prioritizedContentFields": [
{"fieldName": "content"}
],
"prioritizedKeywordsFields": [
{"fieldName": "keyPhrases"}
]
}
}
]
}
}Search Strategy Comparison:
| Strategy | Precision | Recall | Latency | Use Case |
|---|---|---|---|---|
| Keyword only | High for exact | Low | Low (~10ms) | Exact match, IDs |
| Vector only | Medium | High | Medium (~30ms) | Semantic, paraphrase |
| Hybrid (RRF) | High | High | Medium (~40ms) | Balanced, production |
| Hybrid + Semantic | Very High | High | High (~100ms) | Best relevance |
Performance Optimization:
- Cache embeddings: Don't regenerate for repeated queries
- Tune k parameter: Start with 50, adjust based on result quality
- HNSW parameters: Higher
efSearch= better recall but slower - Filter first: Apply filters before vector search when possible
- Batch indexing: Generate embeddings in batches during indexing
Documentation Links:
Q10.2: You're implementing vector search and notice queries are slow despite using HNSW algorithm. What tuning parameters should you adjust?
Answer: Optimize HNSW parameters based on your precision/performance trade-off:
- efSearch (query-time): Lower for speed (100-200), higher for recall (500-1000)
- m (index-time): Connections per node—higher = better recall but larger index
- efConstruction (index-time): Higher = better index quality but slower building
- Dimensions: Reduce embedding dimensions if possible (384 vs 1536)
- Filtering strategy: Apply filters after vector search (post-filtering) for better performance
Clarifications (exam traps):
- efSearch is query-time parameter—you can tune it without reindexing.
- m and efConstruction require re-indexing when changed—test carefully before production.
- Higher efSearch improves recall but increases query latency linearly.
- Filtering before vector search can hurt recall—prefer post-filtering or hybrid approaches.
- Cosine vs Euclidean metric choice depends on your embedding model—use cosine for normalized embeddings.
- Azure AI Search automatically normalizes vectors for cosine similarity.
Parameter Tuning Guide:
{
"vectorSearch": {
"algorithms": [
{
"name": "optimized-hnsw",
"kind": "hnsw",
"hnswParameters": {
"metric": "cosine",
"m": 4, // Default 4, increase to 8-16 for better recall
"efConstruction": 400, // Default 400, increase to 800-1000 for better quality
"efSearch": 500 // Default 500, adjust 100-1000 for speed/recall balance
}
}
]
}
}Performance Testing Matrix:
| efSearch | Query Time | Recall@10 | Recommended For |
|---|---|---|---|
| 100 | 20ms | 85% | Speed-critical apps |
| 200 | 30ms | 90% | Balanced production |
| 500 | 50ms | 95% | High precision needs |
| 1000 | 90ms | 98% | Maximum recall |
Optimization Strategies:
# Strategy 1: Adaptive efSearch based on query type
def adaptive_search(query, query_type):
if query_type == "exact_match":
# Use keyword search only
ef_search = 100
elif query_type == "exploratory":
# Higher recall for broad searches
ef_search = 800
else:
# Balanced for typical queries
ef_search = 500
vector_query = VectorizedQuery(
vector=get_embedding(query),
k_nearest_neighbors=50,
fields="contentVector",
exhaustive=False # Use HNSW, not brute force
)
return search_client.search(
search_text=query,
vector_queries=[vector_query],
top=10
)
# Strategy 2: Dimension reduction
# Use smaller embedding models when possible
# text-embedding-ada-002: 1536 dimensions
# text-embedding-3-small: 512 dimensions (faster, still good quality)
# Strategy 3: Pre-filtering optimization
# Instead of filtering in search query, create separate indexes
# for major categories to reduce search spaceIndex Size vs Performance:
| Vector Dimensions | Index Size (1M docs) | Query Time | Recommendation |
|---|---|---|---|
| 384 | ~1.5 GB | 15-25ms | Best for speed |
| 768 | ~3 GB | 25-40ms | Good balance |
| 1536 | ~6 GB | 40-70ms | Maximum quality |
Monitoring and Tuning:
- Track query latency by percentile (p50, p95, p99)
- Measure recall@k using ground truth dataset
- A/B test parameter changes on production traffic
- Monitor index size growth over time
- Benchmark different embedding models
Documentation Links:
Section 11: Advanced Model Integration and Fine-Tuning (Expert)
Q11.1: Your organization wants to fine-tune GPT-3.5-turbo on proprietary customer service conversations. What's the end-to-end workflow in Azure OpenAI?
Answer: Follow the supervised fine-tuning workflow:
- Prepare training data in JSONL format (prompt-completion pairs)
- Validate data using Azure OpenAI validation tools
- Upload training file to Azure OpenAI
- Create fine-tuning job specifying model and hyperparameters
- Monitor training progress and validation loss
- Deploy fine-tuned model as a new deployment
- Evaluate and iterate on production traffic
Clarifications (exam traps):
- Only specific models support fine-tuning—currently GPT-3.5-turbo, babbage-002, davinci-002 (NOT GPT-4 in most regions).
- Training data needs minimum 10 examples, recommended 50-100 for meaningful improvements.
- JSONL format is strict—each line must be valid JSON with
messagesarray (notprompt/completionfor chat models). - Fine-tuning creates a new model deployment—you're charged for hosting + inference separately.
- Validation split is automatic (10-20%)—or you can provide a separate validation file.
- Fine-tuned models don't support streaming initially—check current limitations.
- Training can take minutes to hours depending on dataset size.
Training Data Format:
{"messages": [{"role": "system", "content": "You are a customer service agent for Contoso Bank."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "To reset your password: 1. Go to contoso.com/login 2. Click 'Forgot Password' 3. Enter your email 4. Follow the link sent to your email."}]}
{"messages": [{"role": "system", "content": "You are a customer service agent for Contoso Bank."}, {"role": "user", "content": "What are your business hours?"}, {"role": "assistant", "content": "Our customer service is available Monday-Friday 8 AM to 6 PM EST, and Saturday 9 AM to 3 PM EST."}]}
{"messages": [{"role": "system", "content": "You are a customer service agent for Contoso Bank."}, {"role": "user", "content": "How do I dispute a charge?"}, {"role": "assistant", "content": "To dispute a charge: 1. Log into your account 2. Navigate to Transactions 3. Select the transaction 4. Click 'Dispute' and provide details. Our fraud team will review within 48 hours."}]}Fine-Tuning Workflow:
from openai import AzureOpenAI
client = AzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_KEY"),
api_version="2024-02-15-preview",
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)
# 1. Upload training file
training_file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# 2. Create fine-tuning job
fine_tune_job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-35-turbo", # Base model
hyperparameters={
"n_epochs": 3, # Number of training epochs
"batch_size": 8,
"learning_rate_multiplier": 0.1
},
suffix="contoso-cs-v1" # Custom model name suffix
)
print(f"Fine-tuning job created: {fine_tune_job.id}")
# 3. Monitor progress
import time
while True:
job_status = client.fine_tuning.jobs.retrieve(fine_tune_job.id)
print(f"Status: {job_status.status}")
if job_status.status in ["succeeded", "failed", "cancelled"]:
break
time.sleep(60)
# 4. Get fine-tuned model ID
fine_tuned_model = job_status.fine_tuned_model
print(f"Fine-tuned model: {fine_tuned_model}")
# 5. Create deployment (via Azure Portal or API)
# Deploy the fine_tuned_model ID as a new deployment
# 6. Use fine-tuned model
response = client.chat.completions.create(
model="contoso-cs-v1-deployment", # Your deployment name
messages=[
{"role": "system", "content": "You are a customer service agent for Contoso Bank."},
{"role": "user", "content": "How do I transfer money internationally?"}
]
)
print(response.choices[0].message.content)Hyperparameter Tuning:
| Parameter | Default | Description | When to Adjust |
|---|---|---|---|
| n_epochs | 3 | Training iterations | Increase if underfitting (to 5-10) |
| batch_size | Auto | Examples per batch | Increase for faster training |
| learning_rate_multiplier | 0.1 | Learning rate scale | Decrease if loss unstable |
Data Quality Guidelines:
- Consistency: Use the same system message across all examples
- Diversity: Cover various user intents and phrasings
- Length: Keep responses concise and similar in style
- Quality over quantity: 50 high-quality examples > 500 poor ones
- Validation: Reserve 10-20% for validation set
Cost Considerations:
- Training: Charged per 1K tokens in training data × epochs
- Hosting: Monthly fee for fine-tuned model deployment
- Inference: Same as base model pricing
Use Cases:
- Domain-specific responses (customer service, legal, medical)
- Tone and style adaptation (formal, casual, technical)
- Structured output (JSON, specific formats)
- Language or terminology (company-specific jargon)
- Reducing prompting overhead (embed instructions in model)
Documentation Links:
Summary (Updated)
This expert pack now covers:
- Private networking vs "selected networks"
- Managed identity + Azure AD auth vs API keys
- Azure OpenAI deployment naming vs model naming
- Hybrid retrieval, chunking, semantic ranking, and groundedness
- Document Intelligence vs OCR vs enrichment pipelines
- CLU vs Text Analytics vs Speech features
- Safety controls (filters + blocklists) plus tool allowlists and prompt-injection defenses
- Multilingual search and RAG patterns
- Structured + unstructured data integration
- Compliance and audit logging across services
- Document Intelligence model selection (prebuilt vs custom)
- Multi-modal vision applications (Spatial Analysis, Custom Vision)
- Read API vs OCR API vs Document Intelligence
- Azure OpenAI function calling optimization
- Multi-layer content filtering strategies
- Cost optimization techniques (model routing, caching, token management)
- Comprehensive observability with distributed tracing and correlation
- Custom skills and knowledge store for advanced knowledge mining
- Debug sessions for skillset troubleshooting
- Container deployment and disconnected scenarios
- IoT Edge integration for edge AI
- Hybrid search with RRF and semantic ranking
- HNSW parameter tuning for vector search optimization
- Fine-tuning workflows for GPT-3.5-turbo customization