Evaluation, Observability, Reliability (Expert) β
Section E1: Evaluation (RAG + GenAI) β
QE1.1: You need to detect regressions after changing chunking or embedding models. Whatβs the correct evaluation approach? β
Answer: Maintain a golden query set and run offline retrieval + answer evaluations (Recall@k, citation validity, pass/fail rubrics).
Clarifications (exam traps):
- βTry it manuallyβ does not scale; you need repeatable evaluation.
QE1.2: What are the most important metrics for a RAG system? β
Answer: Retrieval quality (e.g., Recall@k), groundedness/citation validity, and task success.
Clarifications (exam traps):
- User satisfaction alone wonβt tell you if the system is factually grounded.
QE1.3: You need to evaluate safety policy adherence (jailbreak attempts). Whatβs the best practice? β
Answer: Keep an adversarial test set, run it continuously, and track block/allow outcomes + false positives.
Clarifications (exam traps):
- Safety evaluation must be continuous; content policies and prompts evolve.
Section E2: Observability β
QE2.1: You need to trace a single user request across API gateway β retrieval β model call β tool calls. Whatβs required? β
Answer: A single correlation ID propagated end-to-end + structured logs/spans.
Clarifications (exam traps):
- Resource diagnostic logs alone canβt connect user intent to tool invocations.
QE2.2: What should you log for RAG requests if prompts might contain sensitive data? β
Answer: Log minimal metadata (token counts, timings, chunk IDs, doc IDs) and redact sensitive text.
Clarifications (exam traps):
- βLog everythingβ is a privacy failure.
QE2.3: You need alerts that catch quality drops before users complain. What signals are best? β
Answer: Rising refusal rates, falling citation validity, retrieval no-hit rate increases, and elevated 429/5xx.
Clarifications (exam traps):
- Latency-only alerts miss βsilentβ quality failures.
Section E3: Reliability Patterns β
QE3.1: Your system retries 429s aggressively and causes cascading failures. What should you add? β
Answer: Respect Retry-After, use exponential backoff + jitter, and add a circuit breaker.
Clarifications (exam traps):
- Retrying harder can amplify outages.
QE3.2: You need graceful degradation when Azure OpenAI is unavailable. Whatβs a practical fallback strategy? β
Answer: Reduce features (no tools), switch to a smaller model, queue async processing, or return a partial answer with clear status.
Clarifications (exam traps):
- Fallbacks should be explicit and measurable.
QE3.3: You need to avoid throttling when usage spikes. Whatβs the right architecture lever? β
Answer: Use queueing/batching where possible, cache results, and implement per-user rate limits.
Clarifications (exam traps):
- βScale out clientsβ without backoff increases 429 pressure.