Skip to content

Evaluation, Observability, Reliability (Expert) ​

Section E1: Evaluation (RAG + GenAI) ​

QE1.1: You need to detect regressions after changing chunking or embedding models. What’s the correct evaluation approach? ​

Answer: Maintain a golden query set and run offline retrieval + answer evaluations (Recall@k, citation validity, pass/fail rubrics).

Clarifications (exam traps):

  • β€œTry it manually” does not scale; you need repeatable evaluation.

QE1.2: What are the most important metrics for a RAG system? ​

Answer: Retrieval quality (e.g., Recall@k), groundedness/citation validity, and task success.

Clarifications (exam traps):

  • User satisfaction alone won’t tell you if the system is factually grounded.

QE1.3: You need to evaluate safety policy adherence (jailbreak attempts). What’s the best practice? ​

Answer: Keep an adversarial test set, run it continuously, and track block/allow outcomes + false positives.

Clarifications (exam traps):

  • Safety evaluation must be continuous; content policies and prompts evolve.

Section E2: Observability ​

QE2.1: You need to trace a single user request across API gateway β†’ retrieval β†’ model call β†’ tool calls. What’s required? ​

Answer: A single correlation ID propagated end-to-end + structured logs/spans.

Clarifications (exam traps):

  • Resource diagnostic logs alone can’t connect user intent to tool invocations.

QE2.2: What should you log for RAG requests if prompts might contain sensitive data? ​

Answer: Log minimal metadata (token counts, timings, chunk IDs, doc IDs) and redact sensitive text.

Clarifications (exam traps):

  • β€œLog everything” is a privacy failure.

QE2.3: You need alerts that catch quality drops before users complain. What signals are best? ​

Answer: Rising refusal rates, falling citation validity, retrieval no-hit rate increases, and elevated 429/5xx.

Clarifications (exam traps):

  • Latency-only alerts miss β€œsilent” quality failures.

Section E3: Reliability Patterns ​

QE3.1: Your system retries 429s aggressively and causes cascading failures. What should you add? ​

Answer: Respect Retry-After, use exponential backoff + jitter, and add a circuit breaker.

Clarifications (exam traps):

  • Retrying harder can amplify outages.

QE3.2: You need graceful degradation when Azure OpenAI is unavailable. What’s a practical fallback strategy? ​

Answer: Reduce features (no tools), switch to a smaller model, queue async processing, or return a partial answer with clear status.

Clarifications (exam traps):

  • Fallbacks should be explicit and measurable.

QE3.3: You need to avoid throttling when usage spikes. What’s the right architecture lever? ​

Answer: Use queueing/batching where possible, cache results, and implement per-user rate limits.

Clarifications (exam traps):

  • β€œScale out clients” without backoff increases 429 pressure.

Released under the MIT License.