Evaluation, Observability, Reliability (Expert)

Section E1: Evaluation (RAG + GenAI)

QE1.1: You need to detect regressions after changing chunking or embedding models. What’s the correct evaluation approach?

Answer: Maintain a golden query set and run offline retrieval + answer evaluations (Recall@k, citation validity, pass/fail rubrics).

Clarifications (exam traps):

“Try it manually” does not scale; you need repeatable evaluation.

QE1.2: What are the most important metrics for a RAG system?

Answer: Retrieval quality (e.g., Recall@k), groundedness/citation validity, and task success.

Clarifications (exam traps):

User satisfaction alone won’t tell you if the system is factually grounded.

QE1.3: You need to evaluate safety policy adherence (jailbreak attempts). What’s the best practice?

Answer: Keep an adversarial test set, run it continuously, and track block/allow outcomes + false positives.

Clarifications (exam traps):

Safety evaluation must be continuous; content policies and prompts evolve.

Section E2: Observability

QE2.1: You need to trace a single user request across API gateway → retrieval → model call → tool calls. What’s required?

Answer: A single correlation ID propagated end-to-end + structured logs/spans.

Clarifications (exam traps):

Resource diagnostic logs alone can’t connect user intent to tool invocations.

QE2.2: What should you log for RAG requests if prompts might contain sensitive data?

Answer: Log minimal metadata (token counts, timings, chunk IDs, doc IDs) and redact sensitive text.

Clarifications (exam traps):

“Log everything” is a privacy failure.

QE2.3: You need alerts that catch quality drops before users complain. What signals are best?

Answer: Rising refusal rates, falling citation validity, retrieval no-hit rate increases, and elevated 429/5xx.

Clarifications (exam traps):

Latency-only alerts miss “silent” quality failures.

Section E3: Reliability Patterns

QE3.1: Your system retries 429s aggressively and causes cascading failures. What should you add?

Answer: Respect Retry-After, use exponential backoff + jitter, and add a circuit breaker.

Clarifications (exam traps):

Retrying harder can amplify outages.

QE3.2: You need graceful degradation when Azure OpenAI is unavailable. What’s a practical fallback strategy?

Answer: Reduce features (no tools), switch to a smaller model, queue async processing, or return a partial answer with clear status.

Clarifications (exam traps):

Fallbacks should be explicit and measurable.

QE3.3: You need to avoid throttling when usage spikes. What’s the right architecture lever?

Answer: Use queueing/batching where possible, cache results, and implement per-user rate limits.

Clarifications (exam traps):

“Scale out clients” without backoff increases 429 pressure.

Evaluation, Observability, Reliability (Expert) ​

Section E1: Evaluation (RAG + GenAI) ​

QE1.1: You need to detect regressions after changing chunking or embedding models. What’s the correct evaluation approach? ​

QE1.2: What are the most important metrics for a RAG system? ​

QE1.3: You need to evaluate safety policy adherence (jailbreak attempts). What’s the best practice? ​

Section E2: Observability ​

QE2.1: You need to trace a single user request across API gateway → retrieval → model call → tool calls. What’s required? ​

QE2.2: What should you log for RAG requests if prompts might contain sensitive data? ​

QE2.3: You need alerts that catch quality drops before users complain. What signals are best? ​

Section E3: Reliability Patterns ​

QE3.1: Your system retries 429s aggressively and causes cascading failures. What should you add? ​

QE3.2: You need graceful degradation when Azure OpenAI is unavailable. What’s a practical fallback strategy? ​

QE3.3: You need to avoid throttling when usage spikes. What’s the right architecture lever? ​

Evaluation, Observability, Reliability (Expert)

Section E1: Evaluation (RAG + GenAI)

QE1.1: You need to detect regressions after changing chunking or embedding models. What’s the correct evaluation approach?

QE1.2: What are the most important metrics for a RAG system?

QE1.3: You need to evaluate safety policy adherence (jailbreak attempts). What’s the best practice?

Section E2: Observability

QE2.1: You need to trace a single user request across API gateway → retrieval → model call → tool calls. What’s required?

QE2.2: What should you log for RAG requests if prompts might contain sensitive data?

QE2.3: You need alerts that catch quality drops before users complain. What signals are best?

Section E3: Reliability Patterns

QE3.1: Your system retries 429s aggressively and causes cascading failures. What should you add?

QE3.2: You need graceful degradation when Azure OpenAI is unavailable. What’s a practical fallback strategy?

QE3.3: You need to avoid throttling when usage spikes. What’s the right architecture lever?