Section 10: Evaluation & Safety Operations
Q10.1: How do you automate continuous evaluation of agent quality?
Answer: Register built-in evaluators (Relevance, Fluency, Coherence, Task Adherence) against captured conversations, run them on every build or nightly batch, and gate deployments on aggregated scores. Pair automated scoring with targeted human spot checks to ensure edge-case coverage.
Detailed Explanation:
- Evaluator Setup: Use the Azure AI Projects SDK to instantiate evaluators via
EvaluatorIds, defining thresholds that classify pass/fail per metric. - Dataset Selection: Reuse logged threads and synthetic scenarios that mirror high-risk workflows; tag each scenario with expected outcomes.
- Automation: Trigger evaluation jobs in CI/CD or scheduled pipelines, store results with run metadata, and expose dashboards that track trends over time.
- Quality Gates: Fail the release if accuracy or task-adherence scores drop below agreed baselines; require manual approval for borderline cases.
- Feedback Loop: Feed evaluator findings into prompt revisions, planner tuning, or tool updates, and re-run the suite after fixes to confirm improvements.
Q10.2: How do you operationalize red teaming for Azure AI agents?
Answer: Use the Azure AI red teaming agent to scan production-like deployments with curated attack scenarios, export findings to JSON, and integrate remediation work into sprint planning. Schedule scans after major updates and before go-live to catch regressions.
Detailed Explanation:
- Scenario Library: Maintain prompt templates for jailbreaks, data exfiltration, policy bypass, and abuse patterns relevant to your domain.
- Automated Scans: Invoke
red_team_agent.scanwith the target endpoint and store results in version-controlled artifacts for traceability. - Triaging: Classify findings by severity, assign owners, and document mitigations (prompt hardening, tool restrictions, policy updates).
- Retesting: Re-run the affected scenarios to verify fixes and keep historical comparisons for audit purposes.
- Governance: Report scan outcomes to responsible AI councils or risk boards to prove continuous monitoring.
Q10.3: How do you compose specialised agents while preserving control boundaries?
Answer: Create connected agents for high-risk capabilities (finance, compliance, ops), expose them as ConnectedAgentTool definitions, and require callers to supply descriptive metadata so actions remain auditable. Each connected agent enforces its own guardrails while the orchestrator logs every invocation.
Detailed Explanation:
- Modular Design: Build domain-specific agents with tightly scoped instructions and tools, then register them as connected tools that other agents can invoke on demand.
- Access Control: Apply RBAC and approval workflows to the connected agent definitions so only authorized orchestrators can link them.
- Auditability: Persist invocation details (caller agent, parameters, outcome) for every connected-agent call, and correlate them with the parent conversation thread.
- Safety Layers: Require connected agents to validate inputs, respect their own evaluation policies, and decline requests outside remit.
- Lifecycle Management: Version connected agents independently, enabling safe rollbacks when compliance requirements change.