Skip to content

Implement Generative AI Solutions - Q&A

This document contains comprehensive questions and answers for the Implement Generative AI Solutions domain of the AI-102 exam.


Section 1: Azure OpenAI Service Basics

Q1.1: What is Azure OpenAI Service, and how does it differ from OpenAI's direct API?

Answer: Azure OpenAI Service is Microsoft's managed offering that provides access to OpenAI's large language models (GPT-4, GPT-3.5, Embeddings, DALL-E, etc.) with enterprise-grade features, security, and compliance.

Key Differences:

  1. Enterprise Features:

    • Azure AD integration for authentication
    • Managed identity support
    • Private endpoints for network isolation
    • Data residency and compliance options
  2. Security and Compliance:

    • Data encrypted at rest and in transit
    • Regional availability for data residency
    • Integration with Azure security services
    • Audit logging and monitoring
  3. Cost Management:

    • Azure billing and cost management integration
    • Quota management through Azure subscriptions
    • Budget alerts and cost tracking
  4. Support and SLA:

    • Microsoft support options
    • Enterprise SLA guarantees
    • Integration with Azure support channels

Detailed Explanation: Azure OpenAI Service combines OpenAI's powerful models with Microsoft's enterprise infrastructure, providing a secure, compliant, and scalable way to deploy generative AI solutions in enterprise environments.

Use Cases:

  • Enterprise chatbots and virtual assistants
  • Content generation and summarization
  • Code generation and assistance
  • Natural language processing applications
  • Embeddings for semantic search

Documentation Links:


Q1.2: How do you provision an Azure OpenAI resource?

Answer: To provision an Azure OpenAI resource:

  1. Request Access:

    • Submit access request through Azure Portal
    • Provide business justification
    • Wait for approval (may take time)
  2. Create Resource:

    • Navigate to Azure Portal
    • Search for "Azure OpenAI"
    • Click "Create"
    • Fill in:
      • Subscription and resource group
      • Region (select based on availability)
      • Name for the resource
      • Pricing tier (Pay-as-you-go or other)
  3. Deploy Models:

    • After resource creation, go to Azure OpenAI Studio
    • Navigate to "Deployments"
    • Create new deployment for desired model (GPT-4, GPT-3.5, etc.)
    • Specify deployment name and model version
  4. Configure Access:

    • Set up authentication (keys or Azure AD)
    • Configure network access if needed
    • Set up monitoring and logging

Detailed Explanation: Azure OpenAI requires explicit access approval due to high demand and responsible AI considerations. Once approved, provisioning follows standard Azure resource creation patterns.

Important Considerations:

  • Access Approval: Required before provisioning
  • Regional Availability: Limited to specific regions
  • Model Deployment: Models must be deployed before use
  • Quotas: Initial quotas may be limited

Step-by-Step Process:

  1. Request access at https://aka.ms/oai/access
  2. Wait for approval email
  3. Create resource in approved region
  4. Deploy models in Azure OpenAI Studio
  5. Get endpoint and keys for API access

Documentation Links:


Section 2: Prompt Engineering

Q2.1: What is prompt engineering, and why is it important?

Answer: Prompt engineering is the practice of designing and optimizing input prompts (text instructions) to get the desired output from generative AI models. It's important because:

  • Model behavior is highly dependent on prompt quality
  • Well-crafted prompts improve accuracy and relevance
  • Reduces need for fine-tuning in many cases
  • Significantly impacts user experience and model effectiveness

Detailed Explanation: The quality of prompts directly determines the quality of outputs. Effective prompt engineering involves:

  • Clear instructions and context
  • Examples (few-shot learning)
  • Structured formatting
  • Constraint specification
  • Iterative refinement

Prompt Engineering Best Practices:

  1. Be Specific and Clear:

    • Avoid ambiguity
    • Use explicit instructions
    • Define expected output format
  2. Provide Context:

    • Include relevant background information
    • Set appropriate context window
    • Reference relevant domain knowledge
  3. Use Examples (Few-Shot Learning):

    • Show desired input/output patterns
    • Provide diverse examples
    • Illustrate edge cases
  4. Structure Prompts:

    • Use clear sections (role, task, examples)
    • Format with markdown or structure
    • Separate instructions from data
  5. Iterate and Refine:

    • Test prompts with various inputs
    • Measure output quality
    • Refine based on results

Prompt Techniques:

  • Zero-Shot: Direct instructions without examples
  • Few-Shot: Instructions with 1-5 examples
  • Chain-of-Thought: Breaking down reasoning steps
  • Role-Based: Assigning specific roles to the model
  • Template-Based: Using consistent prompt structures

Documentation Links:


Q2.2: What are the key parameters for controlling generative AI model behavior?

Answer: Key parameters for controlling model behavior include:

  1. Temperature (0.0 - 2.0):

    • Controls randomness in outputs
    • Lower = more focused and deterministic
    • Higher = more creative and diverse
    • Default: 1.0
  2. Max Tokens:

    • Maximum length of generated response
    • Prevents excessive generation
    • Cost control mechanism
    • Must account for input + output tokens
  3. Top P (Nucleus Sampling, 0.0 - 1.0):

    • Alternative to temperature
    • Controls diversity via probability mass
    • Filters out low probability tokens
    • Default: 1.0
  4. Top K:

    • Limits sampling to top K most likely tokens
    • Reduces randomness
    • Not available in all models
  5. Frequency Penalty (-2.0 to 2.0):

    • Reduces likelihood of repeating tokens
    • Higher values reduce repetition
    • Default: 0.0
  6. Presence Penalty (-2.0 to 2.0):

    • Encourages talking about new topics
    • Higher values promote topic diversity
    • Default: 0.0
  7. Stop Sequences:

    • Specifies sequences where generation stops
    • Useful for structured outputs
    • Can specify multiple stop sequences

Detailed Explanation: These parameters fine-tune model behavior without retraining. Understanding their effects is crucial for optimizing outputs for specific use cases.

Parameter Selection Guidelines:

  • Creative Tasks: Higher temperature (0.7-1.2), higher top_p
  • Factual Tasks: Lower temperature (0.0-0.3), lower top_p
  • Code Generation: Lower temperature (0.1-0.3) for consistency
  • Conversation: Moderate temperature (0.7-0.9)
  • Summarization: Lower temperature (0.2-0.5)

Trade-offs:

  • Higher creativity vs. consistency
  • More tokens vs. cost control
  • Less repetition vs. coherence
  • Novelty vs. relevance

Documentation Links:


Q2.3: How do you implement chain-of-thought prompting?

Answer: Chain-of-thought prompting guides the model to show its reasoning process step-by-step:

  1. Explicit Instruction:

    • Instruct model to think step-by-step
    • Show reasoning process in output
    • Example: "Let's solve this step by step:"
  2. Few-Shot Examples:

    • Provide examples with reasoning steps
    • Show how to break down complex problems
    • Demonstrate thought process
  3. Structured Format:

    • Use consistent format for steps
    • Number steps or use bullets
    • Clearly separate reasoning from conclusion
  4. Iterative Refinement:

    • Refine based on observed reasoning quality
    • Adjust complexity of reasoning steps
    • Balance detail with conciseness

Detailed Explanation: Chain-of-thought prompting improves performance on complex reasoning tasks by encouraging the model to break problems into intermediate steps, similar to human problem-solving.

Example Prompt Structure:

Question: [Problem]

Let's think step by step:
1. First, I need to...
2. Then, I should consider...
3. Based on this, I can conclude...

Answer: [Final answer]

Benefits:

  • Improved accuracy on complex problems
  • Better explainability of outputs
  • Easier to debug incorrect reasoning
  • More reliable results for mathematical/logical tasks

When to Use:

  • Mathematical problems
  • Logical reasoning tasks
  • Multi-step problem solving
  • Complex analysis requirements
  • Tasks requiring explanation

Documentation Links:


Section 3: Retrieval-Augmented Generation (RAG)

Q3.1: What is Retrieval-Augmented Generation (RAG), and when should you use it?

Answer: Retrieval-Augmented Generation (RAG) is a pattern that combines:

  • Retrieval: Finding relevant information from external data sources
  • Augmentation: Adding retrieved information to the prompt
  • Generation: Using the augmented prompt for generating responses

When to Use RAG:

  1. Domain-Specific Knowledge:

    • Need information not in training data
    • Enterprise knowledge bases
    • Product documentation
    • Internal policies and procedures
  2. Up-to-Date Information:

    • Current events
    • Real-time data
    • Frequently changing information
    • News and articles
  3. Factual Accuracy:

    • Reducing hallucinations
    • Grounding answers in source material
    • Providing citations
    • Verifiable information
  4. Cost Optimization:

    • Avoid fine-tuning for domain knowledge
    • Faster updates than retraining
    • Leverage pre-trained models with specific data

Detailed Explanation: RAG addresses limitations of generative models:

  • Training data cutoff dates
  • Lack of domain-specific knowledge
  • Hallucination issues
  • Need for citations and sources

RAG Architecture:

  1. Document Processing:

    • Ingest and chunk documents
    • Create embeddings
    • Store in vector database
  2. Query Processing:

    • Convert user query to embedding
    • Retrieve similar documents
    • Rank and filter results
  3. Context Augmentation:

    • Add retrieved context to prompt
    • Structure prompt with context
    • Include source citations
  4. Generation:

    • Generate response with augmented context
    • Include citations in response
    • Verify against source material

Components:

  • Vector Database: Azure AI Search, Pinecone, Qdrant
  • Embeddings: Azure OpenAI embeddings, text-embedding-ada-002
  • Chunking Strategy: Fixed-size, semantic, hierarchical
  • Retrieval Strategy: Semantic search, hybrid search, re-ranking

Documentation Links:


Answer: Implement RAG with Azure OpenAI and Azure AI Search:

  1. Set Up Azure AI Search:

    • Create Azure AI Search resource
    • Create index with vector field for embeddings
    • Configure search capabilities (full-text, vector, hybrid)
  2. Prepare Data:

    • Chunk documents into appropriate sizes
    • Generate embeddings using Azure OpenAI embeddings API
    • Create metadata for each chunk (source, page, etc.)
  3. Index Documents:

    • Upload chunks and embeddings to Azure AI Search
    • Store metadata for retrieval context
    • Index configuration for hybrid search
  4. Implement Retrieval:

    • Convert user query to embedding
    • Perform vector similarity search in Azure AI Search
    • Retrieve top-k most relevant chunks
    • Include metadata for citations
  5. Augment Prompts:

    • Add retrieved chunks to prompt context
    • Structure prompt with system message
    • Include source citations in format
  6. Generate Response:

    • Call Azure OpenAI with augmented prompt
    • Include citations in response
    • Verify against source material

Detailed Explanation: Azure AI Search provides enterprise-grade vector search capabilities with hybrid search support (combining vector and keyword search) for optimal retrieval performance.

Implementation Steps:

Step 1: Create Azure AI Search Index

json
{
  "name": "rag-index",
  "fields": [
    { "name": "id", "type": "Edm.String", "key": true },
    { "name": "content", "type": "Edm.String" },
    { "name": "contentVector", "type": "Collection(Edm.Single)", "dimensions": 1536 },
    { "name": "source", "type": "Edm.String" },
    { "name": "page", "type": "Edm.Int32" }
  ],
  "vectorSearch": {
    "algorithmConfigurations": [
      {
        "name": "vector-search-config",
        "kind": "hnsw"
      }
    ]
  }
}

Step 2: Generate Embeddings and Index

  • Use Azure OpenAI embeddings API (text-embedding-ada-002 or text-embedding-3-small/large)
  • Chunk documents appropriately (512-1024 tokens)
  • Index chunks with embeddings and metadata

Step 3: Retrieve Relevant Chunks

  • Convert query to embedding
  • Perform vector search for similar chunks
  • Optionally combine with keyword search (hybrid)
  • Retrieve top-k chunks (typically 3-5)

Step 4: Augment Prompt

System: You are a helpful assistant that answers questions based on the provided context. Always cite your sources.

Context:
[Retrieved chunk 1] (Source: document1.pdf, page 5)
[Retrieved chunk 2] (Source: document1.pdf, page 6)
[Retrieved chunk 3] (Source: document2.pdf, page 2)

User: [User question]

Best Practices:

  • Chunking: Optimal size (512-1024 tokens), overlap between chunks
  • Embeddings: Use appropriate model for domain
  • Retrieval: Use hybrid search for better results
  • Context Window: Balance retrieved chunks with model limits
  • Citations: Always include source information

Documentation Links:


Q3.3: What are embeddings, and how do you use them in RAG implementations?

Answer: Embeddings are vector representations of text that capture semantic meaning. Words or phrases with similar meanings have similar vectors, enabling semantic search and similarity calculations.

Using Embeddings in RAG:

  1. Generate Document Embeddings:

    • Convert document chunks to embeddings using Azure OpenAI embeddings API
    • Store embeddings with original text
    • Include metadata (source, position, etc.)
  2. Generate Query Embeddings:

    • Convert user queries to embeddings using same model
    • Ensure same model used for documents and queries
  3. Calculate Similarity:

    • Use cosine similarity or dot product
    • Find most similar document chunks to query
    • Rank results by similarity score
  4. Retrieve Relevant Chunks:

    • Select top-k most similar chunks
    • Include metadata for context
    • Use for prompt augmentation

Detailed Explanation: Embeddings transform text into dense vector representations where semantically similar text has similar vectors. This enables finding relevant information even when exact keywords don't match.

Embedding Models:

  • text-embedding-ada-002: Standard model, 1536 dimensions
  • text-embedding-3-small: Improved model, 1536 dimensions
  • text-embedding-3-large: Higher quality, 3072 dimensions

Key Concepts:

  • Vector Dimensions: Higher dimensions typically mean better quality (more expensive)
  • Normalization: Vectors often normalized for cosine similarity
  • Model Consistency: Same model must be used for documents and queries
  • Semantic Understanding: Captures meaning, not just keywords

Best Practices:

  • Use same embedding model for indexing and querying
  • Normalize vectors for cosine similarity calculations
  • Consider domain-specific embeddings for specialized domains
  • Balance embedding quality with cost and performance
  • Cache embeddings for frequently accessed documents

Similarity Metrics:

  • Cosine Similarity: Measures angle between vectors (0-1, higher is more similar)
  • Dot Product: Measures magnitude and direction
  • Euclidean Distance: Measures distance in vector space

Documentation Links:


Section 4: Fine-Tuning

Q4.1: What is fine-tuning, and when should you use it instead of prompt engineering?

Answer: Fine-tuning is training a pre-trained model on a custom dataset to adapt it to specific tasks, domains, or styles. Use fine-tuning when:

  1. Prompt Engineering Limitations:

    • Prompts become too long or complex
    • Need consistent style/format across outputs
    • Specific domain terminology required
    • Prompt engineering doesn't achieve desired results
  2. Performance Requirements:

    • Need faster inference (fewer tokens in prompt)
    • Lower latency requirements
    • Cost optimization (shorter prompts)
  3. Consistency Needs:

    • Specific output format required
    • Consistent tone and style
    • Domain-specific terminology
    • Brand voice requirements
  4. Domain Specialization:

    • Highly specialized domain knowledge
    • Technical or scientific content
    • Legal or medical terminology
    • Company-specific processes

Detailed Explanation: Fine-tuning adapts models to specific use cases by updating model weights, making them more effective for particular tasks than general-purpose models with prompts.

Fine-Tuning vs. Prompt Engineering:

AspectPrompt EngineeringFine-Tuning
Speed to DeployFastSlower (requires training)
CostLower (no training)Higher (training + inference)
FlexibilityHigh (easy to change)Lower (requires retraining)
CustomizationLimited by promptHigh (model adapts)
ConsistencyVariableMore consistent
PerformanceGood for general tasksBetter for specific tasks

Fine-Tuning Process:

  1. Prepare training data (structured format)
  2. Validate data quality and format
  3. Submit fine-tuning job
  4. Monitor training progress
  5. Evaluate fine-tuned model
  6. Deploy fine-tuned model

When NOT to Fine-Tune:

  • General-purpose use cases work well with prompts
  • Need quick iteration and experimentation
  • Don't have sufficient high-quality training data
  • Budget constraints (fine-tuning is expensive)
  • Frequently changing requirements

Documentation Links:


Q4.2: How do you prepare training data for fine-tuning?

Answer: Prepare training data for fine-tuning:

  1. Format Data:

    • Use JSONL (JSON Lines) format
    • Each line is a JSON object with "messages" array
    • Messages have "role" and "content" fields
    • Include system, user, and assistant messages
  2. Data Structure:

    json
    {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Question"}, {"role": "assistant", "content": "Answer"}]}
  3. Data Quality:

    • High-quality examples (at least 10-50 examples minimum)
    • Diverse examples covering use cases
    • Accurate and correct responses
    • Consistent format and style
  4. Data Size:

    • Minimum: 10 examples
    • Recommended: 50-100+ examples
    • More data generally improves performance
    • Balance with cost and time
  5. Validation:

    • Split data into training and validation sets
    • Review examples for accuracy
    • Check format compliance
    • Ensure diversity in examples

Detailed Explanation: Training data quality directly impacts fine-tuning results. Well-structured, diverse, high-quality data produces better fine-tuned models.

Data Preparation Steps:

Step 1: Collect Examples

  • Gather real examples of desired interactions
  • Cover various scenarios and edge cases
  • Include examples of what NOT to do

Step 2: Format Data

json
{"messages": [{"role": "system", "content": "You are a customer support agent."}, {"role": "user", "content": "I can't log into my account"}, {"role": "assistant", "content": "I can help you with that. Can you provide your username or email?"}]}
{"messages": [{"role": "system", "content": "You are a customer support agent."}, {"role": "user", "content": "My order hasn't arrived"}, {"role": "assistant", "content": "I apologize for the delay. Let me check your order status. Can you provide your order number?"}]}

Step 3: Validate Format

  • Check JSONL syntax
  • Verify message structure
  • Ensure role consistency
  • Check for duplicates

Step 4: Upload to Azure

  • Upload to Azure Blob Storage
  • Make accessible to fine-tuning service
  • Verify upload successful

Step 5: Create Fine-Tuning Job

  • Submit fine-tuning job with data file
  • Specify base model
  • Monitor training progress

Best Practices:

  • Quality over Quantity: Better to have fewer high-quality examples than many poor ones
  • Diversity: Cover various scenarios and edge cases
  • Consistency: Maintain consistent style and format
  • Representation: Include examples representative of actual usage
  • Validation: Always validate on held-out data

Common Issues:

  • Format errors (incorrect JSON structure)
  • Insufficient examples
  • Low-quality or incorrect examples
  • Lack of diversity in examples
  • Data leakage (test data in training)

Documentation Links:


Section 5: Content Filtering and Safety

Q5.1: How do you implement content filtering in Azure OpenAI Service?

Answer: Implement content filtering in Azure OpenAI:

  1. Default Content Filters:

    • Content filters enabled by default
    • Automatically evaluate prompts and completions
    • Categories: Hate, Sexual, Violence, Self-Harm
    • Severity levels: Safe, Low, Medium, High
  2. Filter Configuration:

    • Configure filter severity via API or Azure Portal
    • Set per-category filters
    • Customize based on use case requirements
  3. Custom Blocklists:

    • Create custom blocklists for prohibited terms
    • Apply at deployment or subscription level
    • Manage through REST API or Azure Portal
  4. Filter Response Handling:

    • Check content filter results in API response
    • Handle filtered content appropriately
    • Log filtered content for monitoring
    • Implement fallback behavior

Detailed Explanation: Azure OpenAI includes built-in content filters that automatically evaluate content for safety. These filters can be configured and supplemented with custom blocklists.

Content Filter Categories:

  • Hate: Hate speech, discriminatory content
  • Sexual: Sexual content, explicit material
  • Violence: Violent content, harmful actions
  • Self-Harm: Self-harm, suicide-related content

Severity Levels:

  • Safe: Content is safe
  • Low: Mild content, may be inappropriate
  • Medium: Likely inappropriate content
  • High: Highly inappropriate, should be blocked

Filter Configuration Options:

  1. API Configuration:

    • Use content_filter parameter in API calls
    • Specify severity thresholds
    • Customize per category
  2. Azure Portal:

    • Configure filters in Azure OpenAI Studio
    • Set deployment-level filters
    • Manage blocklists
  3. REST API:

    • Create and manage blocklists
    • Configure filter settings
    • Monitor filter statistics

Best Practices:

  • Understand filter behavior for your use case
  • Test filters with representative content
  • Implement appropriate error handling
  • Monitor filter statistics regularly
  • Adjust filters based on false positives/negatives
  • Combine with custom business rules

Documentation Links:


Q5.2: What are blocklists, and how do you implement them?

Answer: Blocklists are custom lists of terms or phrases that should be blocked or flagged when found in prompts or completions. Implement blocklists:

  1. Create Blocklist:

    • Define list of prohibited terms
    • Choose blocklist type (Prompt or Completion)
    • Name and describe blocklist
  2. Add Terms:

    • Add terms or phrases to blocklist
    • Supports exact match and pattern matching
    • Case-sensitive or case-insensitive matching
    • Support for wildcards and patterns
  3. Apply Blocklist:

    • Apply to deployment or subscription level
    • Can have multiple blocklists
    • Combine with default content filters
  4. Monitor and Update:

    • Monitor blocklist hits
    • Update based on new requirements
    • Remove false positives
    • Adjust patterns for better matching

Detailed Explanation: Blocklists allow customization beyond default content filters, enabling blocking of specific terms relevant to your organization or use case.

Blocklist Types:

  • Prompt Blocklists: Applied to user prompts
  • Completion Blocklists: Applied to model completions

Use Cases:

  • Company-specific prohibited terms
  • Competitor names or products
  • Sensitive information patterns
  • Regulatory compliance requirements
  • Brand protection

Implementation Example:

  1. Create blocklist via REST API:

    http
    POST https://{endpoint}/openai/content/filters/blocklists?api-version=2024-02-15-preview
  2. Add terms to blocklist:

    http
    POST https://{endpoint}/openai/content/filters/blocklists/{blocklistId}/items?api-version=2024-02-15-preview
  3. Apply to deployment:

    http
    PATCH https://{endpoint}/openai/deployments/{deploymentId}?api-version=2024-02-15-preview

Best Practices:

  • Start with high-priority terms
  • Test blocklists with sample content
  • Monitor for false positives
  • Use patterns for variations (e.g., "company-name", "Company Name", "COMPANY NAME")
  • Document reasons for blocked terms
  • Regularly review and update blocklists

Documentation Links:


Section 6: Model Selection and Deployment

Q6.1: What Azure OpenAI models are available, and how do you choose the right one?

Answer: Azure OpenAI models available:

  1. GPT-4 Models:

    • GPT-4: Most capable model, best for complex tasks
    • GPT-4 Turbo: Faster and cheaper, improved context window
    • GPT-4o: Optimized for performance and cost
    • Best for: Complex reasoning, code generation, advanced tasks
  2. GPT-3.5 Models:

    • GPT-3.5 Turbo: Fast and cost-effective, good general-purpose
    • Best for: Most common tasks, cost-sensitive applications
    • Good balance of capability and cost
  3. Embedding Models:

    • text-embedding-ada-002: Standard embeddings
    • text-embedding-3-small: Improved quality, same size
    • text-embedding-3-large: Highest quality, larger vectors
    • Best for: Semantic search, RAG implementations
  4. DALL-E Models:

    • DALL-E 2: Image generation
    • DALL-E 3: Improved quality and capabilities
    • Best for: Image generation from text descriptions

Choosing the Right Model:

Considerations:

  1. Task Complexity:

    • Simple tasks: GPT-3.5 Turbo
    • Complex reasoning: GPT-4 or GPT-4 Turbo
    • Code generation: GPT-4 models
  2. Cost Requirements:

    • Cost-sensitive: GPT-3.5 Turbo
    • Quality priority: GPT-4 models
    • Balance: GPT-4 Turbo
  3. Performance Needs:

    • Fast responses: GPT-3.5 Turbo or GPT-4 Turbo
    • Maximum capability: GPT-4 or GPT-4o
  4. Context Window:

    • Small context: GPT-3.5 Turbo
    • Large documents: GPT-4 Turbo or GPT-4o
    • Very large: Check latest model capabilities
  5. Use Case:

    • General conversation: GPT-3.5 Turbo
    • Complex analysis: GPT-4
    • Embeddings: text-embedding-3-small/large
    • Images: DALL-E 3

Detailed Explanation: Model selection impacts cost, performance, and capability. Understanding trade-offs helps choose the right model for each use case.

Model Comparison:

ModelCapabilitySpeedCostBest For
GPT-4HighestSlowerHighestComplex reasoning
GPT-4 TurboHighFastMediumBalanced performance
GPT-4oHighFastMediumOptimized performance
GPT-3.5 TurboGoodFastestLowestGeneral purpose

Best Practices:

  • Start with GPT-3.5 Turbo for prototyping
  • Upgrade to GPT-4 models only if needed
  • Use GPT-4 Turbo for better cost/performance balance
  • Consider embeddings models for semantic search
  • Test multiple models to find optimal fit
  • Monitor costs and adjust as needed

Documentation Links:


Q6.2: How do you deploy and manage models in Azure OpenAI?

Answer: Deploy and manage models:

  1. Deploy Model via Azure OpenAI Studio:

    • Navigate to Azure OpenAI Studio
    • Go to "Deployments" section
    • Click "Create new deployment"
    • Select model (GPT-4, GPT-3.5, etc.)
    • Provide deployment name
    • Configure advanced options if needed
  2. Deploy via REST API:

    http
    PUT https://{endpoint}/openai/deployments/{deploymentName}?api-version=2024-02-15-preview
    {
      "model": "gpt-4",
      "sku": {
        "name": "Standard",
        "capacity": 1
      }
    }
  3. Manage Deployments:

    • View all deployments in Azure OpenAI Studio
    • Monitor usage and performance
    • Update deployment configurations
    • Delete unused deployments
  4. Model Versioning:

    • Specify model version in deployment
    • Update to new versions when available
    • Test new versions before production
    • Maintain multiple versions if needed

Detailed Explanation: Models must be deployed before use. Deployments provide named endpoints for accessing models, allowing versioning, scaling, and management.

Deployment Configuration:

  • Deployment Name: Unique identifier for deployment
  • Model: Base model to deploy (GPT-4, GPT-3.5, etc.)
  • Version: Specific model version (optional)
  • SKU: Capacity and pricing tier
  • Content Filters: Filter configuration for deployment

Deployment Best Practices:

  • Use descriptive deployment names
  • Document deployment purposes
  • Monitor usage and costs per deployment
  • Use separate deployments for different environments (dev, test, prod)
  • Clean up unused deployments
  • Test new versions before promoting to production

Management Tasks:

  1. Monitoring:

    • Track request counts per deployment
    • Monitor error rates
    • Analyze usage patterns
    • Cost tracking per deployment
  2. Scaling:

    • Adjust capacity based on demand
    • Use multiple deployments for load distribution
    • Consider regional deployments for latency
  3. Versioning:

    • Deploy new versions alongside existing
    • A/B test new versions
    • Gradually migrate traffic
    • Roll back if issues occur
  4. Security:

    • Apply content filters per deployment
    • Configure network access
    • Set up authentication
    • Monitor for abuse

Documentation Links:


Summary

This document covers key aspects of implementing generative AI solutions with Azure OpenAI Service, including service basics, prompt engineering, RAG patterns, fine-tuning, content filtering, and model selection. Each topic is essential for success in the AI-102 exam and real-world Azure OpenAI implementations.

Additional Study Resources

Released under the MIT License.