Implement Knowledge Mining and Information Extraction Solutions - Q&A
This document contains comprehensive questions and answers for the Implement Knowledge Mining and Information Extraction Solutions domain of the AI-102 exam.
📚 Reference Links
- Azure AI Search Documentation
- Document Intelligence (Form Recognizer) Documentation
- Knowledge Mining Patterns
- AI-102 Study Guide
Section 1: Azure AI Search Basics
Q1.1: What is Azure AI Search, and how is it used for knowledge mining?
Answer: Azure AI Search (formerly Azure Cognitive Search) is a cloud search service that enables building rich search experiences over heterogeneous content using AI-powered indexing and querying. For knowledge mining:
Content Indexing:
- Index documents from various sources (Blob Storage, SQL, Cosmos DB, etc.)
- Extract text, metadata, and structured data
- Support multiple file formats (PDF, Word, images, etc.)
AI-Enhanced Indexing:
- Use cognitive skills to enrich content
- Extract entities, key phrases, sentiment
- OCR for images and scanned documents
- Language detection and translation
Intelligent Search:
- Full-text search capabilities
- Vector search for semantic similarity
- Hybrid search (keyword + vector)
- Faceted navigation and filtering
Knowledge Extraction:
- Extract insights from unstructured data
- Organize and structure information
- Create searchable knowledge bases
- Enable discovery of hidden information
Detailed Explanation: Azure AI Search transforms unstructured and semi-structured content into searchable knowledge by extracting, enriching, and indexing information from various sources using AI capabilities.
Knowledge Mining Workflow:
- Data Ingestion: Connect to data sources
- Indexing: Extract and index content
- Enrichment: Apply cognitive skills
- Storage: Store enriched data
- Querying: Search and retrieve information
- Applications: Build search interfaces
Use Cases:
- Enterprise search
- Document intelligence
- Content discovery
- E-commerce search
- Knowledge bases
- Compliance and auditing
Key Components:
- Indexer: Automated data ingestion
- Index: Searchable data structure
- Skillset: AI enrichment pipeline
- DataSource: Source of content
- Synonym Maps: Query expansion
Documentation Links:
Q1.2: How do you create a search index in Azure AI Search?
Answer: Create a search index:
Create Index Definition:
pythonfrom azure.search.documents.indexes import SearchIndexClient from azure.search.documents.indexes.models import ( SearchIndex, SimpleField, SearchFieldDataType, ComplexField ) from azure.core.credentials import AzureKeyCredential client = SearchIndexClient( endpoint=search_endpoint, index_name="my-index", credential=AzureKeyCredential(admin_key) ) # Define index fields fields = [ SimpleField(name="id", type=SearchFieldDataType.String, key=True), SimpleField(name="title", type=SearchFieldDataType.String, searchable=True), SimpleField(name="content", type=SearchFieldDataType.String, searchable=True), SimpleField(name="category", type=SearchFieldDataType.String, filterable=True), SimpleField(name="date", type=SearchFieldDataType.DateTimeOffset, sortable=True), SimpleField(name="score", type=SearchFieldDataType.Double, facetable=True) ] # Create index index = SearchIndex( name="my-index", fields=fields ) client.create_index(index)Using REST API:
httpPUT https://{service-name}.search.windows.net/indexes/{index-name}?api-version=2023-11-01 Content-Type: application/json api-key: {admin-key} { "name": "my-index", "fields": [ { "name": "id", "type": "Edm.String", "key": true }, { "name": "title", "type": "Edm.String", "searchable": true }, { "name": "content", "type": "Edm.String", "searchable": true }, { "name": "category", "type": "Edm.String", "filterable": true, "facetable": true } ] }Using Azure Portal:
- Navigate to Azure AI Search resource
- Go to "Indexes" section
- Click "Create index"
- Define fields and settings
- Save index
Detailed Explanation: An index is a schema that defines fields, their types, and search behaviors. Indexes store searchable content and enable fast retrieval through various query capabilities.
Field Types:
- Edm.String: Text fields
- Edm.Int32/Int64: Integer numbers
- Edm.Double: Floating-point numbers
- Edm.Boolean: Boolean values
- Edm.DateTimeOffset: Date/time values
- Edm.GeographyPoint: Geographic coordinates
- Collection(Edm.String): Arrays of strings
- Edm.ComplexType: Nested objects
Field Attributes:
- key: Unique identifier (required)
- searchable: Full-text search enabled
- filterable: Can be used in filters
- sortable: Can be used for sorting
- facetable: Can be used for faceting
- retrievable: Returned in search results
- analyzer: Text analysis configuration
Index Best Practices:
- Design Schema: Plan fields based on query needs
- Field Types: Use appropriate types for data
- Attributes: Configure attributes for search behavior
- Naming: Use clear, consistent field names
- Documentation: Document index purpose and usage
Documentation Links:
Q1.3: What is a skillset, and how do you create one for knowledge enrichment?
Answer: A skillset is a collection of cognitive skills (AI enrichment steps) applied to documents during indexing. Create a skillset:
Define Skillset:
pythonfrom azure.search.documents.indexes.models import ( SearchIndexerSkillset, EntityRecognitionSkill, KeyPhraseExtractionSkill, SentimentSkill, ImageAnalysisSkill, OcrSkill, DocumentExtractionSkill ) skills = [ # Text skills EntityRecognitionSkill( name="entity-recognition", description="Extract entities from text", context="/document/content", inputs=[ { "name": "text", "source": "/document/content" } ], outputs=[ { "name": "entities", "targetName": "entities" } ] ), KeyPhraseExtractionSkill( name="key-phrase-extraction", description="Extract key phrases", context="/document/content", inputs=[{"name": "text", "source": "/document/content"}], outputs=[{"name": "keyPhrases", "targetName": "keyPhrases"}] ), SentimentSkill( name="sentiment-analysis", description="Analyze sentiment", context="/document", inputs=[{"name": "text", "source": "/document/content"}], outputs=[{"name": "sentiment", "targetName": "sentiment"}] ), # Image skills OcrSkill( name="ocr-skill", description="Extract text from images", context="/document/normalized_images/*", inputs=[{"name": "image", "source": "/document/normalized_images/*"}], outputs=[{"name": "text", "targetName": "ocrText"}] ), ImageAnalysisSkill( name="image-analysis", description="Analyze images", context="/document/normalized_images/*", inputs=[{"name": "image", "source": "/document/normalized_images/*"}], outputs=[{"name": "tags", "targetName": "imageTags"}] ) ] # Create skillset skillset = SearchIndexerSkillset( name="my-skillset", description="Knowledge mining skillset", skills=skills, cognitive_services_account={ "key": cognitive_services_key } ) skillset_client.create_skillset(skillset)Available Skills:
- Text Skills: Entity recognition, key phrase extraction, sentiment analysis, language detection, text translation, PII detection
- Image Skills: OCR, image analysis, face detection
- Document Skills: Document cracking (text extraction), merge text
- Custom Skills: Custom web API skills
Detailed Explanation: Skillsets enrich documents during indexing by applying AI capabilities, extracting structured information from unstructured content, and enhancing searchability.
Skillset Workflow:
- Document Cracking: Extract text and images from documents
- Image Skills: Process images (OCR, analysis)
- Text Skills: Process text (entities, key phrases, sentiment)
- Skill Chaining: Outputs of one skill as inputs to another
- Shaping: Organize enriched data into index fields
Skill Types:
Built-in Skills:
- Pre-built cognitive skills
- No custom code required
- Available out-of-the-box
Custom Skills:
- Custom web API skills
- Deploy your own processing logic
- Integrate external services
Skill Execution:
- Sequential: Skills execute in order
- Parallel: Independent skills run in parallel
- Context: Define document scope for skill execution
- Error Handling: Configure error handling policies
Best Practices:
- Skill Selection: Choose relevant skills for use case
- Skill Order: Order skills logically (dependencies)
- Context Scoping: Use appropriate context paths
- Performance: Minimize unnecessary skills
- Cost Optimization: Use skills efficiently
Documentation Links:
Q1.4: How do you create an indexer for automated data ingestion?
Answer: Create an indexer:
Define Data Source:
pythonfrom azure.search.documents.indexes.models import SearchIndexerDataSourceConnection data_source = SearchIndexerDataSourceConnection( name="my-datasource", type="azureblob", connection_string=storage_connection_string, container={ "name": "documents" } ) indexer_client.create_data_source_connection(data_source)Create Indexer:
pythonfrom azure.search.documents.indexes.models import SearchIndexer indexer = SearchIndexer( name="my-indexer", description="Automated document indexing", data_source_name="my-datasource", target_index_name="my-index", skillset_name="my-skillset", schedule={ "interval": "PT1H", # Run every hour "start_time": "2024-01-01T00:00:00Z" }, parameters={ "batch_size": 5, "max_failed_items": 10, "max_failed_items_per_batch": 5 } ) indexer_client.create_indexer(indexer)Run Indexer:
python# Manual run indexer_client.run_indexer("my-indexer") # Get indexer status status = indexer_client.get_indexer_status("my-indexer") print(f"Status: {status.last_result.status}") print(f"Items processed: {status.last_result.items_processed}") print(f"Items failed: {status.last_result.items_failed}")
Detailed Explanation: Indexers automate data ingestion by connecting to data sources, extracting content, applying skillsets for enrichment, and populating search indexes.
Indexer Features:
- Automated Ingestion: Regularly pull data from sources
- Change Detection: Only process changed documents
- Incremental Updates: Update only modified content
- Error Handling: Handle failures gracefully
- Scheduling: Run on schedule or on-demand
Supported Data Sources:
- Azure Blob Storage: Documents in blob containers
- Azure Table Storage: Table data
- Azure SQL Database: SQL tables
- Azure Cosmos DB: Cosmos DB collections
- SharePoint Online: SharePoint documents
Indexer Configuration:
- Schedule: Cron expression for automatic runs
- Batch Size: Documents per batch
- Error Tolerance: Failed items per batch
- Field Mappings: Map source fields to index fields
- Output Field Mappings: Map skill outputs to index fields
Field Mappings:
field_mappings = [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "url",
"mappingFunction": {
"name": "base64Encode"
}
},
{
"sourceFieldName": "metadata_creation_date",
"targetFieldName": "created_date"
}
]
indexer.field_mappings = field_mappingsOutput Field Mappings:
output_field_mappings = [
{
"sourceFieldName": "/document/content/entities/*",
"targetFieldName": "entities",
"mappingFunction": None
},
{
"sourceFieldName": "/document/content/keyPhrases/*",
"targetFieldName": "keyPhrases"
}
]
indexer.output_field_mappings = output_field_mappingsBest Practices:
- Scheduling: Use appropriate schedule intervals
- Error Handling: Configure tolerance levels
- Performance: Optimize batch sizes
- Monitoring: Track indexer status regularly
- Incremental Updates: Enable change detection
Documentation Links:
Section 2: Vector Search and Semantic Search
Q2.1: What is vector search, and how do you implement it in Azure AI Search?
Answer: Vector search enables semantic similarity search by comparing vector representations (embeddings) of text. Implement vector search:
Add Vector Field to Index:
pythonfrom azure.search.documents.indexes.models import ( SearchField, VectorSearch, HnswAlgorithmConfiguration ) # Add vector field vector_field = SearchField( name="contentVector", type="Collection(Edm.Single)", searchable=True, vector_search_dimensions=1536, # Embedding dimensions vector_search_profile="vector-profile" ) # Configure vector search vector_search = VectorSearch( algorithms=[ HnswAlgorithmConfiguration( name="hnsw-config", kind="hnsw" ) ], profiles=[ VectorSearchProfile( name="vector-profile", algorithm="hnsw-config" ) ] ) index.fields.append(vector_field) index.vector_search = vector_searchGenerate Embeddings:
pythonfrom azure.ai.textanalytics import TextAnalyticsClient # Generate embeddings text_analytics = TextAnalyticsClient( endpoint=language_endpoint, credential=AzureKeyCredential(language_key) ) # Create custom skill for embeddings embedding_skill = { "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill", "name": "embedding-skill", "description": "Generate embeddings", "context": "/document", "resourceUri": openai_resource_uri, "deploymentId": openai_deployment_id, "apiKey": openai_api_key, "inputs": [ { "name": "text", "source": "/document/content" } ], "outputs": [ { "name": "embedding", "targetName": "contentVector" } ] }Vector Search Query:
pythonfrom azure.search.documents import SearchClient from azure.search.documents.models import VectorizedQuery # Generate query embedding query_embedding = generate_embedding("What is Azure AI Search?") # Vector search vector_query = VectorizedQuery( vector=query_embedding, k_nearest_neighbors=5, fields="contentVector" ) results = search_client.search( search_text=None, vector_queries=[vector_query], top=5 )
Detailed Explanation: Vector search enables finding semantically similar content even when exact keywords don't match, using embeddings to understand meaning and context.
Vector Search Benefits:
- Semantic Understanding: Finds content by meaning
- Language Agnostic: Works across languages
- Context Awareness: Understands context and relationships
- Synonym Handling: Finds related concepts automatically
Embedding Models:
- Azure OpenAI: text-embedding-ada-002, text-embedding-3-small/large
- Text Analytics: Embeddings API
- Custom Models: Your own embedding models
Vector Dimensions:
- text-embedding-ada-002: 1536 dimensions
- text-embedding-3-small: 1536 dimensions
- text-embedding-3-large: 3072 dimensions
- Match dimensions in index field
Hybrid Search: Combine keyword and vector search:
results = search_client.search(
search_text="Azure AI Search",
vector_queries=[vector_query],
top=10,
query_type="semantic" # Semantic ranking
)Best Practices:
- Use appropriate embedding models
- Match vector dimensions
- Consider hybrid search for best results
- Optimize HNSW parameters
- Test similarity thresholds
Documentation Links:
Q2.2: What is semantic search, and how do you enable it?
Answer: Semantic search provides AI-powered relevance ranking that understands query intent and content meaning. Enable semantic search:
Enable Semantic Ranker:
pythonindex = SearchIndex( name="my-index", fields=fields, semantic_config={ "name": "my-semantic-config", "prioritized_fields": { "titleField": {"fieldName": "title"}, "contentFields": [ {"fieldName": "content"} ], "keywordsField": {"fieldName": "category"} } } )Semantic Search Query:
pythonresults = search_client.search( search_text="What is Azure AI Search?", query_type="semantic", semantic_configuration="my-semantic-config", query_language="en-US", captions="extractive", # Generate captions answers="extractive", # Generate answers top=5 ) # Extract answers for result in results: print(f"Title: {result['title']}") if 'answers' in result: for answer in result['answers']: print(f"Answer: {answer['text']}") print(f"Highlight: {answer['highlights']}")
Detailed Explanation: Semantic search uses language understanding to improve search relevance, generating natural language answers and highlighting relevant passages.
Semantic Search Features:
- Relevance Ranking: AI-powered ranking
- Answer Generation: Natural language answers
- Caption Generation: Relevant passage highlights
- Query Understanding: Understands query intent
Semantic Configuration:
- Title Field: Field used for title
- Content Fields: Fields to search in
- Keywords Field: Field for keyword extraction
Query Options:
- query_type: "semantic" for semantic search
- semantic_configuration: Semantic config name
- query_language: Query language (en-US, etc.)
- captions: "extractive" for passage highlights
- answers: "extractive" for answer generation
Best Practices:
- Configure semantic config properly
- Use meaningful title and content fields
- Specify query language
- Combine with vector search for best results
- Test and tune semantic configuration
Documentation Links:
Section 3: Document Intelligence (Form Recognizer)
Q3.1: What is Azure AI Document Intelligence, and what capabilities does it provide?
Answer: Azure AI Document Intelligence (formerly Form Recognizer) extracts structured data from documents using AI. Capabilities include:
Prebuilt Models:
- Invoice: Extract invoice data
- Receipt: Extract receipt information
- Business Card: Extract contact information
- ID Document: Extract ID information
- W-2 Form: Extract tax form data
- Vaccination Certificate: Extract vaccination records
Custom Models:
- Train custom document models
- Extract domain-specific information
- Label-based training
- Neural-based training
Layout Analysis:
- Text extraction with layout preservation
- Table extraction
- Selection mark detection
- Signature detection
Document Understanding:
- Key-value pair extraction
- Table extraction
- Structure preservation
- Multi-page document support
Detailed Explanation: Document Intelligence automates document processing by extracting structured information from forms, invoices, receipts, and other documents, reducing manual data entry.
Use Cases:
- Invoice processing
- Receipt digitization
- Form processing
- Document automation
- Compliance and auditing
- Data extraction from documents
Document Formats:
- PDF files
- Images (JPEG, PNG)
- TIFF files
- Multi-page documents
- Scanned documents
Supported Languages:
- Multiple languages for layout analysis
- Language-specific models for prebuilt models
- Custom model training supports various languages
Documentation Links:
Q3.2: How do you use prebuilt models to extract data from documents?
Answer: Use prebuilt models:
Invoice Model:
pythonfrom azure.ai.documentintelligence import DocumentIntelligenceClient from azure.core.credentials import AzureKeyCredential client = DocumentIntelligenceClient( endpoint=endpoint, credential=AzureKeyCredential(api_key) ) # Analyze invoice with open("invoice.pdf", "rb") as invoice_file: poller = client.begin_analyze_document( model_id="prebuilt-invoice", analyze_request=invoice_file, content_type="application/pdf" ) result = poller.result() # Extract invoice data for document in result.documents: invoice = document.fields print(f"Invoice ID: {invoice.get('InvoiceId')}") print(f"Vendor Name: {invoice.get('VendorName')}") print(f"Customer Name: {invoice.get('CustomerName')}") print(f"Total Amount: {invoice.get('InvoiceTotal')}") print(f"Due Date: {invoice.get('DueDate')}")Receipt Model:
pythonwith open("receipt.jpg", "rb") as receipt_file: poller = client.begin_analyze_document( model_id="prebuilt-receipt", analyze_request=receipt_file, content_type="image/jpeg" ) result = poller.result() for document in result.documents: receipt = document.fields print(f"Merchant: {receipt.get('MerchantName')}") print(f"Date: {receipt.get('TransactionDate')}") print(f"Total: {receipt.get('Total')}") print(f"Items:") for item in receipt.get('Items', []): print(f" - {item.get('Description')}: {item.get('TotalPrice')}")Business Card Model:
pythonwith open("business-card.jpg", "rb") as card_file: poller = client.begin_analyze_document( model_id="prebuilt-businessCard", analyze_request=card_file, content_type="image/jpeg" ) result = poller.result() for document in result.documents: card = document.fields print(f"Name: {card.get('ContactNames')}") print(f"Company: {card.get('CompanyNames')}") print(f"Phone: {card.get('Phones')}") print(f"Email: {card.get('Emails')}") print(f"Address: {card.get('Addresses')}")Layout Model (General Document Analysis):
pythonwith open("document.pdf", "rb") as doc_file: poller = client.begin_analyze_document( model_id="prebuilt-layout", analyze_request=doc_file, content_type="application/pdf" ) result = poller.result() # Extract pages for page in result.pages: print(f"Page {page.page_number}: {page.width}x{page.height}") # Extract tables for table_idx, table in enumerate(result.tables): print(f"Table {table_idx}:") for row in table.rows: row_data = [cell.content for cell in row.cells] print(row_data) # Extract text for paragraph in result.paragraphs: print(f"Paragraph: {paragraph.content}")
Detailed Explanation: Prebuilt models provide ready-to-use document processing for common document types without training, enabling quick implementation of document intelligence solutions.
Prebuilt Models:
- Invoice: Vendor, customer, amounts, dates, line items
- Receipt: Merchant, date, items, totals, taxes
- Business Card: Contact info, company, addresses
- ID Document: IDs, passports, driver's licenses
- W-2: Tax form data extraction
- Vaccination Certificate: Vaccination records
- Layout: General text, tables, structure
Extracted Fields: Each prebuilt model extracts specific fields relevant to the document type, providing structured data ready for use in applications.
Best Practices:
- Use appropriate model for document type
- Ensure document quality (resolution, orientation)
- Handle multi-page documents
- Validate extracted data
- Implement error handling
Documentation Links:
Q3.3: How do you train a custom document model?
Answer: Train a custom model:
Prepare Training Data:
python# Create training dataset # Label documents with expected fieldsCreate Document Model:
pythonfrom azure.ai.documentintelligence import DocumentIntelligenceAdministrationClient admin_client = DocumentIntelligenceAdministrationClient( endpoint=endpoint, credential=AzureKeyCredential(api_key) ) # Create custom model project project = admin_client.create_project( project_name="my-custom-model", options={ "description": "Custom document model" } )Upload Training Documents:
python# Upload labeled documents # Use Document Intelligence Studio for labeling # Or use REST API for batch uploadTrain Model:
python# Train custom model poller = admin_client.begin_build_document_model( model_id="my-model-001", build_mode="template", # or "neural" azure_blob_source={ "container_url": "https://storage.blob.core.windows.net/documents", "prefix": "training-data/" } ) model = poller.result() print(f"Model ID: {model.model_id}") print(f"Status: {model.status}")Use Custom Model:
python# Analyze document with custom model with open("document.pdf", "rb") as doc_file: poller = client.begin_analyze_document( model_id="my-model-001", analyze_request=doc_file, content_type="application/pdf" ) result = poller.result() for document in result.documents: # Extract custom fields for field_name, field_value in document.fields.items(): print(f"{field_name}: {field_value}")
Detailed Explanation: Custom models enable extracting domain-specific information from documents by training on labeled examples, providing accurate extraction for unique document types.
Training Approaches:
Template-Based (Label-Based):
- Label fields in documents
- Train on labeled examples
- Good for structured forms
- Requires fewer examples
Neural-Based:
- Use neural models for learning
- Good for unstructured documents
- Requires more training data
- Better for complex layouts
Training Requirements:
- Minimum Examples: 5 labeled documents
- Recommended: 15-30 labeled documents
- Diverse Examples: Various formats and layouts
- Quality: High-quality, representative documents
Labeling Tools:
- Document Intelligence Studio: Web-based labeling tool
- REST API: Programmatic labeling
- Sample Labeling Tool: Open-source tool
Best Practices:
- Provide diverse training examples
- Ensure accurate labeling
- Test model before production
- Continuously improve with more examples
- Monitor model performance
Documentation Links:
Section 4: Knowledge Mining Patterns
Q4.1: What are common knowledge mining patterns with Azure AI Search?
Answer: Common knowledge mining patterns:
Content Discovery:
- Index diverse content sources
- Enable full-text and semantic search
- Provide faceted navigation
- Surface relevant content
Enterprise Search:
- Search across enterprise documents
- Enable knowledge workers to find information
- Integrate with existing systems
- Provide unified search experience
Document Intelligence:
- Extract structured data from documents
- Enable document-based Q&A
- Support compliance and auditing
- Automate document processing
Content Enrichment:
- Apply AI skills to content
- Extract entities and insights
- Enhance searchability
- Create searchable knowledge bases
E-commerce Search:
- Product catalog search
- Faceted navigation
- Recommendation systems
- Filter and sort capabilities
Question Answering:
- Document-based Q&A
- Knowledge base search
- Context-aware responses
- Multi-turn conversations
Detailed Explanation: Knowledge mining patterns leverage Azure AI Search capabilities to extract insights from content, enable discovery, and build intelligent applications.
Pattern Implementation:
- Data Ingestion: Connect to data sources
- Content Enrichment: Apply AI skills
- Indexing: Create searchable indexes
- Query Processing: Enable search and retrieval
- Applications: Build user interfaces
Integration Patterns:
- REST API: Direct API integration
- SDKs: Language-specific SDKs
- Azure Functions: Serverless integration
- Logic Apps: Workflow integration
Best Practices:
- Design indexes for query patterns
- Use appropriate enrichment skills
- Implement efficient query strategies
- Monitor and optimize performance
- Test with real user queries
Documentation Links:
Summary
This document covers key aspects of implementing knowledge mining and information extraction solutions, including Azure AI Search, Document Intelligence, vector search, and semantic search. Each topic is essential for success in the AI-102 exam and real-world knowledge mining implementations.