Create an Ontology
Journey: Create and Register an Ontology
Goal: Define a schema for document classification and data extraction
Prerequisites:
- Understanding of Key Concepts
- Valid JWT token for API authentication
- Knowledge of what data you want to extract from documents
Overview
An ontology defines:
- Taxonomy - Hierarchical labels for document classification
- Fields - Structured data fields to extract from documents
Extraction plugins use ontologies to know what data to extract and where to store it.
Step 1: Design Your Ontology Schema
First, decide what you're modeling. For this example, we'll create an Invoice Ontology.
Key Questions:
- What document types will you classify? (e.g., invoice, receipt, contract)
- What sub-categories exist? (e.g., utility_invoice, service_invoice)
- What data fields do you need? (e.g., total, date, vendor, line items)
- Which fields are required vs optional?
Our Example:
invoice
- invoice_total (required)
- invoice_date (required)
- vendor_name (required)
- utility_invoice (sub-type)
- utility_type (required)
Step 2: Write Ontology YAML
Create invoice_ontology.yaml:
# Metadata
name: "Invoice Ontology"
description: "Schema for invoice document classification and data extraction"
version: "1.0.0"
# Taxonomy - Hierarchical labels
taxonomy:
label: "invoice"
description: "General invoice document"
# Fields for invoice documents
fields:
- name: "invoice_total"
description: "Total amount on the invoice"
type: "number"
required: true
- name: "invoice_date"
description: "Date the invoice was issued"
type: "date"
required: true
- name: "invoice_number"
description: "Invoice number or ID"
type: "string"
required: false
- name: "vendor_name"
description: "Name of the vendor or service provider"
type: "string"
required: true
- name: "vendor_address"
description: "Vendor's address"
type: "string"
required: false
- name: "due_date"
description: "Payment due date"
type: "date"
required: false
- name: "line_items"
description: "Individual line items on the invoice"
type: "array"
required: false
# Child labels (sub-types of invoice)
children:
- label: "utility_invoice"
description: "Invoice for utility services (electricity, water, gas)"
# Additional fields specific to utility invoices
fields:
- name: "utility_type"
description: "Type of utility (electricity, water, gas, internet)"
type: "string"
required: true
- name: "account_number"
description: "Utility account number"
type: "string"
required: false
- name: "usage_amount"
description: "Amount of utility consumed (kWh, gallons, etc.)"
type: "number"
required: false
- name: "usage_period_start"
description: "Start date of billing period"
type: "date"
required: false
- name: "usage_period_end"
description: "End date of billing period"
type: "date"
required: false
- label: "service_invoice"
description: "Invoice for professional services"
fields:
- name: "service_type"
description: "Type of service provided"
type: "string"
required: true
- name: "hourly_rate"
description: "Hourly rate for service"
type: "number"
required: false
- name: "hours_worked"
description: "Total hours worked"
type: "number"
required: falseStep 3: Validate Your Schema
You can validate your ontology using the SDK CLI:
pip install bizsupply-sdk
bizsupply init ontology --name my_ontology # Scaffold a template
bizsupply validate my_ontology.yaml # Validate the schemaCheck These Rules:
- Field Types: Must be one of:
string,number,date,boolean,array - Label Names: Use snake_case (e.g.,
utility_invoice, notUtility Invoice) - Hierarchy: Parent fields are inherited by children
- Required Fields: Mark critical fields as
required: true
Flattened Labels:
The system will flatten hierarchical labels:
invoice->["invoice"]utility_invoice->["invoice", "utility_invoice"]
This allows querying documents by parent or child labels.
Step 4: Register Ontology via API
Request:
POST /api/v1/ontologies
Authorization: Bearer <your-jwt-token>
Content-Type: multipart/form-data
Fields:
- ontology_file: invoice_ontology.yamlUsing curl:
curl -X POST "https://api.bizsupply.com/api/v1/ontologies" \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-F "ontology_file=@invoice_ontology.yaml"Response:
{
"ontology_id": "01HQZX3K4M2N5P7Q8R9S0T1U2V",
"name": "Invoice Ontology",
"message": "Ontology registered successfully"
}Save the ontology_id - you'll need it for pipelines!
Step 5: Verify Ontology Registration
Request:
GET /api/v1/ontologies
Authorization: Bearer <your-jwt-token>Response:
{
"count": 1,
"ontologies": [
{
"ontology_id": "01HQZX3K4M2N5P7Q8R9S0T1U2V",
"name": "Invoice Ontology",
"description": "Schema for invoice document classification and data extraction",
"version": "1.0.0",
"taxonomy": {
"label": "invoice",
"fields": [
{
"name": "invoice_total",
"type": "number",
"required": true
}
],
"children": [
{
"label": "utility_invoice",
"fields": [...]
}
]
}
}
]
}Step 6: Use Ontology in Extraction Plugin
Create an extraction plugin that uses your ontology. Install the SDK first:
pip install bizsupply-sdkfrom bizsupply_sdk import ExtractionPlugin, ExtractionResult
class InvoiceExtractorPlugin(ExtractionPlugin):
"""
Extracts invoice data using ontology field definitions.
The Engine resolves fields from the ontology based on document labels
and injects them as the 'fields' parameter.
"""
configurable_parameters = [
{
"parameter_name": "extraction_prompt_id",
"parameter_type": "str",
"default_value": None,
"description": "Prompt ID for extraction",
},
]
async def extract(self, document, file_data, mime_type, fields, configs):
"""
Extract invoice data from a single document.
Args:
document: The classified document
file_data: Raw file bytes (pre-fetched by Engine)
mime_type: MIME type of the file
fields: OntologyFields resolved from document.labels by Engine
configs: Runtime configuration
Returns:
ExtractionResult with extracted data
"""
# Format fields for prompt (e.g., invoice_total, vendor_name, etc.)
fields_json = self.format_fields_for_prompt(fields)
prompt = f"""Extract the following invoice data from this document.
Fields to extract:
{fields_json}
Return a JSON object with the field names as keys and extracted values.
[See attached document file]"""
# Call LLM with the document file
result = await self.prompt_llm(
prompt=prompt,
file_data=file_data,
mime_type=mime_type,
)
if not result or not isinstance(result, dict):
return ExtractionResult(data={})
# Return ExtractionResult - Engine handles persistence
return ExtractionResult(
data=result,
llm_fields=list(result.keys()),
)Validate your plugin before registering:
bizsupply validate invoice_extractor.pyStep 7: Test Ontology in Pipeline
Create a pipeline that uses your ontology:
Request:
POST /api/v1/pipelines
Authorization: Bearer <your-jwt-token>
Content-Type: application/json
{
"name": "Invoice Processing Pipeline",
"description": "Classify and extract invoice data",
"plugin_ids": [
"YOUR_SOURCE_PLUGIN_ID",
"YOUR_CLASSIFIER_PLUGIN_ID",
"YOUR_EXTRACTOR_PLUGIN_ID"
],
"ontology_catalogs_ids": [
"01HQZX3K4M2N5P7Q8R9S0T1U2V"
]
}Response:
{
"pipeline_id": "01HQZY5M7N8P9Q0R1S2T3U4V5W",
"message": "Pipeline created successfully"
}Step 8: Execute Pipeline and Verify Extraction
Execute Pipeline:
POST /api/v1/engine/execute/pipeline
Authorization: Bearer <your-jwt-token>
Content-Type: application/json
{
"pipeline_id": "01HQZY5M7N8P9Q0R1S2T3U4V5W"
}Check Results:
GET /api/v1/documents?labels=invoice
Authorization: Bearer <your-jwt-token>Response:
{
"count": 5,
"documents": [
{
"id": "01HR001A2B3C4D5E6F7G8H9J0K",
"metadata": {
"source": "gmail",
"subject": "Invoice #12345"
},
"labels": ["invoice"],
"data": {
"invoice_total": 1500.00,
"invoice_date": "2025-10-15",
"invoice_number": "INV-12345",
"vendor_name": "Acme Corp",
"vendor_address": "123 Main St, City, State 12345",
"due_date": "2025-11-15"
},
"created_at": "2025-10-28T10:30:00Z"
}
]
}Your ontology is extracting structured data from documents.
Step 9: Update Ontology (Optional)
To add fields or modify your ontology:
Request:
PUT /api/v1/ontologies/{ontology_id}
Authorization: Bearer <your-jwt-token>
Content-Type: multipart/form-data
Fields:
- ontology_file: updated_invoice_ontology.yamlNote: Version will auto-increment. Existing pipelines using this ontology will use the new version.
Field Type Reference
| Type | Description | Example |
|---|---|---|
string | Text value | "Acme Corp" |
number | Numeric value (int or float) | 1500.00 |
date | ISO 8601 date string | "2025-10-15" |
boolean | True/false value | true |
array | List of values | ["item1", "item2"] |
Common Issues
Invalid Field Type
Error: Invalid field type: 'decimal'
Solution: Use number instead. Supported types: string, number, date, boolean, array
Label Hierarchy Not Working
Problem: Child labels not inheriting parent fields
Solution: Ensure child labels are nested under children: in parent taxonomy
Fields Not Extracted
Check:
- Extraction plugin is querying ontology fields correctly
- Document has appropriate labels (classification ran first)
- LLM response schema matches ontology field types
- Ontology ID is included in pipeline
ontology_catalogs_ids
Ontology Not Found in Pipeline
Verify:
GET /api/v1/ontologies/{ontology_id}
Authorization: Bearer <your-jwt-token>Ensure ontology_id exists and belongs to your user/tenant.
Best Practices
- Start Simple: Begin with core required fields, add optional fields later
- Clear Descriptions: Write detailed field descriptions - LLMs use these
- Use Hierarchies: Organize labels hierarchically (invoice -> utility_invoice)
- Version Carefully: Test ontology changes before updating production pipelines
- Required vs Optional: Only mark fields as required if they're critical
- Consistent Naming: Use snake_case for all labels and field names
Next Steps
- Create extraction plugin: Follow Create a Plugin
- Build complete pipeline: Follow Use Plugins
- Process documents: Follow Process Documents
Updated 2 months ago