Create an Ontology

Journey: Create and Register an Ontology

Goal: Define a schema for document classification and data extraction

Prerequisites:

  • Understanding of Key Concepts
  • Valid JWT token for API authentication
  • Knowledge of what data you want to extract from documents

Overview

An ontology defines:

  1. Taxonomy - Hierarchical labels for document classification
  2. Fields - Structured data fields to extract from documents

Extraction plugins use ontologies to know what data to extract and where to store it.


Step 1: Design Your Ontology Schema

First, decide what you're modeling. For this example, we'll create an Invoice Ontology.

Key Questions:

  • What document types will you classify? (e.g., invoice, receipt, contract)
  • What sub-categories exist? (e.g., utility_invoice, service_invoice)
  • What data fields do you need? (e.g., total, date, vendor, line items)
  • Which fields are required vs optional?

Our Example:

invoice
  - invoice_total (required)
  - invoice_date (required)
  - vendor_name (required)
  - utility_invoice (sub-type)
      - utility_type (required)

Step 2: Write Ontology YAML

Create invoice_ontology.yaml:

# Metadata
name: "Invoice Ontology"
description: "Schema for invoice document classification and data extraction"
version: "1.0.0"

# Taxonomy - Hierarchical labels
taxonomy:
  label: "invoice"
  description: "General invoice document"

  # Fields for invoice documents
  fields:
    - name: "invoice_total"
      description: "Total amount on the invoice"
      type: "number"
      required: true

    - name: "invoice_date"
      description: "Date the invoice was issued"
      type: "date"
      required: true

    - name: "invoice_number"
      description: "Invoice number or ID"
      type: "string"
      required: false

    - name: "vendor_name"
      description: "Name of the vendor or service provider"
      type: "string"
      required: true

    - name: "vendor_address"
      description: "Vendor's address"
      type: "string"
      required: false

    - name: "due_date"
      description: "Payment due date"
      type: "date"
      required: false

    - name: "line_items"
      description: "Individual line items on the invoice"
      type: "array"
      required: false

  # Child labels (sub-types of invoice)
  children:
    - label: "utility_invoice"
      description: "Invoice for utility services (electricity, water, gas)"

      # Additional fields specific to utility invoices
      fields:
        - name: "utility_type"
          description: "Type of utility (electricity, water, gas, internet)"
          type: "string"
          required: true

        - name: "account_number"
          description: "Utility account number"
          type: "string"
          required: false

        - name: "usage_amount"
          description: "Amount of utility consumed (kWh, gallons, etc.)"
          type: "number"
          required: false

        - name: "usage_period_start"
          description: "Start date of billing period"
          type: "date"
          required: false

        - name: "usage_period_end"
          description: "End date of billing period"
          type: "date"
          required: false

    - label: "service_invoice"
      description: "Invoice for professional services"

      fields:
        - name: "service_type"
          description: "Type of service provided"
          type: "string"
          required: true

        - name: "hourly_rate"
          description: "Hourly rate for service"
          type: "number"
          required: false

        - name: "hours_worked"
          description: "Total hours worked"
          type: "number"
          required: false

Step 3: Validate Your Schema

You can validate your ontology using the SDK CLI:

pip install bizsupply-sdk
bizsupply init ontology --name my_ontology   # Scaffold a template
bizsupply validate my_ontology.yaml          # Validate the schema

Check These Rules:

  1. Field Types: Must be one of: string, number, date, boolean, array
  2. Label Names: Use snake_case (e.g., utility_invoice, not Utility Invoice)
  3. Hierarchy: Parent fields are inherited by children
  4. Required Fields: Mark critical fields as required: true

Flattened Labels:

The system will flatten hierarchical labels:

  • invoice -> ["invoice"]
  • utility_invoice -> ["invoice", "utility_invoice"]

This allows querying documents by parent or child labels.


Step 4: Register Ontology via API

Request:

POST /api/v1/ontologies
Authorization: Bearer <your-jwt-token>
Content-Type: multipart/form-data

Fields:
- ontology_file: invoice_ontology.yaml

Using curl:

curl -X POST "https://api.bizsupply.com/api/v1/ontologies" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -F "ontology_file=@invoice_ontology.yaml"

Response:

{
  "ontology_id": "01HQZX3K4M2N5P7Q8R9S0T1U2V",
  "name": "Invoice Ontology",
  "message": "Ontology registered successfully"
}

Save the ontology_id - you'll need it for pipelines!


Step 5: Verify Ontology Registration

Request:

GET /api/v1/ontologies
Authorization: Bearer <your-jwt-token>

Response:

{
  "count": 1,
  "ontologies": [
    {
      "ontology_id": "01HQZX3K4M2N5P7Q8R9S0T1U2V",
      "name": "Invoice Ontology",
      "description": "Schema for invoice document classification and data extraction",
      "version": "1.0.0",
      "taxonomy": {
        "label": "invoice",
        "fields": [
          {
            "name": "invoice_total",
            "type": "number",
            "required": true
          }
        ],
        "children": [
          {
            "label": "utility_invoice",
            "fields": [...]
          }
        ]
      }
    }
  ]
}

Step 6: Use Ontology in Extraction Plugin

Create an extraction plugin that uses your ontology. Install the SDK first:

pip install bizsupply-sdk
from bizsupply_sdk import ExtractionPlugin, ExtractionResult


class InvoiceExtractorPlugin(ExtractionPlugin):
    """
    Extracts invoice data using ontology field definitions.

    The Engine resolves fields from the ontology based on document labels
    and injects them as the 'fields' parameter.
    """

    configurable_parameters = [
        {
            "parameter_name": "extraction_prompt_id",
            "parameter_type": "str",
            "default_value": None,
            "description": "Prompt ID for extraction",
        },
    ]

    async def extract(self, document, file_data, mime_type, fields, configs):
        """
        Extract invoice data from a single document.

        Args:
            document: The classified document
            file_data: Raw file bytes (pre-fetched by Engine)
            mime_type: MIME type of the file
            fields: OntologyFields resolved from document.labels by Engine
            configs: Runtime configuration

        Returns:
            ExtractionResult with extracted data
        """
        # Format fields for prompt (e.g., invoice_total, vendor_name, etc.)
        fields_json = self.format_fields_for_prompt(fields)

        prompt = f"""Extract the following invoice data from this document.

Fields to extract:
{fields_json}

Return a JSON object with the field names as keys and extracted values.
[See attached document file]"""

        # Call LLM with the document file
        result = await self.prompt_llm(
            prompt=prompt,
            file_data=file_data,
            mime_type=mime_type,
        )

        if not result or not isinstance(result, dict):
            return ExtractionResult(data={})

        # Return ExtractionResult - Engine handles persistence
        return ExtractionResult(
            data=result,
            llm_fields=list(result.keys()),
        )

Validate your plugin before registering:

bizsupply validate invoice_extractor.py

Step 7: Test Ontology in Pipeline

Create a pipeline that uses your ontology:

Request:

POST /api/v1/pipelines
Authorization: Bearer <your-jwt-token>
Content-Type: application/json

{
  "name": "Invoice Processing Pipeline",
  "description": "Classify and extract invoice data",
  "plugin_ids": [
    "YOUR_SOURCE_PLUGIN_ID",
    "YOUR_CLASSIFIER_PLUGIN_ID",
    "YOUR_EXTRACTOR_PLUGIN_ID"
  ],
  "ontology_catalogs_ids": [
    "01HQZX3K4M2N5P7Q8R9S0T1U2V"
  ]
}

Response:

{
  "pipeline_id": "01HQZY5M7N8P9Q0R1S2T3U4V5W",
  "message": "Pipeline created successfully"
}

Step 8: Execute Pipeline and Verify Extraction

Execute Pipeline:

POST /api/v1/engine/execute/pipeline
Authorization: Bearer <your-jwt-token>
Content-Type: application/json

{
  "pipeline_id": "01HQZY5M7N8P9Q0R1S2T3U4V5W"
}

Check Results:

GET /api/v1/documents?labels=invoice
Authorization: Bearer <your-jwt-token>

Response:

{
  "count": 5,
  "documents": [
    {
      "id": "01HR001A2B3C4D5E6F7G8H9J0K",
      "metadata": {
        "source": "gmail",
        "subject": "Invoice #12345"
      },
      "labels": ["invoice"],
      "data": {
        "invoice_total": 1500.00,
        "invoice_date": "2025-10-15",
        "invoice_number": "INV-12345",
        "vendor_name": "Acme Corp",
        "vendor_address": "123 Main St, City, State 12345",
        "due_date": "2025-11-15"
      },
      "created_at": "2025-10-28T10:30:00Z"
    }
  ]
}

Your ontology is extracting structured data from documents.


Step 9: Update Ontology (Optional)

To add fields or modify your ontology:

Request:

PUT /api/v1/ontologies/{ontology_id}
Authorization: Bearer <your-jwt-token>
Content-Type: multipart/form-data

Fields:
- ontology_file: updated_invoice_ontology.yaml

Note: Version will auto-increment. Existing pipelines using this ontology will use the new version.


Field Type Reference

TypeDescriptionExample
stringText value"Acme Corp"
numberNumeric value (int or float)1500.00
dateISO 8601 date string"2025-10-15"
booleanTrue/false valuetrue
arrayList of values["item1", "item2"]

Common Issues

Invalid Field Type

Error: Invalid field type: 'decimal'

Solution: Use number instead. Supported types: string, number, date, boolean, array

Label Hierarchy Not Working

Problem: Child labels not inheriting parent fields

Solution: Ensure child labels are nested under children: in parent taxonomy

Fields Not Extracted

Check:

  1. Extraction plugin is querying ontology fields correctly
  2. Document has appropriate labels (classification ran first)
  3. LLM response schema matches ontology field types
  4. Ontology ID is included in pipeline ontology_catalogs_ids

Ontology Not Found in Pipeline

Verify:

GET /api/v1/ontologies/{ontology_id}
Authorization: Bearer <your-jwt-token>

Ensure ontology_id exists and belongs to your user/tenant.


Best Practices

  1. Start Simple: Begin with core required fields, add optional fields later
  2. Clear Descriptions: Write detailed field descriptions - LLMs use these
  3. Use Hierarchies: Organize labels hierarchically (invoice -> utility_invoice)
  4. Version Carefully: Test ontology changes before updating production pipelines
  5. Required vs Optional: Only mark fields as required if they're critical
  6. Consistent Naming: Use snake_case for all labels and field names

Next Steps