Journey: Create and Register an Ontology

Goal: Define a schema for document classification and data extraction

Prerequisites:

Understanding of Key Concepts
Valid JWT token for API authentication
Knowledge of what data you want to extract from documents

Overview

An ontology defines:

Taxonomy - Hierarchical labels for document classification
Fields - Structured data fields to extract from documents

Extraction plugins use ontologies to know what data to extract and where to store it.

Step 1: Design Your Ontology Schema

First, decide what you're modeling. For this example, we'll create an Invoice Ontology.

Key Questions:

What document types will you classify? (e.g., invoice, receipt, contract)
What sub-categories exist? (e.g., utility_invoice, service_invoice)
What data fields do you need? (e.g., total, date, vendor, line items)
Which fields are required vs optional?

Our Example:

invoice
  - invoice_total (required)
  - invoice_date (required)
  - vendor_name (required)
  - utility_invoice (sub-type)
      - utility_type (required)

Step 2: Write Ontology YAML

Create invoice_ontology.yaml:

# Metadata
name: "Invoice Ontology"
description: "Schema for invoice document classification and data extraction"
version: "1.0.0"

# Taxonomy - Hierarchical labels
taxonomy:
  label: "invoice"
  description: "General invoice document"

  # Fields for invoice documents
  fields:
    - name: "invoice_total"
      description: "Total amount on the invoice"
      type: "number"
      required: true

    - name: "invoice_date"
      description: "Date the invoice was issued"
      type: "date"
      required: true

    - name: "invoice_number"
      description: "Invoice number or ID"
      type: "string"
      required: false

    - name: "vendor_name"
      description: "Name of the vendor or service provider"
      type: "string"
      required: true

    - name: "vendor_address"
      description: "Vendor's address"
      type: "string"
      required: false

    - name: "due_date"
      description: "Payment due date"
      type: "date"
      required: false

    - name: "line_items"
      description: "Individual line items on the invoice"
      type: "array"
      required: false

  # Child labels (sub-types of invoice)
  children:
    - label: "utility_invoice"
      description: "Invoice for utility services (electricity, water, gas)"

      # Additional fields specific to utility invoices
      fields:
        - name: "utility_type"
          description: "Type of utility (electricity, water, gas, internet)"
          type: "string"
          required: true

        - name: "account_number"
          description: "Utility account number"
          type: "string"
          required: false

        - name: "usage_amount"
          description: "Amount of utility consumed (kWh, gallons, etc.)"
          type: "number"
          required: false

        - name: "usage_period_start"
          description: "Start date of billing period"
          type: "date"
          required: false

        - name: "usage_period_end"
          description: "End date of billing period"
          type: "date"
          required: false

    - label: "service_invoice"
      description: "Invoice for professional services"

      fields:
        - name: "service_type"
          description: "Type of service provided"
          type: "string"
          required: true

        - name: "hourly_rate"
          description: "Hourly rate for service"
          type: "number"
          required: false

        - name: "hours_worked"
          description: "Total hours worked"
          type: "number"
          required: false

Step 3: Validate Your Schema

You can validate your ontology using the SDK CLI:

pip install bizsupply-sdk
bizsupply init ontology --name my_ontology   # Scaffold a template
bizsupply validate my_ontology.yaml          # Validate the schema

Check These Rules:

Field Types: Must be one of: string, number, date, boolean, array
Label Names: Use snake_case (e.g., utility_invoice, not Utility Invoice)
Hierarchy: Parent fields are inherited by children
Required Fields: Mark critical fields as required: true

Flattened Labels:

The system will flatten hierarchical labels:

invoice -> ["invoice"]
utility_invoice -> ["invoice", "utility_invoice"]

This allows querying documents by parent or child labels.

Step 4: Register Ontology via API

Request:

POST /api/v1/ontologies
Authorization: Bearer <your-jwt-token>
Content-Type: multipart/form-data

Fields:
- ontology_file: invoice_ontology.yaml

Using curl:

curl -X POST "https://api.bizsupply.com/api/v1/ontologies" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -F "ontology_file=@invoice_ontology.yaml"

Response:

{
  "ontology_id": "01HQZX3K4M2N5P7Q8R9S0T1U2V",
  "name": "Invoice Ontology",
  "message": "Ontology registered successfully"
}

Save the ontology_id - you'll need it for pipelines!

Step 5: Verify Ontology Registration

Request:

GET /api/v1/ontologies
Authorization: Bearer <your-jwt-token>

Response:

{
  "count": 1,
  "ontologies": [
    {
      "ontology_id": "01HQZX3K4M2N5P7Q8R9S0T1U2V",
      "name": "Invoice Ontology",
      "description": "Schema for invoice document classification and data extraction",
      "version": "1.0.0",
      "taxonomy": {
        "label": "invoice",
        "fields": [
          {
            "name": "invoice_total",
            "type": "number",
            "required": true
          }
        ],
        "children": [
          {
            "label": "utility_invoice",
            "fields": [...]
          }
        ]
      }
    }
  ]
}

Step 6: Use Ontology in Extraction Plugin

Create an extraction plugin that uses your ontology. Install the SDK first:

pip install bizsupply-sdk

from bizsupply_sdk import ExtractionPlugin, ExtractionResult


class InvoiceExtractorPlugin(ExtractionPlugin):
    """
    Extracts invoice data using ontology field definitions.

    The Engine resolves fields from the ontology based on document labels
    and injects them as the 'fields' parameter.
    """

    configurable_parameters = [
        {
            "parameter_name": "extraction_prompt_id",
            "parameter_type": "str",
            "default_value": None,
            "description": "Prompt ID for extraction",
        },
    ]

    async def extract(self, document, file_data, mime_type, fields, configs):
        """
        Extract invoice data from a single document.

        Args:
            document: The classified document
            file_data: Raw file bytes (pre-fetched by Engine)
            mime_type: MIME type of the file
            fields: OntologyFields resolved from document.labels by Engine
            configs: Runtime configuration

        Returns:
            ExtractionResult with extracted data
        """
        # Format fields for prompt (e.g., invoice_total, vendor_name, etc.)
        fields_json = self.format_fields_for_prompt(fields)

        prompt = f"""Extract the following invoice data from this document.

Fields to extract:
{fields_json}

Return a JSON object with the field names as keys and extracted values.
[See attached document file]"""

        # Call LLM with the document file
        result = await self.prompt_llm(
            prompt=prompt,
            file_data=file_data,
            mime_type=mime_type,
        )

        if not result or not isinstance(result, dict):
            return ExtractionResult(data={})

        # Return ExtractionResult - Engine handles persistence
        return ExtractionResult(
            data=result,
            llm_fields=list(result.keys()),
        )

Validate your plugin before registering:

bizsupply validate invoice_extractor.py

Step 7: Test Ontology in Pipeline

Create a pipeline that uses your ontology:

Request:

POST /api/v1/pipelines
Authorization: Bearer <your-jwt-token>
Content-Type: application/json

{
  "name": "Invoice Processing Pipeline",
  "description": "Classify and extract invoice data",
  "plugin_ids": [
    "YOUR_SOURCE_PLUGIN_ID",
    "YOUR_CLASSIFIER_PLUGIN_ID",
    "YOUR_EXTRACTOR_PLUGIN_ID"
  ],
  "ontology_catalogs_ids": [
    "01HQZX3K4M2N5P7Q8R9S0T1U2V"
  ]
}

Response:

{
  "pipeline_id": "01HQZY5M7N8P9Q0R1S2T3U4V5W",
  "message": "Pipeline created successfully"
}

Step 8: Execute Pipeline and Verify Extraction

Execute Pipeline:

POST /api/v1/engine/execute/pipeline
Authorization: Bearer <your-jwt-token>
Content-Type: application/json

{
  "pipeline_id": "01HQZY5M7N8P9Q0R1S2T3U4V5W"
}

Check Results:

GET /api/v1/documents?labels=invoice
Authorization: Bearer <your-jwt-token>

Response:

{
  "count": 5,
  "documents": [
    {
      "id": "01HR001A2B3C4D5E6F7G8H9J0K",
      "metadata": {
        "source": "gmail",
        "subject": "Invoice #12345"
      },
      "labels": ["invoice"],
      "data": {
        "invoice_total": 1500.00,
        "invoice_date": "2025-10-15",
        "invoice_number": "INV-12345",
        "vendor_name": "Acme Corp",
        "vendor_address": "123 Main St, City, State 12345",
        "due_date": "2025-11-15"
      },
      "created_at": "2025-10-28T10:30:00Z"
    }
  ]
}

Your ontology is extracting structured data from documents.

Step 9: Update Ontology (Optional)

To add fields or modify your ontology:

Request:

PUT /api/v1/ontologies/{ontology_id}
Authorization: Bearer <your-jwt-token>
Content-Type: multipart/form-data

Fields:
- ontology_file: updated_invoice_ontology.yaml

Note: Version will auto-increment. Existing pipelines using this ontology will use the new version.

Field Type Reference

Type	Description	Example
`string`	Text value	`"Acme Corp"`
`number`	Numeric value (int or float)	`1500.00`
`date`	ISO 8601 date string	`"2025-10-15"`
`boolean`	True/false value	`true`
`array`	List of values	`["item1", "item2"]`

Common Issues

Invalid Field Type

Error: Invalid field type: 'decimal'

Solution: Use number instead. Supported types: string, number, date, boolean, array

Label Hierarchy Not Working

Problem: Child labels not inheriting parent fields

Solution: Ensure child labels are nested under children: in parent taxonomy

Fields Not Extracted

Check:

Extraction plugin is querying ontology fields correctly
Document has appropriate labels (classification ran first)
LLM response schema matches ontology field types
Ontology ID is included in pipeline ontology_catalogs_ids

Ontology Not Found in Pipeline

Verify:

GET /api/v1/ontologies/{ontology_id}
Authorization: Bearer <your-jwt-token>

Ensure ontology_id exists and belongs to your user/tenant.

Best Practices

Start Simple: Begin with core required fields, add optional fields later
Clear Descriptions: Write detailed field descriptions - LLMs use these
Use Hierarchies: Organize labels hierarchically (invoice -> utility_invoice)
Version Carefully: Test ontology changes before updating production pipelines
Required vs Optional: Only mark fields as required if they're critical
Consistent Naming: Use snake_case for all labels and field names

Next Steps

Create extraction plugin: Follow Create a Plugin
Build complete pipeline: Follow Use Plugins
Process documents: Follow Process Documents