Create an Extraction Plugin

Create an Extraction Plugin

Goal: Build a plugin that extracts structured data from documents (total, date, vendor, line items, etc.)

Use When: You need to pull specific data fields from documents after they've been classified.


What Extraction Plugins Do

Extraction plugins analyze classified documents and extract structured data based on ontology field definitions. For example:

  • From invoices: total, date, vendor_name, line_items
  • From contracts: parties, effective_date, termination_clause
  • From receipts: merchant, amount, payment_method

Key Method: Return ExtractionResult from extract() - the Engine handles field resolution and persistence.


Prerequisites

Before starting:

  • Install the SDK: pip install bizsupply-sdk
  • Read Plugin Interface Specification
  • Documents must be classified first (extraction uses labels to find field definitions)
  • Create an Ontology with field definitions for your document types
  • Create a Prompt for extraction instructions

How Extraction Works

1. Document has labels: ["invoice"]
2. Engine resolves fields from ontology for those labels
3. Engine fetches document file_data and mime_type
4. Engine calls extract(document, file_data, mime_type, fields, configs)
5. Plugin builds prompt with fields, calls LLM, returns ExtractionResult
6. Engine persists the extracted data to the database

Step 1: Create the Plugin Code

Create invoice_extractor.py:

from bizsupply_sdk import ExtractionPlugin, ExtractionResult


class InvoiceExtractorPlugin(ExtractionPlugin):
    """
    Extracts structured data from classified documents.

    The Engine calls extract() with pre-resolved fields based on document labels.
    Just return ExtractionResult - Engine handles persistence!
    """

    # Define configurable parameters as class attributes
    configurable_parameters = [
        {
            "parameter_name": "extraction_prompt_id",
            "parameter_type": "str",
            "default_value": None,
            "description": "Prompt ID for extraction (REQUIRED)",
        },
        {
            "parameter_name": "confidence_threshold",
            "parameter_type": "float",
            "default_value": 0.8,
            "description": "Minimum confidence for extracted values",
        },
    ]

    async def extract(
        self,
        document,
        file_data,
        mime_type,
        fields,
        configs,
    ):
        """
        Extract data from a single document.

        Args:
            document: The document to extract from (already classified)
            file_data: Raw file bytes (Gemini reads PDFs directly)
            mime_type: MIME type of file_data
            fields: OntologyFields to extract (resolved from document.labels)
            configs: Runtime configuration (prompt IDs, thresholds, etc.)

        Returns:
            ExtractionResult with extracted data
        """
        self.logger.info(
            f"Extracting from {document.document_id}, "
            f"labels={document.labels}, fields={len(fields)}"
        )

        # Get configuration
        prompt_id = configs.get("extraction_prompt_id")
        if not prompt_id:
            self.logger.error("'extraction_prompt_id' config is required")
            raise ValueError("extraction_prompt_id configuration is required")

        prompt_template = await self.get_prompt(prompt_id)

        # Build field definitions for prompt
        fields_json = self.format_fields_for_prompt(fields)

        # Build prompt and call LLM
        prompt = prompt_template.format(
            fields=fields_json,
            document_content="[See attached document file]"
        )

        llm_response = await self.prompt_llm(
            prompt=prompt,
            file_data=file_data,
            mime_type=mime_type,
        )

        if not llm_response:
            self.logger.error(f"Empty LLM response for {document.document_id}")
            return ExtractionResult(data={})

        if isinstance(llm_response, dict):
            extracted_data = llm_response
        else:
            self.logger.error(f"Expected dict, got {type(llm_response)}")
            return ExtractionResult(data={})

        self.logger.info(
            f"Extracted {len(extracted_data)} fields from {document.document_id}"
        )

        # Return ExtractionResult - Engine handles persistence
        return ExtractionResult(
            data=extracted_data,
            llm_fields=list(extracted_data.keys()),
            document_type=document.labels[-1] if document.labels else None,
        )

Step 2: Create an Ontology with Fields

The extraction plugin needs an ontology that defines what fields to extract:

POST /ontologies
Content-Type: application/json

{
  "name": "Invoice Ontology",
  "taxonomy": {
    "label": "document",
    "children": [
      {
        "label": "invoice",
        "fields": [
          {
            "name": "invoice_total",
            "dtype": "number",
            "description": "Total amount due"
          },
          {
            "name": "invoice_date",
            "dtype": "date",
            "description": "Date on the invoice"
          },
          {
            "name": "vendor_name",
            "dtype": "string",
            "description": "Name of the vendor/supplier"
          },
          {
            "name": "line_items",
            "dtype": "array",
            "description": "List of items with description, quantity, price"
          }
        ]
      }
    ]
  }
}

Step 3: Create an Extraction Prompt

POST /prompts
Content-Type: application/json

{
  "name": "Invoice Extraction Prompt",
  "prompt": "Extract the following fields from this document.\n\nFields to extract:\n{fields}\n\nDocument:\n{document_content}\n\nRespond with JSON containing only the requested field names as keys."
}

Step 4: Validate and Register

Validate your plugin before registering:

bizsupply validate invoice_extractor.py

Then register:

curl -X POST "https://api.bizsupply.com/api/v1/plugins" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -F "name=Invoice Extractor" \
  -F "description=Extracts structured data from classified documents" \
  -F "code_file=@invoice_extractor.py"

The plugin type and configurable parameters are automatically extracted from your code.


Key Methods

MethodPurpose
await self.prompt_llm(prompt, file_data, mime_type)Call LLM for extraction
await self.get_prompt(prompt_id)Load prompt template
self.format_fields_for_prompt(fields)Format OntologyFields as JSON for prompts
self.loggerPlugin-specific logger

The Engine handles document content fetching, field resolution from the ontology, and data persistence automatically.


Field Types

Ontology fields support these types:

TypeDescriptionExample Value
stringText value"ACME Corp"
numberNumeric value1500.00
dateDate value"2025-01-15"
booleanTrue/falsetrue
arrayList of objects[{"item": "Widget", "qty": 10}]

Common Mistakes

Wrong Return Type

# WRONG - extract() must return ExtractionResult, not dict
return {"invoice_total": 1500.00}

# CORRECT
from bizsupply_sdk import ExtractionResult

return ExtractionResult(data={"invoice_total": 1500.00})

Using Old execute() Method

# WRONG - old v1.0 API
async def execute(self, context: PluginContext):
    for doc in context.documents:
        fields = self.get_ontology_fields_by_labels(doc.labels, context.ontologies)
        content = await self.get_document_content(doc)
        # ... extract ...
        await self.add_document_data(doc, extracted_data)
    return context.documents

# CORRECT - implement extract(), return ExtractionResult
async def extract(self, document, file_data, mime_type, fields, configs):
    # fields are pre-resolved by the Engine
    # file_data is pre-fetched by the Engine
    result = await self.prompt_llm(prompt=..., file_data=file_data, mime_type=mime_type)
    return ExtractionResult(data=result or {})

Missing SDK Import

# WRONG
class MyPlugin(ExtractionPlugin):  # NameError!
    ...

# CORRECT
from bizsupply_sdk import ExtractionPlugin, ExtractionResult

class MyPlugin(ExtractionPlugin):
    ...

Pipeline Order

Extraction plugins should run AFTER classification:

Pipeline:
  1. Classification Plugin -> assigns labels
  2. Extraction Plugin -> extracts data based on labels

Next Steps