Create an Extraction Plugin
Create an Extraction Plugin
Goal: Build a plugin that extracts structured data from documents (total, date, vendor, line items, etc.)
Use When: You need to pull specific data fields from documents after they've been classified.
What Extraction Plugins Do
Extraction plugins analyze classified documents and extract structured data based on ontology field definitions. For example:
- From invoices:
total,date,vendor_name,line_items - From contracts:
parties,effective_date,termination_clause - From receipts:
merchant,amount,payment_method
Key Method: Return ExtractionResult from extract() - the Engine handles field resolution and persistence.
Prerequisites
Before starting:
- Install the SDK:
pip install bizsupply-sdk - Read Plugin Interface Specification
- Documents must be classified first (extraction uses labels to find field definitions)
- Create an Ontology with field definitions for your document types
- Create a Prompt for extraction instructions
How Extraction Works
1. Document has labels: ["invoice"]
2. Engine resolves fields from ontology for those labels
3. Engine fetches document file_data and mime_type
4. Engine calls extract(document, file_data, mime_type, fields, configs)
5. Plugin builds prompt with fields, calls LLM, returns ExtractionResult
6. Engine persists the extracted data to the database
Step 1: Create the Plugin Code
Create invoice_extractor.py:
from bizsupply_sdk import ExtractionPlugin, ExtractionResult
class InvoiceExtractorPlugin(ExtractionPlugin):
"""
Extracts structured data from classified documents.
The Engine calls extract() with pre-resolved fields based on document labels.
Just return ExtractionResult - Engine handles persistence!
"""
# Define configurable parameters as class attributes
configurable_parameters = [
{
"parameter_name": "extraction_prompt_id",
"parameter_type": "str",
"default_value": None,
"description": "Prompt ID for extraction (REQUIRED)",
},
{
"parameter_name": "confidence_threshold",
"parameter_type": "float",
"default_value": 0.8,
"description": "Minimum confidence for extracted values",
},
]
async def extract(
self,
document,
file_data,
mime_type,
fields,
configs,
):
"""
Extract data from a single document.
Args:
document: The document to extract from (already classified)
file_data: Raw file bytes (Gemini reads PDFs directly)
mime_type: MIME type of file_data
fields: OntologyFields to extract (resolved from document.labels)
configs: Runtime configuration (prompt IDs, thresholds, etc.)
Returns:
ExtractionResult with extracted data
"""
self.logger.info(
f"Extracting from {document.document_id}, "
f"labels={document.labels}, fields={len(fields)}"
)
# Get configuration
prompt_id = configs.get("extraction_prompt_id")
if not prompt_id:
self.logger.error("'extraction_prompt_id' config is required")
raise ValueError("extraction_prompt_id configuration is required")
prompt_template = await self.get_prompt(prompt_id)
# Build field definitions for prompt
fields_json = self.format_fields_for_prompt(fields)
# Build prompt and call LLM
prompt = prompt_template.format(
fields=fields_json,
document_content="[See attached document file]"
)
llm_response = await self.prompt_llm(
prompt=prompt,
file_data=file_data,
mime_type=mime_type,
)
if not llm_response:
self.logger.error(f"Empty LLM response for {document.document_id}")
return ExtractionResult(data={})
if isinstance(llm_response, dict):
extracted_data = llm_response
else:
self.logger.error(f"Expected dict, got {type(llm_response)}")
return ExtractionResult(data={})
self.logger.info(
f"Extracted {len(extracted_data)} fields from {document.document_id}"
)
# Return ExtractionResult - Engine handles persistence
return ExtractionResult(
data=extracted_data,
llm_fields=list(extracted_data.keys()),
document_type=document.labels[-1] if document.labels else None,
)Step 2: Create an Ontology with Fields
The extraction plugin needs an ontology that defines what fields to extract:
POST /ontologies
Content-Type: application/json
{
"name": "Invoice Ontology",
"taxonomy": {
"label": "document",
"children": [
{
"label": "invoice",
"fields": [
{
"name": "invoice_total",
"dtype": "number",
"description": "Total amount due"
},
{
"name": "invoice_date",
"dtype": "date",
"description": "Date on the invoice"
},
{
"name": "vendor_name",
"dtype": "string",
"description": "Name of the vendor/supplier"
},
{
"name": "line_items",
"dtype": "array",
"description": "List of items with description, quantity, price"
}
]
}
]
}
}Step 3: Create an Extraction Prompt
POST /prompts
Content-Type: application/json
{
"name": "Invoice Extraction Prompt",
"prompt": "Extract the following fields from this document.\n\nFields to extract:\n{fields}\n\nDocument:\n{document_content}\n\nRespond with JSON containing only the requested field names as keys."
}Step 4: Validate and Register
Validate your plugin before registering:
bizsupply validate invoice_extractor.pyThen register:
curl -X POST "https://api.bizsupply.com/api/v1/plugins" \
-H "Authorization: Bearer YOUR_TOKEN" \
-F "name=Invoice Extractor" \
-F "description=Extracts structured data from classified documents" \
-F "code_file=@invoice_extractor.py"The plugin type and configurable parameters are automatically extracted from your code.
Key Methods
| Method | Purpose |
|---|---|
await self.prompt_llm(prompt, file_data, mime_type) | Call LLM for extraction |
await self.get_prompt(prompt_id) | Load prompt template |
self.format_fields_for_prompt(fields) | Format OntologyFields as JSON for prompts |
self.logger | Plugin-specific logger |
The Engine handles document content fetching, field resolution from the ontology, and data persistence automatically.
Field Types
Ontology fields support these types:
| Type | Description | Example Value |
|---|---|---|
string | Text value | "ACME Corp" |
number | Numeric value | 1500.00 |
date | Date value | "2025-01-15" |
boolean | True/false | true |
array | List of objects | [{"item": "Widget", "qty": 10}] |
Common Mistakes
Wrong Return Type
# WRONG - extract() must return ExtractionResult, not dict
return {"invoice_total": 1500.00}
# CORRECT
from bizsupply_sdk import ExtractionResult
return ExtractionResult(data={"invoice_total": 1500.00})Using Old execute() Method
# WRONG - old v1.0 API
async def execute(self, context: PluginContext):
for doc in context.documents:
fields = self.get_ontology_fields_by_labels(doc.labels, context.ontologies)
content = await self.get_document_content(doc)
# ... extract ...
await self.add_document_data(doc, extracted_data)
return context.documents
# CORRECT - implement extract(), return ExtractionResult
async def extract(self, document, file_data, mime_type, fields, configs):
# fields are pre-resolved by the Engine
# file_data is pre-fetched by the Engine
result = await self.prompt_llm(prompt=..., file_data=file_data, mime_type=mime_type)
return ExtractionResult(data=result or {})Missing SDK Import
# WRONG
class MyPlugin(ExtractionPlugin): # NameError!
...
# CORRECT
from bizsupply_sdk import ExtractionPlugin, ExtractionResult
class MyPlugin(ExtractionPlugin):
...Pipeline Order
Extraction plugins should run AFTER classification:
Pipeline:
1. Classification Plugin -> assigns labels
2. Extraction Plugin -> extracts data based on labels
Next Steps
- Use Plugins - Execute your plugin in a pipeline
- Create an Ontology - Define extraction schemas
- Plugin Service API - All available service methods
Updated 2 months ago