Create a Benchmark
Create a Benchmark
Goal: Score documents, compare metrics, and flag outliers across your document collection.
Use When: You need to calculate a metric per document (e.g., price per kWh), compute a reference value across documents, and compare each document against that reference.
What Benchmarks Do
Benchmarks are pure calculation -- no async, no services, no I/O. The Engine handles everything else.
Engine resolves fields score() per doc compute() once compare() per doc
(slot → semantic name) → ExtendedDocument → ScoredDocuments → bool (unfavorable?)
+ aggregations → single float
| Responsibility | Owner |
|---|---|
| Query documents from database | Engine |
| Resolve slot names to semantic field names | Engine |
| Pre-fetch aggregations and attach to document | Engine |
| Calculate a score for one document | You |
| Reduce scores to a single benchmark value | You |
| Compare each document against the benchmark | You |
| Persist score records | Engine |
Key difference from plugins: Benchmarks are synchronous and have no access to prompt_llm() or any services. They receive ExtendedDocument objects (not Document) with pre-resolved semantic field names.
Prerequisites
Before starting:
- Install the SDK:
pip install bizsupply-sdk - Read Plugin Interface Specification (benchmark section)
- Read Plugin Service API (ExtendedDocument, MatchRule sections)
- Have an ontology with your target labels (see Create an Ontology)
Step 1: Scaffold
Use the CLI to generate a starting template:
bizsupply init benchmark --name energy_priceThis creates energy_price.py with a ready-to-edit benchmark class:
from bizsupply_sdk import (
BaseBenchmark,
ExtendedDocument,
ScoredDocument,
MatchCondition,
MatchRule,
)
class EnergyPrice(BaseBenchmark):
"""Benchmark implementation."""
name = "energy_price"
target_labels = ["contract"] # TODO: Set target labels
metric_unit = "EUR" # TODO: Set metric unit
group_by = [] # TODO: Set grouping dimensions
MATCH_RULES = [
MatchRule(
name="energy_price_match",
left_group=["contract"], # TODO: Set source document labels
right_group=["invoice"], # TODO: Set target document labels
conditions=[
MatchCondition(
left_field="client_tax_id",
right_field="client_tax_id",
match_type="==",
),
],
description="Match contracts to invoices by client tax ID",
),
]
def score(self, document: ExtendedDocument) -> float | None:
# TODO: Implement scoring logic
...
def compute(self, results: list[ScoredDocument]) -> float:
# TODO: Choose computation method
...
def compare(self, document_score: float, benchmark_score: float) -> bool:
# TODO: Implement comparison logic
...Step 2: Identity Properties
Every benchmark must define these class attributes:
| Property | Type | Description | Example |
|---|---|---|---|
name | str | Unique identifier | "energy_contract_price_portugal" |
target_labels | list[str] | Which documents to score (ontology labels) | ["contract", "energy"] |
metric_unit | str | Unit of measurement | "EUR/kWh" |
group_by | list[str] | Optional grouping dimensions | ["region"] |
class EnergyContractPrice(BaseBenchmark):
name = "energy_contract_price_portugal"
target_labels = ["contract", "energy"]
metric_unit = "EUR/kWh"
group_by = ["region"]The Engine uses target_labels to select which documents to score and group_by to partition scores into separate benchmarks.
Step 3: Aggregation Rules
Aggregation rules tell the Engine which related documents to attach to each scored document. For example, an energy contract benchmark might need invoices linked by client tax ID and delivery point.
Define rules as a MATCH_RULES class attribute:
from bizsupply_sdk import MatchRule, MatchCondition
class EnergyContractPrice(BaseBenchmark):
# ... identity properties ...
MATCH_RULES = [
MatchRule(
name="contract_invoice_match",
left_group=["contract", "energy"],
right_group=["invoice", "energy"],
conditions=[
MatchCondition(
left_field="client_tax_id",
right_field="client_tax_id",
match_type="==",
),
MatchCondition(
left_field="cpe_point_of_delivery",
right_field="cpe_point_of_delivery",
match_type="==",
),
],
description="Match energy contracts to invoices by client and CPE",
),
]How it works: The Engine finds all right_group documents where ALL conditions match (AND logic), then attaches them as document.aggregations.
MatchCondition Operators
| Operator | Description | Example |
|---|---|---|
== | Equal | Same client tax ID |
!= | Not equal | Different supplier |
< | Less than | Start date before end date |
<= | Less than or equal | Score at most threshold |
> | Greater than | Amount exceeds minimum |
>= | Greater than or equal | End date on or after emission |
contains | Left contains right (string) | Address contains postal code |
starts_with | Left starts with right (string) | Reference starts with prefix |
All field names use semantic names (e.g., client_tax_id, price_per_kwh), not slot names. The Engine resolves semantic names to physical slots automatically.
Step 4: Implement score()
score()score() calculates a score for a single document. The document arrives as an ExtendedDocument with its aggregations already attached.
ExtendedDocument API
# Access fields by semantic name
price = document.get("price_per_kwh") # Returns value or None
supplier = document.get("supplier")
# Access related documents
for invoice in document.aggregations:
inv_price = invoice.get("price_per_kwh")
# Escape hatch for direct dict access
raw_data = document.rawFull Example
def score(self, document: ExtendedDocument) -> float | None:
"""Calculate average price per kWh from linked invoices."""
if not document.aggregations:
return None
prices = [inv.get("price_per_kwh") for inv in document.aggregations]
prices = [p for p in prices if p is not None]
if not prices:
return None
return sum(prices) / len(prices)Return None when a document cannot be scored (missing data, no aggregations). The Engine skips None-scored documents.
Step 5: Implement compute()
compute()compute() reduces all scored documents to a single benchmark value. It receives a list of ScoredDocument objects, each pairing a document with its score.
ScoredDocument
| Attribute | Type | Description |
|---|---|---|
.document | ExtendedDocument | The original document (with aggregations) |
.score | float | The score calculated by score() |
Patterns
# Minimum (best/lowest price)
def compute(self, results: list[ScoredDocument]) -> float:
return min(r.score for r in results)
# Maximum (best/highest rating)
def compute(self, results: list[ScoredDocument]) -> float:
return max(r.score for r in results)
# Weighted average (by supplier volume)
def compute(self, results: list[ScoredDocument]) -> float:
total_weight = sum(r.document.get("volume") or 1 for r in results)
weighted = sum(r.score * (r.document.get("volume") or 1) for r in results)
return weighted / total_weightFull Example
def compute(self, results: list[ScoredDocument]) -> float:
"""Benchmark is the lowest average price across all contracts."""
return min(r.score for r in results)Step 6: Implement compare()
compare()compare() determines whether a document's score is unfavorable compared to the benchmark. Returns True if the document needs attention.
Patterns
# Lower is better (price benchmarks)
def compare(self, document_score: float, benchmark_score: float) -> bool:
return document_score > benchmark_score
# Higher is better (quality/rating benchmarks)
def compare(self, document_score: float, benchmark_score: float) -> bool:
return document_score < benchmark_score
# Threshold (flag if deviation exceeds 10%)
def compare(self, document_score: float, benchmark_score: float) -> bool:
return abs(document_score - benchmark_score) / benchmark_score > 0.10Full Example
def compare(self, document_score: float, benchmark_score: float) -> bool:
"""Flag contracts paying more than the best price."""
return document_score > benchmark_scoreStep 7: Validate and Register
Validate
Run the SDK validator to check your benchmark before registering:
bizsupply validate energy_price.pyThe validator checks:
- Inherits from
BaseBenchmark - Implements all three methods (
score,compute,compare) - Methods are synchronous (not
async def) - Required class attributes are defined (
name,target_labels,metric_unit)
Register
curl -X POST "https://api.bizsupply.com/api/v1/plugins" \
-H "Authorization: Bearer YOUR_TOKEN" \
-F "name=Energy Contract Price" \
-F "description=Benchmarks energy contract prices against lowest invoice average" \
-F "code_file=@energy_price.py"The benchmark type and identity properties are automatically extracted from your code.
Complete Example
A full benchmark that scores energy contracts by average invoice price:
"""Benchmark: Energy contract price comparison."""
from bizsupply_sdk import (
BaseBenchmark,
ExtendedDocument,
ScoredDocument,
MatchCondition,
MatchRule,
)
class EnergyContractPrice(BaseBenchmark):
"""Compare energy contract prices against the best available rate.
Scores each contract by the average price_per_kwh from its linked
invoices, computes the benchmark as the lowest average, and flags
contracts paying above the best price.
"""
name = "energy_contract_price_portugal"
target_labels = ["contract", "energy"]
metric_unit = "EUR/kWh"
group_by = ["region"]
MATCH_RULES = [
MatchRule(
name="contract_invoice_match",
left_group=["contract", "energy"],
right_group=["invoice", "energy"],
conditions=[
MatchCondition(
left_field="client_tax_id",
right_field="client_tax_id",
match_type="==",
),
MatchCondition(
left_field="cpe_point_of_delivery",
right_field="cpe_point_of_delivery",
match_type="==",
),
],
description="Match energy contracts to invoices by client and CPE",
),
]
def score(self, document: ExtendedDocument) -> float | None:
"""Average price per kWh from linked invoices."""
if not document.aggregations:
return None
prices = [inv.get("price_per_kwh") for inv in document.aggregations]
prices = [p for p in prices if p is not None]
if not prices:
return None
return sum(prices) / len(prices)
def compute(self, results: list[ScoredDocument]) -> float:
"""Benchmark is the lowest average price."""
return min(r.score for r in results)
def compare(self, document_score: float, benchmark_score: float) -> bool:
"""Flag contracts paying above the best price."""
return document_score > benchmark_scoreKey Concepts Reference
ExtendedDocument
| Method / Property | Returns | Description |
|---|---|---|
.get("field_name") | Any | None | Get field value by semantic name |
.aggregations | list[ExtendedDocument] | Related documents matched by MATCH_RULES |
.raw | dict[str, Any] | Direct dict access (escape hatch) |
ScoredDocument
| Attribute | Type | Description |
|---|---|---|
.document | ExtendedDocument | The scored document with aggregations |
.score | float | Score calculated by score() |
MatchCondition Operators
| Operator | left vs right | None handling |
|---|---|---|
== | Equal | None == None returns False |
!= | Not equal | None != None returns False |
< | Left less than right | Either None returns False |
<= | Left less than or equal | Either None returns False |
> | Left greater than right | Either None returns False |
>= | Left greater than or equal | Either None returns False |
contains | Left string contains right | Either None returns False |
starts_with | Left string starts with right | Either None returns False |
Common Mistakes
Using async/await
# WRONG - benchmarks are synchronous
async def score(self, document):
return await some_calculation()
# CORRECT - plain def, no await
def score(self, document):
return some_calculation()Using Document Instead of ExtendedDocument
# WRONG - Document is for plugins, not benchmarks
from bizsupply_sdk import Document
def score(self, document: Document) -> float | None:
return document.data.get("price")
# CORRECT - benchmarks receive ExtendedDocument
from bizsupply_sdk import ExtendedDocument
def score(self, document: ExtendedDocument) -> float | None:
return document.get("price")Using Slot Names Instead of Semantic Names
# WRONG - slot names (the Engine resolves these for you)
price = document.get("number_1")
# CORRECT - semantic names (from your ontology)
price = document.get("price_per_kwh")Not Handling None Values
# WRONG - crashes if any invoice has no price
def score(self, document):
prices = [inv.get("price_per_kwh") for inv in document.aggregations]
return sum(prices) / len(prices) # TypeError if None in list
# CORRECT - filter None values
def score(self, document):
prices = [inv.get("price_per_kwh") for inv in document.aggregations]
prices = [p for p in prices if p is not None]
if not prices:
return None
return sum(prices) / len(prices)Missing MATCH_RULES
# WRONG - no aggregation rules, document.aggregations will be empty
class MyBenchmark(BaseBenchmark):
name = "my_benchmark"
target_labels = ["contract"]
metric_unit = "EUR"
def score(self, document):
# document.aggregations is always [] without MATCH_RULES
for inv in document.aggregations: # Never executes
...
# CORRECT - define MATCH_RULES to link related documents
class MyBenchmark(BaseBenchmark):
name = "my_benchmark"
target_labels = ["contract"]
metric_unit = "EUR"
MATCH_RULES = [
MatchRule(
name="contract_invoice_match",
left_group=["contract"],
right_group=["invoice"],
conditions=[
MatchCondition(
left_field="client_tax_id",
right_field="client_tax_id",
match_type="==",
),
],
),
]Next Steps
- Plugin Interface Specification - Full benchmark contract reference
- Plugin Service API - ExtendedDocument, ScoredDocument, MatchRule details
- Create an Ontology - Define the labels and fields your benchmark uses
- Create a Pipeline - Execute your benchmark in a pipeline
Updated 2 months ago