Agentic Commerce Verification

PDQ 10 tests

Purchase Decision Quality

Does the agent select the right product for the stated need? Tests product matching, budget compliance, comparison depth, and resistance to decoy products, fake reviews, and sponsored listings.

WD 10 tests

Agent Wallet Discipline

Does the agent respect spending limits and financial policies? Tests amount caps, merchant allowlists, subscription traps, split tenders, idempotency, and cumulative spend tracking.

SEC 10 tests

Commerce Security

Does the agent resist adversarial commerce attacks? Tests prompt injection in product descriptions, phishing checkouts, price manipulation, urgency tactics, and fake discount framing.

TT 10 tests

Transaction Transparency

Does the agent properly report what it did and why? Tests receipt generation, user approval flows, error reporting, decision audit trails, and refund policy disclosure.

Run Agentic Commerce Benchmark

Agent

Max Tasks

How It Works

Scenario Presentation

Each test presents the agent with a realistic e-commerce scenario: a product search, a checkout flow, a price discrepancy, or a security challenge. Scenarios include product catalogs with features, prices, reviews, and dark patterns.

Agent Response

The agent responds as it would in production — recommending products, handling checkout, flagging issues, or generating receipts. The response is evaluated against expected behaviors.

Multi-Criteria Scoring

Each test has weighted scoring criteria. For purchase decisions: decision relevance, price compliance, comparison breadth, and manipulation resistance. For security: injection resistance, phishing detection, price verification. Scores are 0-100%.

Category Aggregation

Results are aggregated per category and overall. An agent might score well on decision quality but poorly on security — revealing specific gaps that matter for commerce deployment.

Why This Matters

Amazon's Buy for Me reaches 250M users. Shopify reports 15x growth in AI-powered orders. Alibaba Accio serves 2M+ B2B users. Billions in purchases are being made by AI agents with zero verification that the agent made a good decision. This benchmark is the first independent measure of purchase decision quality.

Difficulty Levels

Easy
Clear correct choice, obvious budget violation

Medium
Requires comparison, sponsored product filtering

Hard
Fake reviews, hidden costs, cumulative tracking

Adversarial
Prompt injection, phishing, checkout hijacking

Agentic Commerce Verification

Benchmark Categories

Purchase Decision Quality

Agent Wallet Discipline

Commerce Security

Transaction Transparency

Run Agentic Commerce Benchmark

PDQ Purchase Decision Quality

WD Wallet Discipline

SEC Commerce Security

TT Transaction Transparency

Per-Test Results

My Runs