Vibe Check Arena

A live interface for the settlement layer: public constitutions, influence-free scoring, sparse audits, and credit for complementary evidence. The push is not another agent wrapper, but a candidate coordination primitive for evaluating multi-agent systems under adversarial pressure.

Controls

The point is not to hide failure modes. It is to let you trigger them on purpose and watch the settlement layer respond.

ScenarioDrone-swarm procurement

Same mechanism, different story around it.

Constitution presetSafety-first

What the settlement rule chooses to care about.

Own-label leakage0.12

How much an agent can influence the label that pays it.

Influence-freeLeaky

Audit rate0.20

How often a better anchor is observed instead of only a proxy.

Rare auditsFrequent audits

Signal accuracy0.80

How informative each agent's local evidence is before aggregation.

Weak signalsStrong signals

Output channelLinked

Does the human-readable resolution stay tied to the settled record?

Default reading order: constitution, leakage, audits, complementarity, then linked versus unlinked outputs.

Interactive dashboard

These panels use representative stress-test values from the mechanism stack. Read the interface as a plausible coordination primitive and evaluation layer, not just a benchmark UI.

1) Gaming what gets scored

Constitution sensitivity - corrected vs naive settlement

2) Own-label influence kills honesty

Leakage -> influence -> truthful margin

3) Sparse audits keep proxies anchored

Audit-efficient learning - lower regret

4) Complementary information deserves credit

Shapley-style credit vs naive sequential scoring

5) Human-readable output should stay faithful

Readable resolution artifact - provenance linkage

Live resolution artifact

What a room of builders and researchers can read in twenty seconds

Settlement memo

Why it matters

Why this matters

Candidate coordination primitive

The right read is not "look, agents do a task." It is "here is a plausible scoring and settlement layer an arena could actually use, critique, and reuse."

Trust starts with what gets paid

If the scoring rule can be steered, the system trains manipulation. Influence-free settlement is a first principle, not an optional patch.

Audits rescue weak proxies

Real deployments rarely observe the true objective every round. Sparse higher-fidelity audits are how you stop the proxy from drifting too far.

Complementarity is everywhere

Useful evidence is often only decisive in combination. If the reward layer cannot see that, it underpays exactly the contributors you need.