Open-Sourcing pca-eval

I put pca-eval on GitHub. It’s the benchmark evaluation rig for my paper on Proof-Carrying Answers, the work I’m building into a company at Detent.ai. A Proof-Carrying Answer packages every AI-generated claim with the specific evidence span that supports it, verified by a natural language inference (NLI) model, then wraps the result in a signed audit trail. An NLI model’s job is to decide whether one sentence entails another. This post is about the measurement slice.

What pca-eval is

pca-eval is an Apache 2.0 licensed Python 3.12 package that runs on a laptop CPU. No GPU required. It contains:

A multi-tier NLI model wrapper (DeBERTa-v3-base, DeBERTa-v3-large, and DeBERTa-v3-large with a 512-token window).
An XGBoost aggregator that combines per-tier signals into a single score.
Six benchmark scripts: SciFact, FEVER, QASPER, HAGRID, FActScore, and AttributionBench.
A tiered CLI runner so you can fail fast before spending minutes on NLI inference.
Statistical tooling (bootstrap confidence intervals and McNemar’s test) for comparing two result files against each other.

The whole thing exists to measure one question: does the verifier actually detect when a claim is not supported by the evidence in front of it? That’s the component the paper’s headline numbers come from.

What pca-eval is not

The repo is the measurement rig, not the full pipeline. It does not contain claim decomposition, per-claim retrieval, the grounded composition step, the Ed25519 signing step, or the fine-tuned DeBERTa checkpoints from the paper. It also does not contain ClaimVerify, the insurance coverage-analysis product I’m building around the methodology.

People ask why the checkpoints aren’t in the repo. Because the checkpoints are the work. The evaluation tooling is generic—it runs on any NLI verifier, any benchmark, any threshold—and that’s why it belongs in the open. The weights, the retrieval engineering, and the product are where specific time and judgment went, and the paper’s Appendix A covers the training recipe for anyone who wants to reproduce it.

Fine-tuned vs pre-trained

The README has two columns: a “Fine-tuned” column (what the paper reports) and a “Pre-trained” column (what you get out of the box with the default cross-encoder/nli-deberta-v3-base model). The gap between them is in-domain fine-tuning on a public NLI training mix. Table 1 of the paper has the numbers; Appendix A has the training recipe.

One thing worth flagging before you read those columns: some of the benchmarks in the training mix overlap with the evaluation mix, so part of the fine-tuned lift is in-distribution, not generalization. The paper discusses this explicitly in its in-distribution disclosure; don’t read the biggest gains as pure architectural wins.

What you can reproduce from this repo:

Baseline measurements on any Hugging Face NLI model.
Architecture ablations: swap the base model, drop a tier, remove the XGBoost aggregator, and measure the delta.
Threshold sweeps and the optimal operating point for each benchmark.
Bootstrap confidence intervals and head-to-head comparisons between result files.

What you can’t reproduce without the fine-tuned checkpoints: the paper’s exact Table 1 numbers. That’s the reproducibility boundary. Evaluation scripts generalize; fine-tuned weights on one specific model mix don’t.

The tiered execution model

Verifying a benchmark is slow. Iterating on the verifier is slower. The CLI has tiers so you can fail fast—each one exists because I got tired of spending five minutes at the next tier up just to find out I’d broken the data load.

dry-run loads the data, validates the format, and prints stats. No model, no downloads, no inference. It finishes in seconds and catches data bugs before an NLI run wastes your time telling you the same thing.
nli-only runs the local DeBERTa NLI model against the provided evidence passages. This is the default tier for headline numbers, and the cleanest “does the verifier work on the verification task” measurement.
nli-abstract runs NLI against full abstracts on SciFact. This tests retrieval and verification together on the one benchmark where that combination makes sense.
sweep runs a threshold sweep to find the F1-maximizing operating point. Thresholds are benchmark-specific; you need this tier to pick one defensibly.
calibrate applies Platt scaling on top of the sweep, producing calibrated probability estimates instead of raw softmax outputs.
ais computes the AIS attribution metric, the standard score for attribution tasks on HAGRID and AttributionBench.

The whole thing runs through one CLI: python -m benchmarks.run <benchmark> --tier <tier>. The tier you pick is the biggest lever on how long you’ll wait.

Running it yourself

Requires Python 3.12+ and network access for the initial dataset and model downloads. Everything else runs locally.

Clone, set up a venv, install in editable mode:

git clone https://github.com/conjfrnk/pca-eval.git
cd pca-eval
python -m venv .venv && source .venv/bin/activate
pip install -e .

Download the benchmark datasets (~100 MB on disk):

python -m benchmarks.download

Make sure the data actually loads. This takes seconds and uses no model:

python -m benchmarks.run scifact --tier dry-run

Run the real thing on one benchmark. On first run, the NLI model (~400 MB) is downloaded from Hugging Face Hub and cached locally; a single-benchmark nli-only run takes about five minutes on a laptop:

python -m benchmarks.run scifact --tier nli-only

Sweep all six benchmarks at once:

python -m benchmarks.run all --tier nli-only

Results land in results/ as JSON files. Run bootstrap confidence intervals on a specific result file:

python -m benchmarks.run_stats results/scifact_nli-only_*.json

That’s the entire day-to-day workflow. Everything runs locally. The only network traffic is the one-time dataset download and the one-time model download from Hugging Face.

What pca-eval is

What pca-eval is not

Fine-tuned vs pre-trained

The tiered execution model

Running it yourself

Further reading