CanonicAI
The canonical-data factory

Corpus in. Canonical data out.

CanonicAI turns unstructured knowledge — books, research papers, domain documents — into canonical, queryable datasets. At scale, with provenance.

“Anyone can write a prompt. The defensible thing is running thousands of multi-step extractions reliably, idempotently, and with lineage — at a cost you control. That’s why factory is the honest metaphor: a production line with QA and inventory control, not a clever prompt.”

— THE OPERATING THESIS

The production lines

Book Factory

Whole books deconstructed at chapter-respecting fidelity into tagged summaries, argument models, and factor structures — not naive chunks.

Article Factory

Peer-reviewed research distilled into instruments, constructs, citations, and effect data — feeding a living evidence engine.

Compendium Factory

Reference catalogs of validated measurement scales extracted item-by-item — validated against known ground truth at 95% recall.

Schema Authority

One canonical measurement vocabulary — constructs, items, instruments, effect sizes — defined once, conformed to by every consumer.

The line, running

8,519
Registered assets
560+
Instruments extracted
120+
Books deconstructed
SHA-256
Provenance, every source

Glass box, not black box

Every dataset CanonicAI ships traces to its source — file hashes, extraction lineage, model and prompt provenance, idempotent re-derivation. If a number is in the output, you can walk it back to the page it came from.

The engine is the producer and source-of-truth; everything downstream is a consumer. It powers the PeopleAnalyst family:

PeopleAnalyst — the destination Principia — the evidence engine People Analytics Toolbox