Provenance-first data licensing · Est. on 13 years of data

The origin layer for frontier AI.

Proprietary, provenance-tracked data — created by identified experts under documented consent — licensed to the labs, enterprises, and sovereign programs building models that have to be right.

0
curriculum-aligned questions
0
expert video
0
lines of production code
0
operational data
0
egocentric video — and counting
A two-sided data platform

Data with a verifiable past,
licensed for the long run.

One principle on both sides of the platform: data used for ongoing training deserves ongoing compensation — and a chain of custody you can show your lawyers.

01 — For Model Builders

License rights-cleared, AI-ready data by vertical.

  • Curated assets across education, STEM video, egocentric video, code, and operational workflows
  • Request samples before any commitment
  • Ongoing-supply and quality-certification options — never one-time dumps
  • Full provenance documentation with every delivery
Browse data assets
02 — For Data Providers

Turn proprietary data into recurring revenue.

  • Ethical, governed licensing of underutilized data assets
  • We handle provenance, IP protection, and compliance
  • Recurring compensation when your data trains models on an ongoing basis
  • You control who licenses what, and for how long
Become a data partner
Data assets

Organized by vertical.
Verified at the source.

Every available asset is rights-cleared, provenance-tracked, and ready for training or evaluation. Roadmap assets ship only after consent and compliance review.

EducationAvailable now

Multimodal K-12

4M+ curriculum-aligned questions across 12 Indian languages, 33 boards, Classes 1–12 — tagged by board, class, subject, chapter, topic, and difficulty (L1–L10), with step-by-step solutions by 400+ subject-matter experts. Plus 3.8M expert video walkthroughs, 5,100 hours of lectures, and 4,500 hours of mentor audio.

4M+ questions12 languages 33 boards3.8M walkthroughs

Best for: vernacular reasoning benchmarks, hallucination reduction, difficulty-calibrated evaluation.

STEM VideoAvailable now

STEM Video Library

~80,000 PhD-level lecture videos — roughly 40,000 hours of native 1080p footage, primarily English-language sciences. Deep, structured exposition of advanced STEM at a scale the open web can't assemble.

~80,000 videos~40,000 hoursNative 1080p

Best for: multimodal pretraining, scientific reasoning, video-language alignment.

Egocentric VideoAvailable now

First-Person POV Corpus

9,200+ first-person videos spanning 30+ rarely captured domains — tech repair, cloud kitchens, gaming, skiing, restoration, craft and industrial work. The value is access: domains the open web doesn't cover, recorded under documented consent with full provenance.

9,200+ videos30+ domainsConsent-documented

Best for: embodied AI, robotics perception, multimodal action understanding.

EngineeringAvailable now

Production Codebase

16.8M lines of production code across 95 repositories with 13 years of continuous git history — JavaScript, PHP, TypeScript, and Python. Real engineering decisions, refactors, and fixes. Not synthetic snippets.

16.8M lines95 repos13 yrs history

Best for: coding-agent training, eval benchmarks, RLHF on real-world code.

OperationsAvailable · NDA

Operational Workflow Data

Anonymized enterprise communication and CRM workflow data capturing how cross-functional teams actually decide — escalations, handoffs, approvals, resolutions over time. Shared under NDA.

AnonymizedCRM + commsNDA required

Best for: enterprise & agentic AI, workflow modeling, decision-pattern learning.

HealthcareIn development

Healthcare Data

Being developed under data-protection and consent review. Released only once governance, anonymization, and regulatory requirements are fully satisfied.

Register interest: we'll notify you ahead of general availability.

Voice & AudioComing soon

Conversational Audio

Large-scale real-world voice conversations, being prepared under consent and privacy (DPDP) compliance before release. Natural, multi-speaker, real-context audio — not studio recordings.

Register interest: early-access samples for qualified teams.

Why DataOrigin

Hard to replicate.
Easy to defend.

Anyone can scrape. Almost no one can show where their data came from, who made it, and that they had the right to license it. We can — for every asset, every time.

Full provenance

Every asset traces to identified creators, with documented chain of custody from creation to delivery.

Documented consent

Content created by experts under explicit, recorded consent — no ambiguity about rights or downstream use.

Multilingual depth

Twelve Indian languages with curriculum-grade tagging — vernacular depth the open web simply doesn't have.

§

Governance & compliance

NDAs, governed delivery, anonymization where required, and DPDP-aware processing throughout.

Ongoing supply

Pipelines that keep producing — refreshed, expanded, and quality-certified across the life of a license.

Recurring licensing

If data trains models on an ongoing basis, the holder is compensated on an ongoing basis. That's the model.

provenance — verify
How it works

From sample to ongoing license.

A governed path designed for legal and procurement teams as much as research teams.

Sample review

Request representative samples of any asset. Evaluate quality, coverage, and fit on your own benchmarks.

NDA

Mutual NDA covering deeper access, documentation, and any asset shared under restricted terms.

Commercial terms

Scope, exclusivity, refresh cadence, and recurring-licensing structure agreed up front.

Governed delivery

Secure, audited delivery with provenance documentation attached to every dataset shipped.

Ongoing licensing

Continuous supply, quality certification, and renewals — a relationship, not a transaction.

Who it's for

Matched to what you're building.

For data providers

Your data is an asset.
We make it behave like one.

Most organizations sit on proprietary data AI labs would pay for — but lack the provenance documentation, legal scaffolding, and buyer relationships to license it safely. That's exactly what we built.

Become a data partner
13 years of data history 400+ subject-matter experts 33 education boards Institution-verified Compliance-first Compliance badge — coming
Contact

Start with a sample.

Tell us what you're building or what data you hold. We'll respond with relevant samples, documentation, or next steps — usually within two business days.

partnerships@dataorigin.ai