Every available asset is rights-cleared, provenance-tracked, and ready for training or evaluation. Roadmap assets ship only after consent and compliance review.
EducationAvailable now
Multimodal K-12
4M+ curriculum-aligned questions across 12 Indian languages, 33 boards, Classes 1–12 — tagged by board, class, subject, chapter, topic, and difficulty (L1–L10), with step-by-step solutions by 400+ subject-matter experts. Plus 3.8M expert video walkthroughs, 5,100 hours of lectures, and 4,500 hours of mentor audio.
4M+ questions12 languages
33 boards3.8M walkthroughs
Best for: vernacular reasoning benchmarks, hallucination reduction, difficulty-calibrated evaluation.
STEM VideoAvailable now
STEM Video Library
~80,000 PhD-level lecture videos — roughly 40,000 hours of native 1080p footage, primarily English-language sciences. Deep, structured exposition of advanced STEM at a scale the open web can't assemble.
~80,000 videos~40,000 hoursNative 1080p
Best for: multimodal pretraining, scientific reasoning, video-language alignment.
Egocentric VideoAvailable now
First-Person POV Corpus
9,200+ first-person videos spanning 30+ rarely captured domains — tech repair, cloud kitchens, gaming, skiing, restoration, craft and industrial work. The value is access: domains the open web doesn't cover, recorded under documented consent with full provenance.
9,200+ videos30+ domainsConsent-documented
Best for: embodied AI, robotics perception, multimodal action understanding.
EngineeringAvailable now
Production Codebase
16.8M lines of production code across 95 repositories with 13 years of continuous git history — JavaScript, PHP, TypeScript, and Python. Real engineering decisions, refactors, and fixes. Not synthetic snippets.
16.8M lines95 repos13 yrs history
Best for: coding-agent training, eval benchmarks, RLHF on real-world code.
OperationsAvailable · NDA
Operational Workflow Data
Anonymized enterprise communication and CRM workflow data capturing how cross-functional teams actually decide — escalations, handoffs, approvals, resolutions over time. Shared under NDA.
AnonymizedCRM + commsNDA required
Best for: enterprise & agentic AI, workflow modeling, decision-pattern learning.
HealthcareIn development
Healthcare Data
Being developed under data-protection and consent review. Released only once governance, anonymization, and regulatory requirements are fully satisfied.
Register interest: we'll notify you ahead of general availability.
Voice & AudioComing soon
Conversational Audio
Large-scale real-world voice conversations, being prepared under consent and privacy (DPDP) compliance before release. Natural, multi-speaker, real-context audio — not studio recordings.
Register interest: early-access samples for qualified teams.