croissant-baker
croissant-baker automatically generates Croissant JSON-LD metadata for ML datasets. Point it at a dataset directory and it produces a standards-compliant .jsonld file — ready for submission to repositories like PhysioNet, NeurIPS Datasets & Benchmarks, or any platform that benefits from standardized dataset metadata.
Installation
or with uv:
Quick start
croissant-baker \
--input ./my-dataset \
--creator "Jane Smith,jane@example.com" \
--description "My dataset" \
--license "CC-BY-4.0"
See the Getting Started guide for a full walkthrough.
Features
- Automatic type inference — reads CSV/TSV, Parquet, FHIR NDJSON, JSON/JSONL, WFDB, and images; maps column and field types to the Croissant schema
- Metadata overrides — sensible defaults for every field; use flags to set name, description, license, creators, citation, and more
- RAI metadata — document responsible AI fields (data collection, biases, limitations, sensitive information) via CLI flags or a YAML config
- Validation built-in — validates against the Croissant spec via
mlcroissantbefore writing; use--no-validateto skip - Dry-run mode — preview which files would be processed with
--dry-runbefore committing - Include / exclude filters — glob patterns to include or exclude files by name