croissant-baker
croissant-baker automatically generates Croissant JSON-LD metadata for ML datasets. Point it at a dataset directory and it produces a standards-compliant .jsonld file — ready for submission to repositories like PhysioNet, NeurIPS Datasets & Benchmarks, or any platform that benefits from standardized dataset metadata.
Installation
or with uv:
Quick start
croissant-baker \
--input ./my-dataset \
--creator "Jane Smith,jane@example.com" \
--description "My dataset" \
--license "CC-BY-4.0"
See the Getting Started guide for a full walkthrough.
Features
- Automatic type inference — reads CSV/TSV, Parquet, FHIR NDJSON, JSON/JSONL, WFDB, and images; maps column and field types to the Croissant schema
- Metadata overrides — sensible defaults for every field; use flags to set name, description, license, creators, citation, and more
- RAI metadata — document responsible AI fields (data collection, biases, limitations, sensitive information) via CLI flags or a YAML config
- Validation built-in — validates against the Croissant spec via
mlcroissantbefore writing; use--no-validateto skip - Dry-run mode — preview which files would be processed with
--dry-runbefore committing - Include / exclude filters — glob patterns to include or exclude files by name
Links
Citation
If you use Croissant Baker in your research, please cite our arXiv preprint:
@misc{attrach2026croissantbakermetadatageneration,
title={Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets},
author={Rafi Al Attrach and Rajna Fani and Sebastian Lobentanzer and Joan Giner-Miguelez and Debanshu Das and Varuni H. K. and Nobin Sarwar and Rajat Ghosh and Anwai Archit and Surbhi Motghare and Christina Conrad Parry and Luis Oala and Lara Grosso and Joaquin Vanschoren and Steffen Vogler and Sujata Goswami and Eric S. Rosenthal and Marzyeh Ghassemi and Matthew McDermott and Tom Pollard},
year={2026},
eprint={2605.15079},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.15079},
}