croissant-baker

croissant-baker automatically generates Croissant JSON-LD metadata for ML datasets. Point it at a dataset directory and it produces a standards-compliant .jsonld file — ready for submission to repositories like PhysioNet, NeurIPS Datasets & Benchmarks, or any platform that benefits from standardized dataset metadata.

Installation

pip install croissant-baker

or with uv:

uv add croissant-baker

Quick start

croissant-baker \
  --input ./my-dataset \
  --creator "Jane Smith,jane@example.com" \
  --description "My dataset" \
  --license "CC-BY-4.0"

See the Getting Started guide for a full walkthrough.

Features

Automatic type inference — reads CSV/TSV, Parquet, FHIR NDJSON, JSON/JSONL, WFDB, and images; maps column and field types to the Croissant schema
Metadata overrides — sensible defaults for every field; use flags to set name, description, license, creators, citation, and more
RAI metadata — document responsible AI fields (data collection, biases, limitations, sensitive information) via CLI flags or a YAML config
Validation built-in — validates against the Croissant spec via mlcroissant before writing; use --no-validate to skip
Dry-run mode — preview which files would be processed with --dry-run before committing
Include / exclude filters — glob patterns to include or exclude files by name

Links

Citation

If you use Croissant Baker in your research, please cite our arXiv preprint:

@misc{attrach2026croissantbakermetadatageneration,
      title={Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets},
      author={Rafi Al Attrach and Rajna Fani and Sebastian Lobentanzer and Joan Giner-Miguelez and Debanshu Das and Varuni H. K. and Nobin Sarwar and Rajat Ghosh and Anwai Archit and Surbhi Motghare and Christina Conrad Parry and Luis Oala and Lara Grosso and Joaquin Vanschoren and Steffen Vogler and Sujata Goswami and Eric S. Rosenthal and Marzyeh Ghassemi and Matthew McDermott and Tom Pollard},
      year={2026},
      eprint={2605.15079},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.15079},
}