Skip to content

croissant-baker

croissant-baker

croissant-baker automatically generates Croissant JSON-LD metadata for ML datasets. Point it at a dataset directory and it produces a standards-compliant .jsonld file — ready for submission to repositories like PhysioNet, NeurIPS Datasets & Benchmarks, or any platform that benefits from standardized dataset metadata.

PyPI   GitHub

Installation

pip install croissant-baker

or with uv:

uv add croissant-baker

Quick start

croissant-baker \
  --input ./my-dataset \
  --creator "Jane Smith,jane@example.com" \
  --description "My dataset" \
  --license "CC-BY-4.0"

See the Getting Started guide for a full walkthrough.

Features

  • Automatic type inference — reads CSV/TSV, Parquet, FHIR NDJSON, JSON/JSONL, WFDB, and images; maps column and field types to the Croissant schema
  • Metadata overrides — sensible defaults for every field; use flags to set name, description, license, creators, citation, and more
  • RAI metadata — document responsible AI fields (data collection, biases, limitations, sensitive information) via CLI flags or a YAML config
  • Validation built-in — validates against the Croissant spec via mlcroissant before writing; use --no-validate to skip
  • Dry-run mode — preview which files would be processed with --dry-run before committing
  • Include / exclude filters — glob patterns to include or exclude files by name