Examples
MIMIC-IV Demo (bundled test data)
The repo includes a small MIMIC-IV demo dataset under tests/data/. This is the easiest way to try the tool immediately after cloning:
# Minimal
croissant-baker \
--input tests/data/input/mimiciv_demo/physionet.org/files/mimic-iv-demo/ \
--creator "Me" \
--output mimic-iv-demo-croissant.jsonld
With full metadata (suitable for publishing):
croissant-baker \
--input tests/data/input/mimiciv_demo/physionet.org/files/mimic-iv-demo/ \
--name "MIMIC-IV Clinical Database Demo" \
--description "Demo subset of MIMIC-IV containing 100 de-identified patients from Beth Israel Deaconess Medical Center (2008-2019), intended for workshops and rapid prototyping" \
--url "https://physionet.org/content/mimic-iv-demo/2.2/" \
--license "https://opendatacommons.org/licenses/odbl/1-0/" \
--dataset-version "2.2" \
--date-published "2023-06-22" \
--citation "Johnson, A., et al. (2023). MIMIC-IV Clinical Database Demo (version 2.2). PhysioNet. https://doi.org/10.13026/dp1f-ex47" \
--creator "Alistair Johnson,aewj@mit.edu,https://physionet.org/" \
--creator "Lucas Bulgarelli,,https://mit.edu/" \
--creator "Tom Pollard,tpollard@mit.edu,https://physionet.org/" \
--creator "Leo Anthony Celi,lceli@mit.edu,https://lcp.mit.edu/" \
--creator "Roger Mark,,https://lcp.mit.edu/" \
--output mimic-iv-demo-croissant.jsonld
MIMIC-IV Full dataset
croissant-baker \
--input /path/to/mimic-iv/ \
--name "MIMIC-IV" \
--description "MIMIC-IV v3.1: de-identified EHR data from ~300,000 patients at Beth Israel Deaconess Medical Center (2008-2022)" \
--url "https://physionet.org/content/mimiciv/3.1/" \
--license "PhysioNet Credentialed Health Data License 1.5.0" \
--dataset-version "3.1" \
--date-published "2024-11-12" \
--citation "Johnson, A., et al. (2024). MIMIC-IV (version 3.1). PhysioNet. https://doi.org/10.13026/kpb9-mt58" \
--creator "Alistair Johnson,aewj@mit.edu,https://physionet.org/" \
--creator "Tom Pollard,tpollard@mit.edu,https://physionet.org/" \
--creator "Leo Anthony Celi,lceli@mit.edu,https://lcp.mit.edu/" \
--output mimic-iv-croissant.jsonld
MIMIC-IV MEDS Demo
croissant-baker \
--input tests/data/input/mimiciv_demo_meds/physionet.org/files/mimic-iv-demo-meds/0.0.1/ \
--name "MIMIC-IV Demo Data in MEDS Format" \
--description "MIMIC-IV demo data converted to the Medical Event Data Standard (MEDS) format, providing a standardized longitudinal patient event representation for 100 demo patients" \
--url "https://physionet.org/content/mimic-iv-ext-meds/" \
--license "https://opendatacommons.org/licenses/odbl/1-0/" \
--dataset-version "0.0.1" \
--date-published "2024-06-18" \
--creator "Alistair Johnson,aewj@mit.edu,https://physionet.org/" \
--creator "Tom Pollard,tpollard@mit.edu,https://physionet.org/" \
--output mimic-iv-demo-meds-croissant.jsonld
Multiple creators
Creator format: Name[,Email[,URL]]. Use --creator once per person.
croissant-baker \
--input ./dataset \
--creator "Alice Lim,alice@example.com,https://alice.example.com" \
--creator "Bob Chen,bob@example.com" \
--creator "Carol Wu"
Custom output path
croissant-baker \
--input ./dataset \
--creator "Jane Smith" \
--output metadata/my-dataset-croissant.jsonld
Parent directories are created automatically.
Include / exclude file patterns
Process only CSV files:
Exclude temporary files:
Dry run — preview without writing
croissant-baker --input ./dataset --dry-run
# Dry run: 4 file(s) would be processed in './dataset':
# patients.csv
# vitals.parquet
# images/scan_001.png
# images/scan_002.png
Skip validation
Useful during development when the dataset is incomplete:
croissant-baker --input ./dataset --creator "Jane Smith" --no-validate
# Tip: Run `croissant-baker validate output.jsonld` to validate later
Validate an existing file
croissant-baker validate dataset-croissant.jsonld
# Validating: dataset-croissant.jsonld
# Valid! Croissant file passed validation
# Dataset: my-dataset
# Files: 3
# Record sets: 2
Programmatic usage (Python)
from croissant_baker.metadata_generator import MetadataGenerator
gen = MetadataGenerator(
dataset_path="./my-dataset",
creators=[{"name": "Jane Smith", "email": "jane@example.com"}],
description="My dataset description",
license="CC-BY-4.0",
)
# Generate and save (validates by default)
gen.save_metadata("my-dataset-croissant.jsonld")
# Or inspect the dict first
metadata = gen.generate_metadata()
print(metadata["name"])