`croissant-baker`

🥐 Generate Croissant metadata for datasets with automatic type inference

Usage:

$ croissant-baker [OPTIONS] COMMAND [ARGS]...

Options:

-i, --input TEXT: Directory containing dataset files
-o, --output TEXT: Output file path
--validate / --no-validate: Validate metadata before saving [default: validate]
--version: Show version and exit
--name TEXT: Dataset name (defaults to directory name)
--description TEXT: Dataset description
--url TEXT: Dataset URL (e.g., https://example.com/dataset)
--license TEXT: License URL or SPDX identifier (e.g., CC-BY-4.0)
--citation TEXT: Citation text (preferably BibTeX format)
--dataset-version TEXT: Dataset version (e.g., 1.0.0)
--date-published TEXT: Publication date (e.g., 2023-12-15 or 2023-12-15T10:30:00)
--date-created TEXT: Creation date (e.g., 2023-12-15 or 2023-12-15T10:30:00).
--date-modified TEXT: Last-modified date (e.g., 2023-12-15 or 2023-12-15T10:30:00).
--creator TEXT: Creator information. Format: 'Name[,Email[,URL]]'. Use multiple times for multiple creators. Examples: --creator 'John Doe' --creator 'Jane Smith,jane@example.com,https://jane.com'
--publisher TEXT: Publishing organization name (e.g., 'PhysioNet').
--keywords TEXT: Topical keywords for dataset discovery. Repeat (--keywords a --keywords b) or comma-delimit (--keywords 'a,b').
--in-language TEXT: BCP 47 language code (e.g., 'en'). Repeat or comma-delimit for multilingual datasets.
--same-as TEXT: URL of an equivalent record (e.g. DOI, mirror landing page). Repeat or comma-delimit.
--sd-license TEXT: License of the metadata description itself (e.g., 'CC0-1.0'), distinct from the data license.
--sd-version TEXT: Version of the metadata description (e.g., '1.0.0'), distinct from --dataset-version.
--alternate-name TEXT: Short alias for the dataset (e.g., 'MIMIC-IV').
--is-live-dataset: Mark the dataset as a live, evolving stream (e.g., a continuously-appended log).
--temporal-coverage TEXT: Time period the data covers. ISO 8601 recommended: '2008/2019' (interval) or '2023-01-15' (point).
--usage-info TEXT: URI pointing to a usage or consent policy. Any RFC 3986 scheme (http(s), urn, did, mailto). Example: 'http://purl.obolibrary.org/obo/DUO_0000042' (DUO term).
--field-mappings FILE: YAML file mapping columns to external vocabularies (Wikidata, SNOMED, LOINC). Schema: 'fields:\n :\n equivalent_property: \n data_types: [, ...]'. Note: column names match across ALL RecordSets, so 'id' applies to every 'id' column in the dataset.
--field-mapping TEXT: Link one column to an external vocabulary URI. Format: 'COLUMN=URI'. Example: --field-mapping 'age=http://www.wikidata.org/entity/Q11464'. Matches by bare column name across all RecordSets; a warning prints if a name resolves to multiple fields. Repeatable; combine with --field-mappings (flags override YAML).
--count-csv-rows: Count exact row numbers for CSV files (slow for large datasets)
--rai-data-collection TEXT: How and where the data was gathered.
--rai-data-collection-type TEXT: Collection type, e.g. 'observational'. Can be used multiple times.
--rai-data-collection-missing-data TEXT: How missing data was handled during collection.
--rai-data-collection-raw-data TEXT: Description of the raw data before processing.
--rai-data-collection-timeframe TEXT: Collection date or datetime in ISO format. Can be used multiple times.
--rai-data-imputation-protocol TEXT: How missing values were imputed.
--rai-data-preprocessing-protocol TEXT: Preprocessing step. Can be used multiple times.
--rai-data-manipulation-protocol TEXT: Transformations applied to the data.
--rai-data-annotation-protocol TEXT: Annotation procedure. Can be used multiple times.
--rai-data-annotation-platform TEXT: Annotation platform or tool. Can be used multiple times.
--rai-data-annotation-analysis TEXT: Annotation quality or agreement analysis. Can be used multiple times.
--rai-annotations-per-item TEXT: Annotation density, e.g. '3 annotators per item'.
--rai-annotator-demographics TEXT: Annotator demographic note. Can be used multiple times.
--rai-machine-annotation-tools TEXT: Automated annotation tool. Can be used multiple times.
--rai-data-biases TEXT: Known bias description. Can be used multiple times.
--rai-data-use-cases TEXT: Intended use case. Can be used multiple times.
--rai-data-limitations TEXT: Known limitation. Can be used multiple times.
--rai-data-social-impact TEXT: Potential social impact of using the dataset.
--rai-personal-sensitive-information TEXT: Sensitive information note. Can be used multiple times.
--rai-data-release-maintenance-plan TEXT: How the dataset release will be maintained over time.
--rai-config FILE: Path to a RAI config YAML file (see rai-example.yaml for the template)
-I, --include TEXT: Glob pattern to include (e.g., '*.csv'). Can be used multiple times.
-E, --exclude TEXT: Glob pattern to exclude (e.g., '*.tmp'). Can be used multiple times.
--dry-run: Perform a dry run to list matching files without generating metadata.
--help: Show this message and exit.

Commands:

rai-apply: Apply RAI attributes from a config YAML to...
validate: Validate a Croissant metadata file.

`croissant-baker rai-apply`

Apply RAI attributes from a config YAML to an existing Croissant file.

Usage:

$ croissant-baker rai-apply [OPTIONS] FILE_PATH

Arguments:

FILE_PATH: Croissant metadata file to update [required]

Options:

--rai-config FILE: RAI config YAML file [required]
-o, --output TEXT: Output path (defaults to overwriting the input file)
--validate / --no-validate: Validate after applying RAI attributes [default: validate]
--help: Show this message and exit.

`croissant-baker validate`

Validate a Croissant metadata file.

Usage:

$ croissant-baker validate [OPTIONS] FILE_PATH

Arguments:

FILE_PATH: Path to Croissant metadata file [required]

Options:

--help: Show this message and exit.