Skip to content

croissant-baker

🥐 Generate Croissant metadata for datasets with automatic type inference

Usage:

$ croissant-baker [OPTIONS] COMMAND [ARGS]...

Options:

  • -i, --input TEXT: Directory containing dataset files
  • -o, --output TEXT: Output file path
  • --validate / --no-validate: Validate metadata before saving [default: validate]
  • --version: Show version and exit
  • --name TEXT: Dataset name (defaults to directory name)
  • --description TEXT: Dataset description
  • --url TEXT: Dataset URL (e.g., https://example.com/dataset)
  • --license TEXT: License URL or SPDX identifier (e.g., CC-BY-4.0)
  • --citation TEXT: Citation text (preferably BibTeX format)
  • --dataset-version TEXT: Dataset version (e.g., 1.0.0)
  • --date-published TEXT: Publication date (e.g., 2023-12-15 or 2023-12-15T10:30:00)
  • --date-created TEXT: Creation date (e.g., 2023-12-15 or 2023-12-15T10:30:00).
  • --date-modified TEXT: Last-modified date (e.g., 2023-12-15 or 2023-12-15T10:30:00).
  • --creator TEXT: Creator information. Format: 'Name[,Email[,URL]]'. Use multiple times for multiple creators. Examples: --creator 'John Doe' --creator 'Jane Smith,jane@example.com,https://jane.com'
  • --publisher TEXT: Publishing organization name (e.g., 'PhysioNet').
  • --keywords TEXT: Topical keywords for dataset discovery. Repeat (--keywords a --keywords b) or comma-delimit (--keywords 'a,b').
  • --in-language TEXT: BCP 47 language code (e.g., 'en'). Repeat or comma-delimit for multilingual datasets.
  • --same-as TEXT: URL of an equivalent record (e.g. DOI, mirror landing page). Repeat or comma-delimit.
  • --sd-license TEXT: License of the metadata description itself (e.g., 'CC0-1.0'), distinct from the data license.
  • --sd-version TEXT: Version of the metadata description (e.g., '1.0.0'), distinct from --dataset-version.
  • --alternate-name TEXT: Short alias for the dataset (e.g., 'MIMIC-IV').
  • --is-live-dataset: Mark the dataset as a live, evolving stream (e.g., a continuously-appended log).
  • --temporal-coverage TEXT: Time period the data covers. ISO 8601 recommended: '2008/2019' (interval) or '2023-01-15' (point).
  • --usage-info TEXT: URI pointing to a usage or consent policy. Any RFC 3986 scheme (http(s), urn, did, mailto). Example: 'http://purl.obolibrary.org/obo/DUO_0000042' (DUO term).
  • --field-mappings FILE: YAML file mapping columns to external vocabularies (Wikidata, SNOMED, LOINC). Schema: 'fields:\n :\n equivalent_property: \n data_types: [, ...]'. Note: column names match across ALL RecordSets, so 'id' applies to every 'id' column in the dataset.
  • --field-mapping TEXT: Link one column to an external vocabulary URI. Format: 'COLUMN=URI'. Example: --field-mapping 'age=http://www.wikidata.org/entity/Q11464'. Matches by bare column name across all RecordSets; a warning prints if a name resolves to multiple fields. Repeatable; combine with --field-mappings (flags override YAML).
  • --count-csv-rows: Count exact row numbers for CSV files (slow for large datasets)
  • --rai-data-collection TEXT: How and where the data was gathered.
  • --rai-data-collection-type TEXT: Collection type, e.g. 'observational'. Can be used multiple times.
  • --rai-data-collection-missing-data TEXT: How missing data was handled during collection.
  • --rai-data-collection-raw-data TEXT: Description of the raw data before processing.
  • --rai-data-collection-timeframe TEXT: Collection date or datetime in ISO format. Can be used multiple times.
  • --rai-data-imputation-protocol TEXT: How missing values were imputed.
  • --rai-data-preprocessing-protocol TEXT: Preprocessing step. Can be used multiple times.
  • --rai-data-manipulation-protocol TEXT: Transformations applied to the data.
  • --rai-data-annotation-protocol TEXT: Annotation procedure. Can be used multiple times.
  • --rai-data-annotation-platform TEXT: Annotation platform or tool. Can be used multiple times.
  • --rai-data-annotation-analysis TEXT: Annotation quality or agreement analysis. Can be used multiple times.
  • --rai-annotations-per-item TEXT: Annotation density, e.g. '3 annotators per item'.
  • --rai-annotator-demographics TEXT: Annotator demographic note. Can be used multiple times.
  • --rai-machine-annotation-tools TEXT: Automated annotation tool. Can be used multiple times.
  • --rai-data-biases TEXT: Known bias description. Can be used multiple times.
  • --rai-data-use-cases TEXT: Intended use case. Can be used multiple times.
  • --rai-data-limitations TEXT: Known limitation. Can be used multiple times.
  • --rai-data-social-impact TEXT: Potential social impact of using the dataset.
  • --rai-personal-sensitive-information TEXT: Sensitive information note. Can be used multiple times.
  • --rai-data-release-maintenance-plan TEXT: How the dataset release will be maintained over time.
  • --rai-config FILE: Path to a RAI config YAML file (see rai-example.yaml for the template)
  • -I, --include TEXT: Glob pattern to include (e.g., '*.csv'). Can be used multiple times.
  • -E, --exclude TEXT: Glob pattern to exclude (e.g., '*.tmp'). Can be used multiple times.
  • --dry-run: Perform a dry run to list matching files without generating metadata.
  • --help: Show this message and exit.

Commands:

  • rai-apply: Apply RAI attributes from a config YAML to...
  • validate: Validate a Croissant metadata file.

croissant-baker rai-apply

Apply RAI attributes from a config YAML to an existing Croissant file.

Usage:

$ croissant-baker rai-apply [OPTIONS] FILE_PATH

Arguments:

  • FILE_PATH: Croissant metadata file to update [required]

Options:

  • --rai-config FILE: RAI config YAML file [required]
  • -o, --output TEXT: Output path (defaults to overwriting the input file)
  • --validate / --no-validate: Validate after applying RAI attributes [default: validate]
  • --help: Show this message and exit.

croissant-baker validate

Validate a Croissant metadata file.

Usage:

$ croissant-baker validate [OPTIONS] FILE_PATH

Arguments:

  • FILE_PATH: Path to Croissant metadata file [required]

Options:

  • --help: Show this message and exit.