RAI Metadata

Croissant Baker supports the Responsible AI (RAI) extension for documenting data provenance, fairness considerations, and collection activities. RAI metadata is embedded directly in the .jsonld output.

There are two separate workflows — they cannot be combined in the same command:

Workflow	When to use
Native `--rai-*` CLI flags	Quick, dataset-level fields already supported by `mlcroissant`
`--rai-config` YAML file	Richer documentation: provenance, activities, lineage, annotator info

Native CLI flags

Pass any combination of --rai-* flags directly on the generate command:

croissant-baker \
  --input ./dataset \
  --creator "Jane Smith" \
  --rai-data-collection "Retrospective EHR data collected 2010–2019 at Example Hospital" \
  --rai-data-collection-type "observational" \
  --rai-data-biases "Single-site cohort; skews toward English-speaking adults" \
  --rai-data-limitations "Adults only; not suitable for direct clinical decision-making" \
  --rai-data-social-impact "May improve clinical AI research but risks amplifying disparities" \
  --rai-personal-sensitive-information "De-identified patient records under HIPAA Safe Harbor" \
  --output dataset-croissant.jsonld

Flags that accept multiple values can be repeated:

--rai-data-preprocessing-protocol "Outlier removal" \
--rai-data-preprocessing-protocol "Unit normalization"

Available flags

Flag	Description
`--rai-data-collection`	How and where the data was gathered.
`--rai-data-collection-type`	Collection type, e.g. 'observational'. Can be used multiple times.
`--rai-data-collection-missing-data`	How missing data was handled during collection.
`--rai-data-collection-raw-data`	Description of the raw data before processing.
`--rai-data-collection-timeframe`	Collection date or datetime in ISO format. Can be used multiple times.
`--rai-data-imputation-protocol`	How missing values were imputed.
`--rai-data-preprocessing-protocol`	Preprocessing step. Can be used multiple times.
`--rai-data-manipulation-protocol`	Transformations applied to the data.
`--rai-data-annotation-protocol`	Annotation procedure. Can be used multiple times.
`--rai-data-annotation-platform`	Annotation platform or tool. Can be used multiple times.
`--rai-data-annotation-analysis`	Annotation quality or agreement analysis. Can be used multiple times.
`--rai-annotations-per-item`	Annotation density, e.g. '3 annotators per item'.
`--rai-annotator-demographics`	Annotator demographic note. Can be used multiple times.
`--rai-machine-annotation-tools`	Automated annotation tool. Can be used multiple times.
`--rai-data-biases`	Known bias description. Can be used multiple times.
`--rai-data-use-cases`	Intended use case. Can be used multiple times.
`--rai-data-limitations`	Known limitation. Can be used multiple times.
`--rai-data-social-impact`	Potential social impact of using the dataset.
`--rai-personal-sensitive-information`	Sensitive information note. Can be used multiple times.
`--rai-data-release-maintenance-plan`	How the dataset release will be maintained over time.

YAML config file

For richer RAI documentation, write a YAML config and pass it with --rai-config:

croissant-baker \
  --input ./dataset \
  --creator "Jane Smith" \
  --rai-config rai.yaml \
  --output dataset-croissant.jsonld

The YAML covers three sections:

`ai_fairness`

ai_fairness:
  data_limitations: >
    Single-site cohort from one academic medical centre.
    Findings may not generalise to other hospital systems.

  data_biases: >
    Skews toward English-speaking adults; paediatric patients under-represented.

  personal_sensitive_information: >
    De-identified patient records. Re-identification risk minimised via HIPAA Safe Harbor.

  data_use_cases: >
    Benchmarking clinical NLP and ML models. Not for direct clinical decision-making.

  data_social_impact: >
    May improve clinical AI research, but risks amplifying health disparities
    if deployed without careful evaluation.

  has_synthetic_data: false

`lineage`

lineage:
  source_datasets:
    - url: https://physionet.org/content/mimiciii/
      name: MIMIC-III
      organisation: PhysioNet
  models: []

`activities`

activities:
  - id: ACT-001
    type: data_collection
    description: >
      Retrospective EHR collected during routine clinical care.
    start_at: "2011-01-01"
    end_at: "2019-12-31"
    collection_types:
      - observations
      - existing_datasets
    agents:
      - name: Beth Israel Deaconess Medical Center
        url: https://www.bidmc.org
        is_synthetic: false

  - id: ACT-002
    type: data_preprocessing
    description: De-identification via HIPAA Safe Harbor procedures.
    agents:
      - name: MIT Laboratory for Computational Physiology
        url: https://lcp.mit.edu

A complete working example is at tests/data/input/mimiciv_demo/physionet.org/mimiciv_demo-rai-example.yaml.

Apply RAI to an existing file

You can inject RAI into a .jsonld file that was already generated:

croissant-baker rai-apply dataset-croissant.jsonld \
  --rai-config rai.yaml \
  --output dataset-croissant-rai.jsonld

Omit --output to overwrite the input file in place.