Skip to content

RAI Metadata

Croissant Baker supports the Responsible AI (RAI) extension for documenting data provenance, fairness considerations, and collection activities. RAI metadata is embedded directly in the .jsonld output.

There are two separate workflows — they cannot be combined in the same command:

Workflow When to use
Native --rai-* CLI flags Quick, dataset-level fields already supported by mlcroissant
--rai-config YAML file Richer documentation: provenance, activities, lineage, annotator info

Native CLI flags

Pass any combination of --rai-* flags directly on the generate command:

croissant-baker \
  --input ./dataset \
  --creator "Jane Smith" \
  --rai-data-collection "Retrospective EHR data collected 2010–2019 at Example Hospital" \
  --rai-data-collection-type "observational" \
  --rai-data-biases "Single-site cohort; skews toward English-speaking adults" \
  --rai-data-limitations "Adults only; not suitable for direct clinical decision-making" \
  --rai-data-social-impact "May improve clinical AI research but risks amplifying disparities" \
  --rai-personal-sensitive-information "De-identified patient records under HIPAA Safe Harbor" \
  --output dataset-croissant.jsonld

Flags that accept multiple values can be repeated:

--rai-data-preprocessing-protocol "Outlier removal" \
--rai-data-preprocessing-protocol "Unit normalization"

Available flags

Flag Description
--rai-data-collection How and where the data was gathered.
--rai-data-collection-type Collection type, e.g. 'observational'. Can be used multiple times.
--rai-data-collection-missing-data How missing data was handled during collection.
--rai-data-collection-raw-data Description of the raw data before processing.
--rai-data-collection-timeframe Collection date or datetime in ISO format. Can be used multiple times.
--rai-data-imputation-protocol How missing values were imputed.
--rai-data-preprocessing-protocol Preprocessing step. Can be used multiple times.
--rai-data-manipulation-protocol Transformations applied to the data.
--rai-data-annotation-protocol Annotation procedure. Can be used multiple times.
--rai-data-annotation-platform Annotation platform or tool. Can be used multiple times.
--rai-data-annotation-analysis Annotation quality or agreement analysis. Can be used multiple times.
--rai-annotations-per-item Annotation density, e.g. '3 annotators per item'.
--rai-annotator-demographics Annotator demographic note. Can be used multiple times.
--rai-machine-annotation-tools Automated annotation tool. Can be used multiple times.
--rai-data-biases Known bias description. Can be used multiple times.
--rai-data-use-cases Intended use case. Can be used multiple times.
--rai-data-limitations Known limitation. Can be used multiple times.
--rai-data-social-impact Potential social impact of using the dataset.
--rai-personal-sensitive-information Sensitive information note. Can be used multiple times.
--rai-data-release-maintenance-plan How the dataset release will be maintained over time.

YAML config file

For richer RAI documentation, write a YAML config and pass it with --rai-config:

croissant-baker \
  --input ./dataset \
  --creator "Jane Smith" \
  --rai-config rai.yaml \
  --output dataset-croissant.jsonld

The YAML covers three sections:

ai_fairness

ai_fairness:
  data_limitations: >
    Single-site cohort from one academic medical centre.
    Findings may not generalise to other hospital systems.

  data_biases: >
    Skews toward English-speaking adults; paediatric patients under-represented.

  personal_sensitive_information: >
    De-identified patient records. Re-identification risk minimised via HIPAA Safe Harbor.

  data_use_cases: >
    Benchmarking clinical NLP and ML models. Not for direct clinical decision-making.

  data_social_impact: >
    May improve clinical AI research, but risks amplifying health disparities
    if deployed without careful evaluation.

  has_synthetic_data: false

lineage

lineage:
  source_datasets:
    - url: https://physionet.org/content/mimiciii/
      name: MIMIC-III
      organisation: PhysioNet
  models: []

activities

activities:
  - id: ACT-001
    type: data_collection
    description: >
      Retrospective EHR collected during routine clinical care.
    start_at: "2011-01-01"
    end_at: "2019-12-31"
    collection_types:
      - observations
      - existing_datasets
    agents:
      - name: Beth Israel Deaconess Medical Center
        url: https://www.bidmc.org
        is_synthetic: false

  - id: ACT-002
    type: data_preprocessing
    description: De-identification via HIPAA Safe Harbor procedures.
    agents:
      - name: MIT Laboratory for Computational Physiology
        url: https://lcp.mit.edu

A complete working example is at tests/data/input/mimiciv_demo/physionet.org/mimiciv_demo-rai-example.yaml.

Apply RAI to an existing file

You can inject RAI into a .jsonld file that was already generated:

croissant-baker rai-apply dataset-croissant.jsonld \
  --rai-config rai.yaml \
  --output dataset-croissant-rai.jsonld

Omit --output to overwrite the input file in place.