RAI Metadata
Croissant Baker supports the Responsible AI (RAI) extension for documenting data provenance, fairness considerations, and collection activities. RAI metadata is embedded directly in the .jsonld output.
There are two separate workflows — they cannot be combined in the same command:
| Workflow | When to use |
|---|---|
Native --rai-* CLI flags |
Quick, dataset-level fields already supported by mlcroissant |
--rai-config YAML file |
Richer documentation: provenance, activities, lineage, annotator info |
Native CLI flags
Pass any combination of --rai-* flags directly on the generate command:
croissant-baker \
--input ./dataset \
--creator "Jane Smith" \
--rai-data-collection "Retrospective EHR data collected 2010–2019 at Example Hospital" \
--rai-data-collection-type "observational" \
--rai-data-biases "Single-site cohort; skews toward English-speaking adults" \
--rai-data-limitations "Adults only; not suitable for direct clinical decision-making" \
--rai-data-social-impact "May improve clinical AI research but risks amplifying disparities" \
--rai-personal-sensitive-information "De-identified patient records under HIPAA Safe Harbor" \
--output dataset-croissant.jsonld
Flags that accept multiple values can be repeated:
--rai-data-preprocessing-protocol "Outlier removal" \
--rai-data-preprocessing-protocol "Unit normalization"
Available flags
| Flag | Description |
|---|---|
--rai-data-collection |
How and where the data was gathered. |
--rai-data-collection-type |
Collection type, e.g. 'observational'. Can be used multiple times. |
--rai-data-collection-missing-data |
How missing data was handled during collection. |
--rai-data-collection-raw-data |
Description of the raw data before processing. |
--rai-data-collection-timeframe |
Collection date or datetime in ISO format. Can be used multiple times. |
--rai-data-imputation-protocol |
How missing values were imputed. |
--rai-data-preprocessing-protocol |
Preprocessing step. Can be used multiple times. |
--rai-data-manipulation-protocol |
Transformations applied to the data. |
--rai-data-annotation-protocol |
Annotation procedure. Can be used multiple times. |
--rai-data-annotation-platform |
Annotation platform or tool. Can be used multiple times. |
--rai-data-annotation-analysis |
Annotation quality or agreement analysis. Can be used multiple times. |
--rai-annotations-per-item |
Annotation density, e.g. '3 annotators per item'. |
--rai-annotator-demographics |
Annotator demographic note. Can be used multiple times. |
--rai-machine-annotation-tools |
Automated annotation tool. Can be used multiple times. |
--rai-data-biases |
Known bias description. Can be used multiple times. |
--rai-data-use-cases |
Intended use case. Can be used multiple times. |
--rai-data-limitations |
Known limitation. Can be used multiple times. |
--rai-data-social-impact |
Potential social impact of using the dataset. |
--rai-personal-sensitive-information |
Sensitive information note. Can be used multiple times. |
--rai-data-release-maintenance-plan |
How the dataset release will be maintained over time. |
YAML config file
For richer RAI documentation, write a YAML config and pass it with --rai-config:
croissant-baker \
--input ./dataset \
--creator "Jane Smith" \
--rai-config rai.yaml \
--output dataset-croissant.jsonld
The YAML covers three sections:
ai_fairness
ai_fairness:
data_limitations: >
Single-site cohort from one academic medical centre.
Findings may not generalise to other hospital systems.
data_biases: >
Skews toward English-speaking adults; paediatric patients under-represented.
personal_sensitive_information: >
De-identified patient records. Re-identification risk minimised via HIPAA Safe Harbor.
data_use_cases: >
Benchmarking clinical NLP and ML models. Not for direct clinical decision-making.
data_social_impact: >
May improve clinical AI research, but risks amplifying health disparities
if deployed without careful evaluation.
has_synthetic_data: false
lineage
lineage:
source_datasets:
- url: https://physionet.org/content/mimiciii/
name: MIMIC-III
organisation: PhysioNet
models: []
activities
activities:
- id: ACT-001
type: data_collection
description: >
Retrospective EHR collected during routine clinical care.
start_at: "2011-01-01"
end_at: "2019-12-31"
collection_types:
- observations
- existing_datasets
agents:
- name: Beth Israel Deaconess Medical Center
url: https://www.bidmc.org
is_synthetic: false
- id: ACT-002
type: data_preprocessing
description: De-identification via HIPAA Safe Harbor procedures.
agents:
- name: MIT Laboratory for Computational Physiology
url: https://lcp.mit.edu
A complete working example is at tests/data/input/mimiciv_demo/physionet.org/mimiciv_demo-rai-example.yaml.
Apply RAI to an existing file
You can inject RAI into a .jsonld file that was already generated:
croissant-baker rai-apply dataset-croissant.jsonld \
--rai-config rai.yaml \
--output dataset-croissant-rai.jsonld
Omit --output to overwrite the input file in place.