Supported Formats
croissant-baker detects file types automatically. Unrecognized files are skipped silently. Handlers are checked in the order listed below — the first match wins.
File types
| Format | Extensions | What's extracted |
|---|---|---|
| CSV | .csv, .csv.gz, .csv.bz2, .csv.xz |
Column names, inferred types, optional row count |
| TSV | .tsv, .tsv.gz, .tsv.bz2, .tsv.xz |
Column names, inferred types, optional row count |
| FHIR | .ndjson, .ndjson.gz, .json, .json.gz |
Resource types, field names and types per resource |
| JSON / JSONL | .json, .json.gz, .jsonl, .jsonl.gz |
Schema inferred from a sample of records |
| WFDB | .hea |
Signal names, sampling frequency, duration, number of signals |
| Parquet | .parquet |
Arrow schema, column names and types, row count |
| Images | .jpg, .jpeg, .png, .gif, .bmp, .webp, .ico, .tiff, .tif |
Dimensions, color mode, encoding format |
| DICOM | .dcm, .dicom |
Image geometry, modality, pixel encoding, acquisition parameters |
| NIfTI | .nii, .nii.gz |
Spatial dimensions, voxel spacing, data type, TR for fMRI volumes |
CSV and TSV
CSV and TSV files are read with PyArrow's streaming reader — memory is constant regardless of file size. Type inference runs in two passes: an initial sweep, then per-column promotion when the first pass hits a type conflict.
Row counts are omitted by default for speed. Pass --count-csv-rows to do a full scan for exact counts (slow on large datasets).
Compressed variants (.gz, .bz2, .xz) are handled transparently.
FHIR (.ndjson, .json Bundle)
Two FHIR serialization formats are supported:
- NDJSON bulk export (
.ndjson,.ndjson.gz): one resource per line, all of the sameresourceType. Produced by FHIR Bulk Data servers. - JSON Bundle (
.json,.json.gz): a FHIR Bundle whoseentry[]may contain mixed resource types.
Field names and types are inferred from a sample of resources. OperationOutcome resources (error markers) are skipped.
Note
FHIR .json files are detected by content — the handler looks for "resourceType": "<UpperCase…" before accepting. Plain JSON files that happen to use .json are handled by the JSON handler instead.
JSON and JSONL
- JSON (
.json,.json.gz): an array of objects (one object per record) or a single object (treated as one record). - JSONL (
.jsonl,.jsonl.gz): newline-delimited JSON, one object per line.
Schema is inferred from a sample of records. FHIR .json files are excluded — they go to the FHIR handler.
Parquet
Schema is read from Parquet metadata only — the file data is never loaded. Partitioned datasets (a directory containing two or more .parquet files) are grouped into a single logical cr:FileSet and cr:RecordSet.
WFDB
WFDB (WaveForm DataBase) is the standard physiological signal format on PhysioNet. The handler reads the .hea header file and records signal channel names, sampling frequency, number of samples, and duration. Associated .dat binary files are listed as related files.
Images
Standard images are read with Pillow. Multi-band or scientific TIFFs fall back to tifffile. All images in a dataset are grouped into one cr:FileSet with a single summary cr:RecordSet covering width, height, color mode, and encoding format.
DICOM
DICOM (.dcm, .dicom) is the standard format for medical imaging (CT, MRI, PET, etc.). The handler uses pydicom with stop_before_pixels=True — only the file header is read, so large pixel arrays are never loaded into memory.
Extracted metadata: image dimensions (rows, columns), number of frames, bits allocated per pixel, photometric interpretation, pixel spacing, slice thickness, modality, study/series description, manufacturer, and SOP class UID.
Files with no extension are also accepted if they carry the DICOM magic bytes (DICM at byte offset 128), which is common in PACS exports.
All DICOM files in a dataset are grouped into one cr:FileSet with a summary cr:RecordSet covering modality counts and dimension ranges.
NIfTI
NIfTI (.nii, .nii.gz) is the standard format for neuroimaging data (structural MRI, fMRI, CT). The handler uses nibabel and reads the header only — the voxel data array is never loaded.
Extracted metadata: spatial dimensions (x, y, z), number of timepoints for 4D volumes, voxel spacing in mm, stored data type, NIfTI version (1 or 2), and repetition time (TR) for fMRI data.
All NIfTI files in a dataset are grouped into one cr:FileSet with a summary cr:RecordSet. The tr_seconds field is only added when at least one 4D volume is present.
Hidden files and directories
Files inside hidden directories (any path component starting with .) are always skipped. Use --include and --exclude glob patterns to further control which files are processed.