Skip to content

Supported Formats

croissant-baker detects file types automatically. Unrecognized files are skipped silently. Handlers are checked in the order listed below — the first match wins.

File types

Format Extensions What's extracted
CSV .csv, .csv.gz, .csv.bz2, .csv.xz Column names, inferred types, optional row count
TSV .tsv, .tsv.gz, .tsv.bz2, .tsv.xz Column names, inferred types, optional row count
FHIR .ndjson, .ndjson.gz, .json, .json.gz Resource types, field names and types per resource
JSON / JSONL .json, .json.gz, .jsonl, .jsonl.gz Schema inferred from a sample of records
WFDB .hea Signal names, sampling frequency, duration, number of signals
Parquet .parquet Arrow schema, column names and types, row count
Images .jpg, .jpeg, .png, .gif, .bmp, .webp, .ico, .tiff, .tif Dimensions, color mode, encoding format
DICOM .dcm, .dicom Image geometry, modality, pixel encoding, acquisition parameters
NIfTI .nii, .nii.gz Spatial dimensions, voxel spacing, data type, TR for fMRI volumes

CSV and TSV

CSV and TSV files are read with PyArrow's streaming reader — memory is constant regardless of file size. Type inference runs in two passes: an initial sweep, then per-column promotion when the first pass hits a type conflict.

Row counts are omitted by default for speed. Pass --count-csv-rows to do a full scan for exact counts (slow on large datasets).

Compressed variants (.gz, .bz2, .xz) are handled transparently.

FHIR (.ndjson, .json Bundle)

Two FHIR serialization formats are supported:

  • NDJSON bulk export (.ndjson, .ndjson.gz): one resource per line, all of the same resourceType. Produced by FHIR Bulk Data servers.
  • JSON Bundle (.json, .json.gz): a FHIR Bundle whose entry[] may contain mixed resource types.

Field names and types are inferred from a sample of resources. OperationOutcome resources (error markers) are skipped.

Note

FHIR .json files are detected by content — the handler looks for "resourceType": "<UpperCase…" before accepting. Plain JSON files that happen to use .json are handled by the JSON handler instead.

JSON and JSONL

  • JSON (.json, .json.gz): an array of objects (one object per record) or a single object (treated as one record).
  • JSONL (.jsonl, .jsonl.gz): newline-delimited JSON, one object per line.

Schema is inferred from a sample of records. FHIR .json files are excluded — they go to the FHIR handler.

Parquet

Schema is read from Parquet metadata only — the file data is never loaded. Partitioned datasets (a directory containing two or more .parquet files) are grouped into a single logical cr:FileSet and cr:RecordSet.

WFDB

WFDB (WaveForm DataBase) is the standard physiological signal format on PhysioNet. The handler reads the .hea header file and records signal channel names, sampling frequency, number of samples, and duration. Associated .dat binary files are listed as related files.

Images

Standard images are read with Pillow. Multi-band or scientific TIFFs fall back to tifffile. All images in a dataset are grouped into one cr:FileSet with a single summary cr:RecordSet covering width, height, color mode, and encoding format.

DICOM

DICOM (.dcm, .dicom) is the standard format for medical imaging (CT, MRI, PET, etc.). The handler uses pydicom with stop_before_pixels=True — only the file header is read, so large pixel arrays are never loaded into memory.

Extracted metadata: image dimensions (rows, columns), number of frames, bits allocated per pixel, photometric interpretation, pixel spacing, slice thickness, modality, study/series description, manufacturer, and SOP class UID.

Files with no extension are also accepted if they carry the DICOM magic bytes (DICM at byte offset 128), which is common in PACS exports.

All DICOM files in a dataset are grouped into one cr:FileSet with a summary cr:RecordSet covering modality counts and dimension ranges.

NIfTI

NIfTI (.nii, .nii.gz) is the standard format for neuroimaging data (structural MRI, fMRI, CT). The handler uses nibabel and reads the header only — the voxel data array is never loaded.

Extracted metadata: spatial dimensions (x, y, z), number of timepoints for 4D volumes, voxel spacing in mm, stored data type, NIfTI version (1 or 2), and repetition time (TR) for fMRI data.

All NIfTI files in a dataset are grouped into one cr:FileSet with a summary cr:RecordSet. The tr_seconds field is only added when at least one 4D volume is present.

Hidden files and directories

Files inside hidden directories (any path component starting with .) are always skipped. Use --include and --exclude glob patterns to further control which files are processed.