Skip to content

API Reference

Use croissant-baker as a Python library to generate Croissant metadata programmatically — without the CLI.

MetadataGenerator

croissant_baker.metadata_generator.MetadataGenerator

Generates Croissant metadata for datasets with automatic type inference.

Discovers files, delegates format-specific logic to registered handlers via the build_croissant protocol, and assembles the final JSON-LD.

__init__(dataset_path, name=None, description=None, url=None, license=None, citation=None, version=None, date_published=None, date_created=None, date_modified=None, creators=None, publisher=None, keywords=None, in_language=None, same_as=None, sd_license=None, sd_version=None, alternate_name=None, is_live_dataset=None, temporal_coverage=None, usage_info=None, field_mappings=None, count_csv_rows=False, includes=None, excludes=None, rai_fields=None)

Initialize the metadata generator for a dataset.

Parameters:

Name Type Description Default
dataset_path str

Path to the directory containing dataset files.

required
name Optional[str]

Dataset name (defaults to directory name).

None
description Optional[str]

Dataset description.

None
url Optional[str]

Dataset URL.

None
license Optional[str]

License URL or SPDX identifier (e.g. "CC-BY-4.0").

None
citation Optional[str]

Citation text, preferably BibTeX format.

None
version Optional[str]

Dataset version string.

None
date_published Optional[str]

Publication date in ISO format ("2023-12-15" or "2023-12-15T10:30:00").

None
date_created Optional[str]

Creation date in ISO format.

None
date_modified Optional[str]

Last-modified date in ISO format.

None
creators Optional[List[Dict[str, str]]]

List of dicts with "name", "email", and/or "url" keys.

None
publisher Optional[str]

Name of the publishing organization (schema.org/Organization).

None
keywords Optional[List[str]]

Topical keywords for dataset discovery (schema.org/keywords).

None
in_language Optional[List[str]]

BCP 47 language code(s) (e.g. "en"). Multiple supported.

None
same_as Optional[List[str]]

URLs of equivalent dataset records (e.g. DOI, mirror landing pages). Multiple values supported per schema.org/sameAs.

None
sd_license Optional[str]

License of the metadata description itself, distinct from the data license (schema.org/sdLicense).

None
sd_version Optional[str]

Version of the metadata description, distinct from version. Defaults to None — only emitted when set.

None
alternate_name Optional[str]

Short alias for the dataset (schema.org/alternateName).

None
is_live_dataset Optional[bool]

Mark dataset as a live, evolving stream.

None
temporal_coverage Optional[str]

Time period the data covers — schema.org accepts free text or ISO 8601 (e.g., "2008/2019", "2023-01-15").

None
usage_info Optional[str]

URL of a usage/consent policy (e.g., a DUO term URL, ODRL Offer URL).

None
field_mappings Optional[Dict[str, Dict[str, object]]]

Per-column overrides keyed by field name. Each value is a dict with optional equivalent_property (vocab URI) and data_types (list of vocab URIs). Used to link columns to external vocabularies like Wikidata/SNOMED/LOINC.

None
count_csv_rows bool

If True, scan each CSV fully for exact row counts. Defaults to False for performance.

False
includes Optional[List[str]]

Glob patterns to include. Applied before excludes.

None
excludes Optional[List[str]]

Glob patterns to exclude. Applied after includes.

None
rai_fields Optional[Dict[str, object]]

Native mlcroissant RAI metadata fields, passed through to mlc.Metadata unchanged.

None

Raises:

Type Description
ValueError

If dataset_path is not a directory.

generate_metadata(progress_callback=None)

Generate complete Croissant metadata for the dataset.

Parameters:

Name Type Description Default
progress_callback

Optional callback with signature (current: int, total: int, file_path: str) -> None called before processing each file.

None

save_metadata(output_path, validate=True)

Generate and save Croissant metadata to a file.

Parameters:

Name Type Description Default
output_path str

Path where the JSON-LD metadata file will be written.

required
validate bool

If True (default), validates with mlcroissant before saving.

True

Raises:

Type Description
ValueError

If validation fails or the file cannot be saved.

File Discovery

croissant_baker.files.discover_files(dir_path, include_patterns=None, exclude_patterns=None)

Recursively discover all files in a directory (skipping hidden directories) and return their relative paths.

Parameters:

Name Type Description Default
dir_path str

Path to the directory to scan.

required
include_patterns Optional[List[str]]

Optional list of glob patterns to include.

None
exclude_patterns Optional[List[str]]

Optional list of glob patterns to exclude.

None

Returns:

Type Description
List[Path]

List of relative file paths found in the directory.

Raises:

Type Description
FileNotFoundError

If the directory does not exist or is not a directory.

PermissionError

If the directory cannot be accessed.

Handler Interface

croissant_baker.handlers.base_handler.FileTypeHandler

Bases: ABC

Abstract base class for file type handlers.

Each handler is responsible for three things: - can_handle: decide if this handler owns a given file - extract_metadata: extract raw metadata from a single file - build_croissant: turn that metadata into Croissant FileSets + RecordSets

All three are required. The generator owns FileObject creation and ID assignment; build_croissant returns only FileSets and RecordSets.

Adding a new format: subclass this, implement all three methods, register the instance in registry.py — no other files need to change.

Subclasses should also set these class attributes for documentation: - EXTENSIONS: tuple of file extensions this handler claims (e.g. (".csv",)) - FORMAT_NAME: short display name for docs (e.g. "CSV") - FORMAT_DESCRIPTION: one-line summary of what metadata is extracted

can_handle(file_path) abstractmethod

Check if the handler can process the given file.

Parameters:

Name Type Description Default
file_path Path

Path to the file to check

required

Returns:

Type Description
bool

True if this handler can process the file, False otherwise

extract_metadata(file_path, **kwargs) abstractmethod

Extract comprehensive metadata from a single file.

Should return a dictionary containing file information, structure, types, and any format-specific metadata needed for Croissant generation.

Subclasses may declare additional named parameters before **kwargs to support handler-specific options (e.g. count_rows for CSV).

Parameters:

Name Type Description Default
file_path Path

Path to the file to process

required
**kwargs object

Handler-specific options forwarded from MetadataGenerator

{}

Returns:

Type Description
dict

Dictionary containing extracted metadata. For tabular data, should include:

dict
  • column_types: Dict mapping column names to Croissant types
dict
  • Basic file info (path, name, size, hash, encoding_format)

Raises:

Type Description
Exception

If the file cannot be processed

build_croissant(file_metas, file_ids) abstractmethod

Build Croissant FileSets and RecordSets for all files this handler processed.

Called once per handler after the FileObject loop. Receives all metadata dicts this handler produced for the dataset, with pre-assigned FileObject IDs aligned by position.

Parameters:

Name Type Description Default
file_metas list[dict]

metadata dicts from extract_metadata, one per file

required
file_ids list[str]

FileObject IDs assigned by the generator, aligned with file_metas

required

Returns:

Type Description
list

(additional_distributions, record_sets) — additional_distributions

list

contains FileSets only. FileObjects are always owned by the generator.