Skip to content

API Reference

Use croissant-baker as a Python library to generate Croissant metadata programmatically — without the CLI.

MetadataGenerator

croissant_baker.metadata_generator.MetadataGenerator

Generates Croissant metadata for datasets with automatic type inference.

Discovers files, delegates format-specific logic to registered handlers via the build_croissant protocol, and assembles the final JSON-LD.

__init__(dataset_path, name=None, description=None, url=None, license=None, citation=None, version=None, date_published=None, creators=None, count_csv_rows=False, includes=None, excludes=None, rai_fields=None)

Initialize the metadata generator for a dataset.

Parameters:

Name Type Description Default
dataset_path str

Path to the directory containing dataset files.

required
name Optional[str]

Dataset name (defaults to directory name).

None
description Optional[str]

Dataset description.

None
url Optional[str]

Dataset URL.

None
license Optional[str]

License URL or SPDX identifier (e.g. "CC-BY-4.0").

None
citation Optional[str]

Citation text, preferably BibTeX format.

None
version Optional[str]

Dataset version string.

None
date_published Optional[str]

Publication date in ISO format ("2023-12-15" or "2023-12-15T10:30:00").

None
creators Optional[List[Dict[str, str]]]

List of dicts with "name", "email", and/or "url" keys.

None
count_csv_rows bool

If True, scan each CSV fully for exact row counts. Defaults to False for performance.

False
includes Optional[List[str]]

Glob patterns to include. Applied before excludes.

None
excludes Optional[List[str]]

Glob patterns to exclude. Applied after includes.

None
rai_fields Optional[Dict[str, object]]

Native mlcroissant RAI metadata fields to pass through to mlc.Metadata unchanged.

None

Raises:

Type Description
ValueError

If dataset_path is not a directory.

generate_metadata(progress_callback=None)

Generate complete Croissant metadata for the dataset.

Parameters:

Name Type Description Default
progress_callback

Optional callback with signature (current: int, total: int, file_path: str) -> None called before processing each file.

None

save_metadata(output_path, validate=True)

Generate and save Croissant metadata to a file.

Parameters:

Name Type Description Default
output_path str

Path where the JSON-LD metadata file will be written.

required
validate bool

If True (default), validates with mlcroissant before saving.

True

Raises:

Type Description
ValueError

If validation fails or the file cannot be saved.

File Discovery

croissant_baker.files.discover_files(dir_path, include_patterns=None, exclude_patterns=None)

Recursively discover all files in a directory (skipping hidden directories) and return their relative paths.

Parameters:

Name Type Description Default
dir_path str

Path to the directory to scan.

required
include_patterns Optional[List[str]]

Optional list of glob patterns to include.

None
exclude_patterns Optional[List[str]]

Optional list of glob patterns to exclude.

None

Returns:

Type Description
List[Path]

List of relative file paths found in the directory.

Raises:

Type Description
FileNotFoundError

If the directory does not exist or is not a directory.

PermissionError

If the directory cannot be accessed.

Handler Interface

croissant_baker.handlers.base_handler.FileTypeHandler

Bases: ABC

Abstract base class for file type handlers.

Each handler is responsible for three things: - can_handle: decide if this handler owns a given file - extract_metadata: extract raw metadata from a single file - build_croissant: turn that metadata into Croissant FileSets + RecordSets

All three are required. The generator owns FileObject creation and ID assignment; build_croissant returns only FileSets and RecordSets.

Adding a new format: subclass this, implement all three methods, register the instance in registry.py — no other files need to change.

Subclasses should also set these class attributes for documentation: - EXTENSIONS: tuple of file extensions this handler claims (e.g. (".csv",)) - FORMAT_NAME: short display name for docs (e.g. "CSV") - FORMAT_DESCRIPTION: one-line summary of what metadata is extracted

can_handle(file_path) abstractmethod

Check if the handler can process the given file.

Parameters:

Name Type Description Default
file_path Path

Path to the file to check

required

Returns:

Type Description
bool

True if this handler can process the file, False otherwise

extract_metadata(file_path, **kwargs) abstractmethod

Extract comprehensive metadata from a single file.

Should return a dictionary containing file information, structure, types, and any format-specific metadata needed for Croissant generation.

Subclasses may declare additional named parameters before **kwargs to support handler-specific options (e.g. count_rows for CSV).

Parameters:

Name Type Description Default
file_path Path

Path to the file to process

required
**kwargs object

Handler-specific options forwarded from MetadataGenerator

{}

Returns:

Type Description
dict

Dictionary containing extracted metadata. For tabular data, should include:

dict
  • column_types: Dict mapping column names to Croissant types
dict
  • Basic file info (path, name, size, hash, encoding_format)

Raises:

Type Description
Exception

If the file cannot be processed

build_croissant(file_metas, file_ids) abstractmethod

Build Croissant FileSets and RecordSets for all files this handler processed.

Called once per handler after the FileObject loop. Receives all metadata dicts this handler produced for the dataset, with pre-assigned FileObject IDs aligned by position.

Parameters:

Name Type Description Default
file_metas list[dict]

metadata dicts from extract_metadata, one per file

required
file_ids list[str]

FileObject IDs assigned by the generator, aligned with file_metas

required

Returns:

Type Description
list

(additional_distributions, record_sets) — additional_distributions

list

contains FileSets only. FileObjects are always owned by the generator.