API Reference
Use croissant-baker as a Python library to generate Croissant metadata
programmatically — without the CLI.
MetadataGenerator
croissant_baker.metadata_generator.MetadataGenerator
Generates Croissant metadata for datasets with automatic type inference.
Discovers files, delegates format-specific logic to registered handlers via the build_croissant protocol, and assembles the final JSON-LD.
__init__(dataset_path, name=None, description=None, url=None, license=None, citation=None, version=None, date_published=None, creators=None, count_csv_rows=False, includes=None, excludes=None, rai_fields=None)
Initialize the metadata generator for a dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_path
|
str
|
Path to the directory containing dataset files. |
required |
name
|
Optional[str]
|
Dataset name (defaults to directory name). |
None
|
description
|
Optional[str]
|
Dataset description. |
None
|
url
|
Optional[str]
|
Dataset URL. |
None
|
license
|
Optional[str]
|
License URL or SPDX identifier (e.g. "CC-BY-4.0"). |
None
|
citation
|
Optional[str]
|
Citation text, preferably BibTeX format. |
None
|
version
|
Optional[str]
|
Dataset version string. |
None
|
date_published
|
Optional[str]
|
Publication date in ISO format ("2023-12-15" or "2023-12-15T10:30:00"). |
None
|
creators
|
Optional[List[Dict[str, str]]]
|
List of dicts with "name", "email", and/or "url" keys. |
None
|
count_csv_rows
|
bool
|
If True, scan each CSV fully for exact row counts. Defaults to False for performance. |
False
|
includes
|
Optional[List[str]]
|
Glob patterns to include. Applied before excludes. |
None
|
excludes
|
Optional[List[str]]
|
Glob patterns to exclude. Applied after includes. |
None
|
rai_fields
|
Optional[Dict[str, object]]
|
Native mlcroissant RAI metadata fields to pass through to mlc.Metadata unchanged. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If dataset_path is not a directory. |
generate_metadata(progress_callback=None)
Generate complete Croissant metadata for the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
progress_callback
|
Optional callback with signature (current: int, total: int, file_path: str) -> None called before processing each file. |
None
|
save_metadata(output_path, validate=True)
Generate and save Croissant metadata to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
Path where the JSON-LD metadata file will be written. |
required |
validate
|
bool
|
If True (default), validates with mlcroissant before saving. |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If validation fails or the file cannot be saved. |
File Discovery
croissant_baker.files.discover_files(dir_path, include_patterns=None, exclude_patterns=None)
Recursively discover all files in a directory (skipping hidden directories) and return their relative paths.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
str
|
Path to the directory to scan. |
required |
include_patterns
|
Optional[List[str]]
|
Optional list of glob patterns to include. |
None
|
exclude_patterns
|
Optional[List[str]]
|
Optional list of glob patterns to exclude. |
None
|
Returns:
| Type | Description |
|---|---|
List[Path]
|
List of relative file paths found in the directory. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the directory does not exist or is not a directory. |
PermissionError
|
If the directory cannot be accessed. |
Handler Interface
croissant_baker.handlers.base_handler.FileTypeHandler
Bases: ABC
Abstract base class for file type handlers.
Each handler is responsible for three things: - can_handle: decide if this handler owns a given file - extract_metadata: extract raw metadata from a single file - build_croissant: turn that metadata into Croissant FileSets + RecordSets
All three are required. The generator owns FileObject creation and ID assignment; build_croissant returns only FileSets and RecordSets.
Adding a new format: subclass this, implement all three methods, register the instance in registry.py — no other files need to change.
Subclasses should also set these class attributes for documentation: - EXTENSIONS: tuple of file extensions this handler claims (e.g. (".csv",)) - FORMAT_NAME: short display name for docs (e.g. "CSV") - FORMAT_DESCRIPTION: one-line summary of what metadata is extracted
can_handle(file_path)
abstractmethod
Check if the handler can process the given file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the file to check |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if this handler can process the file, False otherwise |
extract_metadata(file_path, **kwargs)
abstractmethod
Extract comprehensive metadata from a single file.
Should return a dictionary containing file information, structure, types, and any format-specific metadata needed for Croissant generation.
Subclasses may declare additional named parameters before **kwargs to support handler-specific options (e.g. count_rows for CSV).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the file to process |
required |
**kwargs
|
object
|
Handler-specific options forwarded from MetadataGenerator |
{}
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary containing extracted metadata. For tabular data, should include: |
dict
|
|
dict
|
|
Raises:
| Type | Description |
|---|---|
Exception
|
If the file cannot be processed |
build_croissant(file_metas, file_ids)
abstractmethod
Build Croissant FileSets and RecordSets for all files this handler processed.
Called once per handler after the FileObject loop. Receives all metadata dicts this handler produced for the dataset, with pre-assigned FileObject IDs aligned by position.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_metas
|
list[dict]
|
metadata dicts from extract_metadata, one per file |
required |
file_ids
|
list[str]
|
FileObject IDs assigned by the generator, aligned with file_metas |
required |
Returns:
| Type | Description |
|---|---|
list
|
(additional_distributions, record_sets) — additional_distributions |
list
|
contains FileSets only. FileObjects are always owned by the generator. |