API Reference

Use croissant-baker as a Python library to generate Croissant metadata programmatically — without the CLI.

MetadataGenerator

`croissant_baker.metadata_generator.MetadataGenerator`

Generates Croissant metadata for datasets with automatic type inference.

Discovers files, delegates format-specific logic to registered handlers via the build_croissant protocol, and assembles the final JSON-LD.

`init(dataset_path, name=None, description=None, url=None, license=None, citation=None, version=None, date_published=None, date_created=None, date_modified=None, creators=None, publisher=None, keywords=None, in_language=None, same_as=None, sd_license=None, sd_version=None, alternate_name=None, is_live_dataset=None, temporal_coverage=None, usage_info=None, field_mappings=None, count_csv_rows=False, max_workers=None, includes=None, excludes=None, rai_fields=None)`

Initialize the metadata generator for a dataset.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	Path to the directory containing dataset files.	required
`name`	`Optional[str]`	Dataset name (defaults to directory name).	`None`
`description`	`Optional[str]`	Dataset description.	`None`
`url`	`Optional[str]`	Dataset URL.	`None`
`license`	`Optional[str]`	License URL or SPDX identifier (e.g. "CC-BY-4.0").	`None`
`citation`	`Optional[str]`	Citation text, preferably BibTeX format.	`None`
`version`	`Optional[str]`	Dataset version string.	`None`
`date_published`	`Optional[str]`	Publication date in ISO format ("2023-12-15" or "2023-12-15T10:30:00").	`None`
`date_created`	`Optional[str]`	Creation date in ISO format.	`None`
`date_modified`	`Optional[str]`	Last-modified date in ISO format.	`None`
`creators`	`Optional[List[Dict[str, str]]]`	List of dicts with "name", "email", and/or "url" keys.	`None`
`publisher`	`Optional[str]`	Name of the publishing organization (schema.org/Organization).	`None`
`keywords`	`Optional[List[str]]`	Topical keywords for dataset discovery (schema.org/keywords).	`None`
`in_language`	`Optional[List[str]]`	BCP 47 language code(s) (e.g. "en"). Multiple supported.	`None`
`same_as`	`Optional[List[str]]`	URLs of equivalent dataset records (e.g. DOI, mirror landing pages). Multiple values supported per schema.org/sameAs.	`None`
`sd_license`	`Optional[str]`	License of the metadata description itself, distinct from the data license (schema.org/sdLicense).	`None`
`sd_version`	`Optional[str]`	Version of the metadata description, distinct from `version`. Defaults to None — only emitted when set.	`None`
`alternate_name`	`Optional[str]`	Short alias for the dataset (schema.org/alternateName).	`None`
`is_live_dataset`	`Optional[bool]`	Mark dataset as a live, evolving stream.	`None`
`temporal_coverage`	`Optional[str]`	Time period the data covers — schema.org accepts free text or ISO 8601 (e.g., "2008/2019", "2023-01-15").	`None`
`usage_info`	`Optional[str]`	URL of a usage/consent policy (e.g., a DUO term URL, ODRL Offer URL).	`None`
`field_mappings`	`Optional[Dict[str, Dict[str, object]]]`	Per-column overrides keyed by field name. Each value is a dict with optional `equivalent_property` (vocab URI) and `data_types` (list of vocab URIs). Used to link columns to external vocabularies like Wikidata/SNOMED/LOINC.	`None`
`count_csv_rows`	`bool`	If True, scan each CSV fully for exact row counts. Defaults to False for performance.	`False`
`max_workers`	`Optional[int]`	Maximum worker threads for per-file metadata extraction. None (default) auto-sizes from the CPU count; 1 forces serial. Output is identical regardless of this value.	`None`
`includes`	`Optional[List[str]]`	Glob patterns to include. Applied before excludes.	`None`
`excludes`	`Optional[List[str]]`	Glob patterns to exclude. Applied after includes.	`None`
`rai_fields`	`Optional[Dict[str, object]]`	Native mlcroissant RAI metadata fields, passed through to `mlc.Metadata` unchanged.	`None`

Raises:

Type	Description
`ValueError`	If dataset_path is not a directory.

`generate_metadata(progress_callback=None)`

Generate complete Croissant metadata for the dataset.

Per-file metadata extraction (handler selection, whole-file SHA-256, header/schema reads) is I/O-bound and independent across files, so it runs on a thread pool sized by max_workers. Results are reassembled in discovery order before any FileObject @id is assigned, so the output — including the order of warnings — is identical regardless of worker count.

Parameters:

Name	Type	Description	Default
`progress_callback`		Optional callback with signature (completed: int, total: int, file_path: str) -> None invoked once per file as it finishes extraction.	`None`

`save_metadata(output_path, validate=True)`

Generate and save Croissant metadata to a file.

Parameters:

Name	Type	Description	Default
`output_path`	`str`	Path where the JSON-LD metadata file will be written.	required
`validate`	`bool`	If True (default), validates with mlcroissant before saving.	`True`

Raises:

Type	Description
`ValueError`	If validation fails or the file cannot be saved.

File Discovery

`croissant_baker.files.discover_files(dir_path, include_patterns=None, exclude_patterns=None)`

Recursively discover all files in a directory (skipping hidden directories) and return their relative paths.

Parameters:

Name	Type	Description	Default
`dir_path`	`str`	Path to the directory to scan.	required
`include_patterns`	`Optional[List[str]]`	Optional list of glob patterns to include.	`None`
`exclude_patterns`	`Optional[List[str]]`	Optional list of glob patterns to exclude.	`None`

Returns:

Type	Description
`List[Path]`	List of relative file paths found in the directory.

Raises:

Type	Description
`FileNotFoundError`	If the directory does not exist or is not a directory.
`PermissionError`	If the directory cannot be accessed.

Handler Interface

`croissant_baker.handlers.base_handler.FileTypeHandler`

Bases: ABC

Abstract base class for file type handlers.

Each handler is responsible for three things: - can_handle: decide if this handler owns a given file - extract_metadata: extract raw metadata from a single file - build_croissant: turn that metadata into Croissant FileSets + RecordSets

All three are required. The generator owns FileObject creation and ID assignment; build_croissant returns only FileSets and RecordSets.

Adding a new format: subclass this, implement all three methods, register the instance in registry.py — no other files need to change.

Subclasses should also set these class attributes for documentation: - EXTENSIONS: tuple of file extensions this handler claims (e.g. (".csv",)) - FORMAT_NAME: short display name for docs (e.g. "CSV") - FORMAT_DESCRIPTION: one-line summary of what metadata is extracted

`can_handle(file_path)` `abstractmethod`

Check if the handler can process the given file.

Parameters:

Name	Type	Description	Default
`file_path`	`Path`	Path to the file to check	required

Returns:

Type	Description
`bool`	True if this handler can process the file, False otherwise

`extract_metadata(file_path, **kwargs)` `abstractmethod`

Extract comprehensive metadata from a single file.

Thread-safety: extract_metadata may be called concurrently across files on a single shared handler instance — handlers are registered as singletons and extraction is parallelised (see MetadataGenerator). Implementations must be safe to call concurrently: do not mutate shared or instance state during extraction, and use only parsers that are safe for concurrent reads of independent files. Read-only state set once at construction is fine; mutable per-call state must stay local — never on self (nor on the class or module).

Should return a dictionary containing file information, structure, types, and any format-specific metadata needed for Croissant generation.

Subclasses may declare additional named parameters before **kwargs to support handler-specific options (e.g. count_rows for CSV).

Parameters:

Name	Type	Description	Default
`file_path`	`Path`	Path to the file to process	required
`**kwargs`	`object`	Handler-specific options forwarded from MetadataGenerator	`{}`

Returns:

Type	Description
`dict`	Dictionary containing extracted metadata. For tabular data, should include:
`dict`	column_types: Dict mapping column names to Croissant types
`dict`	Basic file info (path, name, size, hash, encoding_format)

Raises:

Type	Description
`Exception`	If the file cannot be processed

`build_croissant(file_metas, file_ids)` `abstractmethod`

Build Croissant FileSets and RecordSets for all files this handler processed.

Called once per handler after the FileObject loop. Receives all metadata dicts this handler produced for the dataset, with pre-assigned FileObject IDs aligned by position.

Parameters:

Name	Type	Description	Default
`file_metas`	`list[dict]`	metadata dicts from extract_metadata, one per file	required
`file_ids`	`list[str]`	FileObject IDs assigned by the generator, aligned with file_metas	required

Returns:

Type	Description
`list`	(additional_distributions, record_sets) — additional_distributions
`list`	contains FileSets only. FileObjects are always owned by the generator.

API Reference

MetadataGenerator

croissant_baker.metadata_generator.MetadataGenerator

generate_metadata(progress_callback=None)

save_metadata(output_path, validate=True)

File Discovery

croissant_baker.files.discover_files(dir_path, include_patterns=None, exclude_patterns=None)

Handler Interface

croissant_baker.handlers.base_handler.FileTypeHandler

can_handle(file_path) abstractmethod

extract_metadata(file_path, **kwargs) abstractmethod

build_croissant(file_metas, file_ids) abstractmethod

`croissant_baker.metadata_generator.MetadataGenerator`

`generate_metadata(progress_callback=None)`

`save_metadata(output_path, validate=True)`

`croissant_baker.files.discover_files(dir_path, include_patterns=None, exclude_patterns=None)`

`croissant_baker.handlers.base_handler.FileTypeHandler`

`can_handle(file_path)` `abstractmethod`

`extract_metadata(file_path, **kwargs)` `abstractmethod`

`build_croissant(file_metas, file_ids)` `abstractmethod`