Analysis Configuration

Analysis Configuration#

AnalyzeConfig controls how Oumi analyzes datasets. See Dataset Analysis for usage examples.

Core Settings#

Parameter	Type	Required	Default	Description
`dataset_name`	`str`	Conditional	`None`	Dataset name (HuggingFace Hub or registered)
`dataset_path`	`str`	Conditional	`None`	Path to local dataset file
`split`	`str`	No	`"train"`	Dataset split to analyze
`subset`	`str`	No	`None`	Dataset subset/config name
`sample_count`	`int`	No	`None`	Max samples to analyze (None = all)

Dataset Specification#

Provide either a named dataset or local file path:

Named Dataset

dataset_name: "argilla/databricks-dolly-15k-curated-en"
split: train
subset: null  # Optional

Local File

dataset_path: data/dataset_examples/oumi_format.jsonl
is_multimodal: false  # Required

Tip

You can also pass a pre-loaded dataset directly to DatasetAnalyzer:

from oumi.core.analyze.dataset_analyzer import DatasetAnalyzer
analyzer = DatasetAnalyzer(config, dataset=my_dataset)

Output Settings#

Parameter	Type	Default	Description
`output_path`	`str`	`"."`	Directory for output files

YAML

output_path: "./analysis_results"

BASH

oumi analyze --config config.yaml --output /custom/path

Analyzers#

Configure analyzers as a list with id and optional params:

analyzers:
  - id: length
    params:
      char_count: true
      word_count: true

Field	Type	Required	Description
`id`	`str`	Yes	Analyzer identifier (must be registered)
`params`	`dict`	No	Analyzer-specific parameters

`length` Analyzer#

Computes text length metrics:

Parameter	Type	Default	Description
`char_count`	`bool`	`true`	Character count
`word_count`	`bool`	`true`	Word count
`sentence_count`	`bool`	`true`	Sentence count
`token_count`	`bool`	`false`	Token count (requires tokenizer)
`include_special_tokens`	`bool`	`true`	Include special tokens in count

Tokenizer Configuration#

Required when token_count: true:

Parameter	Type	Required	Description
`model_name`	`str`	Yes	HuggingFace model/tokenizer name
`tokenizer_kwargs`	`dict`	No	Additional tokenizer arguments
`trust_remote_code`	`bool`	No	Allow remote code execution

tokenizer_config:
  model_name: openai-community/gpt2
  tokenizer_kwargs:
    use_fast: true

Multimodal Settings#

For vision-language datasets:

Parameter	Type	Default	Description
`is_multimodal`	`bool`	`None`	Whether dataset is multimodal
`processor_name`	`str`	`None`	Processor name for VL datasets
`processor_kwargs`	`dict`	`{}`	Processor arguments
`trust_remote_code`	`bool`	`false`	Allow remote code

dataset_path: "/path/to/vl_data.jsonl"
is_multimodal: true
processor_name: "llava-hf/llava-1.5-7b-hf"

Note

Multimodal datasets require a valid processor_name.

Example Configuration#

Run the example from the Oumi repository root:

oumi analyze --config configs/examples/analyze/analyze.yaml

The example config at configs/examples/analyze/analyze.yaml demonstrates all available options with detailed comments explaining each setting.

Analysis Configuration

Contents

Analysis Configuration#

Core Settings#

Dataset Specification#

Output Settings#

Analyzers#

`length` Analyzer#

Tokenizer Configuration#

Multimodal Settings#

Example Configuration#

See Also#

Analysis Configuration

Contents

Analysis Configuration#

Core Settings#

Dataset Specification#

Output Settings#

Analyzers#

length Analyzer#

Tokenizer Configuration#

Multimodal Settings#

Example Configuration#

See Also#

`length` Analyzer#