Analysis Configuration#
AnalyzeConfig controls how Oumi analyzes datasets. See Dataset Analysis for usage examples.
Core Settings#
Parameter |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
|
Conditional |
|
Dataset name (HuggingFace Hub or registered) |
|
|
Conditional |
|
Path to local dataset file |
|
|
No |
|
Dataset split to analyze |
|
|
No |
|
Dataset subset/config name |
|
|
No |
|
Max samples to analyze (None = all) |
Dataset Specification#
Provide either a named dataset or local file path:
dataset_name: "argilla/databricks-dolly-15k-curated-en"
split: train
subset: null # Optional
dataset_path: data/dataset_examples/oumi_format.jsonl
is_multimodal: false # Required
Tip
You can also pass a pre-loaded dataset directly to DatasetAnalyzer:
from oumi.core.analyze.dataset_analyzer import DatasetAnalyzer
analyzer = DatasetAnalyzer(config, dataset=my_dataset)
Output Settings#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Directory for output files |
output_path: "./analysis_results"
oumi analyze --config config.yaml --output /custom/path
Analyzers#
Configure analyzers as a list with id and optional params:
analyzers:
- id: length
params:
char_count: true
word_count: true
Field |
Type |
Required |
Description |
|---|---|---|---|
|
|
Yes |
Analyzer identifier (must be registered) |
|
|
No |
Analyzer-specific parameters |
length Analyzer#
Computes text length metrics:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Character count |
|
|
|
Word count |
|
|
|
Sentence count |
|
|
|
Token count (requires tokenizer) |
|
|
|
Include special tokens in count |
Tokenizer Configuration#
Required when token_count: true:
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
|
Yes |
HuggingFace model/tokenizer name |
|
|
No |
Additional tokenizer arguments |
|
|
No |
Allow remote code execution |
tokenizer_config:
model_name: openai-community/gpt2
tokenizer_kwargs:
use_fast: true
Multimodal Settings#
For vision-language datasets:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Whether dataset is multimodal |
|
|
|
Processor name for VL datasets |
|
|
|
Processor arguments |
|
|
|
Allow remote code |
dataset_path: "/path/to/vl_data.jsonl"
is_multimodal: true
processor_name: "llava-hf/llava-1.5-7b-hf"
Note
Multimodal datasets require a valid processor_name.
Example Configuration#
Run the example from the Oumi repository root:
oumi analyze --config configs/examples/analyze/analyze.yaml
The example config at configs/examples/analyze/analyze.yaml demonstrates all available options with detailed comments explaining each setting.
See Also#
Dataset Analysis - Main analysis guide
AnalyzeConfig- API referenceSampleAnalyzerParams- Analyzer params