oumi.core.analyze#
Sample analyzer plugin system for OUMI.
This package provides a plugin-based architecture for analyzing conversation data with different types of sample analyzers (length, safety, etc.).
- class oumi.core.analyze.DatasetAnalysisResult(dataset_name: str, total_conversations: int, conversations_analyzed: int, total_messages: int, messages: list[MessageAnalysisResult])[source]#
Bases:
object
Complete result of dataset analysis.
- Variables:
dataset_name (str) – Name of the analyzed dataset
total_conversations (int) – Total number of conversations in the dataset
conversations_analyzed (int) – Number of conversations actually analyzed
total_messages (int) – Total number of messages analyzed
messages (list[oumi.core.analyze.dataset_analyzer.MessageAnalysisResult]) – List of analysis results for each individual message
- conversations_analyzed: int#
- dataset_name: str#
- messages: list[MessageAnalysisResult]#
- to_dataframe() DataFrame [source]#
Convert the analysis results to a pandas DataFrame.
- Returns:
DataFrame with flattened analyzer metrics for easy querying. Each row represents one message with all its analysis metrics.
- to_dict() dict[str, Any] [source]#
Convert the analysis result to a dictionary.
- Returns:
Dictionary representation of the analysis result
- total_conversations: int#
- total_messages: int#
- class oumi.core.analyze.DatasetAnalyzer(config: AnalyzeConfig)[source]#
Bases:
object
Orchestrates dataset analysis by creating and managing sample analyzers.
- property analysis_results: DatasetAnalysisResult | None#
Get the analysis results if available.
- Returns:
DatasetAnalysisResult if analysis has been run, None otherwise
- analyze_dataset() None [source]#
Analyze the dataset and store results internally.
This method performs sample-level analysis using the configured sample analyzers. Each sample analyzer processes individual messages and returns metrics for each message. Results are stored internally and can be accessed via the query() method.
- Raises:
ValueError – If no analyzers are configured for analysis.
- filter(query_expression: str) BaseMapDataset [source]#
Filter the original dataset based on analysis results.
This method uses analysis results to filter the original dataset, returning a new dataset object containing only the conversations that match the query.
- Parameters:
query_expression – Pandas query expression to filter analysis results
- Returns:
A new dataset object containing only the filtered conversations
Examples:
# Filter for conversations with short messages short_dataset = analyzer.filter("length_word_count < 10") # Filter for conversations with assistant messages assistant_dataset = analyzer.filter("role == 'assistant'") # Filter for conversations with long user messages long_user_dataset = analyzer.filter( "role == 'user' and length_word_count > 100")
- query(query_expression: str) DataFrame [source]#
Query analysis results using pandas query expression.
- Parameters:
query_expression – Pandas query expression to filter analysis results
information (Please see pandas DataFrame query documentation for more)
https – //pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
- Returns:
DataFrame with filtered analysis results
Examples
# Filter for short messages short_messages = analyzer.query(“length_word_count < 10”)
# Filter for assistant messages assistant_messages = analyzer.query(“role == ‘assistant’”)
# Filter for long user messages long_user = analyzer.query(“role == ‘user’ and length_word_count > 100”)
- class oumi.core.analyze.LengthAnalyzer(*, char_count: bool = True, word_count: bool = True, sentence_count: bool = True, token_count: bool = False, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, include_special_tokens: bool = True)[source]#
Bases:
SampleAnalyzer
Analyzer that computes various length metrics for text content.
- analyze_message(text_content: str, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None) dict[str, Any] [source]#
Analyze text content and return length metrics.
- Parameters:
text_content – The text content to analyze
tokenizer – Optional tokenizer to use for token counting
- Returns:
Dictionary containing requested length metrics
- class oumi.core.analyze.MessageAnalysisResult(conversation_id: str, conversation_index: int, message_index: int, role: str, message_id: str, text_content: str, analyzer_metrics: dict[str, Any])[source]#
Bases:
object
Result of analyzing a single message in a conversation.
- Variables:
conversation_id (str) – Unique identifier for the conversation
conversation_index (int) – Index of the conversation in the dataset
message_index (int) – Index of the message within the conversation
role (str) – Role of the message sender (e.g., ‘user’, ‘assistant’)
message_id (str) – Unique identifier for the message
text_content (str) – The text content of the message
analyzer_metrics (dict[str, Any]) – Dictionary of metrics computed by sample analyzers, with keys prefixed by analyzer ID to avoid conflicts
- ANALYZER_METRICS_FIELD = 'analyzer_metrics'#
- analyzer_metrics: dict[str, Any]#
- conversation_id: str#
- conversation_index: int#
- message_id: str#
- message_index: int#
- role: str#
- text_content: str#
- class oumi.core.analyze.SampleAnalyzer[source]#
Bases:
ABC
Base class for sample analyzer plugins that analyze individual samples.
- abstractmethod analyze_message(text_content: str, tokenizer: Any | None = None) dict[str, Any] [source]#
Analyze a single message and return metrics.
- Parameters:
text_content – The text content to analyze
tokenizer – Optional tokenizer to use for tokenization-based analysis
- Returns:
Dictionary containing analysis metrics