<!-- ## Internal Notes: to be deleted.

1 TODO: Let's implement a `return_full_text` field so the user can demand a model
does not include the the input text as well in its response
see https://huggingface.co/docs/transformers/v4.17.0/main_classes/pipelines

2 pip installing Oumi with [.gpu] it does not include ipywidgets which disables the monitoring of 
tqdm inside the notebook and results below in: `TqdmWarning: IProgress not found. Please update jupyter and ipywidgets`
Handling it with `!pip install ipywidgets`, TODO: Can we do better?


!pip install ipywidgets # Installing ipywidgets for widget visualization -->

<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

üëã Welcome to Open Universal Machine Intelligence (Oumi)!

üöÄ Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

ü§ù Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

‚≠ê If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Finetuning a Vision-Language Model (Overview)

In this tutorial, we'll use LoRA training and SFT to guide a large vision/language model to produce short and concise answer grounded on visual input.

Specifically, we'll use the Oumi framework to streamline the process and achieve high-quality results fast.

We'll cover the following topics:
1. Prerequisites
2. Data Preparation & Sanity Checks
3. Training Config Preparation
4. Launching Training
5. Inference

# üìã Prerequisites

## Machine requirements

A machine with CUDA support and a GPU with the minimum of 24GB of memory is required to run this notebook. This notebook thus can not be run on the free Colab tier, which only has a T4 GPU with 16GB memory.

In [None]:
import sys

import torch

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print(f"GPU type: {torch.cuda.get_device_name()}")
    total_memory_gb = float(torch.cuda.mem_get_info()[1]) / float(1024 * 1024 * 1024)
    print(f"GPU memory: {total_memory_gb:.1f}GB")
    if total_memory_gb < 24.0 * 0.99:
        print(
            "Error! The notebook requires at least 24GB of memory. "
            "Got: {total_memory_gb:.3f}GB",
            file=sys.stderr,
        )
    elif total_memory_gb < 30.0 * 0.99:
        print(
            "You may have to reduce batch size to 1 for LoRA fine-tuning "
            "to prevent CUDA OOM (out-of-memory) errors.\n"
            f"Your GPU got only {total_memory_gb:.1f}GB of VRAM.",
            file=sys.stderr,
        )
else:
    print(
        "Error! The notebook will NOT run in a machine without CUDA.", file=sys.stderr
    )

## Oumi Installation

First, let's install Oumi. You can find more detailed instructions [here](https://oumi.ai/docs/en/latest/get_started/installation.html). Here, we include Oumi's GPU dependencies.

In [None]:
%pip install oumi[gpu]

In [None]:
# Additionally, install the following packages for widget visualization.
%pip install ipywidgets

# And deactivate the parallelism warning from the tokenizers library.
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Deactivate relevant HF warnings

## Configure HuggingFace Access Token

Llama models are gated on HuggingFace Hub. To run this notebook, you must first complete the agreement for [Llama 3.2](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) on HuggingFace, and wait for it to be accepted. Then, specify `HF_TOKEN` below to enable access to the model if it's not already set.

Usually, you can get the token by running this command `cat ~/.cache/huggingface/token` on your local machine.

In [5]:
if not os.environ.get("HF_TOKEN"):
    # NOTE: Set your Hugging Face token here if not already set.
    os.environ["HF_TOKEN"] = "<MY_HF_TOKEN>"

hf_token = os.environ.get("HF_TOKEN")
print(f"Using HF Token: '{hf_token}'")

## Creating our working directory

For our experiments, we'll use the following folder to save the model, training artifacts, and our inference and training configs.

In [2]:
from pathlib import Path

tutorial_dir = Path("vision_language_tutorial").resolve()
tutorial_dir.mkdir(parents=True, exist_ok=True)
tutorial_dir = str(tutorial_dir)  # Convert back to `str` for simplicity.
print(f"Using the directory: '{tutorial_dir}'")

Using the directory: '/home/user/oumi/notebooks/vision_language_tutorial'


In what follows we use Meta's [Llama-3.2-11B-Vision-Instruct](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) model.

Llama-11B-Vision is a high-performing instruction-tuned multi-modal model, which uses a moderate amount of resources (11B parameters).

We will finetune this model with the [vqav2-small](https://huggingface.co/datasets/merve/vqav2-small) dataset which will help the model respond in __a succinct manner__ on visually grounded questions.

The principles presented here are generic and "Oumi-flexible". 

To repeat this experiment with other models/data you can simply replace e.g., the `model_name` (a string) with the names of other supported models (see [here](https://oumi.ai/docs/en/latest/resources/models/supported_models.html)) and adapt the configurations.

## First, let's initialize our dataset and build a tokenizer and an underlying data processor.


In [3]:
from oumi.builders import build_tokenizer
from oumi.core.configs import ModelParams
from oumi.datasets.vision_language.vqav2_small import Vqav2SmallDataset

model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"

tokenizer = build_tokenizer(ModelParams(model_name=model_name))

dataset = Vqav2SmallDataset(
    tokenizer=tokenizer,
    processor_name=model_name,
    limit=1000,  # Limit the number of examples for demonstration purposes (!)
)

print("\nExamples included:", len(dataset))

[2025-01-28 16:02:56,348][oumi][rank0][pid:865958][MainThread][INFO]][models.py:437] Using the chat template 'llama3-instruct', which is the default for model 'meta-llama/Llama-3.2-11B-Vision-Instruct'.
[2025-01-28 16:02:56,350][oumi][rank0][pid:865958][MainThread][INFO]][base_map_dataset.py:68] Creating map dataset (type: Vqav2SmallDataset) dataset_name: 'None', dataset_path: 'None'...
[2025-01-28 16:02:59,913][oumi][rank0][pid:865958][MainThread][INFO]][base_map_dataset.py:472] Dataset Info:
	Split: validation
	Version: 0.0.0
	Dataset size: 3391008667
	Download size: 3376516283
	Size: 6767524950 bytes
	Rows: 21435
	Columns: ['multiple_choice_answer', 'question', 'image']
[2025-01-28 16:03:01,259][oumi][rank0][pid:865958][MainThread][INFO]][base_map_dataset.py:411] Loaded DataFrame with shape: (21435, 3). Columns:
multiple_choice_answer    object
question                  object
image                     object
dtype: object

Examples included: 1000


### Now let's see a few examples to get a feel for the dataset we are going to use.

In [None]:
import io

from PIL import Image

from oumi.core.types.conversation import Type

num_examples_to_display = 3

for i in range(num_examples_to_display):
    conversation = dataset.conversation(i)  # Retrieve the i-th example (conversation)

    print(f"Example {i}:")

    for message in conversation.messages:
        if message.role == "user":  # The `user` poses a question, regarding an image
            img_content = message.image_content_items[-1]
            assert img_content.binary is not None
            image = Image.open(io.BytesIO(img_content.binary))
            image.save(f"{tutorial_dir}/example_{i}.png")  # Save the image locally
            display(image)

        print(f"{message.role}: {message.content}")
    print("\n")

As we can see above the ground-truth answers are **very short and succinct**, which can be an advantage for scenarios where we want to generate concise answers.

In [5]:
# Furthermore, if you want to see directly the underlying stored data, stored in a
# pandas DataFrame, you can do so by running the following command:
dataset.data.head()

Unnamed: 0,multiple_choice_answer,question,image
0,carnival ride,Where are the kids riding?,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...
1,yes,Is this boy a good pitcher?,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...
2,wetsuit,What is the person wearing?,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...
3,4,How many sinks are in this bathroom?,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...
4,soccer,What sport are the girls playing?,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...


## Initial Model Responses

Let's see now how this model performs on a given prompt without any finetuning.
- For this we will create and execute and `inference configuration` stored in a YAML file.

In [None]:
%%writefile "{tutorial_dir}/infer.yaml"

model:
  model_name: "meta-llama/Llama-3.2-11B-Vision-Instruct"
  torch_dtype_str: "bfloat16" # Good choice if you have access to Ampere or newer GPU
  chat_template: "llama3-instruct"
  model_max_length: 1024
  trust_remote_code: False # For other models this might need to be set to True
  
generation:
  max_new_tokens: 128
  batch_size: 1
  
engine: NATIVE 
# Let's use the `native` engine (i.e., the underlying machine's default)
# for inference.  
# You can also consider VLLM, if are working with GPU for much faster inference. 

In [7]:
from oumi.core.configs import InferenceConfig
from oumi.core.types.conversation import Conversation, Message, Role
from oumi.inference import NativeTextInferenceEngine

# Note: the *first* time you call inference will take a few minutes to download
# and cache the model (assuming you do not already have it downloaded locally).
inference_config = InferenceConfig.from_yaml(str(Path(tutorial_dir) / "infer.yaml"))
inference_engine = NativeTextInferenceEngine(inference_config.model)

example = dataset.conversation(1)
example = Conversation(messages=example.filter_messages(role=Role.USER))
inference_engine.infer([example], inference_config)

[2025-01-28 16:03:04,292][oumi][rank0][pid:865958][MainThread][INFO]][models.py:185] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-01-28 16:03:04,292][oumi][rank0][pid:865958][MainThread][INFO]][models.py:255] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForVision2Seq'> to instantiate model.


INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

[2025-01-28 16:03:09,567][oumi][rank0][pid:865958][MainThread][INFO]][models.py:428] Using the chat template 'llama3-instruct' specified in model config!
[2025-01-28 16:03:11,495][oumi][rank0][pid:865958][MainThread][INFO]][native_text_inference_engine.py:111] Setting EOS token id to `128009`


[USER: <IMAGE_BINARY> | Is this boy a good pitcher?
 ASSISTANT: The boy in the image is wearing a baseball uniform and appears to be pitching, but it's difficult to determine if he's a good pitcher based on this image alone.]

In [8]:
# Clean up to free-up GPU memory used for inference above
import gc

import torch


def cleanup_memory():
    """Delete the inference_engine and collect garbage."""
    global inference_engine
    if inference_engine:
        del inference_engine
        inference_engine = None
    for _ in range(3):
        gc.collect()
        torch.cuda.empty_cache()
        torch.cuda.synchronize()


cleanup_memory()

In [18]:
# Note. You can do the same inference directly with our CLI (terminal) instead of the
# Python API. E.g., uncomment the following line and execute this cell:

conversation_id = 1
query = dataset.conversation(conversation_id).messages[0].text_content_items[0].content
print(f"\n{query}")
image_file = f"{tutorial_dir}/example_{conversation_id}.png"

!echo "{query}" | oumi infer -c "{tutorial_dir}/infer.yaml" -i --image="{image_file}"


Is this boy a good pitcher?

@@@@@@@@@@@@@@@@@@@
@                 @
@   @@@@@  @  @   @
@   @   @  @  @   @
@   @@@@@  @@@@   @
@                 @
@   @@@@@@@   @   @
@   @  @  @   @   @
@   @  @  @   @   @
@                 @
@@@@@@@@@@@@@@@@@@@

[2025-01-28 16:39:46,835][oumi][rank0][pid:876355][MainThread][INFO]][models.py:185] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-01-28 16:39:47,027][oumi][rank0][pid:876355][MainThread][INFO]][models.py:255] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForVision2Seq'> to instantiate model.
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:06<00:00,  1.34s/it]
[2025-01-28 16:39

OK! As you can see by default this model gives quite __verbose__ responses. Can we change this behavior?

## Preparing our training experiment
 - Specifically, let's create an execute a YAML file with our _training_ config!
 - You can find many more details about the listed hyper-parameters in our [docs](https://oumi.ai/docs/en/latest/user_guides/train/training_methods.html).

In [10]:
%%writefile "{tutorial_dir}/train.yaml"

model:
  model_name: "meta-llama/Llama-3.2-11B-Vision-Instruct"
  torch_dtype_str: "bfloat16"
  model_max_length: 1024  
  attn_implementation: "sdpa"
  chat_template: "llama3-instruct"
  freeze_layers:
    - "vision_model"     # Let's finetune only the language component of the model

data:
  train:
    collator_name: "vision_language_with_padding" # Simple padding collator
    use_torchdata: True

    datasets:
      - dataset_name: "merve/vqav2-small"
        split: "validation" # This dataset has only a validation split
        shuffle: True
        seed: 42
        transform_num_workers: "auto"
        dataset_kwargs:
          # The default for our model:
          processor_name: "meta-llama/Llama-3.2-11B-Vision-Instruct"           
          limit: 1000 # Again, we downsample to 1000 examples for demonstration 
                      # purposes only.
          return_tensors: True      

training:
  output_dir: "vision_language_tutorial"
  trainer_type: "TRL_SFT"
  enable_gradient_checkpointing: True
  # You can decrease the following two params if you run out of memory
  per_device_train_batch_size: 2 # Use batch size 1 if you only have 24GB of GPU VRAM.
  gradient_accumulation_steps: 8 # Thus effective batch size is 2x8=16 on a single GPU
  use_peft: True
  
  # **NOTE**
  # We set `max_steps` to 10 steps to first verify that training works
  # Swap to `num_train_epochs: 1` to get more meaningful results
  # (One training epoch will take ~25 mins on a single A100-40GB GPUs)
  max_steps: 40
  # num_train_epochs: 1

  gradient_checkpointing_kwargs:
    # Reentrant docs: https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint
    use_reentrant: False
  ddp_find_unused_parameters: False
  empty_device_cache_steps: 1

  optimizer: "adamw_torch_fused"
  learning_rate: 2e-5
  warmup_ratio: 0.03
  weight_decay: 0.0
  lr_scheduler_type: "cosine"

  logging_steps: 5
  save_steps: 0
  dataloader_main_process_only: False
  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 16  
  enable_wandb: True # Set to False if you don't want to use Weights & Biases
  
peft: # Our LoRA configuration; we target several layers  
  lora_r: 8
  lora_alpha: 8
  lora_dropout: 0.1
  lora_target_modules:
    - "q_proj"
    - "v_proj"
    - "o_proj"
    - "k_proj"
    - "gate_proj"
    - "up_proj"
    - "down_proj"
  lora_init_weights: GAUSSIAN

# Below lines are effective if you have access to multiple GPUs
# If you do, please uncomment them to train with all available GPUS:

# fsdp:
#   enable_fsdp: True
#   sharding_strategy: "HYBRID_SHARD"
#   forward_prefetch: True
#   auto_wrap_policy: "TRANSFORMER_BASED_WRAP"
#   transformer_layer_cls: "MllamaSelfAttentionDecoderLayer,MllamaCrossAttentionDecoderLayer,MllamaVisionEncoderLayer"

Overwriting /home/user/oumi/notebooks/vision_language_tutorial/train.yaml


In [11]:
# Re-claim memory!
cleanup_memory()

In [None]:
## Let's launch the training!

!oumi train -c "{tutorial_dir}/train.yaml"

# Or, if you have multiple GPUs you want to use:
# !oumi distributed torchrun -m oumi train -c "{tutorial_dir}/train.yaml"

## Finally, let's use the Fine-tuned Model and see the effect of training!

Once we're happy with the results, we can serve the fine-tuned model for inference:

In [13]:
%%writefile "{tutorial_dir}/trained_infer.yaml"

model:
  model_name: "meta-llama/Llama-3.2-11B-Vision-Instruct"  
  adapter_model: "vision_language_tutorial" # Directory with our saved LoRA parameters!
  torch_dtype_str: "bfloat16"
  chat_template: "llama3-instruct"
  model_max_length: 1024
  trust_remote_code: False

generation:
  max_new_tokens: 256
  batch_size: 1
  
engine: NATIVE

Overwriting /home/user/oumi/notebooks/vision_language_tutorial/trained_infer.yaml


In [14]:
cleanup_memory()

In [15]:
config = InferenceConfig.from_yaml(str(Path(tutorial_dir) / "trained_infer.yaml"))
inference_engine = NativeTextInferenceEngine(config.model)

example = dataset.conversation(1)
example = Conversation(messages=example.filter_messages(role=Role.USER))
inference_engine.infer([example], inference_config)

[2025-01-28 16:21:51,146][oumi][rank0][pid:865958][MainThread][INFO]][models.py:185] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-01-28 16:21:51,147][oumi][rank0][pid:865958][MainThread][INFO]][models.py:255] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForVision2Seq'> to instantiate model.


INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

[2025-01-28 16:21:56,531][oumi][rank0][pid:865958][MainThread][INFO]][models.py:236] Loading PEFT adapter from: vision_language_tutorial ...
[2025-01-28 16:21:57,627][oumi][rank0][pid:865958][MainThread][INFO]][models.py:428] Using the chat template 'llama3-instruct' specified in model config!


[USER: <IMAGE_BINARY> | Is this boy a good pitcher?
 ASSISTANT: yes]

In [16]:
# Or, if you want to test it with your own image/question pair:
from oumi.core.types.conversation import ContentItem
from oumi.utils.image_utils import load_image_png_bytes_from_path

your_image_path = f"{tutorial_dir}/example_1.png"  # Replace with your image path!
image_bytes = load_image_png_bytes_from_path(your_image_path)

conversation = Conversation(
    messages=[
        Message(
            role=Role.USER,
            content=[
                ContentItem(type=Type.IMAGE_BINARY, binary=image_bytes),
                # Replace the question below with your own question!
                ContentItem(type=Type.TEXT, content="Is this boy a good pitcher?"),
            ],
        )
    ]
)

inference_engine.infer([conversation], config)

[2025-01-28 16:22:01,541][oumi][rank0][pid:865958][MainThread][INFO]][native_text_inference_engine.py:111] Setting EOS token id to `128009`


[USER: <IMAGE_BINARY> | Is this boy a good pitcher?
 ASSISTANT: yes]