Overview

An overview of Kubeflow Training Operator

What is Kubeflow Training Operator ?

Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, Tensorflow, XGBoost, and others.

User can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron with Training Operator to orchestrate their ML training on Kubernetes.

Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using Training Operator Python SDK.

Training Operator implements centralized Kubernetes controller to orchestrate distributed training jobs.

Users can run High-performance computing (HPC) tasks with Training Operator and MPIJob since it supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC. Training Operator implements V1 API version of MPI Operator. For MPI Operator V2 version, please follow this guide to install MPI Operator V2.

Architecture

This diagram shows the major features of Training Operator and supported ML frameworks.

Training Operator Overview

Custom Resources for ML Frameworks

To perform distributed training Training Operator implements the following Custom Resources for each ML framework:

ML Framework	Custom Resource
PyTorch	PyTorchJob
Tensorflow	TFJob
XGBoost	XGBoostJob
MPI	MPIJob
PaddlePaddle	PaddleJob

Getting Started

You can create your first Training Operator job using Python SDK. Define the training function that implements end-to-end model training. Training Operator schedules appropriate resources to run this training function on every Worker.

Install Training Operator SDK:

pip install kubeflow-training

You can implement your training loop in the train function. Each Worker will execute this function on the appropriate Kubernetes Pod. Usually, this function contains logic to download dataset, create model, and train the model.

World Size and Rank will be set automatically in env variables by Training Operator controller to perform PyTorch DDP.

For example:

def train_func():
    import torch
    import os

    # Create model.
    class Net(torch.nn.Module):
        """Create the Pytorch model"""
        ...
    model = Net()

    # Download dataset.
    train_loader = torch.utils.data.DataLoader(...)

    # Attach model to PyTorch distributor.
    torch.distributed.init_process_group(backend="nccl")
    Distributor = torch.nn.parallel.DistributedDataParallel
    model = Distributor(model)

    # Start model training.
    model.train()

# Start PyTorchJob with 100 Workers and 2 GPUs per Worker.
from kubeflow.training import TrainingClient
TrainingClient().create_job(
    name="pytorch-ddp",
    func=train_func,
    num_workers=100,
    resources_per_worker={"gpu": "2"},
)

Next steps

Learn more about the PyTorchJob APIs.
Follow the scheduling guide to configure various job schedulers for Training Operator jobs.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified January 3, 2024: Training: Add Training Operator Overview Page (#3651) (1e6c362)