|
Today, we are excited to announce support for Amazon Elastic Kubernetes Service (EKS) in Amazon SageMaker HyperPod, a purpose-built infrastructure built with core resiliency for Foundation Model (FM) development. With this new capability, customers can use EKS to scale HyperPod clusters, combining the power of Kubernetes with the resilient environment of Amazon SageMaker HyperPod, which is designed to train models at scale. Amazon SageMaker HyperPod efficiently scales across over 1,000 AI accelerators, reducing training times by up to 40%.
Amazon SageMaker HyperPod now allows customers to manage their clusters using a Kubernetes-based interface. This integration allows you to seamlessly switch between Slurm and Amazon EKS to optimize a variety of workloads, including training, tuning, experimentation, and inference. The CloudWatch Observability EKS add-on provides comprehensive monitoring capabilities, providing insights into CPU, network, disk, and other low-level node metrics in a unified dashboard. This enhanced observability extends to cluster-wide resource utilization, node-level metrics, pod-level performance, and container-level utilization data, facilitating efficient troubleshooting and optimization.
Launched at re:Invent 2023, Amazon SageMaker HyperPod has become an essential solution for AI startups and enterprises looking to efficiently train and deploy large-scale models. It is compatible with SageMaker’s distributed training library and offers Model Parallel and Data Parallel software optimizations to reduce training times by up to 20%. SageMaker HyperPod automatically detects and repairs or replaces failed instances, allowing data scientists to train models without interruption for weeks or months. This allows data scientists to focus on model development instead of managing infrastructure.
The integration of Amazon EKS and Amazon SageMaker HyperPod leverages the benefits of Kubernetes, which is gaining popularity for machine learning (ML) workloads due to its scalability and rich open source tooling. Organizations often standardize on Kubernetes to build applications, including those required for generative AI use cases, because it allows them to reuse functionality across environments while meeting compliance and governance standards. Today’s announcement enables customers to scale and optimize resource utilization across over 1,000 AI accelerators. This flexibility improves the developer experience, manages containerized apps, and dynamically scales FM training and inference workloads.
Amazon EKS support for Amazon SageMaker HyperPod provides resiliency with deep health checks, automated node recovery, and automatic job resumption, ensuring uninterrupted training for large-scale and/or long-running jobs. Task management is simplified with the optional HyperPod CLI, which is designed for Kubernetes environments, but customers can also use their own CLI tools. Integration with Amazon CloudWatch Container Insights provides advanced observability capabilities, providing deep insights into cluster performance, health, and utilization. Data scientists can also use tools like Kubeflow for automated ML workflows. The integration also includes Amazon SageMaker Managed MLflow, providing a powerful solution for experiment tracking and model management.
At a high level, Amazon SageMaker HyperPod clusters are created by cloud administrators using the HyperPod cluster API and are fully managed by the HyperPod service, eliminating the undifferentiated heavy lifting required to build and optimize ML infrastructure. Amazon EKS is used to orchestrate these HyperPod nodes similar to how Slurm orchestrates HyperPod nodes, providing a familiar Kubernetes-based administrative experience for customers.
Let’s take a look at how to get started with Amazon EKS support on Amazon SageMaker HyperPod.
Follow the Amazon SageMaker HyperPod EKS workshop to prepare for the scenario by creating an Amazon EKS cluster with a single AWS CloudFormation stack, configuring the VPC and storage resources, and checking prerequisites.
To create and manage an Amazon SageMaker HyperPod cluster, you can use the AWS Management Console or the AWS Command Line Interface (AWS CLI). Using the AWS CLI, specify the cluster configuration in a JSON file. Choose the Amazon EKS cluster you created earlier as the orchestrator for your SageMaker HyperPod cluster. Then, create a cluster worker node called “worker-group-1”. Subnet,
NodeRecovery
Set up Automatic
To enable automatic node recovery: and OnStartDeepHealthChecks
I add InstanceStress
and InstanceConnectivity
Enables in-depth health screening.
cat > eli-cluster-config.json << EOL
{
"ClusterName": "example-hp-cluster",
"Orchestrator": {
"Eks": {
"ClusterArn": "${EKS_CLUSTER_ARN}"
}
},
"InstanceGroups": (
{
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.p5.48xlarge",
"InstanceCount": 32,
"LifeCycleConfig": {
"SourceS3Uri": "s3://${BUCKET_NAME}",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "${EXECUTION_ROLE}",
"ThreadsPerCore": 1,
"OnStartDeepHealthChecks": (
"InstanceStress",
"InstanceConnectivity"
),
},
....
),
"VpcConfig": {
"SecurityGroupIds": (
"$SECURITY_GROUP"
),
"Subnets": (
"$SUBNET_ID"
)
},
"ResilienceConfig": {
"NodeRecovery": "Automatic"
}
}
EOL
You can provision and mount additional Amazon EBS volumes to your HyperPod nodes by adding InstanceStorageConfigs.
To create a cluster using the SageMaker HyperPod API, run the following AWS CLI command.
aws sagemaker create-cluster \
--cli-input-json file://eli-cluster-config.json
The AWS command returns the ARN of the new HyperPod cluster.
{
"ClusterArn": "arn:aws:sagemaker:us-east-2:ACCOUNT-ID:cluster/wccy5z4n4m49"
}
Then, check the HyperPod cluster status in the SageMaker console and wait for the status to change. InService
.
You can use Amazon CloudWatch Container Insights to monitor cluster performance and health metrics.
Things to know
Here are some key things to know about Amazon EKS support for Amazon SageMaker HyperPod:
Resilient environment – This integration provides a more resilient training environment with deep health checks, automated node recovery, and automatic job resumption. SageMaker HyperPod automatically detects, diagnoses, and recovers from errors, allowing you to continuously train foundational models for weeks or months without interruption. This can reduce training times by up to 40%.
Improved GPU observability – Amazon CloudWatch Container Insights provides detailed metrics and logs for containerized applications and microservices, enabling comprehensive monitoring of cluster performance and health.
Scientist-friendly tools – This release includes integration with custom HyperPod CLI for job management, Kubeflow Training Operators for distributed training, Kueue for scheduling, and SageMaker Managed MLflow for experiment tracking. It also works with SageMaker’s distributed training libraries to provide model-parallel and data-parallel optimizations to significantly reduce training times. These libraries, combined with automatic job resumption, enable efficient, uninterrupted training of large models.
Flexible resource utilization – This integration improves the developer experience and scalability for FM workloads. Data scientists can efficiently share compute capacity for training and inference jobs. You can use existing Amazon EKS clusters or create new clusters, connect to HyperPod compute, and bring your own tools for job submission, queuing, and monitoring.
To get started with Amazon SageMaker HyperPod on Amazon EKS, you can explore resources such as the SageMaker HyperPod EKS workshop, the aws-do-hyperpod project, and the awsome-distributed-training project. This release is generally available in AWS Regions where Amazon SageMaker HyperPod is available, except Europe (London). For pricing information, visit the Amazon SageMaker pricing page.
This blog post was a collaborative effort. I would like to thank Manoj Ravi, Adhesh Garg, Tomonori Shimomura, Alex Iankoulski, Anoop Saha, and the entire team for their contributions in gathering and refining the information presented here. Their collective expertise was crucial in making this comprehensive article.
-Ellie.