beta
/Resilient Fully Sharded Data Parallel
Abstract

Systems and methods are provided for failure resiliency in distributed training of machine learning (ML) models. Examples include a plurality of compute nodes storing shards of a plurality of shards of model states of an ML model, and a first compute node storing a first shard of model states of the ML model. The first compute node can store a plurality of shard portions. Each shard portion can be received from a respective compute node of the plurality of compute nodes and can be a replica of a portion of a respective shard, of the plurality of shards, stored at the respective compute node. Responsive to a failure of a compute node of the plurality of compute nodes, the first compute node can update the first shard with a shard portion corresponding to the failed compute node and the ML model can be trained based on the updated first shard.

Full Text

What is claimed is:

Systems and methods are provided for failure resiliency in distributed training of machine learning (ML) models. Examples include a plurality of compute nodes storing shards of a plurality of shards of model states of an ML model, and a first compute node storing a first shard of model states of the ML model. The first compute node can store a plurality of shard portions. Each shard portion can be received from a respective compute node of the plurality of compute nodes and can be a replica of a portion of a respective shard, of the plurality of shards, stored at the respective compute node. Responsive to a failure of a compute node of the plurality of compute nodes, the first compute node can update the first shard with a shard portion corresponding to the failed compute node and the ML model can be trained based on the updated first shard.
Timeline
Filed
02/20/2026
Published
06/25/2026
Granted
Not Available
IPC Codes(2)
G06F 11/20:using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
G06N 3/098:Distributed learning, e.g. federated learning