Resilient Fully Sharded Data Parallel

/Resilient Fully Sharded Data Parallel

Abstract

Systems and methods are provided for failure resiliency in distributed training of machine learning (ML) models. Examples include a plurality of compute nodes storing shards of a plurality of shards of model states of an ML model, and a first compute node storing a first shard of model states of the ML model. The first compute node can store a plurality of shard portions. Each shard portion can be received from a respective compute node of the plurality of compute nodes and can be a replica of a portion of a respective shard, of the plurality of shards, stored at the respective compute node. Responsive to a failure of a compute node of the plurality of compute nodes, the first compute node can update the first shard with a shard portion corresponding to the failed compute node and the ML model can be trained based on the updated first shard.

Full Text

What is claimed is:

Timeline

Filed

02/20/2026

Published

06/25/2026

Granted

Not Available

IPC Codes(2)

G06F 11/20:using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements

G06N 3/098:Distributed learning, e.g. federated learning