Systems and methods are provided for failure resiliency in distributed training of machine learning (ML) models. Examples include a plurality of compute nodes storing shards of a plurality of shards of model states of an ML model, and a first compute node storing a first shard of model states of the ML model. The first compute node can store a plurality of shard portions. Each shard portion can be received from a respective compute node of the plurality of compute nodes and can be a replica of a portion of a respective shard, of the plurality of shards, stored at the respective compute node. Responsive to a failure of a compute node of the plurality of compute nodes, the first compute node can update the first shard with a shard portion corresponding to the failed compute node and the ML model can be trained based on the updated first shard.
Full Text
What is claimed is: