Changes between Initial Version and Version 1 of Public/User_Guide/FTI


Ignore:
Timestamp:
Jun 13, 2019, 2:41:58 PM (5 years ago)
Author:
Kai Keller
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/FTI

    v1 v1  
     1== What is FTI? 
     2
     3FTI stands for Fault Tolerance Interface and is a library that aims to give computational scientists the means to perform fast and efficient multilevel checkpointing in large scale supercomputers. FTI leverages local storage plus data replication and erasure codes to provide several levels of reliability and performance. FTI is application-level checkpointing and allows users to select which datasets needs to be protected, in order to improve efficiency and avoid wasting space, time and energy. In addition, it offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously. 
     4
     5= Loading the FTI module 
     6
     7module load Intel/2018.2.199-GCC-5.5.0  ParaStationMPI/5.2.1-1 FTI 
     8
     9= Documentation and User guide 
     10
     11Doxygen: http://leobago.github.io/fti/ \\ \\
     12
     13Manual: https://github.com/leobago/fti/wiki
     14
     15= Quick Start Guide
     16
     17== Before the Execution
     18The user needs to provide an FTI configuration file (see attachments: config.fti)
     19
     20== FTI Calls inside Application
     21* Before any other FTI API call, '''FTI_Init''' has to be called. \\
     22* '''FTI_Protect''' informs FTI about a buffer that needs to be checkpointed. \\
     23* '''FTI_Checkpoint''' writes the checkpoint file, containing all protected buffers. \\
     24* '''FTI_Status''' returns 1 if the execution has been failed and was recovered. \\
     25* '''FTI_Recover''' updates all protected buffer by loading the data from the checkpoint file.
     26* '''FTI_Finalize''' finalizes FTI.
     27
     28== Example
     29{{{#!C
     30#include <stdlib.h>
     31#include <fti.h>
     32#define ITER_CHECK 10
     33
     34int main(int argc, char** argv){
     35    MPI_Init(&argc, &argv);
     36    char* path = "config.fti"; //config file path
     37    FTI_Init(path, MPI_COMM_WORLD);
     38    int world_rank, world_size; //FTI_COMM rank & size
     39    MPI_Comm_rank(FTI_COMM_WORLD, &world_rank);
     40    MPI_Comm_size(FTI_COMM_WORLD, &world_size);
     41
     42    int *array = malloc(sizeof(int) * world_size);
     43    int number = world_rank;
     44    int i = 0;
     45    //adding variables to protect
     46    FTI_Protect(1, &i, 1, FTI_INTG);
     47    FTI_Protect(2, &number, 1, FTI_INTG);
     48    if (FTI_Status() != 0) {
     49        FTI_Recover();
     50    }
     51    for (; i < 100; i++) {
     52        if (i % ITER_CHECK == 0) {
     53            FTI_Checkpoint(i / ITER_CHECK + 1, 2);
     54        }
     55        MPI_Allgather(&number, 1, MPI_INT, array,
     56                      1, MPI_INT, FTI_COMM_WORLD);
     57        number += 1;
     58    }
     59    free(array);
     60    FTI_Finalize();
     61    MPI_Finalize();
     62    return 0;
     63}
     64}}}
     65
     66= Feature List
     67* Multi level checkpointing
     68    * Node local
     69        * single checkpoint per node (level 1)
     70        * Buddy checkpoints (level 2)
     71        * Encoding checkpoints tolerating failures of half of the nodes/processes (level 3)
     72    * checkpoints on parallel file system (level 4)
     73- Differential checkpointing (available for posix/fti-ff level 4).
     74  Differential checkpointing means the differential update of a
     75  CP file. Unchanged application data compared to the last checkpoint
     76  will not be updated and thus saves I/O time.
     77- Incremental checkpointing. Incremental checkpointing means the incremental completion
     78  of a CP file. This technique serves primarily to avoid overhead
     79  caused by oversaturated network channels. It may also be used to overlap
     80  computation and checkpoint I/O
     81- Single file checkpoint on the PFS using HDF5 or MPI-IO
     82- Asynchronous post-processing of higher level checkpoints