Changes between Initial Version and Version 1 of Public/User_Guide/SCR


Ignore:
Timestamp:
Sep 16, 2016, 1:51:24 PM (8 years ago)
Author:
Andreas Galonska
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/User_Guide/SCR

    v1 v1  
     1
     2== SCR (Scalable Checkpoint Restart) ==
     3
     4'''Configuration'''
     5
     6
     7SCR can be configured using Environment Variables or via configuration files.
     8Environment variables have to be exported in a batch script via
     9
     10{{{export SCR_VAR=<value>}}}
     11
     12while in configuration files they only need to be defined via
     13
     14{{{SCR_VAR=<value>}}}
     15
     16'''List of variables'''
     17
     18
     19{{{
     20SCR_CONF_FILE = <path to conf file> #let SCR use this configuration file
     21SCR_PREFIX = <directory> # should point to the global file system (typically /sdv-work/$USER/your_checkpoints). All Checkpoints are stored there, when flushing.
     22SCR_USER_NAME=$USER #mandatory
     23SCR_JOB_NAME=$PBS_JOBNAME #mandatory
     24SCR_JOB_ID=`echo  $PBS_JOBID | awk -v FS="." '{print $1}'` #mandatory: used to extract the jobid from PBS for identifying jobs
     25
     26SCR_COPY_TYPE="FILE"|"LOCAL"|"PARTNER"|"BUDDY"
     27#"FILE" instructs SCR to use multiple checkpoint descriptors defined in the configuration file
     28#"LOCAL" instructs SCR to use only _1_ node-local checkpointing
     29#"PARTNER" instructs SCR using also a partner node to store the checkpoints (SCR intrinsic Buddy-Checkpointing)
     30#"BUDDY" instructs to use the SIONlib Buddy-Checkpointing. Further information on this can be retrieved from k.thust@fz-juelich.de or a.galonska@fz-juelich.de
     31
     32SCR_FLUSH=X # Let SCR flush from node-local directory to global file system every X checkpoints
     33SCR_CACHE_BASE=<cache-directory> # Directory to store node-local files, only used when setting SCR_COPY_TYPE="SINGLE" (default=/tmp)
     34SCR_CACHE_SIZE=X # Only store X Checkpoints inside the cache directory (default=2)
     35SCR_FETCH=<1/0> # enable or disable fetching from global file system during restart.
     36SCR_DEBUG=X # use verbosity level X when running with SCR. Higher X => more output!
     37}}}
     38
     39
     40'''Configuration File Example'''
     41
     42{{{
     43SCR_FLUSH=1 # flush every checkpoint
     44SCR_DEBUG=0 # ne debug output needed
     45SCR_MODE=PIO #use node-local parallel I/O with e.g SIONlib or HDF5. Other Option is "DEFAULT", to use task-local serial IO
     46SCR_FLUSH_ASYNC=1 #use asynchronous transfer (only available when using STORE TYPE = BEEGFS
     47
     48
     49
     50STORE=/mnt/beeond COUNT=10 TYPE=BEEGFS
     51#using the /mnt/beeond as a cache enables asynchronous transfers of CP files to the global FS (synchronous as well)
     52#COUNT=X: Store X CPs inside this store
     53#TYPE=DEFAULT/BEEGFS defines if this store is a BEEGFS Cache or not
     54
     55#STORE=/tmp COUNT=10 TYPE=DEFAULT #when using DEFAULT, no beegfs flushing will be available. Only synchronous flushing copying over the files from node-local to global storage.
     56
     57STORE=/tmp #mandatory: Needed to store metadata for SCR
     58
     59# Specify the different types of checkpoints for a job
     60# 1 is used to store every time a CP to the BEEGFS Cache in this case
     61# 2 could be used to create a PARTNER Checkpoint every 4th time.
     62# 3 could be used to create a BUDDY Checkpoint with SIONlib every 6th time.
     63# times is by means of calling SCR_Need_checkpoint(int *flag) in your code, when flag=1
     64# every CKPT descriptor needs an appropiate configured STORE (see above)
     65
     66CKPT=1 STORE=/mnt/beeond INTERVAL=1 TYPE=SINGLE
     67#CKPT=2 STORE=/tmp INTERVAL=4 TYPE=PARTNER
     68#CKPT=3 STORE=/tmp INTERVAL=6 TYPE=BUDDY
     69
     70}}}
     71
     72