| 1 | |
| 2 | == SCR (Scalable Checkpoint Restart) == |
| 3 | |
| 4 | '''Configuration''' |
| 5 | |
| 6 | |
| 7 | SCR can be configured using Environment Variables or via configuration files. |
| 8 | Environment variables have to be exported in a batch script via |
| 9 | |
| 10 | {{{export SCR_VAR=<value>}}} |
| 11 | |
| 12 | while in configuration files they only need to be defined via |
| 13 | |
| 14 | {{{SCR_VAR=<value>}}} |
| 15 | |
| 16 | '''List of variables''' |
| 17 | |
| 18 | |
| 19 | {{{ |
| 20 | SCR_CONF_FILE = <path to conf file> #let SCR use this configuration file |
| 21 | SCR_PREFIX = <directory> # should point to the global file system (typically /sdv-work/$USER/your_checkpoints). All Checkpoints are stored there, when flushing. |
| 22 | SCR_USER_NAME=$USER #mandatory |
| 23 | SCR_JOB_NAME=$PBS_JOBNAME #mandatory |
| 24 | SCR_JOB_ID=`echo $PBS_JOBID | awk -v FS="." '{print $1}'` #mandatory: used to extract the jobid from PBS for identifying jobs |
| 25 | |
| 26 | SCR_COPY_TYPE="FILE"|"LOCAL"|"PARTNER"|"BUDDY" |
| 27 | #"FILE" instructs SCR to use multiple checkpoint descriptors defined in the configuration file |
| 28 | #"LOCAL" instructs SCR to use only _1_ node-local checkpointing |
| 29 | #"PARTNER" instructs SCR using also a partner node to store the checkpoints (SCR intrinsic Buddy-Checkpointing) |
| 30 | #"BUDDY" instructs to use the SIONlib Buddy-Checkpointing. Further information on this can be retrieved from k.thust@fz-juelich.de or a.galonska@fz-juelich.de |
| 31 | |
| 32 | SCR_FLUSH=X # Let SCR flush from node-local directory to global file system every X checkpoints |
| 33 | SCR_CACHE_BASE=<cache-directory> # Directory to store node-local files, only used when setting SCR_COPY_TYPE="SINGLE" (default=/tmp) |
| 34 | SCR_CACHE_SIZE=X # Only store X Checkpoints inside the cache directory (default=2) |
| 35 | SCR_FETCH=<1/0> # enable or disable fetching from global file system during restart. |
| 36 | SCR_DEBUG=X # use verbosity level X when running with SCR. Higher X => more output! |
| 37 | }}} |
| 38 | |
| 39 | |
| 40 | '''Configuration File Example''' |
| 41 | |
| 42 | {{{ |
| 43 | SCR_FLUSH=1 # flush every checkpoint |
| 44 | SCR_DEBUG=0 # ne debug output needed |
| 45 | SCR_MODE=PIO #use node-local parallel I/O with e.g SIONlib or HDF5. Other Option is "DEFAULT", to use task-local serial IO |
| 46 | SCR_FLUSH_ASYNC=1 #use asynchronous transfer (only available when using STORE TYPE = BEEGFS |
| 47 | |
| 48 | |
| 49 | |
| 50 | STORE=/mnt/beeond COUNT=10 TYPE=BEEGFS |
| 51 | #using the /mnt/beeond as a cache enables asynchronous transfers of CP files to the global FS (synchronous as well) |
| 52 | #COUNT=X: Store X CPs inside this store |
| 53 | #TYPE=DEFAULT/BEEGFS defines if this store is a BEEGFS Cache or not |
| 54 | |
| 55 | #STORE=/tmp COUNT=10 TYPE=DEFAULT #when using DEFAULT, no beegfs flushing will be available. Only synchronous flushing copying over the files from node-local to global storage. |
| 56 | |
| 57 | STORE=/tmp #mandatory: Needed to store metadata for SCR |
| 58 | |
| 59 | # Specify the different types of checkpoints for a job |
| 60 | # 1 is used to store every time a CP to the BEEGFS Cache in this case |
| 61 | # 2 could be used to create a PARTNER Checkpoint every 4th time. |
| 62 | # 3 could be used to create a BUDDY Checkpoint with SIONlib every 6th time. |
| 63 | # times is by means of calling SCR_Need_checkpoint(int *flag) in your code, when flag=1 |
| 64 | # every CKPT descriptor needs an appropiate configured STORE (see above) |
| 65 | |
| 66 | CKPT=1 STORE=/mnt/beeond INTERVAL=1 TYPE=SINGLE |
| 67 | #CKPT=2 STORE=/tmp INTERVAL=4 TYPE=PARTNER |
| 68 | #CKPT=3 STORE=/tmp INTERVAL=6 TYPE=BUDDY |
| 69 | |
| 70 | }}} |
| 71 | |
| 72 | |