Changes between Version 12 and Version 13 of Public/Energy


Ignore:
Timestamp:
Sep 27, 2021, 2:30:40 PM (3 years ago)
Author:
Alessio Netti
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/Energy

    v12 v13  
    1515precise time-stamp ranges and by leveraging access to measured data from different components like CPU, GPU or memory, instead of using only single accumulated values. Hence, it offers a convenient way for analysis of SLURM jobs.
    1616
    17 DCDB is currently deployed on the DEEP-EST prototype and it is continuously collecting sensor data. A total of 62595 sensors are available, each with a time-to-live of 30 days. Please refer to D5.4 for the full list of monitored sensors and further details about the DCDB deployment. The DCDB Gitlab [https://dcdb.it repository] is also a good source of documentation and usage examples. Two types of data are available:
     17DCDB is currently deployed on the DEEP-EST prototype and it is continuously collecting sensor data. A total of 62595 sensors are available, each with a time-to-live of 1 year. Please refer to D5.4 for the full list of monitored sensors and further details about the DCDB deployment. The DCDB Gitlab [https://dcdb.it repository] is also a good source of documentation and usage examples. Two types of data are available:
    1818
    1919
     
    140140Finally, the {{{<STATNAME>}}} field identifies the actual type of aggregation performed from the readings of the queried sensor, by combining the data of all compute nodes on which the job was running. It can be one of the following:
    141141
    142 * **.sum**: sum of all 10s measurements in a certain 30s time window.
    143 * **.avg**: average of all 10s measurements in a certain 30s time window.
    144 * **.med**: median of all 10s measurements in a certain 30s time window.
    145 
    146 In order to get a cumulative measure of a job's performance (e.g., total energy spent or total amount of instructions computed), {{{.sum}}} should be used. Moreover, for a job to be measured by DCDB, its duration has to be longer than 30s. We currently expect to store job data for the entire lifetime of the system, and hence the time-to-live of 30 days does not apply to it.
     142* **.sum**: sum of all measurements of a metric, in a certain 30s time window.
     143* **.avg**: average of the measurements of a metric across nodes, in a certain 30s time window.
     144* **.med**: median of the measurements of a metric across nodes, in a certain 30s time window.
     145
     146In order to get a cumulative measure of a job's performance (e.g., total energy spent or total amount of instructions computed), {{{.sum}}} should be used. The {{{.avg}}} and {{{.med}}} aggregations, on the other hand, indicate single-node performance in each 30s time window. It should be noted that, for a job to be measured by DCDB, its duration has to be longer than 30s. We currently expect to store job data for the entire lifetime of the system, and hence the time-to-live of 1 year does not apply to it.
    147147
    148148==== BeeGFS Performance Metrics
     
    158158==== Long-term Sub-sampled Sensor Data
    159159
    160 The sensor data collected by DCDB (excluding per-job data) has a time-to-live of 30 days. After this interval the data will be automatically erased from the database. However, to keep track of the system’s history, DCDB automatically computes sub-sampled versions of all sensors. These are kept forever in the database, and can be queried as follows:
     160The sensor data collected by DCDB (excluding per-job data) has a time-to-live of 1 year. After this interval the data will be automatically erased from the database. However, to keep track of the system’s history, DCDB automatically computes sub-sampled versions of all sensors. These are kept forever in the database, and can be queried as follows:
    161161
    162162{{{