Context Navigation

Changes between Version 12 and Version 13 of Public/Energy

Timestamp:: Sep 27, 2021, 2:30:40 PM (4 years ago)
Author:: Alessio Netti
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Public/Energy

-                      v12
+                      v13
 precise time-stamp ranges and by leveraging access to measured data from different components like CPU, GPU or memory, instead of using only single accumulated values. Hence, it offers a convenient way for analysis of SLURM jobs.
 DCDB is currently deployed on the DEEP-EST prototype and it is continuously collecting sensor data. A total of 62595 sensors are available, each with a time-to-live of 30 days. Please refer to D5.4 for the full list of monitored sensors and further details about the DCDB deployment. The DCDB Gitlab [https://dcdb.it repository] is also a good source of documentation and usage examples. Two types of data are available:
+DCDB is currently deployed on the DEEP-EST prototype and it is continuously collecting sensor data. A total of 62595 sensors are available, each with a time-to-live of 1 year. Please refer to D5.4 for the full list of monitored sensors and further details about the DCDB deployment. The DCDB Gitlab [https://dcdb.it repository] is also a good source of documentation and usage examples. Two types of data are available:
 …
 Finally, the {{{<STATNAME>}}} field identifies the actual type of aggregation performed from the readings of the queried sensor, by combining the data of all compute nodes on which the job was running. It can be one of the following:
 * **.sum**: sum of all 10s measurements in a certain 30s time window.
 * **.avg**: average of all 10s measurements in a certain 30s time window.
 * **.med**: median of all 10s measurements in a certain 30s time window.
 In order to get a cumulative measure of a job's performance (e.g., total energy spent or total amount of instructions computed), {{{.sum}}} should be used. Moreover, for a job to be measured by DCDB, its duration has to be longer than 30s. We currently expect to store job data for the entire lifetime of the system, and hence the time-to-live of 30 days does not apply to it.
+* **.sum**: sum of all measurements of a metric, in a certain 30s time window.
+* **.avg**: average of the measurements of a metric across nodes, in a certain 30s time window.
+* **.med**: median of the measurements of a metric across nodes, in a certain 30s time window.
+In order to get a cumulative measure of a job's performance (e.g., total energy spent or total amount of instructions computed), {{{.sum}}} should be used. The {{{.avg}}} and {{{.med}}} aggregations, on the other hand, indicate single-node performance in each 30s time window. It should be noted that, for a job to be measured by DCDB, its duration has to be longer than 30s. We currently expect to store job data for the entire lifetime of the system, and hence the time-to-live of 1 year does not apply to it.
 ==== BeeGFS Performance Metrics
 …
 ==== Long-term Sub-sampled Sensor Data
 The sensor data collected by DCDB (excluding per-job data) has a time-to-live of 30 days. After this interval the data will be automatically erased from the database. However, to keep track of the system’s history, DCDB automatically computes sub-sampled versions of all sensors. These are kept forever in the database, and can be queried as follows:
+The sensor data collected by DCDB (excluding per-job data) has a time-to-live of 1 year. After this interval the data will be automatically erased from the database. However, to keep track of the system’s history, DCDB automatically computes sub-sampled versions of all sensors. These are kept forever in the database, and can be queried as follows:
 {{{