Changes between Version 9 and Version 10 of Public/Energy


Ignore:
Timestamp:
Dec 17, 2020, 8:44:41 AM (3 years ago)
Author:
Alessio Netti
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Public/Energy

    v9 v10  
    8181}}}
    8282
     83==== Energy and Power Sensors
     84
     85DCDB provides several types of energy and power measurements that can be useful for performance analysis. Depending on the data source, these can have different units and scales - this information can be shown by querying a sensor with the {{{dcdbconfig sensor show}}} command. In detail, these are the energy sensors available for compute nodes in the DEEP-EST prototype:
     86
     87* **energy** [Joules]: energy consumption of a compute node as a whole.
     88* **pkg-energy** [!MicroJoules]: energy consumption of a compute node's CPUs.
     89* **dram-energy** [!MicroJoules]: energy consumption of a compute node's memory components.
     90* **gpu0/energy** (ESB, DAM) [!MilliJoules]: energy consumption of Nvidia V100 GPUs.
     91* **gpu0/sysfs-energy** (ESB) [Joules]: alternative sensor for the energy consumption of Nvidia V100 GPUs, not including the energy drawn via PCIe.
     92* **sysfs-energy** (ESB) [Joules]: energy consumption of an ESB compute node, excluding the GPU.
     93
     94These, instead, are the available sensors for estimating power consumption:
     95
     96* **power** [!MilliWatts]: power consumption of a compute node as a whole. On DAM nodes this sensor is quantified in Watts.
     97* **gpu0/power** (ESB, DAM) [!MilliWatts]: power consumption of Nvidia V100 GPUs.
     98* **gpu0/sysfs-power** (ESB) [!MilliWatts]: alternative sensor for the power consumption of Nvidia V100 GPUs, not including the energy drawn via PCIe.
     99
     100In the case of multi-socket systems (i.e. CN and DAM), the pkg and dram energy and power sensors are available both for each single socket (e.g., under {{{socket0/pkg-energy}}}), as well as for the entire compute node. The same aggregation scheme applies for sensors associated with CPU performance counters. Aside from those on compute nodes, many other energy and power sensors are available on the DEEP-EST prototype: these are associated with the system's infrastructure and cooling equipment.
     101
    83102==== Using dcdbquery for Job Queries
    84103
     
    101120The {{{<SENSORNAME>}}} field identifies instead the aggregated metric and can be one of the following:
    102121
    103 * energy
    104 * dram-energy
    105 * pkg-energy
    106 * MemUsed
    107 * instructions
    108 * cpu-cycles
    109 * cache-misses
    110 * cache-misses-l2
    111 * cache-misses-l3
    112 * scalar-double
    113 * scalar-double-128
    114 * scalar-double-256
    115 * scalar-double-512
    116 * gpu-energy (DAM, ESB)
    117 * ib-portXmitData (CN, ESB)
    118 * ib-portRcvData (CN, ESB)
     122* **energy** [Joules]: energy consumption of nodes, as computed from the respective energy sensors.
     123* **dram-energy** [!MicroJoules]: energy consumption of the nodes' memory, as computed from the dram-energy sensors.
     124* **pkg-energy** [!MicroJoules]: energy consumption of the nodes' CPUs, as computed from the pkg-energy sensors.
     125* **!MemUsed** [!KiloBytes]: amount of used RAM on the compute nodes.
     126* **instructions**: number of CPU instructions executed on the compute nodes.
     127* **cpu-cycles**: amount of CPU cycles (affected by frequency scaling) on the compute nodes.
     128* **cache-misses**: total amount of CPU cache misses on the compute nodes.
     129* **cache-misses-l2**: amount of CPU L2 cache misses on the compute nodes.
     130* **cache-misses-l3**: amount of CPU L3 cache misses on the compute nodes.
     131* **scalar-double**: number of double precision floating point operations on the compute nodes' CPUs.
     132* **scalar-double-128**: number of double precision floating point operations using 128-bit vectorization.
     133* **scalar-double-256**: number of double precision floating point operations using 256-bit vectorization.
     134* **scalar-double-512**: number of double precision floating point operations using 512-bit vectorization.
     135* **gpu-energy** (DAM, ESB) [!MilliJoules]: energy consumption of the compute nodes' GPUs, as computed from the gpu0/energy sensors.
     136* **ib-portXmitData** (CN, ESB) [Bytes]: amount of data transmitted over the Infiniband network.
     137* **ib-portRcvData** (CN, ESB) [Bytes]: amount of data received over the Infiniband network.
    119138
    120139Finally, the {{{<STATNAME>}}} field identifies the actual type of aggregation performed from the readings of the queried sensor, by combining the data of all compute nodes on which the job was running. It can be one of the following:
    121140
    122 * .sum (sum of all 10s measurements in a certain 30s time window)
    123 * .avg (average of all 10s measurements in a certain 30s time window)
    124 * .med (median of all 10s measurements in a certain 30s time window)
     141* **.sum**: sum of all 10s measurements in a certain 30s time window.
     142* **.avg**: average of all 10s measurements in a certain 30s time window.
     143* **.med**: median of all 10s measurements in a certain 30s time window.
    125144
    126145In order to get a cumulative measure of a job's performance (e.g., total energy spent or total amount of instructions computed), {{{.sum}}} should be used. Moreover, for a job to be measured by DCDB, its duration has to be longer than 30s. We currently expect to store job data for the entire lifetime of the system, and hence the time-to-live of 30 days does not apply to it.
     146
     147==== BeeGFS Performance Metrics
     148
     149DCDB also exposes a series of performance metrics associated with the BeeGFS distributed file system. Comprehensive documentation about the available metrics and their meaning is available [https://www.beegfs.io/wiki/MonDatabaseReference here]. All BeeGFS sensors in the DCDB database are identified by the {{{/deepest/beegfs/}}} prefix, and they can be listed by using the following command:
     150
     151{{{
     152dcdbconfig sensor listpublic | grep beegfs
     153}}}
     154
     155Most BeeGFS sensors do not directly map to users and compute nodes in the DEEP-EST prototype, and they require knowledge of the system's architecture to be interpreted.
    127156
    128157==== Long-term Sub-sampled Sensor Data