Otto-von-Guericke-University Magdeburg Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods for Systems and Control Theory

Dr. Jens Saak, Dipl.-Math. Martin Köhler Website: http://www.mpi-magdeburg.mpg.de/mpcsc/lehre/2012\_WS\_SC/

# Scientific Computing 1 8th Homework

#### Handout: 11/22/2012

Return: 11/29/2012

### Exercise 1:

Consider an electric signal can move at light speed  $(2.997 \, 10^8 \frac{\text{m}}{\text{s}})$ . Compute the distance the electric signal can travel during one clock cycle in the following applications:

• The memory-io clock rate on a mainboard:

| RAM - Type | Clock Rate |
|------------|------------|
| EDO-RAM    | 66 MHz     |
| SD-RAM     | 133 MHz    |
| DDR-RAM    | 200 MHz    |
| DDR2-RAM   | 533 MHz    |
| DDR3-RAM   | 1066 MHz   |
| DDR4-RAM   | 1600 MHz   |

• The CPU and L1/2/3 cache clock rate of:

| CPU                | Clock Rate |
|--------------------|------------|
| Intel Core 2       | 2.66 GHz   |
| Intel Xeon E5-2690 | 2.90 GHz   |
| Intel Xeon E3-1990 | 3.70 GHz   |
| IBM POWER7         | 4.25 GHz   |

• The clock rate of different network techniques:

| Technique        | Clock Rate |
|------------------|------------|
| Ethernet 100Mbit | 31.25 MHz  |
| Ethernet 10Gbit  | 417Mhz     |

Does the covered distance influence the design of electric circuits? Which problems will occur when the clock rates are driven even higher?

# Exercise 2:

(8 Points)

We consider the following C function:

```
void axpy(int n, double *a, double *b, double *c) {
    int i;
    for ( i = 0 ; i < n ; i++) {
        c[i]=a[i]+b[i];
    }
}</pre>
```

#### (6 Points)



computing c = a + b with  $a, b, c \in \mathbb{R}^n$ . We ran this for various  $n \in \{1000, 2000, \dots, 2 \cdot 10^6\}$  and took the average time for one vector add operation. On four different computers we got the following plots:

What can you recognize in the plots? Explain this behavior. Is it possible to determine any details of the memory hierarchy? Why is the runtime not linear?

**Hint:** The complete source of the benchmark program is available at: http://www.mpi-magdeburg.mpg.de/mpcsc/lehre/2012\_WS\_SC/data/axpy.c

### Exercise 3:

Consider a CPU with a clock rate of 2.13 GHz and a cache bus width of 128 bit. Compute the bandwidth between CPU and cache. A hardware manufacture combines this CPU with dual-channel DDR3 main memory (1066 Mhz, approximate bandwidth 34 GB/s).

- Compare the cache transfer rate with the main memory transfer rate.
- Assume that the main memory has a CAS-Latency of 3. How many CPU cycles are gone at least

#### (5 Points)

until a requested dataset arrives from the main memory in the CPU cache? What does this mean in the design of efficient algorithms?

• In which sense does the problem increase when a data portion has to be fetched from local storage, e.g., because it has been swapped?

# Exercise 4:

### (4 Points)

Consider the matrix-matrix product  $C = A \cdot B$  with  $A, B, C \in \mathbb{R}^{n \times n}$ . An easy pseudo code algorithm to compute the product is

```
for i = 1, \ldots, n do
for j = 1, \ldots, n do
C_{ij} = 0
for k = 1, \ldots, n do
C_{ij} = C_{ij} + A_{ik}B_{kj}
end for
end for
end for
```

Think about how many times the entries of A and B are accessed. Why does this way of evaluating the matrix-matrix product not have a good data locality? Reformulate the matrix-matrix product to get a better data locality in the memory hierarchy.

### **Overall Points: 23**