summary
(a) simulate_cr function:. Simulates multilevel C/R and outputs the results.
(b) optimize_cr function:. Find the interval and L2_freq that maximizes the efficiency of the simulate_cr function, and output the simulation results at that time.
Development Details
simulate_cr function
tuple simulate_cr(interval, L2ckpt_freq, L1ckpt_overhead, L2ckpt_latency, ckptRestartTimes, failRates, N, SN, G, g, alpha, check_interval, n_check_ok, n_failure_max, efficiency_log)
Argument (Input)
argument name |
Description. |
type |
|
|---|---|---|---|
Interval |
L1 Checkpoint interval |
int |
indispensable |
L2ckpt_frq |
L2 Checkpoint Frequency |
int |
indispensable |
L1ckpt_overhead |
Synchronization L1 Checkpoint time |
int |
indispensable |
L2ckpt_latency |
Asynchronous L2 checkpoint time |
int |
indispensable |
ckptRestartTimes |
Array of length 2 containing the time required for L1,L2 recovery = [L1 recovery time,L2 recovery time]. |
List [int ,int]. |
indispensable |
failRates |
Array of length 2 containing the number of failures requiring L1,L2 recovery per unit time = [L1 failure times Number,L2 Failure Count] |
List [f loat,f loat]. |
indispensable |
N |
Total number of computation nodes |
int |
indispensable |
SN |
Number of spare nodes *Parameters added to the specification |
int |
indispensable |
g |
L1 Checkpoint group size |
int |
indispensable |
g |
L1 Checkpoint Fault Tolerance |
int |
indispensable |
alpha |
Threshold of change in efficiency that terminates the simulation value |
float |
indispensable |
check_interval |
Frequency of Efficiency change checks Parameters added to the specification |
int |
Optional, Default=1 |
n_check_ok |
Because it is judged to be finished by the change amount check of Efficiency Number of consecutive times of *Parameters added to the specification |
int |
Optional, Default=1 |
n_failure_max |
Maximum number of failures *Parameters added to the specification |
int |
Optional, D efault=500000 |
efficiency_log |
Turn on/off the historical output of the Efficiency change check Parameters added to the specification |
bool |
Optional, Default=False |
Return value (output): tuple type data = (X,A,B,C,D,E,F)
argument name |
Description. |
type |
|---|---|---|
an unknown |
Efficiency = A/(B+C+D+F) |
float |
A |
real computation time |
float |
B |
Time spent in the calculation state |
float |
c |
L1 Time spent at checkpoint |
float |
D |
L1 Time spent in recovery |
float |
E |
L2 Time spent on checkpoints |
float |
f |
Time spent on L2 recovery |
float |
optimize_cr function
tuple optimize_cr (L1ckpt_overhead, L2ckpt_latency, ckptRestartTimes, failRates, N, SN, G, g, alpha, check_interval, n_check_ok, n_failure_max, n_steps, log_interval)
Argument (Input)
argument name |
Description. |
type |
|
|---|---|---|---|
L1ckpt_overhead |
Synchronization L1 Checkpoint time |
int |
indispensable |
L2ckkpt_latency |
Asynchronous L2 checkpoint time |
int |
indispensable |
ckptRestartTimes |
Array of length 2 containing the time required for L1,L2 recovery = [L1 recovery time,L2 recovery time]. |
List [int ,int]. |
indispensable |
failRates |
Array of length 2 containing the number of failures requiring L1,L2 recovery per unit time = [L1 failure times Number,L2 Failure Count] |
List [f loat,f loat]. |
indispensable |
N |
Total number of computation nodes |
int |
indispensable |
SN |
Number of spare nodes *Parameters added to the specification |
int |
indispensable |
g |
L1 Checkpoint group size |
int |
indispensable |
g |
L1 Checkpoint Fault Tolerance |
int |
indispensable |
alpha |
Threshold of change in efficiency that terminates the simulation value |
float |
indispensable |
check_interval |
Frequency of Efficiency change checks Parameters added to the specification |
int |
Optional, Default=1 |
n_check_ok |
Because it is judged to be finished by the change amount check of Efficiency Number of consecutive times of *Parameters added to the specification |
int |
Optional, Default=1 |
n_failure_max |
Maximum number of failures *Parameters added to the specification |
int |
Optional, D efault=500000 |
n_steps |
Number of optimization iterations *Parameters added to specification |
int |
Optional, Default=5000 |
log_interval |
Log output interval for optimization, 0 means no output Parameters added to the specification |
int |
Optional, Default=100 |
Return value (output): tuple type data=(X,A,B,C,D,E,F, interval, L2ckpt_freq)
argument name |
Description. |
type |
|---|---|---|
an unknown |
Efficiency = A/(B+C+D+F) at interval, L2ckpt_freq of optimization results |
float |
A |
interval of optimization results, real computation time at L2ckpt_freq |
float |
B |
interval of optimization results, time spent in the computation state at L2ckpt_freq |
float |
C |
interval of optimization results, time spent on L1 checkpoint at L2ckpt_freq |
float |
D |
interval of optimization results, time spent for L1 recovery at L2ckpt_freq |
float |
E |
interval of optimization results, time spent on L2 checkpoints during L2ckpt_freq |
float |
f |
interval of optimization results, time spent for L2 recovery at L2ckpt_freq |
float |
interval |
L1 checkpoint interval for optimization results |
int |
L2ckpt_freq |
Frequency of L2 checkpoints for optimization results |
int |
Optimization Methodology
An annealing method was used as the optimization technique.
Initial state
Of the following combinations of interval and L2_freq_freq (24 combinations), the one with the highest efficiency is implemented as the initial state.
interval = 1000, 2500, 5000, 8000, 12000, 24000
L2_freq_freq = 1, 2, 5, 10
State Transition
The following four methods were considered for state transitions.
Method 1. 1. randomly select which value of interval or L2ckpt_freq to change 2. increase/decrease the selected parameter by 2%.
Method 2. 1. randomly select which value of interval or L2ckpt_freq to change 2. increase or decrease the selected parameter by a random value within 5%.
Method 3 1. increase/decrease both interval and L2ckpt_freq by a random value within 0-5%.
Method 4 1. randomly select which value of interval or L2ckpt_freq to change 2. increase/decrease the selected parameter by a fixed value
As a result of the study, Method 1 was adopted because none of the methods showed much difference except for Method 4 (*).
Because the interval has a wide range, when increasing or decreasing it by a fixed value, a small value causes too many times to move within the range, while a large value causes too large a change on the small side.
The above state transition methods can be changed to any of the above methods with a simple source code modification. The 2% and 5% numbers can also be changed only by modifying the corresponding parts of the source code.