summary

(a) simulate_cr function:. Simulates multilevel C/R and outputs the results.

(b) optimize_cr function:. Find the interval and L2_freq that maximizes the efficiency of the simulate_cr function, and output the simulation results at that time.

Development Details

simulate_cr function

tuple simulate_cr(interval, L2ckpt_freq, L1ckpt_overhead, L2ckpt_latency, ckptRestartTimes, failRates, N, SN, G, g, alpha, check_interval, n_check_ok, n_failure_max, efficiency_log)

Argument (Input)

argument name

Description.

type

Interval

L1 Checkpoint interval

int

indispensable

L2ckpt_frq

L2 Checkpoint Frequency

int

indispensable

L1ckpt_overhead

Synchronization L1 Checkpoint time

int

indispensable

L2ckpt_latency

Asynchronous L2 checkpoint time

int

indispensable

ckptRestartTimes

Array of length 2 containing the time required for L1,L2 recovery

= [L1 recovery time,L2 recovery time].

List

[int ,int].

indispensable

failRates

Array of length 2 containing the number of failures requiring L1,L2 recovery per unit time = [L1 failure times

Number,L2 Failure Count]

List [f loat,f loat].

indispensable

N

Total number of computation nodes

int

indispensable

SN

Number of spare nodes *Parameters added to the specification

int

indispensable

g

L1 Checkpoint group size

int

indispensable

g

L1 Checkpoint Fault Tolerance

int

indispensable

alpha

Threshold of change in efficiency that terminates the simulation

value

float

indispensable

check_interval

Frequency of Efficiency change checks

Parameters added to the specification

int

Optional, Default=1

n_check_ok

Because it is judged to be finished by the change amount check of Efficiency

Number of consecutive times of *Parameters added to the specification

int

Optional, Default=1

n_failure_max

Maximum number of failures *Parameters added to the specification

int

Optional, D efault=500000

efficiency_log

Turn on/off the historical output of the Efficiency change check

Parameters added to the specification

bool

Optional, Default=False

Return value (output): tuple type data = (X,A,B,C,D,E,F)

argument name

Description.

type

an unknown

Efficiency = A/(B+C+D+F)

float

A

real computation time

float

B

Time spent in the calculation state

float

c

L1 Time spent at checkpoint

float

D

L1 Time spent in recovery

float

E

L2 Time spent on checkpoints

float

f

Time spent on L2 recovery

float

optimize_cr function

tuple optimize_cr (L1ckpt_overhead, L2ckpt_latency, ckptRestartTimes, failRates, N, SN, G, g, alpha, check_interval, n_check_ok, n_failure_max, n_steps, log_interval)

Argument (Input)

argument name

Description.

type

L1ckpt_overhead

Synchronization L1 Checkpoint time

int

indispensable

L2ckkpt_latency

Asynchronous L2 checkpoint time

int

indispensable

ckptRestartTimes

Array of length 2 containing the time required for L1,L2 recovery

= [L1 recovery time,L2 recovery time].

List

[int ,int].

indispensable

failRates

Array of length 2 containing the number of failures requiring L1,L2 recovery per unit time = [L1 failure times

Number,L2 Failure Count]

List [f loat,f loat].

indispensable

N

Total number of computation nodes

int

indispensable

SN

Number of spare nodes *Parameters added to the specification

int

indispensable

g

L1 Checkpoint group size

int

indispensable

g

L1 Checkpoint Fault Tolerance

int

indispensable

alpha

Threshold of change in efficiency that terminates the simulation

value

float

indispensable

check_interval

Frequency of Efficiency change checks

Parameters added to the specification

int

Optional, Default=1

n_check_ok

Because it is judged to be finished by the change amount check of Efficiency

Number of consecutive times of *Parameters added to the specification

int

Optional, Default=1

n_failure_max

Maximum number of failures *Parameters added to the specification

int

Optional, D efault=500000

n_steps

Number of optimization iterations *Parameters added to specification

int

Optional, Default=5000

log_interval

Log output interval for optimization, 0 means no output

Parameters added to the specification

int

Optional, Default=100

Return value (output): tuple type data=(X,A,B,C,D,E,F, interval, L2ckpt_freq)

argument name

Description.

type

an unknown

Efficiency = A/(B+C+D+F) at interval, L2ckpt_freq of optimization results

float

A

interval of optimization results, real computation time at L2ckpt_freq

float

B

interval of optimization results, time spent in the computation state at L2ckpt_freq

float

C

interval of optimization results, time spent on L1 checkpoint at L2ckpt_freq

float

D

interval of optimization results, time spent for L1 recovery at L2ckpt_freq

float

E

interval of optimization results, time spent on L2 checkpoints during L2ckpt_freq

float

f

interval of optimization results, time spent for L2 recovery at L2ckpt_freq

float

interval

L1 checkpoint interval for optimization results

int

L2ckpt_freq

Frequency of L2 checkpoints for optimization results

int

Optimization Methodology

An annealing method was used as the optimization technique.

Initial state

Of the following combinations of interval and L2_freq_freq (24 combinations), the one with the highest efficiency is implemented as the initial state.

interval = 1000, 2500, 5000, 8000, 12000, 24000

L2_freq_freq = 1, 2, 5, 10

State Transition

The following four methods were considered for state transitions.

Method 1. 1. randomly select which value of interval or L2ckpt_freq to change 2. increase/decrease the selected parameter by 2%.

Method 2. 1. randomly select which value of interval or L2ckpt_freq to change 2. increase or decrease the selected parameter by a random value within 5%.

Method 3 1. increase/decrease both interval and L2ckpt_freq by a random value within 0-5%.

Method 4 1. randomly select which value of interval or L2ckpt_freq to change 2. increase/decrease the selected parameter by a fixed value

As a result of the study, Method 1 was adopted because none of the methods showed much difference except for Method 4 (*).

Because the interval has a wide range, when increasing or decreasing it by a fixed value, a small value causes too many times to move within the range, while a large value causes too large a change on the small side.

The above state transition methods can be changed to any of the above methods with a simple source code modification. The 2% and 5% numbers can also be changed only by modifying the corresponding parts of the source code.