-
Notifications
You must be signed in to change notification settings - Fork 11
Description
robotmk-controller
A) Situation
In the current two working modes (async/spooldir=triggered externally) there is always the possibility that the plugin does not get executed properly, crashes somehow, in short: won't return at least the "robotmk" section header.
This leads to problems:
- For Checkmk then, it looks like the host is not an end2end host at all, which is not true. ("missing agent data section")
- There is not validation of the agent: presence of robotdir, python files allowed...
B) idea / basic concept
The idea is to expand the possibilities of the RobotMK agent with some metadata coming from a "controller" plugin. Furthermore, the agent should handle three different operation modes:
agent_serialrobotmkcalled once, all suites are executed in series- there is always max. one suite running
cache_timedepends on total runtime of all suites
agent_parallel- multiple
robotmkexecutions, suites are executed in parallel - max. parallel executions can be limited by
max_parallel_executions(robotmk.yml) cache_timeon suite basis
- multiple
externalrobotmkeither called once (executes all suites in robotdir) oder with suite name as argument- max. parallel executions can be limited by
max_parallel_executions(robotmk.yml) cache_timeon suite or overall basis
robotmk.py has two operation modes:
- called without any argument by the agent minutely => "controller" mode (see "Tasks of controller")
- called by the controller with argument => "runner" mode:
- in mode "agent_serial" called with
--run=> execute all suites - in mode "agent_parallel" called with
--run SUITENAME=> individual execution of suites (NOT YET IMPLEMENTED) - in mode "external": not called at all, this is done externally.
- in mode "agent_serial" called with
C) Tasks of robotmk-controller
There are two main tasks for the controller plugin:
- I. Execution : (In mode
agent_serial+agent_parallel, done as a detached process )- serial mode:
robotmkruns the suites series inglobal_execution_interval, cache timeglobal_cache_time - parallel mode:
robotmkruns each suite individually insuite_execution_interval, cache timesuite_cache_time
- serial mode:
- II. Output generation, Spoolfile monitoring :
- Reads the spool files written after each execution; prints out the
<<<robotmk>>>header, followed by a JSON containing all suite results - continuous monitoring the spoolfiles for stale spoolfiles (explained below)
- writing a log file about each action and decision to make debugging easier.
- Housekeeping: Remove spoolfiles of suites which are not defined in config
- in addition
- validate the agent config; alert if there are configuration errors
- no robotdir
- no robotmk.yml
- no spooldir reading
- presence of suite dirs which are named to execute
- validate the agent config; alert if there are configuration errors
- Version of robotmk deployment
- Reads the spool files written after each execution; prints out the
These tasks are explained in the following in respect of the execution modes:
C.I Execution
C.I.0 General
In any of the three operation modes (agent_serial|agent_parallel|external) the controller will first load the configuration. It contains global parameters as well as RF parameters for the suite execution:
robotmk.yml
global_cache_time: 4000
global_execution_interval: 3600
execution_mode : agent_serial|agent_parallel|external
suites:
suite1_with_vars:
suite: suite1
suite_cache_time: 240
suite_execution_interval: 180
variable:
var1: foo
var2: bar
suite1_without_vars:
suite: suite1
suite_cache_time: 1100
suite_execution_interval: 900
suite2:
suite_cache_time: 2000
suite_execution_interval: 1800
The results of RF executions will always be written as spoolfiles into the agent tmp dir.
Example:
tmp/
robotmk_suite1_with_vars.json
robotmk_suite_two_bar.json
robotmk_suite_three_baz.json
Each spoolfile contains the suite ID and below:
- the file path to the XML result
- the file path to the HTML log
cat tmp/robotmk_suite_1_with_vars.json
{
'suite1_with_vars': {
"last_start_time": "2020-11-20 11:45:33",
"last_end_time": "2020-11-20 11:47:29",
"last_rc": 0,
"last_message": "OK: suite1_with_vars (00:01:46)"
"xml": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-output.xml",
"html": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-log.html",
}
}
C.I.1 Execution in agent_serial mode
This mode is most likely like the "async" execution of checkmk plugins. The difference is that the controller, when called minutely (=default check interval), will each time decide whether robotmk should be executed.
# pseudocode
IF last_start_time < last_end_time: # last run was finished
IF now > last_start_time + global_execution_interval:
execute_robotmk_detached() # execution on time
ELSE:
pass # last result too fresh, do nothing
ELSE: # task is running
IF now > last_start_time + global_execution_interval:
pass # do not throw an error; the monitoring part will detect the outdated spoolfile!
ELSE:
pass # task is running within allowed cache_time, do nothing
start/end times get read from robotmk-state.json (written on begin/end of robotmk):
C.I.2 Execution in agent_parallel mode
In this mode all RF tests are allowed to run at different intervals and, if configured, in parallel.
At first, the list of scheduled suites is read from robotmk.yml:
...
suites:
suite1_with_vars:
suite: suite1
suite_cache_time: 240
suite_execution_interval: 180
variable:
var1: foo
var2: bar
suite1_without_vars:
suite: suite1
suite_cache_time: 1100
suite_execution_interval: 900
suite2:
suite_cache_time: 2000
suite_execution_interval: 1800
The controller plugin is reponsible to decide which suites of this list have to be executed.
This splits up into two main tasks, explained below:
- task file creation
- task procession
task file creation
Task files are created by controller to create a queue of RF suites to execute.
Whenever a task is finished, the file gets deleted.
Task files are always empty; they represent the order of suites by a epoch timestamp in the filename:
1) 2) 3)
1606459120_robotmk_suite1_with_vars
1606459460_robotmk_suite1_without_vars
1606459700_robotmk_suite2
#1 next execution timestamp
#2 responsible for execution (default: robotmk; feature: place OS username here)
#3 suite ID
Tasks can be divided into three groups:
| Group | yml entry? | state file? | meaning | action |
|---|---|---|---|---|
| New | X | Task has never run before (config change) | schedule | |
| Inventory | X | X | Task has run before | schedule |
| Orphaned | X | Task not planned to run anymore (config change) | cleanup statefile |
Whether "New" and "Inventory" tasks are candidates to execute now, is determined like follows:
# pseudocode
def read_statefile(suite):
if statefile_exists:
state = read(statefile)
return state # Inventory task, already ran
else:
return 0 # New task, never ran
# creates taskfiles for all suites which should run until next agent run
for suite in suites:
state = read_statefile(suite)
agent_interval = now - controllerstate.last_run
sched_window_start = now
sched_window_end = now + agent_interval
next_exec_time = state.last_start_time + suite['suite_execution_interval']
if sched_window_start <= next_exec_time < sched_window_end:
create_taskfile(next_exec_time, suite)
task procession
Now that the task files are written, it is time to process them.
Given we have those three task files (and an agent interval of 5mins=300 seconds):
1606459120_robotmk_suite1_with_vars
# 1606459420 = now
1606459460_robotmk_suite1_without_vars
1606459700_robotmk_suite2
1606459120 is over already, hence robotmk will execute it immediately.
1606459460 is 40 seconds ahead, hence robotmk will be called with a 40s delay.
1606459700 is 180 seconds ahead, hence robotmk will be called with a 180s delay.
Each of these three robotmk calls are started at the same time (now) as detached processes with their delay parameter.
After the delay time within the plugin has passed and the execution time of a test has come, robotmk has to pass resource constraints; those contraints are checks which guarantee that the machine never gets overloaded with tests:
- max. currently running tests <
max_parallel_executions(robotmk.yml) - (TBD) min. CPU Idle
- (TBD) min. RAM Free
If there is any constraint violation, robotmk will abort and update the suite statefile:
cat tmp/robotmk_suite_1_with_vars.json
{
'suite1_with_vars': {
"last_start_time": "2020-11-20 11:45:33",
"last_end_time": "2020-11-20 11:47:29",
"last_rc": 255,
"last_message": "UNKNOWN: suite1_with_vars did not have resources to start (4 of max. 4 parallel tests are already running!)."
"xml": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-output.xml",
"html": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-log.html",
}
}
C.I.3 Exeution in external mode
C.II Monitoring
As stated above, there are three working modes:
- agent_serial
- agent_parallel
- external.
| . | agent_serial |
agent_parallel |
external |
|---|---|---|---|
| runner:suites | one for all suites | one per suite | one for n |
| --- | --- | --- | --- |
global: |
|||
cache_time |
O |
O 1) |
|
execution_interval |
O |
||
| --- | --- | --- | --- |
suite: |
|||
cache_time |
O |
O |
|
execution_interval |
O |
||
| monitor runner cache headroom | last_end_time |
||
| monitor suite cache headroom | |||
| monitor runner result staleness | |||
| monitor suite result staleness |
When does it make sense to monitor the headroom time?
A non-selective (=complete) run is whenever the runner gets started
with no suite args. That is when:
- serial mode (controller itself starts runner with no suite args)
- external mode (a scheduled task starts the runner with no suite args)
A selective, non-complete run is - parallel mode (controller starts one runner per suite)
- external mode with specific suites as arguments to --run
- external mode (a scheduled task starts the runner with suite args)
This table shows which mode supports selective runs:
| exec_mode | selective | non-selective |
|---|---|---|
agent_serial |
no | yes |
agent_parallel |
yes | yes 1) |
external |
yes | yes |
- Remark: It's questionable whether this makes sense. It's the same as serial mode, which also executes all suites as defined.
D) Notes to myself
- Change the WATO page so that the key of a suite is a unique identifier. The suite directory is a separate field. This helps to distinguish suite executions if there are two planned executions of the same suite which only differ in the variables!
if now > state.last_start_time + suite['suite_cache_time']- cache_time != execution_interval! Introduce a factor which extends the caching time