Skip to content

Redesign: robotmk-master, serial/parallel/external execution #59

@simonmeggle

Description

@simonmeggle

robotmk-controller

A) Situation

In the current two working modes (async/spooldir=triggered externally) there is always the possibility that the plugin does not get executed properly, crashes somehow, in short: won't return at least the "robotmk" section header.
This leads to problems:

  • For Checkmk then, it looks like the host is not an end2end host at all, which is not true. ("missing agent data section")
  • There is not validation of the agent: presence of robotdir, python files allowed...

B) idea / basic concept

The idea is to expand the possibilities of the RobotMK agent with some metadata coming from a "controller" plugin. Furthermore, the agent should handle three different operation modes:

  • agent_serial
    • robotmk called once, all suites are executed in series
    • there is always max. one suite running
    • cache_time depends on total runtime of all suites
  • agent_parallel
    • multiple robotmk executions, suites are executed in parallel
    • max. parallel executions can be limited by max_parallel_executions (robotmk.yml)
    • cache_time on suite basis
  • external
    • robotmk either called once (executes all suites in robotdir) oder with suite name as argument
    • max. parallel executions can be limited by max_parallel_executions (robotmk.yml)
    • cache_time on suite or overall basis

robotmk.py has two operation modes:

  • called without any argument by the agent minutely => "controller" mode (see "Tasks of controller")
  • called by the controller with argument => "runner" mode:
    • in mode "agent_serial" called with --run => execute all suites
    • in mode "agent_parallel" called with --run SUITENAME => individual execution of suites (NOT YET IMPLEMENTED)
    • in mode "external": not called at all, this is done externally.

C) Tasks of robotmk-controller

There are two main tasks for the controller plugin:

  • I. Execution : (In mode agent_serial + agent_parallel, done as a detached process )
    • serial mode: robotmk runs the suites series in global_execution_interval, cache time global_cache_time
    • parallel mode: robotmk runs each suite individually in suite_execution_interval, cache time suite_cache_time
  • II. Output generation, Spoolfile monitoring :
    • Reads the spool files written after each execution; prints out the <<<robotmk>>> header, followed by a JSON containing all suite results
    • continuous monitoring the spoolfiles for stale spoolfiles (explained below)
    • writing a log file about each action and decision to make debugging easier.
    • Housekeeping: Remove spoolfiles of suites which are not defined in config
    • in addition
      • validate the agent config; alert if there are configuration errors
        • no robotdir
        • no robotmk.yml
        • no spooldir reading
        • presence of suite dirs which are named to execute
    • Version of robotmk deployment

These tasks are explained in the following in respect of the execution modes:


C.I Execution

C.I.0 General

In any of the three operation modes (agent_serial|agent_parallel|external) the controller will first load the configuration. It contains global parameters as well as RF parameters for the suite execution:

robotmk.yml

global_cache_time: 4000
global_execution_interval: 3600
execution_mode : agent_serial|agent_parallel|external
suites: 
  suite1_with_vars:
    suite: suite1 
    suite_cache_time: 240
    suite_execution_interval: 180    
    variable:
      var1: foo
      var2: bar
  suite1_without_vars:
    suite: suite1 
    suite_cache_time: 1100
    suite_execution_interval: 900    
  suite2: 
    suite_cache_time: 2000
    suite_execution_interval: 1800    

The results of RF executions will always be written as spoolfiles into the agent tmp dir.

Example:

tmp/
    robotmk_suite1_with_vars.json
	robotmk_suite_two_bar.json
	robotmk_suite_three_baz.json

Each spoolfile contains the suite ID and below:

  • the file path to the XML result
  • the file path to the HTML log
cat tmp/robotmk_suite_1_with_vars.json
{
	'suite1_with_vars': {
        "last_start_time": "2020-11-20 11:45:33", 
        "last_end_time": "2020-11-20 11:47:29",
        "last_rc": 0,
        "last_message": "OK: suite1_with_vars (00:01:46)"
	    "xml": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-output.xml",
	    "html": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-log.html",
	}
}

C.I.1 Execution in agent_serial mode

This mode is most likely like the "async" execution of checkmk plugins. The difference is that the controller, when called minutely (=default check interval), will each time decide whether robotmk should be executed.

# pseudocode
IF last_start_time < last_end_time:       # last run was finished 
    IF now > last_start_time + global_execution_interval:
        execute_robotmk_detached()     # execution on time
    ELSE: 
        pass    # last result too fresh, do nothing 
ELSE:   # task is running
    IF now > last_start_time + global_execution_interval:
        pass    # do not throw an error; the monitoring part will detect the outdated spoolfile!
    ELSE: 
        pass    # task is running within allowed cache_time, do nothing

start/end times get read from robotmk-state.json (written on begin/end of robotmk):


C.I.2 Execution in agent_parallel mode

In this mode all RF tests are allowed to run at different intervals and, if configured, in parallel.

At first, the list of scheduled suites is read from robotmk.yml:

...
suites: 
  suite1_with_vars:
    suite: suite1 
    suite_cache_time: 240
    suite_execution_interval: 180    
    variable:
      var1: foo
      var2: bar
  suite1_without_vars:
    suite: suite1 
    suite_cache_time: 1100
    suite_execution_interval: 900    
  suite2: 
    suite_cache_time: 2000
    suite_execution_interval: 1800  

The controller plugin is reponsible to decide which suites of this list have to be executed.

This splits up into two main tasks, explained below:

  • task file creation
  • task procession

task file creation

Task files are created by controller to create a queue of RF suites to execute.

Whenever a task is finished, the file gets deleted.

Task files are always empty; they represent the order of suites by a epoch timestamp in the filename:

1)         2)      3)
1606459120_robotmk_suite1_with_vars
1606459460_robotmk_suite1_without_vars
1606459700_robotmk_suite2

#1 next execution timestamp
#2 responsible for execution (default: robotmk; feature: place OS username here)
#3 suite ID

Tasks can be divided into three groups:

Group yml entry? state file? meaning action
New X Task has never run before (config change) schedule
Inventory X X Task has run before schedule
Orphaned X Task not planned to run anymore (config change) cleanup statefile

Whether "New" and "Inventory" tasks are candidates to execute now, is determined like follows:

# pseudocode

def read_statefile(suite):
    if statefile_exists:
        state = read(statefile)
        return state  # Inventory task, already ran
    else: 
        return 0      # New task, never ran

# creates taskfiles for all suites which should run until next agent run
for suite in suites: 
    state = read_statefile(suite)
    agent_interval = now - controllerstate.last_run
    sched_window_start = now
    sched_window_end = now + agent_interval
    next_exec_time = state.last_start_time + suite['suite_execution_interval']
    if sched_window_start <= next_exec_time < sched_window_end:
      create_taskfile(next_exec_time, suite)

task procession

Now that the task files are written, it is time to process them.

Given we have those three task files (and an agent interval of 5mins=300 seconds):

1606459120_robotmk_suite1_with_vars
# 1606459420 = now
1606459460_robotmk_suite1_without_vars
1606459700_robotmk_suite2

1606459120 is over already, hence robotmk will execute it immediately.

1606459460 is 40 seconds ahead, hence robotmk will be called with a 40s delay.

1606459700 is 180 seconds ahead, hence robotmk will be called with a 180s delay.

Each of these three robotmk calls are started at the same time (now) as detached processes with their delay parameter.

After the delay time within the plugin has passed and the execution time of a test has come, robotmk has to pass resource constraints; those contraints are checks which guarantee that the machine never gets overloaded with tests:

  • max. currently running tests < max_parallel_executions ( robotmk.yml)
  • (TBD) min. CPU Idle
  • (TBD) min. RAM Free

If there is any constraint violation, robotmk will abort and update the suite statefile:

cat tmp/robotmk_suite_1_with_vars.json
{
	'suite1_with_vars': {
        "last_start_time": "2020-11-20 11:45:33", 
        "last_end_time": "2020-11-20 11:47:29",
        "last_rc": 255,
        "last_message": "UNKNOWN: suite1_with_vars did not have resources to start (4 of max. 4 parallel tests are already running!)."
	    "xml": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-output.xml",
	    "html": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-log.html",
	}
}

C.I.3 Exeution in external mode


C.II Monitoring

As stated above, there are three working modes:

  • agent_serial
  • agent_parallel
  • external.
. agent_serial agent_parallel external
runner:suites one for all suites one per suite one for n
--- --- --- ---
global:
cache_time O O 1)
execution_interval O
--- --- --- ---
suite:
cache_time O O
execution_interval O
monitor runner cache headroom last_end_time
monitor suite cache headroom
monitor runner result staleness
monitor suite result staleness

When does it make sense to monitor the headroom time?

A non-selective (=complete) run is whenever the runner gets started
with no suite args. That is when:

  • serial mode (controller itself starts runner with no suite args)
  • external mode (a scheduled task starts the runner with no suite args)
    A selective, non-complete run is
  • parallel mode (controller starts one runner per suite)
  • external mode with specific suites as arguments to --run
  • external mode (a scheduled task starts the runner with suite args)

This table shows which mode supports selective runs:

exec_mode selective non-selective
agent_serial no yes
agent_parallel yes yes 1)
external yes yes
  1. Remark: It's questionable whether this makes sense. It's the same as serial mode, which also executes all suites as defined.

D) Notes to myself

  • Change the WATO page so that the key of a suite is a unique identifier. The suite directory is a separate field. This helps to distinguish suite executions if there are two planned executions of the same suite which only differ in the variables!
  • if now > state.last_start_time + suite['suite_cache_time'] - cache_time != execution_interval! Introduce a factor which extends the caching time

Metadata

Metadata

Assignees

Labels

c:agentcomponent: Agent (py)c:bakerycomponent: bakeryc:checkcomponent: MK checkc:watocomponent: WATO rulesp:mediumPriority: mediumt:featureType: new feature

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions