Redesign: robotmk-master, serial/parallel/external execution

# robotmk-controller
## A) Situation
In the current two working modes (async/spooldir=triggered externally) there is always the possibility that the plugin does not get executed properly, crashes somehow, in short: won't return at least the "robotmk" section header.
This leads to problems: 
* For Checkmk then, it looks like the host is not an end2end host at all, which is not true. ("missing agent data section")
* There is not validation of the agent: presence of robotdir, python files allowed... 

## B) idea / basic concept
The idea is to expand the possibilities of the RobotMK agent with some metadata coming from a "controller" plugin. Furthermore, the agent should handle three different operation modes: 

* `agent_serial` 
  * `robotmk` called once, all suites are executed in series
  * there is always max. one suite running 
  * `cache_time` depends on total runtime of all suites
* `agent_parallel` 
  * multiple `robotmk` executions, suites are executed in parallel 
  * max. parallel executions can be limited by `max_parallel_executions` (`robotmk.yml`)
  * `cache_time` on suite basis
* `external`
  * `robotmk` either called once (executes all suites in robotdir) oder with suite name as argument
  * max. parallel executions can be limited by `max_parallel_executions` (`robotmk.yml`)
  * `cache_time` on suite or overall basis

`robotmk.py` has two operation modes: 
* called without any argument by the agent minutely => "**controller**" mode (see "Tasks of controller")
* called by the controller with argument => "**runner**" mode:
  * in mode "agent_serial" called with `--run` =>  execute all suites
  * in mode "agent_parallel" called with `--run SUITENAME` => individual execution of suites (NOT YET IMPLEMENTED)
  * in mode "external": not called at all, this is done externally.  

## C) Tasks of `robotmk-controller`

There are **two** main tasks for the controller plugin: 
* I. _Execution_ : (In mode `agent_serial` + `agent_parallel`, done as a [detached process](https://stackoverflow.com/questions/49123439/python-how-to-run-process-in-detached-mode) ) 
  * serial mode: `robotmk` runs the suites series in `global_execution_interval`, cache time `global_cache_time` 
  * parallel mode: `robotmk` runs each suite individually in `suite_execution_interval`, cache time `suite_cache_time`
* II. _Output generation_, Spoolfile monitoring : 
  * Reads the spool files written after each execution; prints out the `<<<robotmk>>>` header, followed by a JSON containing all suite results  
  * continuous monitoring the spoolfiles for stale spoolfiles (explained below) 
  * writing a log file about each action and decision to make debugging easier. 
  * Housekeeping: Remove spoolfiles of suites which are not defined in config 
  * in addition
    * validate the agent config; alert if there are configuration errors 
      * no robotdir
      * no robotmk.yml
      * no spooldir reading
      * presence of suite dirs which are named to execute 
  * Version of robotmk deployment



These tasks are explained in the following in respect of the execution modes: 

---

## C.I Execution 

### C.I.0 General 

In any of the three operation modes (`agent_serial|agent_parallel|external`) the `controller` will first load the configuration. It contains global parameters as well as RF parameters for the suite execution:  

```
robotmk.yml

global_cache_time: 4000
global_execution_interval: 3600
execution_mode : agent_serial|agent_parallel|external
suites: 
  suite1_with_vars:
    suite: suite1 
    suite_cache_time: 240
    suite_execution_interval: 180    
    variable:
      var1: foo
      var2: bar
  suite1_without_vars:
    suite: suite1 
    suite_cache_time: 1100
    suite_execution_interval: 900    
  suite2: 
    suite_cache_time: 2000
    suite_execution_interval: 1800    
```

The results of RF executions will always be written as spoolfiles into the agent tmp dir. 

Example: 

```
tmp/
    robotmk_suite1_with_vars.json
	robotmk_suite_two_bar.json
	robotmk_suite_three_baz.json
```

Each spoolfile contains the suite ID and below: 

* the file path to the **XML result**
* the file path to the **HTML log** 

```
cat tmp/robotmk_suite_1_with_vars.json
{
	'suite1_with_vars': {
        "last_start_time": "2020-11-20 11:45:33", 
        "last_end_time": "2020-11-20 11:47:29",
        "last_rc": 0,
        "last_message": "OK: suite1_with_vars (00:01:46)"
	    "xml": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-output.xml",
	    "html": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-log.html",
	}
}
```

---
### C.I.1 Execution in agent_serial mode

This mode is most likely like the "async" execution of checkmk plugins. The difference is that the `controller`, when called minutely (=default check interval), will each time decide whether `robotmk` should be executed. 

```
# pseudocode
IF last_start_time < last_end_time:       # last run was finished 
    IF now > last_start_time + global_execution_interval:
        execute_robotmk_detached()     # execution on time
    ELSE: 
        pass    # last result too fresh, do nothing 
ELSE:   # task is running
    IF now > last_start_time + global_execution_interval:
        pass    # do not throw an error; the monitoring part will detect the outdated spoolfile!
    ELSE: 
        pass    # task is running within allowed cache_time, do nothing
```
 
start/end times get read from `robotmk-state.json` (written on begin/end of robotmk): 

---
     
### C.I.2 Execution in agent_parallel mode

In this mode all RF tests are allowed to run at different intervals and, if configured, in parallel. 

At first, the list of scheduled suites is read from `robotmk.yml`:

```
...
suites: 
  suite1_with_vars:
    suite: suite1 
    suite_cache_time: 240
    suite_execution_interval: 180    
    variable:
      var1: foo
      var2: bar
  suite1_without_vars:
    suite: suite1 
    suite_cache_time: 1100
    suite_execution_interval: 900    
  suite2: 
    suite_cache_time: 2000
    suite_execution_interval: 1800  
```

The controller plugin is reponsible to decide which suites of this list have to be executed.

This splits up into two main tasks, explained below:

* task file **creation**
* task **procession**

--- 

#### task file creation

Task files are created by `controller` to create a queue of RF suites to execute. 

Whenever a task is finished, the file gets deleted. 

Task files are always empty; they represent the order of suites by a epoch timestamp in the filename:

```
1)         2)      3)
1606459120_robotmk_suite1_with_vars
1606459460_robotmk_suite1_without_vars
1606459700_robotmk_suite2

#1 next execution timestamp
#2 responsible for execution (default: robotmk; feature: place OS username here)
#3 suite ID
```

Tasks can be divided into three groups: 

| Group | yml entry? | state file? | meaning | action |
|-|:-:|:-:|-|-|
| **New** | X |  | Task has never run before (config change) | schedule |
| **Inventory** | X | X | Task has run before | schedule |
| **Orphaned** |  | X | Task not planned to run anymore (config change) | cleanup statefile |

Whether "New" and "Inventory" tasks are candidates to execute *now*, is determined like follows: 

```
# pseudocode

def read_statefile(suite):
    if statefile_exists:
        state = read(statefile)
        return state  # Inventory task, already ran
    else: 
        return 0      # New task, never ran

# creates taskfiles for all suites which should run until next agent run
for suite in suites: 
    state = read_statefile(suite)
    agent_interval = now - controllerstate.last_run
    sched_window_start = now
    sched_window_end = now + agent_interval
    next_exec_time = state.last_start_time + suite['suite_execution_interval']
    if sched_window_start <= next_exec_time < sched_window_end:
      create_taskfile(next_exec_time, suite)
```


---

#### task procession

Now that the task files are written, it is time to process them. 

Given we have those three task files (and an agent interval of 5mins=300 seconds): 

```
1606459120_robotmk_suite1_with_vars
# 1606459420 = now
1606459460_robotmk_suite1_without_vars
1606459700_robotmk_suite2
```

`1606459120` is over already, hence `robotmk` will execute it *immediately*. 

`1606459460` is 40 seconds ahead, hence `robotmk` will be called with a *40s delay*. 

`1606459700` is 180 seconds ahead, hence `robotmk` will be called with a *180s delay*. 

Each of these three `robotmk` calls are started at the same time (now) as detached processes with their delay parameter. 

After the delay time within the plugin has passed and the execution time of a test has come, `robotmk` has to pass _resource constraints_; those contraints are checks which guarantee that the machine never gets overloaded with tests: 

* max. currently running tests < `max_parallel_executions` ( `robotmk.yml`) 
* (TBD) min. CPU Idle 
* (TBD) min. RAM Free

If there is any constraint violation, `robotmk` will abort and update the suite statefile: 

```
cat tmp/robotmk_suite_1_with_vars.json
{
	'suite1_with_vars': {
        "last_start_time": "2020-11-20 11:45:33", 
        "last_end_time": "2020-11-20 11:47:29",
        "last_rc": 255,
        "last_message": "UNKNOWN: suite1_with_vars did not have resources to start (4 of max. 4 parallel tests are already running!)."
	    "xml": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-output.xml",
	    "html": "C:\Windows\temp\robotmk_suite1_with_vars_12134242-log.html",
	}
}
```

---

### C.I.3 Exeution in external mode

---
   
## C.II Monitoring

As stated above, there are three working modes: 
* agent_serial
* agent_parallel
* external. 



. | `agent_serial` | `agent_parallel` | `external` |
|---|---|---|---|
runner:suites | one for all suites | one per suite  | one for n |
|---|---|---|---|
`global:` | | | |
`cache_time` | `O` | |`O` 1) |
`execution_interval` | `O`  | | |
|---|---|---|---|
`suite:` | | | |
`cache_time` | | `O` | `O` |
`execution_interval` |  | `O` | |
|||||
|monitor runner cache headroom|`last_end_time`|||
|monitor suite cache headroom||||
|monitor runner result staleness||||
|monitor suite result staleness||||

### When does it make sense to monitor the headroom time? 
A **non-selective** (=complete) run is whenever the runner gets started 
with no suite args. That is when:
- _serial_ mode (controller itself starts runner with no suite args)
- _external_ mode (a scheduled task starts the runner with no suite args)
A **selective**, non-complete run is 
- _parallel_ mode (controller starts one runner per suite)
- _external_ mode with specific suites as arguments to --run
- _external_ mode (a scheduled task starts the runner with suite args)

This table shows which mode supports selective runs:

**exec_mode** | selective | non-selective |
|---|---|---|
`agent_serial` | no | yes |
`agent_parallel` | yes | yes 1) |
`external` | yes | yes |

1) Remark: It's questionable whether this makes sense. It's the same as serial mode, which also executes _all_ suites as defined.  

## D) Notes to myself

* Change the WATO page so that the key of a suite is a unique identifier. The suite directory is a separate field. This helps to distinguish suite executions if there are two planned executions of the same suite which only differ in the variables!
* `if now > state.last_start_time + suite['suite_cache_time']` - cache_time != execution_interval! Introduce a factor which extends the caching time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redesign: robotmk-master, serial/parallel/external execution #59

robotmk-controller

A) Situation

B) idea / basic concept

C) Tasks of `robotmk-controller`

C.I Execution

C.I.0 General

C.I.1 Execution in agent_serial mode

C.I.2 Execution in agent_parallel mode

task file creation

task procession

C.I.3 Exeution in external mode

C.II Monitoring

When does it make sense to monitor the headroom time?

D) Notes to myself

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Group	yml entry?	state file?	meaning	action
New	X		Task has never run before (config change)	schedule
Inventory	X	X	Task has run before	schedule
Orphaned		X	Task not planned to run anymore (config change)	cleanup statefile

.	`agent_serial`	`agent_parallel`	`external`
runner:suites	one for all suites	one per suite	one for n
---	---	---	---
`global:`
`cache_time`	`O`		`O` 1)
`execution_interval`	`O`
---	---	---	---
`suite:`
`cache_time`		`O`	`O`
`execution_interval`		`O`

monitor runner cache headroom	`last_end_time`
monitor suite cache headroom
monitor runner result staleness
monitor suite result staleness

exec_mode	selective	non-selective
`agent_serial`	no	yes
`agent_parallel`	yes	yes 1)
`external`	yes	yes

Redesign: robotmk-master, serial/parallel/external execution #59

Description

robotmk-controller

A) Situation

B) idea / basic concept

C) Tasks of robotmk-controller

C.I Execution

C.I.0 General

C.I.1 Execution in agent_serial mode

C.I.2 Execution in agent_parallel mode

task file creation

task procession

C.I.3 Exeution in external mode

C.II Monitoring

When does it make sense to monitor the headroom time?

D) Notes to myself

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

C) Tasks of `robotmk-controller`