-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Hi,
Since a few days, we encounter a new problem related to livestatus.
We're running a 2.4.1 version under Debian 8 with the following architecture:
- 1 shinken master (with no poller running on it)
- 4 pollers
- 4 realms
- Thruk 2.0 connected to 5 livestatus (master and 4 realms).
The fact is that when when we launch an arbiter-reload, the broker gets mad because of the livestatus module. Thruk interface becomes unusable although the livestatus still seems to be up.
Here is an example of the traceback in brokerd.log:
[1447026298] ERROR: [broker-1] [Livestatus] Unexpected error during process of request 'GET services\nColumns: accept_passive_checks acknowledged action_url action_url_expanded active_checks_enabled check_command check_interval check_options check_period check_type checks_enabled comments current_attempt current_notification_number description event_handler event_handler_enabled custom_variable_names custom_variable_values execution_time first_notification_delay flap_detection_enabled groups has_been_checked high_flap_threshold host_acknowledged host_action_url_expanded host_active_checks_enabled host_address host_alias host_checks_enabled host_check_type host_latency host_plugin_output host_perf_data host_current_attempt host_check_command host_comments host_groups host_has_been_checked host_icon_image_expanded host_icon_image_alt host_is_executing host_is_flapping host_name host_notes_url_expanded host_notifications_enabled host_scheduled_downtime_depth host_state host_accept_passive_checks host_last_state_change icon_image icon_image_alt icon_image_expanded is_executing is_flapping last_check last_notification last_state_change latency long_plugin_output low_flap_threshold max_check_attempts next_check notes notes_expanded notes_url notes_url_expanded notification_interval notification_period notifications_enabled obsess_over_service percent_state_change perf_data plugin_output process_performance_data retry_interval scheduled_downtime_depth state state_type modified_attributes_list last_time_critical last_time_ok last_time_unknown last_time_warning display_name host_display_name host_custom_variable_names host_custom_variable_values in_check_period in_notification_period host_parents\nFilter: host_has_been_checked = 0\nFilter: host_has_been_checked = 1\nFilter: host_state = 0\nAnd: 2\nOr: 2\nFilter: host_scheduled_downtime_depth = 0\nFilter: host_acknowledged = 0\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 1\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 3\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 2\nAnd: 2\nOr: 3\nFilter: scheduled_downtime_depth = 0\nFilter: acknowledged = 0\nAnd: 2\nAnd: 4\nOutputFormat: json\nResponseHeader: fixed16\n\n' : 115536
[1447026298] ERROR: [broker-1] [Livestatus] Back trace of this exception: Traceback (most recent call last):
File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 74, in handle_request
return self.handle_request_and_fail(data)
File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 135, in handle_request_and_fail
output, keepalive = query.process_query()
File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 283, in process_query
return self.response.respond()
File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 147, in respond
responselength = 1 + self.get_response_len() # 1 for the final '\n'
File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 142, in get_response_len
if isinstance(rsp, LiveStatusListResponse)
File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 83, in total_len
for generated_data in value:
File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 278, in make_live_data_generator
for value in self.make_live_data_generator2(result, columns, aliases):
File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 224, in make_live_data_generator2
item = next(result)
File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 46, in gen_filtered
for val in values:
File "/var/lib/shinken/modules/livestatus/livestatus_regenerator.py", line 125, in itersorted
yield self.items[id]
KeyError: 115536
The only workaround we found consists in restarting the broker each time we want to reload the arbiter. (and this workaround leads to high memory leaks..)
So to not replace a problem with another, we searched and found our issue could be related to issue #47
We tried to manually do the GET requests when everything goes fine and livestatus answers correctly:
echo -e "GET hosts\n\n" | netcat localhost 50000
(works too when doing queries about contacts, services, etc)
Another thing we noticed is that it may occur when livestatus is often asked by thruk, because we never have those errors during the night or weekend. So it might be related to the number of user/operators connected to thruk.
Any help would be appreciated,
Regards