Skip to content

Simple karma points to exclude collectors that keep crashing #50

@tsuna

Description

@tsuna

Sometimes a new collector gets deployed and it doesn't work, or more commonly it only works on a small subset of hosts and it doesn't properly exit(13) on the hosts where it's not supposed to run. What would be nice is to have a dead-simple karma point system:

  • When the collector is first discovered and first started, it gets X karma points.
  • Each time the collector crashes, it loses C karma points.
  • Every N seconds that elapse, the collector gains G karma points, up to an upper bound of Gmax points.
  • Whenever a collector crashes, we check its karma, if it's negative, we mark it as dead and don't restart it anymore

The idea is that if a collector crashes too often, we want to give up on it, instead of spamming the logs. But if a collector has been up for a while, and all of a sudden it starts crashing a few times in a row, it's worth trying some more before giving up on it.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions