-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
Summary
When new paths are added to a filestream input, Filebeat may re-ingest the full content of files it has previously processed but for which the state has been deleted. This happens because the state is tied to the file being tracked by filestream. Once the file is removed, its state is also removed via clean_removed: true
(the current default). Thus, the file is treated as new if discovered under a different path after being marked as “deleted”.
A primary example is an existing Kubernetes integration being updated to ingest rotated log files, causing duplication for all previously rotated files. This issue aims to investigate and define a robust strategy to prevent this re-ingestion, ensuring a seamless experience when monitoring paths are updated.
It's similar to #43650.
Background
A primary example of this issue occurs when updating our Kubernetes integration:
- Legacy Path: The original configuration monitored /var/log/containers/*, which contains symlinks to the active log file for each container.
- Rotation: On log rotation, kubelet renames the active file (e.g., log.0 becomes log.0.timestamp). Filebeat, tracking the symlink, correctly registers this event as a file "deletion".
- State Purge: With the default clean_removed: true setting, Filebeat removes the state for this "deleted" file from its registry to prevent indefinite growth.
- Data Re-ingestion: When the configuration is updated to monitor /var/log/pods/*, Filebeat discovers the rotated files (e.g., log.0.timestamp). As there is no state associated with this specific path in its registry, it treats them as new files and ingests their entire content again.
Proposed Avenues for Investigation
We have identified two potential approaches to solve this issue.
1. Persist State for Removed Files (Preferred)
This is considered the most robust and promising solution.
Concept: The core idea is to modify filestream’s registry to retain the state of files that have been marked as deleted (e.g., they are moved from the monitored paths). Fingerprinting correctly identifies a file that was previously tracked, was marked as deleted, and appears later under a new path. If a file with a known fingerprint reappears, its previous state (offset) can be restored, and ingestion can resume from the correct position.
Challenges:
- A new, sophisticated state management strategy is required to allow state to be kept for deleted files and to eventually purge this historical state to prevent the indefinite growth of the registry.
- This approach is fundamentally incompatible with the legacy clean_inactive:0, legacy_clean_inactive:true behaviour. A solution for their coexistence must be found.
- Using clean_inactive instead of clean_removed is not a viable option. This approach's effectiveness depends on environment-specific tuning, and it's impossible to establish a one-size-fits-all default value that would work reliably. Furthermore, any proposed solution must be compatible with the default behaviour, which is clean_removed: true.
2. New ignore_older_for_path Configuration
This approach was considered as a potential mitigation but has significant drawbacks.
Concept: Introduce a new configuration directive where one can specify a path glob and a timestamp. Filebeat would ignore any file matching the glob if its creation time is before the given timestamp.
Drawbacks:
- Incomplete Solution: It only mitigates the issue. A small risk of re-ingestion remains for files that are rotated after the specified timestamp but before Filebeat restarts.
- Lacks Automation: There is currently no mechanism to dynamically set the current timestamp when generating an integration policy, making it incompatible with our requirement for a fully automatic, transparent user experience.
Requirements for Investigation
A successful solution must meet the following requirements:
Requirements
- Automatic Prevention: The solution must automatically prevent the re-ingestion of previously processed files when new monitoring paths are added to a filestream input.
- User Transparency for integrations: The mechanism must be entirely transparent under the Elastic-Agent (eg. Kubernetes integration), requiring no manual configuration or intervention.
- Fingerprint-Based Identity: This state persistence mechanism must apply only to file identity fingerprint.
- State Persistence and Updates: When a file is no longer on a monitored path, its fingerprint-keyed state must be retained in the registry for a configurable period and continue to be updated as events are acknowledged by the output.
- State Restoration: If a file with a previously seen fingerprint for a deleted file is discovered (potentially at a new path), Filebeat must resume ingestion from the last known offset associated with that fingerprint.
- Bounded State Store: A mechanism must exist to prevent the registry from growing indefinitely with historical fingerprint states.
- Configurable State TTL: The solution should include a time-based eviction policy for states for deleted files. This would purge the state for a deleted file if no corresponding file has been seen for a configurable duration.
- Performance: The mechanism for managing and loading states for removed files must ensure it's feasible to load the state into memory.
Acceptance Criteria
The primary use case driving this investigation is the Kubernetes integration upgrade. The solution will be considered successful if the following scenario is met:
- GIVEN an existing Kubernetes integration configured to monitor /var/log/containers/*.
- AND the cluster contains pods with log files that have already been rotated and now reside in /var/log/pods/*.
- WHEN the integration policy is updated to include the /var/log/pods/* path.
- THEN Filebeat must ingest only new data from both the active and rotated log files.
- AND the full content of the previously rotated log files must NOT be ingested again.