Skip to content

Commit e6238bc

Browse files
khalillilahkjriguera
authored andcommitted
Add ability to mute all errors (mainly due to access rights) coming from process scraper of the hostmetricsreceiver (open-telemetry#34981)
**Description:** We are currently encountering an issue with the `process` scraper in the `hostmetricsreceiver`, primarily due to access rights restrictions for certain processes like system processes for example. This is resulting in a large number of verbose error logs. Most of them are coming from the `process.open_file_descriptors` metric but we have errors coming from other metrics as well. In order to solve this issue, we added a flag `mute_process_all_errors `that mutes errors comming from the process scraper metrics, as these errors are predominantly associated with processes that we should not be monitoring anyways. **Link to tracking Issue:** open-telemetry#20435 **Testing:** Added unit tests **Documentation:** **Errors**: - Permission denied errors: ``` go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:176 2024-09-02T17:24:10.341+0200 error scraping metrics {"kind": "receiver", "name": "hostmetrics/linux/localhost", "data_type": "metrics", "error": "error reading open file descriptor count for process \"systemd\" (pid 1): open /proc/1/fd: permission denied; ``` - File not found errors: ``` go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:176 2024-09-02T17:25:38.688+0200 error scraperhelper/scrapercontroller.go:200 Error scraping metrics {"kind": "receiver", "name": "hostmetrics/process", "data_type": "metrics", "error": "error reading cpu times for process \"java\" (pid 466650): open /proc/466650/stat: no such file or directory; error reading memory info for process \"java\" (pid 466650): open /proc/466650/statm: no such file or directory; error reading thread info for process \"java\" (pid 466650): open /proc/466650/status: no such file or directory; error reading cpu times for process \"java\" (pid 474774): open /proc/474774/stat: no such file or directory; error reading memory info for process \"java\" (pid 474774): open /proc/474774/statm: no such file or directory; error reading thread info for process \"java\" (pid 474774): open /proc/474774/status: no such file or directory; error reading cpu times for process \"java\" (pid 481780): open /proc/481780/stat: no such file or directory; error reading memory info for process \"java\" (pid 481780): open /proc/481780/statm: no such file or directory; error reading thread info for process \"java\" (pid 481780): open /proc/481780/status: no such file or directory", "scraper": "process"} ``` **Config**: ``` receiver hostmetrics/process: collection_interval: ${PROCESSES_COLLECTION_INTERVAL}s scrapers: process: mute_process_name_error: true mute_process_exe_error: true mute_process_io_error: true mute_process_user_error: true resource_attributes: # disable non_used default attributes process.command: enabled: false process.command_line: enabled: false process.executable.path: enabled: false process.owner: enabled: false process.parent_pid: enabled: false metrics: # disable non-used default metrics process.cpu.time: enabled: false process.memory.virtual: enabled: false # enable used optional metrics process.cpu.utilization: enabled: true process.open_file_descriptors: enabled: true process.threads: enabled: true ```
1 parent 89f054a commit e6238bc

File tree

5 files changed

+77
-10
lines changed

5 files changed

+77
-10
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Use this changelog template to create an entry for release notes.
2+
3+
# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
4+
change_type: enhancement
5+
6+
# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
7+
component: hostmetricsreceiver
8+
9+
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
10+
note: Add ability to mute all errors (mainly due to access rights) coming from process scraper of the hostmetricsreceiver
11+
12+
# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
13+
issues: [20435]
14+
15+
# (Optional) One or more lines of additional information to render under the primary note.
16+
# These lines will be padded with 2 spaces and then inserted directly into the document.
17+
# Use pipe (|) for multiline entries.
18+
subtext:
19+
20+
# If your change doesn't affect end users or the exported elements of any package,
21+
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
22+
# Optional: The change log or logs in which this entry should be included.
23+
# e.g. '[user]' or '[user, api]'
24+
# Include 'user' if the change is relevant to end users.
25+
# Include 'api' if there is a change to a library API.
26+
# Default: '[user]'
27+
change_logs: []

receiver/hostmetricsreceiver/README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ process:
114114
<include|exclude>:
115115
names: [ <process name>, ... ]
116116
match_type: <strict|regexp>
117+
mute_process_all_errors: <true|false>
117118
mute_process_name_error: <true|false>
118119
mute_process_exe_error: <true|false>
119120
mute_process_io_error: <true|false>
@@ -123,12 +124,12 @@ process:
123124
```
124125

125126
The following settings are optional:
126-
127-
- `mute_process_name_error` (default: false): mute the error encountered when trying to read a process name the collector does not have permission to read
128-
- `mute_process_io_error` (default: false): mute the error encountered when trying to read IO metrics of a process the collector does not have permission to read
129-
- `mute_process_cgroup_error` (default: false): mute the error encountered when trying to read the cgroup of a process the collector does not have permission to read
130-
- `mute_process_exe_error` (default: false): mute the error encountered when trying to read the executable path of a process the collector does not have permission to read (Linux only)
131-
- `mute_process_user_error` (default: false): mute the error encountered when trying to read a uid which doesn't exist on the system, eg. is owned by a user that only exists in a container.
127+
- `mute_process_all_errors` (default: false): mute all the errors encountered when trying to read metrics of a process. When this flag is enabled, there is no need to activate any other error suppression flags.
128+
- `mute_process_name_error` (default: false): mute the error encountered when trying to read a process name the collector does not have permission to read. This flag is ignored when `mute_process_all_errors` is set to true as all errors are muted.
129+
- `mute_process_io_error` (default: false): mute the error encountered when trying to read IO metrics of a process the collector does not have permission to read. This flag is ignored when `mute_process_all_errors` is set to true as all errors are muted.
130+
- `mute_process_cgroup_error` (default: false): mute the error encountered when trying to read the cgroup of a process the collector does not have permission to read. This flag is ignored when `mute_process_all_errors` is set to true as all errors are muted.
131+
- `mute_process_exe_error` (default: false): mute the error encountered when trying to read the executable path of a process the collector does not have permission to read (Linux only). This flag is ignored when `mute_process_all_errors` is set to true as all errors are muted.
132+
- `mute_process_user_error` (default: false): mute the error encountered when trying to read a uid which doesn't exist on the system, eg. is owned by a user that only exists in a container. This flag is ignored when `mute_process_all_errors` is set to true as all errors are muted.
132133

133134
## Advanced Configuration
134135

receiver/hostmetricsreceiver/internal/scraper/processscraper/config.go

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,29 +22,38 @@ type Config struct {
2222
Include MatchConfig `mapstructure:"include"`
2323
Exclude MatchConfig `mapstructure:"exclude"`
2424

25+
// MuteProcessAllErrors is a flag that will mute all the errors encountered when trying to read metrics of a process.
26+
// When this flag is enabled, there is no need to activate any other error suppression flags.
27+
MuteProcessAllErrors bool `mapstructure:"mute_process_all_errors,omitempty"`
28+
2529
// MuteProcessNameError is a flag that will mute the error encountered when trying to read a process name the
2630
// collector does not have permission to read.
2731
// See https://github.com/open-telemetry/opentelemetry-collector/issues/3004 for more information.
32+
// This flag is ignored when MuteProcessAllErrors is set to true as all errors are muted.
2833
MuteProcessNameError bool `mapstructure:"mute_process_name_error,omitempty"`
2934

3035
// MuteProcessIOError is a flag that will mute the error encountered when trying to read IO metrics of a process
3136
// the collector does not have permission to read.
37+
// This flag is ignored when MuteProcessAllErrors is set to true as all errors are muted.
3238
MuteProcessIOError bool `mapstructure:"mute_process_io_error,omitempty"`
3339

3440
// MuteProcessCgroupError is a flag that will mute the error encountered when trying to read the cgroup of a process
3541
// the collector does not have permission to read.
42+
// This flag is ignored when MuteProcessAllErrors is set to true as all errors are muted.
3643
MuteProcessCgroupError bool `mapstructure:"mute_process_cgroup_error,omitempty"`
3744

3845
// MuteProcessExeError is a flag that will mute the error encountered when trying to read the executable path of a process
39-
// the collector does not have permission to read (Linux)
46+
// the collector does not have permission to read (Linux).
47+
// This flag is ignored when MuteProcessAllErrors is set to true as all errors are muted.
4048
MuteProcessExeError bool `mapstructure:"mute_process_exe_error,omitempty"`
4149

4250
// MuteProcessUserError is a flag that will mute the error encountered when trying to read uid which
43-
// doesn't exist on the system, eg. is owned by user existing in container only
51+
// doesn't exist on the system, eg. is owned by user existing in container only.
52+
// This flag is ignored when MuteProcessAllErrors is set to true as all errors are muted.
4453
MuteProcessUserError bool `mapstructure:"mute_process_user_error,omitempty"`
4554

4655
// ScrapeProcessDelay is used to indicate the minimum amount of time a process must be running
47-
// before metrics are scraped for it. The default value is 0 seconds (0s)
56+
// before metrics are scraped for it. The default value is 0 seconds (0s).
4857
ScrapeProcessDelay time.Duration `mapstructure:"scrape_process_delay"`
4958
}
5059

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,10 @@ func (s *scraper) scrape(ctx context.Context) (pmetric.Metrics, error) {
187187
}
188188
}
189189

190+
if s.config.MuteProcessAllErrors {
191+
return s.mb.Emit(), nil
192+
}
193+
190194
return s.mb.Emit(), errs.Combine()
191195
}
192196

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper_test.go

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1003,6 +1003,7 @@ func TestScrapeMetrics_MuteErrorFlags(t *testing.T) {
10031003
muteProcessExeError bool
10041004
muteProcessIOError bool
10051005
muteProcessUserError bool
1006+
muteProcessAllErrors bool
10061007
skipProcessNameError bool
10071008
omitConfigField bool
10081009
expectedError string
@@ -1093,6 +1094,30 @@ func TestScrapeMetrics_MuteErrorFlags(t *testing.T) {
10931094
return 4
10941095
}(),
10951096
},
1097+
{
1098+
name: "All Process Errors Muted",
1099+
muteProcessNameError: false,
1100+
muteProcessExeError: false,
1101+
muteProcessIOError: false,
1102+
muteProcessUserError: false,
1103+
muteProcessAllErrors: true,
1104+
expectedCount: 0,
1105+
},
1106+
{
1107+
name: "Process User Error Enabled and All Process Errors Muted",
1108+
muteProcessUserError: false,
1109+
skipProcessNameError: true,
1110+
muteProcessExeError: true,
1111+
muteProcessNameError: true,
1112+
muteProcessAllErrors: true,
1113+
expectedCount: func() int {
1114+
if runtime.GOOS == "darwin" {
1115+
// disk.io is not collected on darwin
1116+
return 3
1117+
}
1118+
return 4
1119+
}(),
1120+
},
10961121
}
10971122

10981123
for _, test := range testCases {
@@ -1106,6 +1131,7 @@ func TestScrapeMetrics_MuteErrorFlags(t *testing.T) {
11061131
config.MuteProcessExeError = test.muteProcessExeError
11071132
config.MuteProcessIOError = test.muteProcessIOError
11081133
config.MuteProcessUserError = test.muteProcessUserError
1134+
config.MuteProcessAllErrors = test.muteProcessAllErrors
11091135
}
11101136
scraper, err := newProcessScraper(receivertest.NewNopSettings(), config)
11111137
require.NoError(t, err, "Failed to create process scraper: %v", err)
@@ -1135,7 +1161,7 @@ func TestScrapeMetrics_MuteErrorFlags(t *testing.T) {
11351161

11361162
assert.Equal(t, test.expectedCount, md.MetricCount())
11371163

1138-
if config.MuteProcessNameError && config.MuteProcessExeError && config.MuteProcessUserError {
1164+
if (config.MuteProcessNameError && config.MuteProcessExeError && config.MuteProcessUserError) || config.MuteProcessAllErrors {
11391165
assert.NoError(t, err)
11401166
} else {
11411167
assert.EqualError(t, err, test.expectedError)

0 commit comments

Comments
 (0)