[processor/spanmetrics] Fix getting key from cache error #15687

Frapschen · 2022-10-27T08:16:57Z

Description:
I get an error in my app when it sent span to a otel collector which configed a spanmetrics processor with defaultDimensionsCacheSize:

2022/10/27 02:37:46 rpc error: code = Unknown desc = value not found in metricKeyToDimensions cache by key "amamba\x00amamba.io.api.pipeline.v1alpha1.Pipelines/ReplayPipelineRun\x00SPAN_KIND_SERVER\x00STATUS_CODE_OK\x00cd7b102e-fbc5-4556-a0fd-718298df3de9\x00amamba-system\x00demo-dev-worker-03\x00amamba-apiserver-66c9486b55-5l4n8"

I trick the error to:

opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go

Lines 335 to 344 in 92ad54f

    
           func (p *processorImp) getDimensionsByMetricKey(k metricKey) (*pcommon.Map, error) { 
        
           	if item, ok := p.metricKeyToDimensions.Get(k); ok { 
        
           		if attributeMap, ok := item.(pcommon.Map); ok { 
        
           			return &attributeMap, nil 
        
           		} 
        
           		return nil, fmt.Errorf("type assertion of metricKeyToDimensions attributes failed, the key is %q", k) 
        
           	} 
        
           	return nil, fmt.Errorf("value not found in metricKeyToDimensions cache by key %q", k) 
        
           }

I think the error will occur when the added keys are more than processorImp.defaultDimensionsCacheSize however the processorImp.callSum doesn't have any limitation, it contains the key and the key's value. I add a patch to get key from the metricKeyToDimensions.EvictedItems to handle this scene.

TylerHelmuth · 2022-10-27T18:11:16Z

pinging @albertteoh as code owner

albertteoh · 2022-10-28T10:55:02Z

processor/spanmetricsprocessor/processor.go

@@ -339,8 +339,13 @@ func (p *processorImp) getDimensionsByMetricKey(k metricKey) (*pcommon.Map, erro
 		}
 		return nil, fmt.Errorf("type assertion of metricKeyToDimensions attributes failed, the key is %q", k)
 	}
+	if item, ok := p.metricKeyToDimensions.EvictedItems[k]; ok {


I thought metricKeyToDimensions .Get(...) already does this?

opentelemetry-collector-contrib/processor/spanmetricsprocessor/internal/cache/cache.go

Line 60 in 0ea5dc4

val, ok := c.evictedItems[key]

yes, it do use the evictedItems , but still can't not explain why this error occur

I debug for a while, I catch the point this time that:

opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go

Lines 313 to 324 in 9810bb0

for key := range p.callSum {

mCalls := ilm.Metrics().AppendEmpty()

mCalls.SetName("calls_total")

mCalls.SetEmptySum().SetIsMonotonic(true)

mCalls.Sum().SetAggregationTemporality(p.config.GetAggregationTemporality())

dpCalls := mCalls.Sum().DataPoints().AppendEmpty()

dpCalls.SetStartTimestamp(pcommon.NewTimestampFromTime(p.startTime))

dpCalls.SetTimestamp(pcommon.NewTimestampFromTime(time.Now()))

dpCalls.SetIntValue(p.callSum[key])

dimensions, err := p.getDimensionsByMetricKey(key)

It loops p.callSum containing all key-values since the collector is started to find a key in the cache, but the cache only has 1000 items by default. If the built different key is large than 1000, then the next loop of p.callSum will raise an error.

Ah yes, I think you're right about the fact that p.callSum accumulates keys indefinitely since startup, unless if the user has configured delta temporality.

For cumulative temporality, while I think we should still keep all keys since collector startup (so the counts are correct), I don't think we should loop over this entire set of metric keys. Instead we should only loop over those that relate to the current batch of spans received, by somehow marking a key as "dirty" if it relates to a span's set of metric keys.

What do you think?

I can put together some tests locally to test this theory.

I think so, we just loop over those keys that relate to the current batch of spans received. Do I need close this PR and wait for your PR or I fix it in this PR?

Please go ahead with the fix 👍🏼

Note that #15710 involves a fairly major refactor of the spanmetrics processor, so I suggest to wait for #15710 to be merged first.

As it's quite a large refactor, it might make more sense to close this PR and create a new one based off the refactored version.

@Frapschen FYI #15710 has been merged.

processor need get key from EvictedItems when the key was evicted

ece937a

Frapschen requested review from a team and TylerHelmuth October 27, 2022 08:16

github-actions bot assigned mx-psi Oct 27, 2022

update

e0e0530

Frapschen mentioned this pull request Oct 27, 2022

[processor/spanmetrics] getting key from cache error #15688

Closed

fix ut

0583d5c

albertteoh reviewed Oct 28, 2022

View reviewed changes

This was referenced Nov 2, 2022

[processor/spanmetrics] Fix getting key from cache error #16023

Closed

[processor/spanmetrics] Fix getting key from cache error #16024

Closed

Frapschen closed this Nov 2, 2022

Frapschen mentioned this pull request Nov 11, 2022

[processor/servicegraph] Failed to find dimensions for key xxx #16262

Closed

Frapschen deleted the fix_use_evicted_cache branch December 23, 2022 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[processor/spanmetrics] Fix getting key from cache error #15687

[processor/spanmetrics] Fix getting key from cache error #15687

Uh oh!

Frapschen commented Oct 27, 2022

Uh oh!

TylerHelmuth commented Oct 27, 2022

Uh oh!

albertteoh Oct 28, 2022

Uh oh!

Frapschen Oct 30, 2022

Uh oh!

Frapschen Oct 30, 2022

Uh oh!

albertteoh Oct 31, 2022

Uh oh!

Frapschen Oct 31, 2022

Uh oh!

albertteoh Oct 31, 2022

Uh oh!

albertteoh Nov 1, 2022

Uh oh!

Uh oh!

	func (p processorImp) getDimensionsByMetricKey(k metricKey) (pcommon.Map, error) {
	if item, ok := p.metricKeyToDimensions.Get(k); ok {
	if attributeMap, ok := item.(pcommon.Map); ok {
	return &attributeMap, nil
	}
	return nil, fmt.Errorf("type assertion of metricKeyToDimensions attributes failed, the key is %q", k)
	}

	return nil, fmt.Errorf("value not found in metricKeyToDimensions cache by key %q", k)
	}

	for key := range p.callSum {
	mCalls := ilm.Metrics().AppendEmpty()
	mCalls.SetName("calls_total")
	mCalls.SetEmptySum().SetIsMonotonic(true)
	mCalls.Sum().SetAggregationTemporality(p.config.GetAggregationTemporality())

	dpCalls := mCalls.Sum().DataPoints().AppendEmpty()
	dpCalls.SetStartTimestamp(pcommon.NewTimestampFromTime(p.startTime))
	dpCalls.SetTimestamp(pcommon.NewTimestampFromTime(time.Now()))
	dpCalls.SetIntValue(p.callSum[key])

	dimensions, err := p.getDimensionsByMetricKey(key)

[processor/spanmetrics] Fix getting key from cache error #15687

[processor/spanmetrics] Fix getting key from cache error #15687

Uh oh!

Conversation

Frapschen commented Oct 27, 2022

Uh oh!

TylerHelmuth commented Oct 27, 2022

Uh oh!

albertteoh Oct 28, 2022

Choose a reason for hiding this comment

Uh oh!

Frapschen Oct 30, 2022

Choose a reason for hiding this comment

Uh oh!

Frapschen Oct 30, 2022

Choose a reason for hiding this comment

Uh oh!

albertteoh Oct 31, 2022

Choose a reason for hiding this comment

Uh oh!

Frapschen Oct 31, 2022

Choose a reason for hiding this comment

Uh oh!

albertteoh Oct 31, 2022

Choose a reason for hiding this comment

Uh oh!

albertteoh Nov 1, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!