Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented May 16, 2023

Extracted from #99191, where I ran a bunch of experiments to make sure the logging around connection failures is clean and helpful.

I did manual testing on the changes in this PR. It's tricky to test vmodule-depth-thingies programmatically and I don't want to spend time building new infra here.

  • cli: don't lower gRPC severity threshold
  • grpcutil: adopt DepthLoggerV2
  • sidetransport: suppress logging when peer down

Release note: None
Epic: CRDB-21710

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the grpc-logging-improvements branch from 3db58c5 to 3c8f67e Compare May 17, 2023 17:17
@tbg tbg marked this pull request as ready for review May 17, 2023 19:51
@tbg tbg requested a review from a team May 17, 2023 19:51
@tbg tbg requested review from a team as code owners May 17, 2023 19:51
@tbg tbg requested review from a team and aliher1911 May 17, 2023 19:51
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)


pkg/util/grpcutil/grpc_log.go line 120 at r1 (raw file):

	// has a multi-line JSON object in it (but not in any structured way),
	// so there's really no way to work with it. Best we can do is throw
	// away the many newlines.

Q: why throw away the newlines? how do they hurt us?


pkg/util/grpcutil/grpc_log.go line 123 at r1 (raw file):

	sl := make([]interface{}, 0, len(args))
	for _, arg := range args {
		sl = append(sl, strings.Replace(fmt.Sprint(arg), "\n", "", -1))

nit: replace sequences of multiple newlines with a single space, to avoid two separate words from being concatenated together.

tbg added 3 commits May 18, 2023 16:09
grpc has a `DepthLogger` interface in which the methods called get told about
the number of stackframes to skip to get the true caller. This takes a lot of
guesswork (which had since gone stale anyway) out of our grpc logger
implementation.

Now we'll (sort of) reliably get the log lines annotated with the stack frames
from which they came.

The old interface still needs to be supported, but it's no longer called, so
I simplified it down to something that still logs but makes no attempt to get
the "correct" caller.

Since I was here I couldn't help but also massage the output a bit.
Unfortunately gRPC gives us totally unstructured logging, so I can't add
redaction or anything of that sort, but I at least removed newlines so that
grpc logs are more compact and easier to grep out.

I also improved vmodule so that you can opt into *only* gRPC warnings and
errors, etc, on a per-file basis (3=INFO, 2=WARNING, 1=ERROR).

Epic: none
Release note: None

squash gRPC logging rework

- remove warning output filter
- force messages into single line to avoid blobs
- rework vmodule logging to make it more useful
It's just noise. `rpc.Context` already logs the connection errors for us.

Epic: none
Release note: None
It's duplicating info already logged by the rpc.Context.

Epic: none
Release note: None
@tbg tbg force-pushed the grpc-logging-improvements branch from 3c8f67e to 004b471 Compare May 18, 2023 14:09
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, appreciate it!

bors r=knz

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @knz)


pkg/util/grpcutil/grpc_log.go line 120 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Q: why throw away the newlines? how do they hurt us?

I find these log messages pretty irritating to work with since they are difficult to grep out (merge-logs makes it better, but that's not always at hand). The details printed on these lines also barely add any detail but consume a lot of vertical real estate.

@craig
Copy link
Contributor

craig bot commented May 18, 2023

Build failed:

@tbg
Copy link
Member Author

tbg commented May 19, 2023

bors r=knz

bors failure is

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/115b56b7b7bd2dfb6f4ea1f5f6513edb/logTestStatusEngineStatsJson1920103392
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:280: using admin URL https://127.0.0.1:34309/
    status_test.go:287: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/115b56b7b7bd2dfb6f4ea1f5f6513edb/logTestStatusEngineStatsJson1920103392
--- FAIL: TestStatusEngineStatsJson (64.59s)

I was able to reproduce this on the merge-base 5af7d05 via

./dev test --stress ./pkg/server/ --filter TestStatusEngineStatsJson:

--- FAIL: TestStatusEngineStatsJson (68.44s)
    test_log_scope.go:161: test logs captured to: /Users/tobias/go/src/github.com/cockroachdb/cockroach/tmp/_tmp/eeba368469fff078dbc179626aa9b86d/logTestStatusEngineStatsJson2601962448
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:280: using admin URL https://127.0.0.1:51397
    status_test.go:287: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
FAIL
I230519 16:09:42.012033 1 (gostd) testmain.go:528  [-] 1  Test //pkg/server:server_test exited with error code 1


ERROR: exit status 1

58 runs completed, 1 failures, over 2m39s

So it isn't introduced in this PR. Still, it's topical to what I'm working on (#99191) so I'll double check what's going on here. This strikes me as completely unexpected in two ways: first, the connection should "just work", second, it shouldn't be unhealthy for >45s, and third, we're not supposed to see "connection unexpectedly re-dialed" pretty much ever, or so I thought.

Touches #99261
Touches #95491

@craig craig bot merged commit 67ff506 into cockroachdb:master May 19, 2023
@craig
Copy link
Contributor

craig bot commented May 19, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants