-
Notifications
You must be signed in to change notification settings - Fork 111
Added functionality for export of failure logs #591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: telemetry
Are you sure you want to change the base?
Conversation
Signed-off-by: Sai Shree Pradhan <[email protected]>
raise Error("Cannot create cursor from closed connection") | ||
raise Error( | ||
"Cannot create cursor from closed connection", | ||
connection_uuid=self.get_session_id_hex(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there plans to include the statement ID too across this PR?
Let's elaborate on testing details, maybe include a redacted log JSON ? |
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
From your JSON, can we skip null fields while building the telemetry request? |
Signed-off-by: Sai Shree Pradhan <[email protected]>
…r operations Signed-off-by: Sai Shree Pradhan <[email protected]>
…ze and get telemetry client Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds enhanced error logging functionality by exporting failure logs via telemetry and integrating the export call into the Error class. Key changes include:
- New export_failure_log method in TelemetryClient with corresponding tests.
- Enhanced exception raising throughout the code to include a connection_uuid for improved traceability.
- Updates to JSON conversion utilities to use a more compact form for telemetry events.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/unit/test_telemetry.py | Adds test cases for the new export_failure_log functionality (note: duplicate test functions detected). |
src/databricks/sql/thrift_backend.py | Propagates connection_uuid through error handling for improved error context. |
src/databricks/sql/telemetry/utils.py | Introduces helper functions to filter None values and convert dataclasses to compact JSON. |
src/databricks/sql/telemetry/telemetry_client.py | Implements telemetry export for failure logs and adds try/except blocks to handle export errors. |
src/databricks/sql/telemetry/models/* | Updates to_json methods to use the new compact JSON conversion. |
src/databricks/sql/exc.py & src/databricks/sql/client.py | Updates Error and client exception raises to include connection_uuid in error messages. |
Comments suppressed due to low confidence (2)
tests/unit/test_telemetry.py:86
- Duplicate test function 'test_export_failure_log' detected; consider renaming or consolidating the two tests to avoid test overrides.
def test_export_failure_log(self, noop_telemetry_client):
src/databricks/sql/telemetry/telemetry_client.py:395
- [nitpick] The assignment syntax in the TelemetryClientFactory initialization appears confusing due to extra bracket indentation; please review and simplify the assignment for clarity.
TelemetryClientFactory._clients[connection_uuid] = TelemetryClient(
host_url=host_url, | ||
executor=TelemetryClientFactory._executor, | ||
with TelemetryClientFactory._lock: | ||
if connection_uuid not in TelemetryClientFactory._clients: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually not the connection_uuid but the hex right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the session id of ThriftBackend converted to a hex string. Every connection has only one session, so I used it as a connection identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes however let's not use the variable name connection_uuid
since that is misleading, we should name it exactly what it is
@@ -83,6 +83,12 @@ def test_export_initial_telemetry_log(self, noop_telemetry_client): | |||
driver_connection_params=MagicMock(), user_agent="test" | |||
) | |||
|
|||
def test_export_failure_log(self, noop_telemetry_client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test and the test at line 141 have the same name, let's dedup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it's alright since they are in different classes? Have done the same thing for test_export_initial_telemetry_log and test_close. Should I change these too?
Signed-off-by: Sai Shree Pradhan <[email protected]>
cls._executor.shutdown( | ||
wait=True | ||
) # This waits for all submitted work to complete | ||
logger.debug("Thread pool shutdown completed successfully") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misleading log message, let's add logs appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I don't quite understand. Do you mean the submitted work make have completed but with a failure so I shouldn't add "successfully" in the log message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 things here :
- We can mention that telemetry client has been shut down successfully.
- Telemetry logs should not be too verbose; debug/error should be avoided as much as possible.
Another thing that I noticed across this PR is : let's not try to 1:1 replicate what we have in JDBC telemetry. We can follow python best practices instead. For example :
- you can explore using a mixin base class or a utility function to make JSON serialization reusable in all the telemetry models that you have - that will make the code prettier.
- Another thing is :
TelemetryClientFactory
is overly Java specific, we could dofactory functions
instead
Another thing that is missing in this PR is coverage. Can we add coverage on the PR description?
src/databricks/sql/client.py
Outdated
@@ -1156,7 +1181,10 @@ def fetchall(self) -> List[Row]: | |||
if self.active_result_set: | |||
return self.active_result_set.fetchall() | |||
else: | |||
raise Error("There is no active result set") | |||
raise Error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we plan to have custom tags for each message?
cls._executor.shutdown( | ||
wait=True | ||
) # This waits for all submitted work to complete | ||
logger.debug("Thread pool shutdown completed successfully") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 things here :
- We can mention that telemetry client has been shut down successfully.
- Telemetry logs should not be too verbose; debug/error should be avoided as much as possible.
Another thing that I noticed across this PR is : let's not try to 1:1 replicate what we have in JDBC telemetry. We can follow python best practices instead. For example :
- you can explore using a mixin base class or a utility function to make JSON serialization reusable in all the telemetry models that you have - that will make the code prettier.
- Another thing is :
TelemetryClientFactory
is overly Java specific, we could dofactory functions
instead
Another thing that is missing in this PR is coverage. Can we add coverage on the PR description?
), | ||
) | ||
self.export_event(telemetry_frontend_log) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a try except block here. From what I can see all the errors can stem from the flush() function and there anyway we are logging and cathching the exception. It does't make sense to have try blocks where there is no error thrown
self.export_event(telemetry_frontend_log) | ||
def export_failure_log(self, error_name, error_message): | ||
logger.debug("Exporting failure log for connection %s", self._connection_uuid) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on error handling. In telemetry the only error is the network call and that is handled in flush error handling. I don't feel random try blocks are needed cc @vikrantpuppala
src/databricks/sql/thrift_backend.py
Outdated
@@ -737,13 +763,15 @@ def _results_message_to_execute_response(self, resp, operation_state): | |||
or direct_results.resultSet.hasMoreRows | |||
) | |||
description = self._hive_schema_to_description( | |||
t_result_set_metadata_resp.schema | |||
t_result_set_metadata_resp.schema, self._connection_uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is connection_uuid an argument to this function _hive_schema_to_description
is it just for telemetry logging ? Can't we use thread local because we are polluting random functions for logging cc @vikrantpuppala
3 broad suggestions
|
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
What type of PR is this?
Description
Added error logging to the init function of Error class
Added export_failure_log to TelemetryClient
How is this tested?
Unit tests
E2E Tests
Manually
Queried from a non-existent table
Request Body Summary:
uploadTime: 1749620674701
items: 0 items
protoLogs: 2 logs
N/A
Related Tickets & Documents
PECOBLR-524