Skip to content

Added functionality for export of failure logs #591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: telemetry
Choose a base branch
from

Conversation

saishreeeee
Copy link
Collaborator

@saishreeeee saishreeeee commented Jun 10, 2025

What type of PR is this?

  • Refactor
  • Feature
  • Bug Fix
  • Other

Description

Added error logging to the init function of Error class
Added export_failure_log to TelemetryClient

How is this tested?

  • Unit tests

  • E2E Tests

  • Manually
    Queried from a non-existent table
    Request Body Summary:
    uploadTime: 1749620674701
    items: 0 items
    protoLogs: 2 logs

        Proto Log #1: Initial connection log
        Proto Log #2: Error log
        {
          "frontend_log_event_id": "b251bea7-c9a8-42e4-b00c-5ea4011d7d17",
          "context": {
            "client_context": {
              "timestamp_millis": 1749716213881,
              "user_agent": "PyDatabricksSqlConnector/4.0.3"
            }
          },
          "entry": {
            "sql_driver_log": {
              "session_id": "<REDACTED>",
              "system_configuration": {
                "driver_version": "4.0.3",
                "os_name": "Darwin",
                "os_version": "24.5.0",
                "os_arch": "arm64",
                "runtime_name": "Python 3.13.3",
                "runtime_version": "3.13.3",
                "runtime_vendor": "CPython",
                "driver_name": "Databricks SQL Python Connector",
                "char_set_encoding": "utf-8",
                "locale_name": "en_US"
              },
              "driver_connection_params": {
                "http_path": "<REDACTED>",
                "mode": "THRIFT",
                "host_info": {
                  "host_url": "<REDACTED>",
                  "port": 443
                },
                "auth_mech": "PAT"
              },
              "error_info": {
                "error_name": "ServerOperationError",
                "stack_trace": "[TABLE_OR_VIEW_NOT_FOUND] The table or view `non_existent_table` cannot be found. Verify the spelling and correctness of the schema and catalog.\nIf you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.\nTo tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS. SQLSTATE: 42P01; line 2 pos 33"
              }
            }
          }
        }          
        === Response Details ===
        Status Code: 200
        Response Body:
        {
          "errors": [],
          "numSuccess": 0,
          "numProtoSuccess": 2,
          "numRealtimeSuccess": 0
        }
    
  • N/A

Related Tickets & Documents

PECOBLR-524

raise Error("Cannot create cursor from closed connection")
raise Error(
"Cannot create cursor from closed connection",
connection_uuid=self.get_session_id_hex(),
Copy link
Contributor

@samikshya-db samikshya-db Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there plans to include the statement ID too across this PR?

@samikshya-db
Copy link
Contributor

Let's elaborate on testing details, maybe include a redacted log JSON ?

@samikshya-db
Copy link
Contributor

From your JSON, can we skip null fields while building the telemetry request?

Signed-off-by: Sai Shree Pradhan <[email protected]>
…ze and get telemetry client

Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds enhanced error logging functionality by exporting failure logs via telemetry and integrating the export call into the Error class. Key changes include:

  • New export_failure_log method in TelemetryClient with corresponding tests.
  • Enhanced exception raising throughout the code to include a connection_uuid for improved traceability.
  • Updates to JSON conversion utilities to use a more compact form for telemetry events.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/unit/test_telemetry.py Adds test cases for the new export_failure_log functionality (note: duplicate test functions detected).
src/databricks/sql/thrift_backend.py Propagates connection_uuid through error handling for improved error context.
src/databricks/sql/telemetry/utils.py Introduces helper functions to filter None values and convert dataclasses to compact JSON.
src/databricks/sql/telemetry/telemetry_client.py Implements telemetry export for failure logs and adds try/except blocks to handle export errors.
src/databricks/sql/telemetry/models/* Updates to_json methods to use the new compact JSON conversion.
src/databricks/sql/exc.py & src/databricks/sql/client.py Updates Error and client exception raises to include connection_uuid in error messages.
Comments suppressed due to low confidence (2)

tests/unit/test_telemetry.py:86

  • Duplicate test function 'test_export_failure_log' detected; consider renaming or consolidating the two tests to avoid test overrides.
def test_export_failure_log(self, noop_telemetry_client):

src/databricks/sql/telemetry/telemetry_client.py:395

  • [nitpick] The assignment syntax in the TelemetryClientFactory initialization appears confusing due to extra bracket indentation; please review and simplify the assignment for clarity.
TelemetryClientFactory._clients[connection_uuid] = TelemetryClient(

host_url=host_url,
executor=TelemetryClientFactory._executor,
with TelemetryClientFactory._lock:
if connection_uuid not in TelemetryClientFactory._clients:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually not the connection_uuid but the hex right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the session id of ThriftBackend converted to a hex string. Every connection has only one session, so I used it as a connection identifier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes however let's not use the variable name connection_uuid since that is misleading, we should name it exactly what it is

@@ -83,6 +83,12 @@ def test_export_initial_telemetry_log(self, noop_telemetry_client):
driver_connection_params=MagicMock(), user_agent="test"
)

def test_export_failure_log(self, noop_telemetry_client):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test and the test at line 141 have the same name, let's dedup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's alright since they are in different classes? Have done the same thing for test_export_initial_telemetry_log and test_close. Should I change these too?

cls._executor.shutdown(
wait=True
) # This waits for all submitted work to complete
logger.debug("Thread pool shutdown completed successfully")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misleading log message, let's add logs appropriately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I don't quite understand. Do you mean the submitted work make have completed but with a failure so I shouldn't add "successfully" in the log message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 things here :

  1. We can mention that telemetry client has been shut down successfully.
  2. Telemetry logs should not be too verbose; debug/error should be avoided as much as possible.

Another thing that I noticed across this PR is : let's not try to 1:1 replicate what we have in JDBC telemetry. We can follow python best practices instead. For example :

  • you can explore using a mixin base class or a utility function to make JSON serialization reusable in all the telemetry models that you have - that will make the code prettier.
  • Another thing is : TelemetryClientFactory is overly Java specific, we could do factory functions instead

Another thing that is missing in this PR is coverage. Can we add coverage on the PR description?

@@ -1156,7 +1181,10 @@ def fetchall(self) -> List[Row]:
if self.active_result_set:
return self.active_result_set.fetchall()
else:
raise Error("There is no active result set")
raise Error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we plan to have custom tags for each message?

cls._executor.shutdown(
wait=True
) # This waits for all submitted work to complete
logger.debug("Thread pool shutdown completed successfully")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 things here :

  1. We can mention that telemetry client has been shut down successfully.
  2. Telemetry logs should not be too verbose; debug/error should be avoided as much as possible.

Another thing that I noticed across this PR is : let's not try to 1:1 replicate what we have in JDBC telemetry. We can follow python best practices instead. For example :

  • you can explore using a mixin base class or a utility function to make JSON serialization reusable in all the telemetry models that you have - that will make the code prettier.
  • Another thing is : TelemetryClientFactory is overly Java specific, we could do factory functions instead

Another thing that is missing in this PR is coverage. Can we add coverage on the PR description?

),
)
self.export_event(telemetry_frontend_log)
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a try except block here. From what I can see all the errors can stem from the flush() function and there anyway we are logging and cathching the exception. It does't make sense to have try blocks where there is no error thrown

self.export_event(telemetry_frontend_log)
def export_failure_log(self, error_name, error_message):
logger.debug("Exporting failure log for connection %s", self._connection_uuid)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on error handling. In telemetry the only error is the network call and that is handled in flush error handling. I don't feel random try blocks are needed cc @vikrantpuppala

@@ -737,13 +763,15 @@ def _results_message_to_execute_response(self, resp, operation_state):
or direct_results.resultSet.hasMoreRows
)
description = self._hive_schema_to_description(
t_result_set_metadata_resp.schema
t_result_set_metadata_resp.schema, self._connection_uuid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is connection_uuid an argument to this function _hive_schema_to_description is it just for telemetry logging ? Can't we use thread local because we are polluting random functions for logging cc @vikrantpuppala

@jprakash-db
Copy link
Contributor

3 broad suggestions

  • Can we not throw blanket Error , try to use specific errors such as ConnectionError,etc as telemetry uses the error name
  • Add try except blocks only where you expect an error, or atleast re throw the error from the child functions. Random try blocks don't feel makes sense
  • Passing arguments like connection_uuid to functions that don't need it and just for telemetry doesn't feel. Either store this like a class variable during class initialisation or can we explore thread locals

Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants