-
Notifications
You must be signed in to change notification settings - Fork 235
Add --wait option to databricks runs submit CLI command #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #487 +/- ##
==========================================
+ Coverage 85.29% 85.38% +0.09%
==========================================
Files 42 42
Lines 3291 3326 +35
==========================================
+ Hits 2807 2840 +33
- Misses 484 486 +2
Continue to review full report at Codecov.
|
databricks_cli/runs/cli.py
Outdated
if run_state['result_state'] == 'SUCCESS': | ||
sys.exit(0) | ||
else: | ||
error_and_quit('job failed with state ' + run_state['result_state'] + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: error_and_quit
currently doesn't not echo to stderr. See definition below. Should I change it to do so?
def error_and_quit(message):
ctx = click.get_current_context()
context_object = ctx.ensure_object(ContextObject)
if context_object.debug_mode:
traceback.print_exc()
click.echo(u'Error: {}'.format(message))
sys.exit(1)
databricks_cli/runs/cli.py
Outdated
' and state message ' + run_state['state_message']) | ||
click.echo('Job still running with lifecycle state ' + run_state['life_cycle_state'] + | ||
'. URL: ' + run['run_page_url'], err=True) | ||
time.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we're ok with this polling interval to start with, or if we need to add more complex backoff/jitter logic upfront. To start with, I'd prefer to keep this simple and not implement backoff logic but just hardcode a constant polling interval that the JAWS stability reviewer (@shivamdixit) is comfortable with, even if it's >5 seconds (I think e.g. up to 10s would be fine for detecting that run submitted in order to integration test a notebook succeeded/failed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. The same is used elsewhere (e.g. dbx and your GH action).
Beware that there are uses for this beyond integration tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor comments, but mostly looks good!
run_id = submit_res['run_id'] | ||
completed_states = set(['TERMINATED', 'SKIPPED', 'INTERNAL_ERROR']) | ||
# Wait for run to complete | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have a time-out? So it's not perpetually waiting if something goes wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaning towards not having a time-out for now. I believe that the submitted run themselves have an internal timeouts in Databricks. Users can also force exit on their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Users can always CTRL-C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last few comments, after that LGTM, thanks @jerrylian-db !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final nit. Everything else LGTM.
We can release this as 0.16.10.
databricks_cli/runs/cli.py
Outdated
' and state message ' + run_state['state_message']) | ||
click.echo('Job still running with lifecycle state ' + run_state['life_cycle_state'] + | ||
'. URL: ' + run['run_page_url'], err=True) | ||
time.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. The same is used elsewhere (e.g. dbx and your GH action).
Beware that there are uses for this beyond integration tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we could also print the stdout/stderr of the job after it has finished. If you e.g. run the job in order to run some unit tests you will be interested in the outcome without having to go to the Databricks web UI.
But this should be in a follow up PR.
databricks_cli/runs/cli.py
Outdated
else: | ||
error_and_quit('Run failed with state ' + run_state['result_state'] + | ||
' and state message ' + run_state['state_message']) | ||
click.echo('Run is still active, with lifecycle state ' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to print text with every single ping? Is it not more user friendly to just print on state changes as seen in the Airbreeze version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is a lot of text. However, maybe given my experience with the run-notebook GitHub action, I also find printing the ping results comforting. It gives me a sense of transparency that the CLI is still working and pinging and that the Databricks run is still active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can envision a fancier user experience where there is a countdown text and spinner in the CLI that shows how many seconds until the next ping and what the latest state result is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lennartkats-db how would you feel about going with this rudimentary printing for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, maybe given my experience with the run-notebook GitHub action, I also find printing the ping results comforting. It gives me a sense of transparency that the CLI is still working and pinging and that the Databricks run is still active.
@jerrylian-db let's discuss this quickly - I like Lennart's suggestion since I think the use case you describe can still be achieved by clicking into the job run UI from the URL we print out and verifying that the job is still running. IMO the default should be to print less info and we could later add a debug option for printing more verbose output (e.g. on each ping to the jobs REST API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I've implemented a more concise printing now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Ok, thanks for changing this. I do like the new version better!
Note that getting this printing right is important for the developer experience. As a non-blocking comment, I'd consider tinkering a bit more with it even to get it just right:
- I wouldn't print the very long URL with every message printed
- I'd consider printing a "." without a newline every now and then to show progress. A spinner would be even better, but then you do need to make sure that no one is capturing the output (all the ANSI codes would otherwise mess up CI/CD logs). And I'm not sure you want to build that yourself, and pulling in dependencies is also something we'd rather avoid at this point since we install in the global Python namespace.
@fjakobs wondering if it would be best to ask users to call |
@jerrylian-db I've created a separate ticket for the output handling. #489 When customers provide |
Description
This PR follows #475 to implement a --wait option for the runs submit CLI. When this option is enabled, the CLI submits a one-time job run and continues to poll for the job run state until the job run completes.
Testing