Skip to content

Conversation

jerrylian-db
Copy link
Contributor

@jerrylian-db jerrylian-db commented Jun 15, 2022

Description

This PR follows #475 to implement a --wait option for the runs submit CLI. When this option is enabled, the CLI submits a one-time job run and continues to poll for the job run state until the job run completes.

Testing

  • I added unit tests to cover different branches in the added code.
  • I manually tested submitting and waiting for a job run (for both API version 2.0 and 2.1). See screenshot:

Screen Shot 2022-06-17 at 11 11 48 AM

@pietern pietern changed the title [ML-22876] Add --wait option to databricks runs submit CLI command Add --wait option to databricks runs submit CLI command Jun 15, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2022

Codecov Report

Merging #487 (2c4382d) into main (6cdde61) will increase coverage by 0.09%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##             main     #487      +/-   ##
==========================================
+ Coverage   85.29%   85.38%   +0.09%     
==========================================
  Files          42       42              
  Lines        3291     3326      +35     
==========================================
+ Hits         2807     2840      +33     
- Misses        484      486       +2     
Impacted Files Coverage Δ
databricks_cli/runs/cli.py 96.49% <93.10%> (-1.24%) ⬇️
databricks_cli/utils.py 92.85% <100.00%> (+0.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cdde61...2c4382d. Read the comment docs.

@jerrylian-db jerrylian-db requested a review from pietern June 15, 2022 19:30
if run_state['result_state'] == 'SUCCESS':
sys.exit(0)
else:
error_and_quit('job failed with state ' + run_state['result_state'] +
Copy link
Contributor Author

@jerrylian-db jerrylian-db Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: error_and_quit currently doesn't not echo to stderr. See definition below. Should I change it to do so?

def error_and_quit(message):
    ctx = click.get_current_context()
    context_object = ctx.ensure_object(ContextObject)
    if context_object.debug_mode:
        traceback.print_exc()
    click.echo(u'Error: {}'.format(message))
    sys.exit(1)

' and state message ' + run_state['state_message'])
click.echo('Job still running with lifecycle state ' + run_state['life_cycle_state'] +
'. URL: ' + run['run_page_url'], err=True)
time.sleep(5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we're ok with this polling interval to start with, or if we need to add more complex backoff/jitter logic upfront. To start with, I'd prefer to keep this simple and not implement backoff logic but just hardcode a constant polling interval that the JAWS stability reviewer (@shivamdixit) is comfortable with, even if it's >5 seconds (I think e.g. up to 10s would be fine for detecting that run submitted in order to integration test a notebook succeeded/failed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. The same is used elsewhere (e.g. dbx and your GH action).

Beware that there are uses for this beyond integration tests.

Copy link
Contributor

@smurching smurching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor comments, but mostly looks good!

run_id = submit_res['run_id']
completed_states = set(['TERMINATED', 'SKIPPED', 'INTERNAL_ERROR'])
# Wait for run to complete
while True:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have a time-out? So it's not perpetually waiting if something goes wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaning towards not having a time-out for now. I believe that the submitted run themselves have an internal timeouts in Databricks. Users can also force exit on their own.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Users can always CTRL-C

Copy link
Contributor

@smurching smurching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last few comments, after that LGTM, thanks @jerrylian-db !

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final nit. Everything else LGTM.

We can release this as 0.16.10.

' and state message ' + run_state['state_message'])
click.echo('Job still running with lifecycle state ' + run_state['life_cycle_state'] +
'. URL: ' + run['run_page_url'], err=True)
time.sleep(5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. The same is used elsewhere (e.g. dbx and your GH action).

Beware that there are uses for this beyond integration tests.

Copy link
Contributor

@fjakobs fjakobs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could also print the stdout/stderr of the job after it has finished. If you e.g. run the job in order to run some unit tests you will be interested in the outcome without having to go to the Databricks web UI.

But this should be in a follow up PR.

else:
error_and_quit('Run failed with state ' + run_state['result_state'] +
' and state message ' + run_state['state_message'])
click.echo('Run is still active, with lifecycle state ' +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to print text with every single ping? Is it not more user friendly to just print on state changes as seen in the Airbreeze version?

Copy link
Contributor Author

@jerrylian-db jerrylian-db Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is a lot of text. However, maybe given my experience with the run-notebook GitHub action, I also find printing the ping results comforting. It gives me a sense of transparency that the CLI is still working and pinging and that the Databricks run is still active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can envision a fancier user experience where there is a countdown text and spinner in the CLI that shows how many seconds until the next ping and what the latest state result is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lennartkats-db how would you feel about going with this rudimentary printing for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, maybe given my experience with the run-notebook GitHub action, I also find printing the ping results comforting. It gives me a sense of transparency that the CLI is still working and pinging and that the Databricks run is still active.

@jerrylian-db let's discuss this quickly - I like Lennart's suggestion since I think the use case you describe can still be achieved by clicking into the job run UI from the URL we print out and verifying that the job is still running. IMO the default should be to print less info and we could later add a debug option for printing more verbose output (e.g. on each ping to the jobs REST API)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I've implemented a more concise printing now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Ok, thanks for changing this. I do like the new version better!

Note that getting this printing right is important for the developer experience. As a non-blocking comment, I'd consider tinkering a bit more with it even to get it just right:

  • I wouldn't print the very long URL with every message printed
  • I'd consider printing a "." without a newline every now and then to show progress. A spinner would be even better, but then you do need to make sure that no one is capturing the output (all the ANSI codes would otherwise mess up CI/CD logs). And I'm not sure you want to build that yourself, and pulling in dependencies is also something we'd rather avoid at this point since we install in the global Python namespace.

@jerrylian-db
Copy link
Contributor Author

@fjakobs wondering if it would be best to ask users to call runs get_output after the run completes rather than we print the output in runs submit --wait. Right now, the only stdout print of this runs submit --wait is the run-id in JSON format, which is easy for users to parse in workflows. Printing more stuff to stout would complicate that.

@fjakobs
Copy link
Contributor

fjakobs commented Jun 17, 2022

@jerrylian-db I've created a separate ticket for the output handling. #489

When customers provide --wait I would expect the CLI to print the response from the last "get run" API call, which is a superset of the "submit job" response.

@pietern pietern merged commit 121c2a6 into databricks:main Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants