Add --wait option to databricks runs submit CLI command #487

jerrylian-db · 2022-06-15T18:15:20Z

Description

This PR follows #475 to implement a --wait option for the runs submit CLI. When this option is enabled, the CLI submits a one-time job run and continues to poll for the job run state until the job run completes.

Testing

I added unit tests to cover different branches in the added code.
I manually tested submitting and waiting for a job run (for both API version 2.0 and 2.1). See screenshot:

databricks_cli/runs/cli.py

codecov-commenter · 2022-06-15T19:02:09Z

Codecov Report

Merging #487 (2c4382d) into main (6cdde61) will increase coverage by 0.09%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##             main     #487      +/-   ##
==========================================
+ Coverage   85.29%   85.38%   +0.09%     
==========================================
  Files          42       42              
  Lines        3291     3326      +35     
==========================================
+ Hits         2807     2840      +33     
- Misses        484      486       +2

Impacted Files	Coverage Δ
databricks_cli/runs/cli.py	`96.49% <93.10%> (-1.24%)`	⬇️
databricks_cli/utils.py	`92.85% <100.00%> (+0.85%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cdde61...2c4382d. Read the comment docs.

jerrylian-db · 2022-06-15T19:31:59Z

databricks_cli/runs/cli.py

+                if run_state['result_state'] == 'SUCCESS':
+                    sys.exit(0)
+                else:
+                    error_and_quit('job failed with state ' + run_state['result_state'] +


QQ: error_and_quit currently doesn't not echo to stderr. See definition below. Should I change it to do so?

def error_and_quit(message): ctx = click.get_current_context() context_object = ctx.ensure_object(ContextObject) if context_object.debug_mode: traceback.print_exc() click.echo(u'Error: {}'.format(message)) sys.exit(1)

databricks_cli/runs/cli.py

tests/runs/test_cli.py

smurching · 2022-06-15T19:46:50Z

databricks_cli/runs/cli.py

+                                   ' and state message ' + run_state['state_message'])
+            click.echo('Job still running with lifecycle state ' + run_state['life_cycle_state'] +
+                        '. URL: ' + run['run_page_url'], err=True)
+            time.sleep(5)


I wonder if we're ok with this polling interval to start with, or if we need to add more complex backoff/jitter logic upfront. To start with, I'd prefer to keep this simple and not implement backoff logic but just hardcode a constant polling interval that the JAWS stability reviewer (@shivamdixit) is comfortable with, even if it's >5 seconds (I think e.g. up to 10s would be fine for detecting that run submitted in order to integration test a notebook succeeded/failed)

I think this is fine. The same is used elsewhere (e.g. dbx and your GH action).

Beware that there are uses for this beyond integration tests.

smurching

Just minor comments, but mostly looks good!

vladimirk-db · 2022-06-15T20:18:44Z

databricks_cli/runs/cli.py

+        run_id = submit_res['run_id']
+        completed_states = set(['TERMINATED', 'SKIPPED', 'INTERNAL_ERROR'])
+        # Wait for run to complete
+        while True:


I wonder if we should have a time-out? So it's not perpetually waiting if something goes wrong.

Leaning towards not having a time-out for now. I believe that the submitted run themselves have an internal timeouts in Databricks. Users can also force exit on their own.

+1 Users can always CTRL-C

databricks_cli/runs/cli.py

tests/runs/test_cli.py

smurching

Last few comments, after that LGTM, thanks @jerrylian-db !

pietern

One final nit. Everything else LGTM.

We can release this as 0.16.10.

databricks_cli/runs/cli.py

pietern · 2022-06-16T06:53:23Z

databricks_cli/runs/cli.py

+                                   ' and state message ' + run_state['state_message'])
+            click.echo('Job still running with lifecycle state ' + run_state['life_cycle_state'] +
+                        '. URL: ' + run['run_page_url'], err=True)
+            time.sleep(5)


I think this is fine. The same is used elsewhere (e.g. dbx and your GH action).

Beware that there are uses for this beyond integration tests.

fjakobs

It would be great if we could also print the stdout/stderr of the job after it has finished. If you e.g. run the job in order to run some unit tests you will be interested in the outcome without having to go to the Databricks web UI.

But this should be in a follow up PR.

databricks_cli/runs/cli.py

lennartkats-db · 2022-06-16T15:31:32Z

databricks_cli/runs/cli.py

+                else:
+                    error_and_quit('Run failed with state ' + run_state['result_state'] +
+                                   ' and state message ' + run_state['state_message'])
+            click.echo('Run is still active, with lifecycle state ' +


This seems to print text with every single ping? Is it not more user friendly to just print on state changes as seen in the Airbreeze version?

I agree that it is a lot of text. However, maybe given my experience with the run-notebook GitHub action, I also find printing the ping results comforting. It gives me a sense of transparency that the CLI is still working and pinging and that the Databricks run is still active.

I can envision a fancier user experience where there is a countdown text and spinner in the CLI that shows how many seconds until the next ping and what the latest state result is.

@lennartkats-db how would you feel about going with this rudimentary printing for now?

However, maybe given my experience with the run-notebook GitHub action, I also find printing the ping results comforting. It gives me a sense of transparency that the CLI is still working and pinging and that the Databricks run is still active.

@jerrylian-db let's discuss this quickly - I like Lennart's suggestion since I think the use case you describe can still be achieved by clicking into the job run UI from the URL we print out and verifying that the job is still running. IMO the default should be to print less info and we could later add a debug option for printing more verbose output (e.g. on each ping to the jobs REST API)

Sounds good. I've implemented a more concise printing now.

👍 Ok, thanks for changing this. I do like the new version better!

Note that getting this printing right is important for the developer experience. As a non-blocking comment, I'd consider tinkering a bit more with it even to get it just right:

I wouldn't print the very long URL with every message printed

I'd consider printing a "." without a newline every now and then to show progress. A spinner would be even better, but then you do need to make sure that no one is capturing the output (all the ANSI codes would otherwise mess up CI/CD logs). And I'm not sure you want to build that yourself, and pulling in dependencies is also something we'd rather avoid at this point since we install in the global Python namespace.

jerrylian-db · 2022-06-16T17:35:40Z

@fjakobs wondering if it would be best to ask users to call runs get_output after the run completes rather than we print the output in runs submit --wait. Right now, the only stdout print of this runs submit --wait is the run-id in JSON format, which is easy for users to parse in workflows. Printing more stuff to stout would complicate that.

databricks_cli/runs/cli.py

fjakobs · 2022-06-17T07:53:26Z

@jerrylian-db I've created a separate ticket for the output handling. #489

When customers provide --wait I would expect the CLI to print the response from the last "get run" API call, which is a superset of the "submit job" response.

jerrylian-db added 2 commits June 15, 2022 17:49

wip

654301e

lint

8781f67

pietern reviewed Jun 15, 2022

View reviewed changes

pietern changed the title ~~[ML-22876] Add --wait option to databricks runs submit CLI command~~ Add --wait option to databricks runs submit CLI command Jun 15, 2022

jerrylian-db added 2 commits June 15, 2022 18:50

remove 2.1 check

950fce6

fix tests

7fb6c5c

jerrylian-db added 4 commits June 15, 2022 19:18

adapt feedback

54838e5

wip

6a92534

wip

a81db2b

wip

1ab5c09

jerrylian-db requested a review from pietern June 15, 2022 19:30

jerrylian-db commented Jun 15, 2022

View reviewed changes

smurching reviewed Jun 15, 2022

View reviewed changes

databricks_cli/runs/cli.py Outdated Show resolved Hide resolved

smurching reviewed Jun 15, 2022

View reviewed changes

databricks_cli/runs/cli.py Outdated Show resolved Hide resolved

smurching reviewed Jun 15, 2022

View reviewed changes

tests/runs/test_cli.py Outdated Show resolved Hide resolved

smurching reviewed Jun 15, 2022

View reviewed changes

vladimirk-db reviewed Jun 15, 2022

View reviewed changes

jerrylian-db added 2 commits June 15, 2022 21:09

wip

6e92837

adapt feedback

c1bc00c

smurching reviewed Jun 16, 2022

View reviewed changes

tests/runs/test_cli.py Outdated Show resolved Hide resolved

smurching reviewed Jun 16, 2022

View reviewed changes

tests/runs/test_cli.py Outdated Show resolved Hide resolved

smurching reviewed Jun 16, 2022

View reviewed changes

update tests

6f4d5b4

smurching approved these changes Jun 16, 2022

View reviewed changes

pietern reviewed Jun 16, 2022

View reviewed changes

fjakobs requested changes Jun 16, 2022

View reviewed changes

fjakobs approved these changes Jun 16, 2022

View reviewed changes

fjakobs mentioned this pull request Jun 16, 2022

Also add --wait to the databricks jobs run-now API #488

Open

lennartkats-db requested changes Jun 16, 2022

View reviewed changes

databricks_cli/runs/cli.py Outdated Show resolved Hide resolved

lennartkats-db requested changes Jun 16, 2022

View reviewed changes

jerrylian-db added 2 commits June 16, 2022 17:54

wip

4862b46

adapt feedback

b4bd046

smurching reviewed Jun 16, 2022

View reviewed changes

databricks_cli/runs/cli.py Outdated Show resolved Hide resolved

adapt feedback

561e8c5

jerrylian-db requested a review from lennartkats-db June 16, 2022 18:49

jerrylian-db added 2 commits June 16, 2022 18:51

spelling

3b4012e

wording

2c4382d

lennartkats-db approved these changes Jun 17, 2022

View reviewed changes

Merge branch 'main' into add_wait

e28bcfe

pietern merged commit 121c2a6 into databricks:main Jun 17, 2022

This was referenced Jun 17, 2022

Add --wait option to databricks runs submit CLI command #475

Closed

Tests for backoff_with_jitter #494

Closed

Add --wait option to databricks runs submit CLI command #487

Add --wait option to databricks runs submit CLI command #487

Uh oh!

Conversation

jerrylian-db commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jerrylian-db Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smurching left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smurching left a comment

Choose a reason for hiding this comment

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjakobs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerrylian-db Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerrylian-db commented Jun 16, 2022

Uh oh!

Uh oh!

fjakobs commented Jun 17, 2022

Uh oh!

Uh oh!

jerrylian-db commented Jun 15, 2022 •

edited

Loading

codecov-commenter commented Jun 15, 2022 •

edited

Loading

jerrylian-db Jun 15, 2022 •

edited

Loading

fjakobs left a comment •

edited

Loading

jerrylian-db Jun 16, 2022 •

edited

Loading