-
-
Notifications
You must be signed in to change notification settings - Fork 17
fix: add retry to network settings update #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: add retry to network settings update #262
Conversation
…ork-retry # Conflicts: # go.mod
savme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the PR, @JEETDESAI25! This definitely solves the core issue. I think we should tweak the approach slightly, because retrying on any 500 might capture unrelated errors we probably don't want to retry.
We might be better off verifying that the project is in the ACTIVE state before moving forward with any related resources (in both createProject and updateProject). That way we'd avoid broad retries and still solve the underlying problem.
Happy to chat through the details if you'd like - just let me know!
|
@savme Thanks for the review! You’re right about the blanket 500 retry and it could hide unrelated errors. My plan is to add a |
|
Yeah, this is pretty much exactly what I was thinking @JEETDESAI25 👍 |
43c93fd to
e9e5fbf
Compare
|
Thank you for guiding me, really appreciate your feedback. Please let me know if there's anything else I can adjust. @savme |
| const projectActiveTimeout = 5 * time.Minute | ||
|
|
||
| func waitForProjectActive(ctx context.Context, projectRef string, client *api.ClientWithResponses) diag.Diagnostics { | ||
| err := retry.RetryContext(ctx, projectActiveTimeout, func() *retry.RetryError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the more specific WaitForStateContext would work better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, we use StateChangeConf.WaitForStateContext with explicit Pending/Target states. Also moved the helper to utils.go so both project_resource and settings_resource can share it.
| }) | ||
|
|
||
| switch status { | ||
| case api.V1ProjectWithDatabaseResponseStatusACTIVEHEALTHY, api.V1ProjectWithDatabaseResponseStatusACTIVEUNHEALTHY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you come across a project being in an unhealthy state during testing?
I’m not 100% sure, but I assume that subsequent updates to an unhealthy project would be rejected. If that’s the case, we probably shouldn’t treat this status as a successful update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, an unhealthy project might reject subsequent updates. Changed to only target ACTIVE_HEALTHY. ACTIVE_UNHEALTHY now stays in the Pending list so it keeps polling until the project becomes fully healthy.
| switch status { | ||
| case api.V1ProjectWithDatabaseResponseStatusACTIVEHEALTHY, api.V1ProjectWithDatabaseResponseStatusACTIVEUNHEALTHY: | ||
| return nil | ||
| case api.V1ProjectWithDatabaseResponseStatusINITFAILED, api.V1ProjectWithDatabaseResponseStatusREMOVED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case api.V1ProjectWithDatabaseResponseStatusINITFAILED, api.V1ProjectWithDatabaseResponseStatusREMOVED: | |
| case api.V1ProjectWithDatabaseResponseStatusINITFAILED, api.V1ProjectWithDatabaseResponseStatusREMOVED, api.V1ProjectWithDatabaseResponseStatusGOINGDOWN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added GOING_DOWN also included INACTIVE , PAUSE_FAILED and RESTORE_FAILED since these are terminal states that require operator intervention.
| var knownProjectStatuses = map[api.V1ProjectWithDatabaseResponseStatus]bool{ | ||
| // Target | ||
| api.V1ProjectWithDatabaseResponseStatusACTIVEHEALTHY: true, | ||
| // Pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets move this to a separate array for easier reuse
|
|
||
| const projectActiveTimeout = 5 * time.Minute | ||
|
|
||
| const statusUnknownTransient = "UNKNOWN_TRANSIENT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this custom status
| }) | ||
|
|
||
| switch httpResp.JSON200.Status { | ||
| case api.V1ProjectWithDatabaseResponseStatusGOINGDOWN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with a check using terminal states array
| if !knownProjectStatuses[httpResp.JSON200.Status] { | ||
| tflog.Warn(ctx, "Unrecognized project status, treating as transient", map[string]interface{}{ | ||
| "project_ref": projectRef, | ||
| "status": status, | ||
| }) | ||
| return httpResp.JSON200, statusUnknownTransient, nil | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if !knownProjectStatuses[httpResp.JSON200.Status] { | |
| tflog.Warn(ctx, "Unrecognized project status, treating as transient", map[string]interface{}{ | |
| "project_ref": projectRef, | |
| "status": status, | |
| }) | |
| return httpResp.JSON200, statusUnknownTransient, nil | |
| } |
we can assume all status returned by api have a corresponding enum value
What kind of change does this PR introduce?
Adds retry handling for network settings updates.
What is the current behavior?
Closes #239. Applying supabase_settings immediately after supabase_project frequently fails with HTTP 500 (error adding pooler tenant…) because the pooler tenant isn’t fully provisioned yet; users must rerun manually.
What is the new behavior?
Additional context
Retry window is five minutes to give the pooler tenant time to finish provisioning; the issue noted one-minute sleeps were no longer sufficient, and other Terraform providers use 2–5 minute waits for similar propagation delays.