Skip to content

Conversation

@JEETDESAI25
Copy link
Collaborator

What kind of change does this PR introduce?

Adds retry handling for network settings updates.

What is the current behavior?

Closes #239. Applying supabase_settings immediately after supabase_project frequently fails with HTTP 500 (error adding pooler tenant…) because the pooler tenant isn’t fully provisioned yet; users must rerun manually.

What is the new behavior?

  • updateNetworkConfig now wraps the API call in retry.RetryContext, retrying transient 500s for up to 5 minutes with debug logging on each retry.
  • Added TestAccSettingsResource_NetworkRetry, which mocks a 500→201 flow to validate the retry path end-to-end.

Additional context

Retry window is five minutes to give the pooler tenant time to finish provisioning; the issue noted one-minute sleeps were no longer sufficient, and other Terraform providers use 2–5 minute waits for similar propagation delays.

@JEETDESAI25 JEETDESAI25 requested a review from a team as a code owner December 4, 2025 01:57
@savme savme self-requested a review December 10, 2025 07:41
Copy link
Collaborator

@savme savme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the PR, @JEETDESAI25! This definitely solves the core issue. I think we should tweak the approach slightly, because retrying on any 500 might capture unrelated errors we probably don't want to retry.

We might be better off verifying that the project is in the ACTIVE state before moving forward with any related resources (in both createProject and updateProject). That way we'd avoid broad retries and still solve the underlying problem.

Happy to chat through the details if you'd like - just let me know!

@JEETDESAI25
Copy link
Collaborator Author

@savme Thanks for the review! You’re right about the blanket 500 retry and it could hide unrelated errors. My plan is to add a waitForProjectActive helper that polls GET /v1/projects/{ref} until the status is ACTIVE_HEALTHY or ACTIVE_UNHEALTHY, call it at the end of createProject and after updateInstanceSize and then remove the retry from updateNetworkConfig. That matches the current 5‑minute window but makes the readiness check explicit. Does this sound good?

@savme
Copy link
Collaborator

savme commented Dec 12, 2025

Yeah, this is pretty much exactly what I was thinking @JEETDESAI25 👍

@JEETDESAI25 JEETDESAI25 force-pushed the feat/issue-239-network-retry branch from 43c93fd to e9e5fbf Compare December 12, 2025 22:33
@JEETDESAI25
Copy link
Collaborator Author

Thank you for guiding me, really appreciate your feedback. Please let me know if there's anything else I can adjust. @savme

const projectActiveTimeout = 5 * time.Minute

func waitForProjectActive(ctx context.Context, projectRef string, client *api.ClientWithResponses) diag.Diagnostics {
err := retry.RetryContext(ctx, projectActiveTimeout, func() *retry.RetryError {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the more specific WaitForStateContext would work better here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we use StateChangeConf.WaitForStateContext with explicit Pending/Target states. Also moved the helper to utils.go so both project_resource and settings_resource can share it.

})

switch status {
case api.V1ProjectWithDatabaseResponseStatusACTIVEHEALTHY, api.V1ProjectWithDatabaseResponseStatusACTIVEUNHEALTHY:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you come across a project being in an unhealthy state during testing?

I’m not 100% sure, but I assume that subsequent updates to an unhealthy project would be rejected. If that’s the case, we probably shouldn’t treat this status as a successful update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, an unhealthy project might reject subsequent updates. Changed to only target ACTIVE_HEALTHY. ACTIVE_UNHEALTHY now stays in the Pending list so it keeps polling until the project becomes fully healthy.

switch status {
case api.V1ProjectWithDatabaseResponseStatusACTIVEHEALTHY, api.V1ProjectWithDatabaseResponseStatusACTIVEUNHEALTHY:
return nil
case api.V1ProjectWithDatabaseResponseStatusINITFAILED, api.V1ProjectWithDatabaseResponseStatusREMOVED:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case api.V1ProjectWithDatabaseResponseStatusINITFAILED, api.V1ProjectWithDatabaseResponseStatusREMOVED:
case api.V1ProjectWithDatabaseResponseStatusINITFAILED, api.V1ProjectWithDatabaseResponseStatusREMOVED, api.V1ProjectWithDatabaseResponseStatusGOINGDOWN:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added GOING_DOWN also included INACTIVE , PAUSE_FAILED and RESTORE_FAILED since these are terminal states that require operator intervention.

var knownProjectStatuses = map[api.V1ProjectWithDatabaseResponseStatus]bool{
// Target
api.V1ProjectWithDatabaseResponseStatusACTIVEHEALTHY: true,
// Pending
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move this to a separate array for easier reuse


const projectActiveTimeout = 5 * time.Minute

const statusUnknownTransient = "UNKNOWN_TRANSIENT"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this custom status

})

switch httpResp.JSON200.Status {
case api.V1ProjectWithDatabaseResponseStatusGOINGDOWN,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with a check using terminal states array

Comment on lines +98 to +105
if !knownProjectStatuses[httpResp.JSON200.Status] {
tflog.Warn(ctx, "Unrecognized project status, treating as transient", map[string]interface{}{
"project_ref": projectRef,
"status": status,
})
return httpResp.JSON200, statusUnknownTransient, nil
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !knownProjectStatuses[httpResp.JSON200.Status] {
tflog.Warn(ctx, "Unrecognized project status, treating as transient", map[string]interface{}{
"project_ref": projectRef,
"status": status,
})
return httpResp.JSON200, statusUnknownTransient, nil
}

we can assume all status returned by api have a corresponding enum value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Creating supabase_settings with network settings directly after supabase_project almost always fails

3 participants