Skip to content

fix duplicate nats jobs #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

shellphy
Copy link

@shellphy shellphy commented Jan 20, 2025

Reason for This PR

Avoid duplication of tasks

Description of Changes

NATS consumer implements at-least-once message delivery semantics. By default, if a task is being processed but no "in-progress" signal is sent back to the NATS server, the server will re-deliver the same message every 30 seconds, which leads to duplicate task executions when the task processing time exceeds 30 seconds.

The issue can be reproduced by running the test case TestNATSLongTaskErr against the original version. The test fails because the task is executed twice, which validates my concern about duplicate task executions.

To prevent duplicate task executions, I implemented a concurrency-safe solution using a sync.Map named inProgressItems to track the status of running tasks. This map maintains the state of currently executing tasks, and entries are promptly cleaned up after task completion (ACK), failure (NACK), or termination (TERM) operations.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the MIT license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

  • All commits in this PR are signed (git commit -s).
  • The reason for this PR is clearly provided (issue no. or explanation).
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any user-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced job tracking with in-progress item management.
    • Added a new method to ensure cleanup of in-progress items after job execution.
    • Introduced a configuration file for long-running NATS job tasks.
  • Tests

    • Added a new test case for long-running task error handling.
    • Updated test configurations to validate NATS job processing.

The changes improve the robustness of job processing by tracking in-progress items and preventing duplicate job executions.

Copy link

coderabbitai bot commented Jan 20, 2025

Walkthrough

The pull request introduces enhancements to the NATS jobs driver in the RoadRunner project, focusing on improving state management and handling of in-progress jobs. A new inProgressItems field of type sync.Map is added to the Driver struct to track currently processing jobs. The changes include a new wrapCleanupFn method to manage job lifecycle, updated listener logic for job processing, and a new configuration file along with test infrastructure to validate the new functionality.

Changes

File Change Summary
natsjobs/driver.go Added inProgressItems sync.Map field to Driver struct
natsjobs/listener.go - Added wrapCleanupFn method to manage job cleanup
- Modified job processing logic to use inProgressItems
- Updated handling of acknowledgment and requeue functions
tests/configs/.rr-nats-long-task.yaml New configuration file for long-running NATS job tests
tests/jobs_nats_test.go Added TestNATSLongTaskErr test function to validate job processing
tests/php_test_files/jobs/jobs_long_task.php PHP script for simulating long-running job processing

Sequence Diagram

sequenceDiagram
    participant Driver
    participant Listener
    participant Job
    
    Driver->>Listener: Start Job Processing
    Listener->>Listener: Check inProgressItems
    alt Job Not In Progress
        Listener->>Job: Process Job
        Listener->>Driver: Update inProgressItems
    else Job Already In Progress
        Listener->>Listener: Log Duplicate Job
        Listener->>Driver: Skip Processing
    end
    Listener->>Driver: Cleanup Job
Loading

Poem

🐰 In the warren of code, a map takes flight,
Tracking jobs with synchronized might.
No duplicate tasks shall slip through the cracks,
With RoadRunner's clever tracking hacks.
A rabbit's efficiency, pure delight! 🚀

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
natsjobs/listener.go (1)

106-119: Consider refactoring nested function declarations for better readability.

While the implementation is functionally correct, the nested function declarations could be simplified for better maintainability.

Consider extracting the requeue cleanup wrapper:

-item.Options.requeueFn = func(item *Item) error {
-    return c.wrapCleanupFn(item.ID(), func() error {
-        return c.requeue(item)
-    })()
-}
+func (c *Driver) wrapRequeueCleanup(item *Item) func() error {
+    return c.wrapCleanupFn(item.ID(), func() error {
+        return c.requeue(item)
+    })
+}
+
+item.Options.requeueFn = func(item *Item) error {
+    return c.wrapRequeueCleanup(item)()
+}
tests/php_test_files/jobs/jobs_long_task.php (1)

1-25: Consider enhancing error handling in the test script.

While the script serves its purpose for testing, the error handling could be more robust.

Consider adding more specific error handling:

 try {
     sleep(35);
     $task->ack();
 } catch (\Throwable $e) {
-    $task->error((string)$e);
+    $error_message = sprintf(
+        "Task failed: %s\nFile: %s\nLine: %d",
+        $e->getMessage(),
+        $e->getFile(),
+        $e->getLine()
+    );
+    $task->error($error_message);
 }
tests/jobs_nats_test.go (1)

1258-1339: Consider adding more specific test assertions.

While the test covers the basic functionality, it could benefit from more detailed assertions.

Consider adding assertions for the exact message content and timing:

 require.Equal(t, 1, oLogger.FilterMessageSnippet("job processing was started").Len())
 require.Equal(t, 1, oLogger.FilterMessageSnippet("job already in progress").Len())
+
+// Verify timing of the duplicate job detection
+messages := oLogger.FilterMessageSnippet("job already in progress")
+require.Equal(t, 1, len(messages))
+firstMessage := messages[0]
+require.Contains(t, firstMessage.Message, "test")  // Verify job ID
+
+// Verify the time difference between start and duplicate detection
+startMessages := oLogger.FilterMessageSnippet("job processing was started")
+require.Equal(t, 1, len(startMessages))
+timeDiff := messages[0].Time.Sub(startMessages[0].Time)
+require.Greater(t, timeDiff.Seconds(), 30.0)  // Verify minimum processing time
tests/configs/.rr-nats-long-task.yaml (1)

1-40: Consider documenting the test configuration.

The configuration file would benefit from comments explaining:

  1. The purpose of this test configuration
  2. The relationship with the duplicate jobs fix
  3. Why specific values were chosen (e.g., timeouts, prefetch)

Add comments like:

 version: '3'
+# Test configuration for verifying NATS duplicate jobs fix
+# - Uses long-running tasks (35s) to trigger potential duplicates
+# - Single worker with prefetch=1 to control message delivery
+# - Configured for at-least-once delivery semantics testing
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7355284 and 9636acb.

⛔ Files ignored due to path filters (1)
  • go.work.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • natsjobs/driver.go (2 hunks)
  • natsjobs/listener.go (3 hunks)
  • tests/configs/.rr-nats-long-task.yaml (1 hunks)
  • tests/jobs_nats_test.go (1 hunks)
  • tests/php_test_files/jobs/jobs_long_task.php (1 hunks)
🔇 Additional comments (7)
natsjobs/listener.go (3)

45-54: Well-structured cleanup wrapper implementation!

The implementation is concise and handles the cleanup of inProgressItems correctly. The error handling ensures the map is only cleaned up on successful operations.


84-99: Effective duplicate job prevention implementation!

The use of LoadOrStore provides thread-safe checking for in-progress items. The error handling is comprehensive, and the logging provides clear visibility into duplicate job scenarios.


Line range hint 134-152: Well-implemented auto-ack handling!

The implementation properly handles auto-ack scenarios with appropriate error handling and cleanup. The sequence of operations (ack → delete → cleanup) is correct.

natsjobs/driver.go (1)

39-47: Clean integration of inProgressItems tracking!

The addition of inProgressItems to the Driver struct and its initialization is well-implemented. The use of sync.Map ensures thread-safe operations.

Also applies to: 134-140

tests/configs/.rr-nats-long-task.yaml (3)

31-38: Configuration aligns well with duplicate job prevention.

The NATS pipeline configuration is well-structured for testing the duplicate jobs fix:

  • prefetch: 1 ensures controlled message delivery
  • deliver_new: "true" prioritizes new messages
  • delete_after_ack: true helps maintain clean state

23-27: Worker pool configuration supports the test scenario.

The worker pool settings are appropriate for testing the duplicate jobs fix:

  • Single worker ensures sequential processing
  • No max jobs limit (max_jobs: 0) prevents artificial constraints
  • Generous timeouts (60s) accommodate the long-running tasks

7-9: Verify the relay timeout configuration.

The relay timeout is set to 20s but according to the summary, the test involves tasks running for 35s. This mismatch could cause premature termination of long-running tasks.

Run this script to check the task duration in the PHP file:

✅ Verification successful

The relay timeout configuration is correct for this test case

The 20-second relay timeout combined with a 35-second task duration is intentionally configured this way to test the system's behavior with long-running tasks that exceed the timeout period. This setup is appropriate for verifying the duplicate jobs issue mentioned in the PR objectives.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the sleep duration in the PHP test file
rg "sleep|usleep" "php_test_files/jobs/jobs_long_task.php"

Length of output: 137


Script:

#!/bin/bash
# Find PHP files containing "long" or "task" in their names
fd -e php -E vendor '(long|task)'

Length of output: 80


Script:

#!/bin/bash
# Check sleep duration in the test file
cat "tests/php_test_files/jobs/jobs_long_task.php"

Length of output: 575

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
natsjobs/listener.go (4)

84-99: Simplify error handling pattern.

The nested error handling could be simplified to improve readability while maintaining the same functionality.

Consider this more concise approach:

 if _, loaded := c.inProgressItems.LoadOrStore(item.ID(), struct{}{}); loaded {
-	err = m.InProgress()
-	if err != nil {
-		errn := m.Nak()
-		if errn != nil {
-			c.log.Error("failed to send Nak state", zap.Error(errn), zap.Error(err))
-			continue
-		}
-		c.log.Error("failed to send InProgress state", zap.Error(err))
-		continue
-	}
+	if err := m.InProgress(); err != nil {
+		if errn := m.Nak(); errn != nil {
+			c.log.Error("failed to send Nak state", zap.Error(errn), zap.Error(err))
+		} else {
+			c.log.Error("failed to send InProgress state", zap.Error(err))
+		}
+		continue
+	}

 	c.log.Info("job already in progress", zap.String("id", item.ID()))
 	span.End()
 	continue
 }

106-119: Reduce code duplication in wrapper functions.

The wrapper pattern for nakWithDelay and requeueFn is duplicated. Consider extracting a helper function to make the code more DRY.

Consider this approach:

+// wrapDelayedCleanupFn helper for operations that need additional setup
+func (c *Driver) wrapDelayedCleanupFn(id string, setupFn func() func() error) func() error {
+	return c.wrapCleanupFn(id, func() error {
+		return setupFn()()
+	})
+}

 // in listenerStart:
-item.Options.nakWithDelay = func(delay time.Duration) error {
-	return c.wrapCleanupFn(item.ID(), func() error {
-		return m.NakWithDelay(delay)
-	})()
-}
+item.Options.nakWithDelay = func(delay time.Duration) error {
+	return c.wrapDelayedCleanupFn(item.ID(), func() func() error {
+		return func() error { return m.NakWithDelay(delay) }
+	})()
+}

134-137: Maintain consistent error handling patterns.

The error handling pattern here differs from other parts of the code. Consider maintaining consistency with the simplified error handling pattern suggested earlier.

Consider this approach:

 if item.Options.AutoAck {
 	c.log.Debug("auto_ack option enabled")
-	err := item.Options.ack()
-	if err != nil {
+	if err := item.Options.ack(); err != nil {
 		item = nil
 		c.log.Error("message acknowledge", zap.Error(err))
 		span.RecordError(err)

Line range hint 1-180: Overall implementation effectively addresses task duplication.

The implementation successfully addresses the task duplication issue by using a thread-safe sync.Map to track in-progress items and implementing proper cleanup mechanisms. The code is well-structured and includes appropriate error handling and logging.

While there are some suggested improvements for code organization and error handling patterns, the core functionality is solid and achieves the PR objectives.

Consider adding metrics/monitoring for:

  • Number of in-progress items
  • Frequency of duplicate job attempts
  • Cleanup success/failure rates
    This would help track the effectiveness of the solution in production.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9636acb and 6032aab.

📒 Files selected for processing (1)
  • natsjobs/listener.go (3 hunks)

Comment on lines +45 to +54
// wrapCleanupFn wrap the cleanup function to ensure the inProgressItems map is updated correctly
func (c *Driver) wrapCleanupFn(id string, fn func() error) func() error {
return func() error {
err := fn()
if err == nil {
c.inProgressItems.Delete(id)
}
return err
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider cleanup on all execution paths.

The cleanup function only removes items from inProgressItems when err == nil. This could lead to memory leaks if a job consistently fails, as failed items will remain in the map indefinitely.

Consider cleaning up regardless of the error state:

 func (c *Driver) wrapCleanupFn(id string, fn func() error) func() error {
 	return func() error {
 		err := fn()
+		// Always clean up, regardless of error
+		c.inProgressItems.Delete(id)
-		if err == nil {
-			c.inProgressItems.Delete(id)
-		}
 		return err
 	}
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// wrapCleanupFn wrap the cleanup function to ensure the inProgressItems map is updated correctly
func (c *Driver) wrapCleanupFn(id string, fn func() error) func() error {
return func() error {
err := fn()
if err == nil {
c.inProgressItems.Delete(id)
}
return err
}
}
// wrapCleanupFn wrap the cleanup function to ensure the inProgressItems map is updated correctly
func (c *Driver) wrapCleanupFn(id string, fn func() error) func() error {
return func() error {
err := fn()
// Always clean up, regardless of error
c.inProgressItems.Delete(id)
return err
}
}

@shellphy shellphy mentioned this pull request Jan 20, 2025
6 tasks
@rustatian rustatian self-requested a review January 21, 2025 20:42
@rustatian rustatian added the enhancement New feature or request label Jan 21, 2025
@rustatian rustatian marked this pull request as draft January 21, 2025 20:42
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
natsjobs/listener.go (2)

87-102: Enhance error handling and span management.

Consider these improvements:

  1. Add job ID to error logs for better debugging
  2. Move span.End() after error handling to capture error context
 if _, loaded := c.inProgressItems.LoadOrStore(item.ID(), struct{}{}); loaded {
     err = m.InProgress()
     if err != nil {
         errn := m.Nak()
         if errn != nil {
-            c.log.Error("failed to send Nak state", zap.Error(errn), zap.Error(err))
+            c.log.Error("failed to send Nak state", 
+                zap.String("job_id", item.ID()),
+                zap.Error(errn), 
+                zap.Error(err))
+            span.RecordError(errn)
+            span.End()
             continue
         }
-        c.log.Error("failed to send InProgress state", zap.Error(err))
+        c.log.Error("failed to send InProgress state",
+            zap.String("job_id", item.ID()),
+            zap.Error(err))
+        span.RecordError(err)
+        span.End()
         continue
     }

     c.log.Info("job already in progress", zap.String("id", item.ID()))
     span.End()
     continue
 }

109-122: Simplify nested function creation.

The nested function creation in nakWithDelay and requeueFn could be simplified for better readability.

-item.Options.nakWithDelay = func(delay time.Duration) error {
-    return c.wrapCleanupFn(item.ID(), func() error {
-        return m.NakWithDelay(delay)
-    })()
-}
+item.Options.nakWithDelay = func(delay time.Duration) error {
+    nakFn := func() error { return m.NakWithDelay(delay) }
+    return c.wrapCleanupFn(item.ID(), nakFn)()
+}

-item.Options.requeueFn = func(item *Item) error {
-    return c.wrapCleanupFn(item.ID(), func() error {
-        return c.requeue(item)
-    })()
-}
+item.Options.requeueFn = func(item *Item) error {
+    requeueFn := func() error { return c.requeue(item) }
+    return c.wrapCleanupFn(item.ID(), requeueFn)()
+}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6032aab and c4a9f85.

📒 Files selected for processing (1)
  • natsjobs/listener.go (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Analyze (go)
  • GitHub Check: NATS plugin (Go stable, PHP 8.3, OS ubuntu-latest)
  • GitHub Check: NATS durability plugin (Go stable, PHP 8.3, OS ubuntu-latest)
🔇 Additional comments (3)
natsjobs/listener.go (3)

18-20: Great documentation!

The comment clearly explains the NATS server's message redelivery behavior, which is crucial context for understanding the duplicate jobs issue being fixed.


48-57: Consider cleanup on all execution paths.

The cleanup function only removes items from inProgressItems when err == nil. This could lead to memory leaks if a job consistently fails, as failed items will remain in the map indefinitely.


137-140: ⚠️ Potential issue

Ensure cleanup happens even when message deletion fails.

When auto-ack is enabled, the inProgressItems entry should be cleaned up even if the subsequent message deletion fails. Currently, if message deletion fails, the item remains in the map.

 if item.Options.AutoAck {
     c.log.Debug("auto_ack option enabled")
     err := item.Options.ack()
     if err != nil {
         item = nil
         c.log.Error("message acknowledge", zap.Error(err))
         span.RecordError(err)
         span.End()
         continue
     }

     if item.Options.deleteAfterAck {
         err = c.stream.DeleteMsg(context.Background(), meta.Sequence.Stream)
         if err != nil {
+            // Clean up the item from inProgressItems even if deletion fails
+            c.inProgressItems.Delete(item.ID())
             c.log.Error("delete message", zap.Error(err))
             item = nil
             span.RecordError(err)
             span.End()
             continue
         }
     }

Likely invalid or redundant comment.

@rustatian
Copy link
Member

Hey @shellphy 👋🏻
Thank you for the PR 👍🏻

I'm not sure this is the correct way to handle case with the re-delivery:

  1. By creating a map with IDs you're interrupting the natural way of delivering messages.
  2. Map is not cleared in case of some errors in the Jobs plugin or on the pipeline Stop.

I think, naturally, while processing the message, should be implemented something like a ticker with a currently in progress messages. To naturally notify the Nats server about in-progress messages instead of skipping them.

I understand that this solution won't work for everyone, but worth saying that increasing ackwait timeout would also solve that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants