fix: Implement Solr update retry #11581

akramcodez · 2025-12-11T15:54:17Z

Overview

This PR implements a robust retry mechanism for Solr index updates to prevent data loss in case of transient Solr outages or network issues. Previously, failed Solr updates were dropped this change ensures they are recorded, retried, and archived if persistently failing.

Technical Details

1. Failure Persistence

New database tables:

solr_update_failures
solr_update_failures_archived

These store failed Solr update attempts instead of dropping them.

2. Retry Logic

Added record_failure in solr/utils.py with:

Exponential backoff (starting at 1 minute, doubling after each failure)
Random jitter to avoid retry bursts
Entity type tracking (work, author, etc.) for observability

3. Retry Worker

Created a dedicated retry daemon:

scripts/solr_updater/solr_retry_worker.py

It:

Polls the failure queue
Retries Solr updates
Archives entries after 10 failed attempts

4. Deployment

Added the retry worker to the startup script:

docker/ol-solr-updater-start.sh

Testing Instructions

Simulate a Solr outage
Block access to Solr or stop the Solr service.

Trigger an update

python scripts/solr_updater/solr_update.py --keys /works/OL123W

Verify failure persistence
Confirm the failure is saved in the solr_update_failures table.
Restore Solr
Once access is restored, the retry worker should:
- Pick up the queued failure
- Successfully process the update

I also verified this logic using a dedicated local test script that simulates multiple failure scenarios happy to attach it if needed.

Stakeholders

@cdrini

for more information, see https://pre-commit.ci

Copilot

Pull request overview

This PR implements a retry mechanism for failed Solr updates to prevent data loss during transient Solr outages or network issues. The implementation adds database tables for tracking failures, a retry worker daemon, and failure recording logic integrated into the existing Solr update flow.

Key Changes:

Database tables (solr_update_failures and solr_update_failures_archived) to persist failed update attempts
Retry worker daemon (solr_retry_worker.py) with exponential backoff and automatic archival after 10 attempts
Integration of failure recording into solr_update() when max retries are exceeded

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
`openlibrary/core/schema.sql`	Adds two new tables for tracking and archiving failed Solr updates with appropriate indexes for query performance
`openlibrary/solr/utils.py`	Implements `record_failure()` function with exponential backoff calculation and database insertion
`openlibrary/solr/update.py`	Extracts entity type from updater class names and passes it to `solr_update()` for failure tracking
`scripts/solr_updater/solr_retry_worker.py`	New daemon that polls for ready failures, retries them, and archives those exceeding max retry attempts
`docker/ol-solr-updater-start.sh`	Starts the retry worker as a background process during container initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-11T16:02:47Z

openlibrary/core/schema.sql

+    max_retries INT DEFAULT 10 NOT NULL,
+    next_retry_at timestamp without time zone NOT NULL,
+    first_failed_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL,
+    last_attempted_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL,


The last_attempted_at column has a default value that only sets on row creation. When retries occur, this timestamp is not updated, making it impossible to track when the last retry attempt was made. Consider removing the default or updating this field during retry attempts for better observability.

Suggested change

last_attempted_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL,

last_attempted_at timestamp without time zone NOT NULL,

Copilot · 2025-12-11T16:02:48Z

scripts/solr_updater/solr_retry_worker.py

+        logger.error(
+            f"Retry failed for batch {failure_id}: {e}",
+            extra={
+                'failure_id': failure_id,
+                'retry_count': retry_count,
+                'error_type': type(e).__name__,
+            },
+            exc_info=True,
+        )
+
+        new_retry_count = retry_count + 1
+        if new_retry_count >= max_retries:
+            archive_failure(failure_id)
+        else:
+            logger.warning(
+                f"Batch {failure_id} failed retry {new_retry_count}/{max_retries}"
+            )
+
+        return False
+


When a retry fails, the retry_count and next_retry_at fields in the database are not updated. This means the same failure will be retried immediately on the next poll instead of respecting the exponential backoff strategy. The function should update these fields in the database to schedule the next retry attempt appropriately and increment the retry count.

Copilot · 2025-12-11T16:02:48Z

scripts/solr_updater/solr_retry_worker.py

+
+        db.query("DELETE FROM solr_update_failures WHERE id = $1", failure_id)
+


The retry_failure function mixes async and sync operations. It calls an async function (update.update_keys) but then performs blocking synchronous database operations (db.query). The db.query calls should be made asynchronous or the function should use asyncio.to_thread() to avoid blocking the event loop.

Copilot · 2025-12-11T16:02:48Z

scripts/solr_updater/solr_retry_worker.py

+
+    try:
+        await update.update_keys(keys, commit=True)
+


When update_keys is called during retry, it may fail again and call record_failure, which would create a new entry with retry_count=0 instead of updating the existing failure record. This could lead to duplicate failure entries for the same keys. The retry logic should either pass the existing failure_id to be updated, or handle this case differently to avoid duplication.

Copilot · 2025-12-11T16:02:49Z

openlibrary/solr/update.py

        updater_keys = uniq(k for k in keys if updater.key_test(k))
        await updater.preload_keys(updater_keys)
+
+        entity_type = updater.__class__.__name__.replace('SolrUpdater', '').lower()


The entity_type extraction relies on a naming convention (removing 'SolrUpdater' suffix). If an updater class doesn't follow this naming pattern, the entity type would be incorrect. Consider using a more explicit approach like adding an entity_type property to the AbstractSolrUpdater class to avoid coupling to naming conventions.

Suggested change

entity_type = updater.__class__.__name__.replace('SolrUpdater', '').lower()

entity_type = getattr(updater, "entity_type", "unknown")

Copilot · 2025-12-11T16:02:50Z

scripts/solr_updater/solr_retry_worker.py

+    """Archive a failure that exceeded max retries."""
+    try:
+        db.query(
+            """
+            INSERT INTO solr_update_failures_archived
+            SELECT *, NOW() as archived_at, 'max_retries_exceeded' as archived_reason,
+                   NULL as manual_resolution_notes, NULL as resolved_at, NULL as resolved_by
+            FROM solr_update_failures
+            WHERE id = $1
+            """,
+            failure_id,
+        )
+
+        db.query("DELETE FROM solr_update_failures WHERE id = $1", failure_id)
+
+        logger.critical(
+            f"Archived failure {failure_id} after exceeding max retries",
+            extra={'failure_id': failure_id},
+        )
+    except Exception as e:
+        logger.error(f"Failed to archive failure {failure_id}: {e}", exc_info=True)
+


The archive_failure function also mixes async and sync operations by being called from an async context but performing synchronous database operations. This should be made async or use asyncio.to_thread() to prevent blocking the event loop.

Copilot · 2025-12-11T16:02:50Z

openlibrary/solr/utils.py

+        base_delay_minutes = 2**retry_count
+        jitter = random.uniform(-base_delay_minutes * 0.1, base_delay_minutes * 0.1)
+        total_delay_minutes = base_delay_minutes + jitter
+        next_retry_at = datetime.utcnow() + timedelta(minutes=total_delay_minutes)


Using datetime.utcnow() is deprecated in Python 3.12+ and will be removed in future versions. Consider using datetime.now(timezone.utc) instead for better compatibility with future Python versions.

Copilot · 2025-12-11T16:02:50Z

docker/ol-solr-updater-start.sh

+echo "Starting Solr retry worker"
+PYTHONPATH=. python scripts/solr_updater/solr_retry_worker.py "$OL_CONFIG" \
+    --ol-url "$OL_URL" \
+    --poll-interval 30 \
+    --batch-size 100 &
+


The retry worker is started in the background without any health check or restart mechanism. If the worker crashes or exits unexpectedly, failed Solr updates will accumulate without being retried. Consider adding monitoring, health checks, or using a process supervisor like supervisord to ensure the worker restarts on failure.

Suggested change

echo "Starting Solr retry worker"

PYTHONPATH=. python scripts/solr_updater/solr_retry_worker.py "$OL_CONFIG" \

--ol-url "$OL_URL" \

--poll-interval 30 \

--batch-size 100 &

echo "Starting Solr retry worker (with auto-restart)"

(

while true; do

PYTHONPATH=. python scripts/solr_updater/solr_retry_worker.py "$OL_CONFIG" \

--ol-url "$OL_URL" \

--poll-interval 30 \

--batch-size 100

EXIT_CODE=$?

echo "Solr retry worker exited with code $EXIT_CODE. Restarting in 5 seconds..." >&2

sleep 5

done

) &

Copilot · 2025-12-11T16:02:51Z

openlibrary/solr/utils.py

+def record_failure(
+    keys: list[str],
+    entity_type: str,
+    exception: Exception,
+    retry_count: int = 0,
+    max_retries: int = 10,
+) -> None:
+    """Record a failed Solr update for later retry with exponential backoff."""
+    try:
+        error_type = type(exception).__name__
+        error_message = str(exception)
+        stack_trace = traceback.format_exc()
+
+        # Exponential backoff: 2^retry_count minutes + jitter
+        base_delay_minutes = 2**retry_count
+        jitter = random.uniform(-base_delay_minutes * 0.1, base_delay_minutes * 0.1)
+        total_delay_minutes = base_delay_minutes + jitter
+        next_retry_at = datetime.utcnow() + timedelta(minutes=total_delay_minutes)
+
+        solr_response_code = getattr(exception, 'status_code', None)
+
+        db.query(
+            """
+            INSERT INTO solr_update_failures
+            (keys, entity_type, error_type, error_message, stack_trace,
+             retry_count, max_retries, next_retry_at, solr_response_code)
+            VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
+            """,
+            keys,
+            entity_type,
+            error_type,
+            error_message,
+            stack_trace,
+            retry_count,
+            max_retries,
+            next_retry_at,
+            solr_response_code,
+        )
+
+        logger.warning(
+            f"Recorded failed Solr update for {len(keys)} keys ({entity_type}). "
+            f"Next retry in {total_delay_minutes:.1f} minutes.",
+            extra={
+                'keys_sample': keys[:5],
+                'total_keys': len(keys),
+                'entity_type': entity_type,
+                'error_type': error_type,
+                'retry_count': retry_count,
+                'next_retry_at': next_retry_at.isoformat(),
+                'delay_minutes': total_delay_minutes,
+            },
+        )
+    except Exception as e:
+        logger.error(
+            f"Failed to record Solr update failure: {e}",
+            exc_info=True,
+            extra={
+                'original_error': str(exception),
+                'keys_count': len(keys),
+                'entity_type': entity_type,
+            },
+        )
+


The new retry mechanism functionality (record_failure, retry_failure, archive_failure) lacks test coverage. The existing test file tests solr_update but doesn't cover the failure recording and retry logic. Consider adding tests to verify: failure recording with exponential backoff calculation, retry worker logic, archival of failures after max retries, and handling of duplicate failures during retry.

Copilot · 2025-12-11T16:02:51Z

scripts/solr_updater/solr_retry_worker.py

+
+import asyncio
+import logging
+


Import of 'datetime' is not used.

akramcodez · 2025-12-12T04:22:24Z

Not ready for review yet. I'll inform you after solving the errors!

fix: Implement Solr update retry

8a57d15

Copilot AI review requested due to automatic review settings December 11, 2025 15:54

Copilot started reviewing on behalf of akramcodez December 11, 2025 15:54 View session

[pre-commit.ci] auto fixes from pre-commit.com hooks

9c802e7

for more information, see https://pre-commit.ci

Copilot AI reviewed Dec 11, 2025

View reviewed changes

jimchamp assigned cdrini Dec 11, 2025

github-actions bot added the Needs: Response Issues which require feedback from lead label Dec 12, 2025

	last_attempted_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL,
	last_attempted_at timestamp without time zone NOT NULL,


		db.query("DELETE FROM solr_update_failures WHERE id = $1", failure_id)

	entity_type = updater.__class__.__name__.replace('SolrUpdater', '').lower()
	entity_type = getattr(updater, "entity_type", "unknown")


		import asyncio
		import logging

Uh oh!

fix: Implement Solr update retry #11581

Are you sure you want to change the base?

fix: Implement Solr update retry #11581

Conversation

akramcodez commented Dec 11, 2025

Overview

Technical Details

1. Failure Persistence

2. Retry Logic

3. Retry Worker

4. Deployment

Testing Instructions

Stakeholders

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

akramcodez commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants