Skip to content

Conversation

@akramcodez
Copy link
Contributor

Closes #10737

Overview

This PR implements a robust retry mechanism for Solr index updates to prevent data loss in case of transient Solr outages or network issues. Previously, failed Solr updates were dropped this change ensures they are recorded, retried, and archived if persistently failing.

Technical Details

1. Failure Persistence

New database tables:

  • solr_update_failures
  • solr_update_failures_archived

These store failed Solr update attempts instead of dropping them.

2. Retry Logic

Added record_failure in solr/utils.py with:

  • Exponential backoff (starting at 1 minute, doubling after each failure)
  • Random jitter to avoid retry bursts
  • Entity type tracking (work, author, etc.) for observability

3. Retry Worker

Created a dedicated retry daemon:

scripts/solr_updater/solr_retry_worker.py

It:

  • Polls the failure queue
  • Retries Solr updates
  • Archives entries after 10 failed attempts

4. Deployment

Added the retry worker to the startup script:

docker/ol-solr-updater-start.sh

Testing Instructions

  1. Simulate a Solr outage
    Block access to Solr or stop the Solr service.

  2. Trigger an update

    python scripts/solr_updater/solr_update.py --keys /works/OL123W
  3. Verify failure persistence
    Confirm the failure is saved in the solr_update_failures table.

  4. Restore Solr
    Once access is restored, the retry worker should:

    • Pick up the queued failure
    • Successfully process the update

I also verified this logic using a dedicated local test script that simulates multiple failure scenarios happy to attach it if needed.

Stakeholders

@cdrini

Copilot AI review requested due to automatic review settings December 11, 2025 15:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a retry mechanism for failed Solr updates to prevent data loss during transient Solr outages or network issues. The implementation adds database tables for tracking failures, a retry worker daemon, and failure recording logic integrated into the existing Solr update flow.

Key Changes:

  • Database tables (solr_update_failures and solr_update_failures_archived) to persist failed update attempts
  • Retry worker daemon (solr_retry_worker.py) with exponential backoff and automatic archival after 10 attempts
  • Integration of failure recording into solr_update() when max retries are exceeded

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
openlibrary/core/schema.sql Adds two new tables for tracking and archiving failed Solr updates with appropriate indexes for query performance
openlibrary/solr/utils.py Implements record_failure() function with exponential backoff calculation and database insertion
openlibrary/solr/update.py Extracts entity type from updater class names and passes it to solr_update() for failure tracking
scripts/solr_updater/solr_retry_worker.py New daemon that polls for ready failures, retries them, and archives those exceeding max retry attempts
docker/ol-solr-updater-start.sh Starts the retry worker as a background process during container initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

max_retries INT DEFAULT 10 NOT NULL,
next_retry_at timestamp without time zone NOT NULL,
first_failed_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL,
last_attempted_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL,
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last_attempted_at column has a default value that only sets on row creation. When retries occur, this timestamp is not updated, making it impossible to track when the last retry attempt was made. Consider removing the default or updating this field during retry attempts for better observability.

Suggested change
last_attempted_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL,
last_attempted_at timestamp without time zone NOT NULL,

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +118
logger.error(
f"Retry failed for batch {failure_id}: {e}",
extra={
'failure_id': failure_id,
'retry_count': retry_count,
'error_type': type(e).__name__,
},
exc_info=True,
)

new_retry_count = retry_count + 1
if new_retry_count >= max_retries:
archive_failure(failure_id)
else:
logger.warning(
f"Batch {failure_id} failed retry {new_retry_count}/{max_retries}"
)

return False

Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a retry fails, the retry_count and next_retry_at fields in the database are not updated. This means the same failure will be retried immediately on the next poll instead of respecting the exponential backoff strategy. The function should update these fields in the database to schedule the next retry attempt appropriately and increment the retry count.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +87

db.query("DELETE FROM solr_update_failures WHERE id = $1", failure_id)

Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry_failure function mixes async and sync operations. It calls an async function (update.update_keys) but then performs blocking synchronous database operations (db.query). The db.query calls should be made asynchronous or the function should use asyncio.to_thread() to avoid blocking the event loop.

Copilot uses AI. Check for mistakes.

try:
await update.update_keys(keys, commit=True)

Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When update_keys is called during retry, it may fail again and call record_failure, which would create a new entry with retry_count=0 instead of updating the existing failure record. This could lead to duplicate failure entries for the same keys. The retry logic should either pass the existing failure_id to be updated, or handle this case differently to avoid duplication.

Copilot uses AI. Check for mistakes.
updater_keys = uniq(k for k in keys if updater.key_test(k))
await updater.preload_keys(updater_keys)

entity_type = updater.__class__.__name__.replace('SolrUpdater', '').lower()
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entity_type extraction relies on a naming convention (removing 'SolrUpdater' suffix). If an updater class doesn't follow this naming pattern, the entity type would be incorrect. Consider using a more explicit approach like adding an entity_type property to the AbstractSolrUpdater class to avoid coupling to naming conventions.

Suggested change
entity_type = updater.__class__.__name__.replace('SolrUpdater', '').lower()
entity_type = getattr(updater, "entity_type", "unknown")

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +61
"""Archive a failure that exceeded max retries."""
try:
db.query(
"""
INSERT INTO solr_update_failures_archived
SELECT *, NOW() as archived_at, 'max_retries_exceeded' as archived_reason,
NULL as manual_resolution_notes, NULL as resolved_at, NULL as resolved_by
FROM solr_update_failures
WHERE id = $1
""",
failure_id,
)

db.query("DELETE FROM solr_update_failures WHERE id = $1", failure_id)

logger.critical(
f"Archived failure {failure_id} after exceeding max retries",
extra={'failure_id': failure_id},
)
except Exception as e:
logger.error(f"Failed to archive failure {failure_id}: {e}", exc_info=True)

Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The archive_failure function also mixes async and sync operations by being called from an async context but performing synchronous database operations. This should be made async or use asyncio.to_thread() to prevent blocking the event loop.

Copilot uses AI. Check for mistakes.
base_delay_minutes = 2**retry_count
jitter = random.uniform(-base_delay_minutes * 0.1, base_delay_minutes * 0.1)
total_delay_minutes = base_delay_minutes + jitter
next_retry_at = datetime.utcnow() + timedelta(minutes=total_delay_minutes)
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using datetime.utcnow() is deprecated in Python 3.12+ and will be removed in future versions. Consider using datetime.now(timezone.utc) instead for better compatibility with future Python versions.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +18
echo "Starting Solr retry worker"
PYTHONPATH=. python scripts/solr_updater/solr_retry_worker.py "$OL_CONFIG" \
--ol-url "$OL_URL" \
--poll-interval 30 \
--batch-size 100 &

Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry worker is started in the background without any health check or restart mechanism. If the worker crashes or exits unexpectedly, failed Solr updates will accumulate without being retried. Consider adding monitoring, health checks, or using a process supervisor like supervisord to ensure the worker restarts on failure.

Suggested change
echo "Starting Solr retry worker"
PYTHONPATH=. python scripts/solr_updater/solr_retry_worker.py "$OL_CONFIG" \
--ol-url "$OL_URL" \
--poll-interval 30 \
--batch-size 100 &
echo "Starting Solr retry worker (with auto-restart)"
(
while true; do
PYTHONPATH=. python scripts/solr_updater/solr_retry_worker.py "$OL_CONFIG" \
--ol-url "$OL_URL" \
--poll-interval 30 \
--batch-size 100
EXIT_CODE=$?
echo "Solr retry worker exited with code $EXIT_CODE. Restarting in 5 seconds..." >&2
sleep 5
done
) &

Copilot uses AI. Check for mistakes.
Comment on lines +81 to +143
def record_failure(
keys: list[str],
entity_type: str,
exception: Exception,
retry_count: int = 0,
max_retries: int = 10,
) -> None:
"""Record a failed Solr update for later retry with exponential backoff."""
try:
error_type = type(exception).__name__
error_message = str(exception)
stack_trace = traceback.format_exc()

# Exponential backoff: 2^retry_count minutes + jitter
base_delay_minutes = 2**retry_count
jitter = random.uniform(-base_delay_minutes * 0.1, base_delay_minutes * 0.1)
total_delay_minutes = base_delay_minutes + jitter
next_retry_at = datetime.utcnow() + timedelta(minutes=total_delay_minutes)

solr_response_code = getattr(exception, 'status_code', None)

db.query(
"""
INSERT INTO solr_update_failures
(keys, entity_type, error_type, error_message, stack_trace,
retry_count, max_retries, next_retry_at, solr_response_code)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
""",
keys,
entity_type,
error_type,
error_message,
stack_trace,
retry_count,
max_retries,
next_retry_at,
solr_response_code,
)

logger.warning(
f"Recorded failed Solr update for {len(keys)} keys ({entity_type}). "
f"Next retry in {total_delay_minutes:.1f} minutes.",
extra={
'keys_sample': keys[:5],
'total_keys': len(keys),
'entity_type': entity_type,
'error_type': error_type,
'retry_count': retry_count,
'next_retry_at': next_retry_at.isoformat(),
'delay_minutes': total_delay_minutes,
},
)
except Exception as e:
logger.error(
f"Failed to record Solr update failure: {e}",
exc_info=True,
extra={
'original_error': str(exception),
'keys_count': len(keys),
'entity_type': entity_type,
},
)

Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new retry mechanism functionality (record_failure, retry_failure, archive_failure) lacks test coverage. The existing test file tests solr_update but doesn't cover the failure recording and retry logic. Consider adding tests to verify: failure recording with exponential backoff calculation, retry worker logic, archival of failures after max retries, and handling of duplicate failures during retry.

Copilot uses AI. Check for mistakes.

import asyncio
import logging

Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'datetime' is not used.

Copilot uses AI. Check for mistakes.
@akramcodez
Copy link
Contributor Author

Not ready for review yet. I'll inform you after solving the errors!

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Response Issues which require feedback from lead

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed search index updates to authors' works lost until full re-index

2 participants