-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: Implement Solr update retry #11581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a retry mechanism for failed Solr updates to prevent data loss during transient Solr outages or network issues. The implementation adds database tables for tracking failures, a retry worker daemon, and failure recording logic integrated into the existing Solr update flow.
Key Changes:
- Database tables (
solr_update_failuresandsolr_update_failures_archived) to persist failed update attempts - Retry worker daemon (
solr_retry_worker.py) with exponential backoff and automatic archival after 10 attempts - Integration of failure recording into
solr_update()when max retries are exceeded
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
openlibrary/core/schema.sql |
Adds two new tables for tracking and archiving failed Solr updates with appropriate indexes for query performance |
openlibrary/solr/utils.py |
Implements record_failure() function with exponential backoff calculation and database insertion |
openlibrary/solr/update.py |
Extracts entity type from updater class names and passes it to solr_update() for failure tracking |
scripts/solr_updater/solr_retry_worker.py |
New daemon that polls for ready failures, retries them, and archives those exceeding max retry attempts |
docker/ol-solr-updater-start.sh |
Starts the retry worker as a background process during container initialization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| max_retries INT DEFAULT 10 NOT NULL, | ||
| next_retry_at timestamp without time zone NOT NULL, | ||
| first_failed_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL, | ||
| last_attempted_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL, |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last_attempted_at column has a default value that only sets on row creation. When retries occur, this timestamp is not updated, making it impossible to track when the last retry attempt was made. Consider removing the default or updating this field during retry attempts for better observability.
| last_attempted_at timestamp without time zone DEFAULT (current_timestamp at time zone 'utc') NOT NULL, | |
| last_attempted_at timestamp without time zone NOT NULL, |
| logger.error( | ||
| f"Retry failed for batch {failure_id}: {e}", | ||
| extra={ | ||
| 'failure_id': failure_id, | ||
| 'retry_count': retry_count, | ||
| 'error_type': type(e).__name__, | ||
| }, | ||
| exc_info=True, | ||
| ) | ||
|
|
||
| new_retry_count = retry_count + 1 | ||
| if new_retry_count >= max_retries: | ||
| archive_failure(failure_id) | ||
| else: | ||
| logger.warning( | ||
| f"Batch {failure_id} failed retry {new_retry_count}/{max_retries}" | ||
| ) | ||
|
|
||
| return False | ||
|
|
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a retry fails, the retry_count and next_retry_at fields in the database are not updated. This means the same failure will be retried immediately on the next poll instead of respecting the exponential backoff strategy. The function should update these fields in the database to schedule the next retry attempt appropriately and increment the retry count.
|
|
||
| db.query("DELETE FROM solr_update_failures WHERE id = $1", failure_id) | ||
|
|
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry_failure function mixes async and sync operations. It calls an async function (update.update_keys) but then performs blocking synchronous database operations (db.query). The db.query calls should be made asynchronous or the function should use asyncio.to_thread() to avoid blocking the event loop.
|
|
||
| try: | ||
| await update.update_keys(keys, commit=True) | ||
|
|
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When update_keys is called during retry, it may fail again and call record_failure, which would create a new entry with retry_count=0 instead of updating the existing failure record. This could lead to duplicate failure entries for the same keys. The retry logic should either pass the existing failure_id to be updated, or handle this case differently to avoid duplication.
| updater_keys = uniq(k for k in keys if updater.key_test(k)) | ||
| await updater.preload_keys(updater_keys) | ||
|
|
||
| entity_type = updater.__class__.__name__.replace('SolrUpdater', '').lower() |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entity_type extraction relies on a naming convention (removing 'SolrUpdater' suffix). If an updater class doesn't follow this naming pattern, the entity type would be incorrect. Consider using a more explicit approach like adding an entity_type property to the AbstractSolrUpdater class to avoid coupling to naming conventions.
| entity_type = updater.__class__.__name__.replace('SolrUpdater', '').lower() | |
| entity_type = getattr(updater, "entity_type", "unknown") |
| """Archive a failure that exceeded max retries.""" | ||
| try: | ||
| db.query( | ||
| """ | ||
| INSERT INTO solr_update_failures_archived | ||
| SELECT *, NOW() as archived_at, 'max_retries_exceeded' as archived_reason, | ||
| NULL as manual_resolution_notes, NULL as resolved_at, NULL as resolved_by | ||
| FROM solr_update_failures | ||
| WHERE id = $1 | ||
| """, | ||
| failure_id, | ||
| ) | ||
|
|
||
| db.query("DELETE FROM solr_update_failures WHERE id = $1", failure_id) | ||
|
|
||
| logger.critical( | ||
| f"Archived failure {failure_id} after exceeding max retries", | ||
| extra={'failure_id': failure_id}, | ||
| ) | ||
| except Exception as e: | ||
| logger.error(f"Failed to archive failure {failure_id}: {e}", exc_info=True) | ||
|
|
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The archive_failure function also mixes async and sync operations by being called from an async context but performing synchronous database operations. This should be made async or use asyncio.to_thread() to prevent blocking the event loop.
| base_delay_minutes = 2**retry_count | ||
| jitter = random.uniform(-base_delay_minutes * 0.1, base_delay_minutes * 0.1) | ||
| total_delay_minutes = base_delay_minutes + jitter | ||
| next_retry_at = datetime.utcnow() + timedelta(minutes=total_delay_minutes) |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using datetime.utcnow() is deprecated in Python 3.12+ and will be removed in future versions. Consider using datetime.now(timezone.utc) instead for better compatibility with future Python versions.
| echo "Starting Solr retry worker" | ||
| PYTHONPATH=. python scripts/solr_updater/solr_retry_worker.py "$OL_CONFIG" \ | ||
| --ol-url "$OL_URL" \ | ||
| --poll-interval 30 \ | ||
| --batch-size 100 & | ||
|
|
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry worker is started in the background without any health check or restart mechanism. If the worker crashes or exits unexpectedly, failed Solr updates will accumulate without being retried. Consider adding monitoring, health checks, or using a process supervisor like supervisord to ensure the worker restarts on failure.
| echo "Starting Solr retry worker" | |
| PYTHONPATH=. python scripts/solr_updater/solr_retry_worker.py "$OL_CONFIG" \ | |
| --ol-url "$OL_URL" \ | |
| --poll-interval 30 \ | |
| --batch-size 100 & | |
| echo "Starting Solr retry worker (with auto-restart)" | |
| ( | |
| while true; do | |
| PYTHONPATH=. python scripts/solr_updater/solr_retry_worker.py "$OL_CONFIG" \ | |
| --ol-url "$OL_URL" \ | |
| --poll-interval 30 \ | |
| --batch-size 100 | |
| EXIT_CODE=$? | |
| echo "Solr retry worker exited with code $EXIT_CODE. Restarting in 5 seconds..." >&2 | |
| sleep 5 | |
| done | |
| ) & |
| def record_failure( | ||
| keys: list[str], | ||
| entity_type: str, | ||
| exception: Exception, | ||
| retry_count: int = 0, | ||
| max_retries: int = 10, | ||
| ) -> None: | ||
| """Record a failed Solr update for later retry with exponential backoff.""" | ||
| try: | ||
| error_type = type(exception).__name__ | ||
| error_message = str(exception) | ||
| stack_trace = traceback.format_exc() | ||
|
|
||
| # Exponential backoff: 2^retry_count minutes + jitter | ||
| base_delay_minutes = 2**retry_count | ||
| jitter = random.uniform(-base_delay_minutes * 0.1, base_delay_minutes * 0.1) | ||
| total_delay_minutes = base_delay_minutes + jitter | ||
| next_retry_at = datetime.utcnow() + timedelta(minutes=total_delay_minutes) | ||
|
|
||
| solr_response_code = getattr(exception, 'status_code', None) | ||
|
|
||
| db.query( | ||
| """ | ||
| INSERT INTO solr_update_failures | ||
| (keys, entity_type, error_type, error_message, stack_trace, | ||
| retry_count, max_retries, next_retry_at, solr_response_code) | ||
| VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) | ||
| """, | ||
| keys, | ||
| entity_type, | ||
| error_type, | ||
| error_message, | ||
| stack_trace, | ||
| retry_count, | ||
| max_retries, | ||
| next_retry_at, | ||
| solr_response_code, | ||
| ) | ||
|
|
||
| logger.warning( | ||
| f"Recorded failed Solr update for {len(keys)} keys ({entity_type}). " | ||
| f"Next retry in {total_delay_minutes:.1f} minutes.", | ||
| extra={ | ||
| 'keys_sample': keys[:5], | ||
| 'total_keys': len(keys), | ||
| 'entity_type': entity_type, | ||
| 'error_type': error_type, | ||
| 'retry_count': retry_count, | ||
| 'next_retry_at': next_retry_at.isoformat(), | ||
| 'delay_minutes': total_delay_minutes, | ||
| }, | ||
| ) | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Failed to record Solr update failure: {e}", | ||
| exc_info=True, | ||
| extra={ | ||
| 'original_error': str(exception), | ||
| 'keys_count': len(keys), | ||
| 'entity_type': entity_type, | ||
| }, | ||
| ) | ||
|
|
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new retry mechanism functionality (record_failure, retry_failure, archive_failure) lacks test coverage. The existing test file tests solr_update but doesn't cover the failure recording and retry logic. Consider adding tests to verify: failure recording with exponential backoff calculation, retry worker logic, archival of failures after max retries, and handling of duplicate failures during retry.
|
|
||
| import asyncio | ||
| import logging | ||
|
|
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'datetime' is not used.
|
Not ready for review yet. I'll inform you after solving the errors! |
Closes #10737
Overview
This PR implements a robust retry mechanism for Solr index updates to prevent data loss in case of transient Solr outages or network issues. Previously, failed Solr updates were dropped this change ensures they are recorded, retried, and archived if persistently failing.
Technical Details
1. Failure Persistence
New database tables:
solr_update_failuressolr_update_failures_archivedThese store failed Solr update attempts instead of dropping them.
2. Retry Logic
Added
record_failureinsolr/utils.pywith:3. Retry Worker
Created a dedicated retry daemon:
It:
4. Deployment
Added the retry worker to the startup script:
Testing Instructions
Simulate a Solr outage
Block access to Solr or stop the Solr service.
Trigger an update
Verify failure persistence
Confirm the failure is saved in the
solr_update_failurestable.Restore Solr
Once access is restored, the retry worker should:
I also verified this logic using a dedicated local test script that simulates multiple failure scenarios happy to attach it if needed.
Stakeholders
@cdrini