Full text download fallback implementation (#72)

mathinic · jannisborn · web-flow · commit 1eae847585ac · 2025-03-25T20:56:31.000+01:00
* Adding full text download fallback options

* Fix bug when not providing api_keys

* Fix missing log message with no citation_pdf_url

* Adding Table of Contents to ReadMe for better overview

* Fix ReadMe formatting

* Fix type hints and logging based on review feedback

Co-authored-by: Jannis Born &lt;jannis.born@gmx.de&gt;

* Refactor PDF download functions to remove logger parameter and improve type hints

* Add tests for fallback functions when doing full text download

* Bump version to 0.3.0 and add contributor info in README

* Fix test case comment to ignore codespell warning for DOI

* Update codespell ignore list for false-positive (smll in DOI)

* chore: apply formatting and slightly lower the logging verbosity

* chore: cleanup

* ci: lower codecov requirement

* Add additional API tests

---------

Co-authored-by: Jannis Born &lt;jannis.born@gmx.de&gt;
Co-authored-by: Jannis Born &lt;jab@zurich.ibm.com&gt;
diff --git a/.codespellrc b/.codespellrc
@@ -3,4 +3,4 @@
 skip = .git*,.codespellrc
 check-hidden = true
 # ignore-regex = 
-ignore-words-list = vor
+ignore-words-list = vor,smll
diff --git a/.gitignore b/.gitignore
@@ -142,4 +142,7 @@ dmypy.json
 __pycache__/
 
 # Specific folders
-paperscraper/server_dumps/*.json
+paperscraper/server_dumps/*.json
+
+# ignore api keys
+api_keys.txt
diff --git a/README.md b/README.md
@@ -9,12 +9,27 @@ MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.or
 [![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
 # paperscraper
 
-`paperscraper` is a `python` package for scraping publication metadata or full PDF files from
+`paperscraper` is a `python` package for scraping publication metadata or full text files (PDF or XML) from
 **PubMed** or preprint servers such as **arXiv**, **medRxiv**, **bioRxiv** and **chemRxiv**.
 It provides a streamlined interface to scrape metadata, allows to retrieve citation counts
 from Google Scholar, impact factors from journals and comes with simple postprocessing functions
 and plotting routines for meta-analysis.
 
+## Table of Contents
+
+1. [Getting Started](#getting-started)
+   - [Download X-rxiv Dumps](#download-x-rxiv-dumps)
+   - [Arxiv Local Dump](#arxiv-local-dump)
+2. [Examples](#examples)
+   - [Publication Keyword Search](#publication-keyword-search)
+   - [Full-Text Retrieval (PDFs & XMLs)](#full-text-retrieval-pdfs--xmls)
+   - [Citation Search](#citation-search)
+   - [Journal Impact Factor](#journal-impact-factor)
+3. [Plotting](#plotting)
+   - [Barplots](#barplots)
+   - [Venn Diagrams](#venn-diagrams)
+4. [Citation](#citation)
+5. [Contributions](#contributions)
 
 ## Getting started
 
@@ -43,6 +58,21 @@ medrxiv(start_date="2023-04-01", end_date="2023-04-08")
 ```
 But watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.
 
+#### Arxiv local dump
+If you prefer local search rather than using the arxiv API:
+
+```py
+from paperscraper.get_dumps import arxiv
+arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.
+```
+
+Afterwards you can search the local arxiv dump just like the other x-rxiv dumps.
+The direct endpoint is `paperscraper.arxiv.get_arxiv_papers_local`. You can also specify the
+backend directly in the `get_and_dump_arxiv_papers` function:
+```py
+from paperscraper.arxiv import get_and_dump_arxiv_papers
+get_and_dump_arxiv_papers(..., backend='local')
+```
 
 ## Examples
 
@@ -113,27 +143,71 @@ get_and_dump_scholar_papers(topic)
 ```
 *NOTE*: The scholar endpoint does not require authentication but since it regularly prompts with captchas, it's difficult to apply large scale.
 
-### Scrape PDFs
+### Full-Text Retrieval (PDFs & XMLs)
+
+`paperscraper` allows you to download full text of publications using DOIs. The basic functionality works reliably for preprint servers (arXiv, bioRxiv, medRxiv, chemRxiv), but retrieving papers from PubMed dumps is more challenging due to publisher restrictions and paywalls.
+
+#### Standard Usage
 
-`paperscraper` also allows you to download the PDF files.
+The main download functions work for all paper types with automatic fallbacks:
 
 ```py
 from paperscraper.pdf import save_pdf
 paper_data = {'doi': "10.48550/arXiv.2207.03928"}
 save_pdf(paper_data, filepath='gt4sd_paper.pdf')
 ```
 
-If you want to batch download all PDFs for your previous metadata search, use the wrapper.
-Here we scrape the PDFs for the metadata obtained in the previous example.
+To batch download full texts from your metadata search results:
 
 ```py
 from paperscraper.pdf import save_pdf_from_dump
 
-# Save PDFs in current folder and name the files by their DOI
+# Save PDFs/XMLs in current folder and name the files by their DOI
 save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')
 ```
-*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs. 
-Many publishers detect and block scraping and many publications are simply behind paywalls.
+
+#### Automatic Fallback Mechanisms
+
+When the standard text retrieval fails, `paperscraper` automatically tries these fallbacks:
+
+- **BioC-PMC**: For biomedical papers in [PubMed Central](https://pmc.ncbi.nlm.nih.gov/) (open-access repository), it retrieves open-access full-text XML from the [BioC-PMC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/).
+- **eLife Papers**: For [eLife](https://elifesciences.org/) journal papers, it fetches XML files from eLife's open [GitHub repository](https://github.com/elifesciences/elife-article-xml).
+
+These fallbacks are tried automatically without requiring any additional configuration.
+
+#### Enhanced Retrieval with Publisher APIs
+
+For more comprehensive access to papers from major publishers, you can provide API keys for:
+
+- **Wiley TDM API**: Enables access to [Wiley](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) publications (2,000+ journals).
+- **Elsevier TDM API**: Enables access to [Elsevier](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) publications (The Lancet, Cell, ...).
+
+To use publisher APIs:
+
+1. Create a file with your API keys:
+```
+WILEY_TDM_API_TOKEN=your_wiley_token_here
+ELSEVIER_TDM_API_KEY=your_elsevier_key_here
+```
+
+2. Pass the file path when calling retrieval functions:
+
+```py
+from paperscraper.pdf import save_pdf_from_dump
+
+save_pdf_from_dump(
+    'pubmed_query_results.jsonl',
+    pdf_path='./papers',
+    key_to_save='doi',
+    api_keys='path/to/your/api_keys.txt'
+)
+```
+
+For obtaining API keys:
+- Wiley TDM API: Visit [Wiley Text and Data Mining](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) (free for academic users with institutional subscription)
+- Elsevier TDM API: Visit [Elsevier's Text and Data Mining](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) (free for academic users with institutional subscription)
+
+*NOTE*: While these fallback mechanisms improve retrieval success rates, they cannot guarantee access to all papers due to various access restrictions.
 
 
 ### Citation search
@@ -185,28 +259,13 @@ i.search("quantum information", threshold=90, return_all=True)
 # ]
 ```
 
-## Arxiv local dump
-If you prefer local search rather than using the arxiv API:
-
-```py
-from paperscraper.get_dumps import arxiv
-arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.
-```
-
-Afterwards you can search the local arxiv dump just like the other x-rxiv dumps.
-The direct endpoint is `paperscraper.arxiv.get_arxiv_papers_local`. You can also specify the
-backend directly in the `get_and_dump_arxiv_papers` function:
-```py
-from paperscraper.arxiv import get_and_dump_arxiv_papers
-get_and_dump_arxiv_papers(..., backend='local')
-```
 
-### Plotting
+## Plotting
 
 When multiple query searches are performed, two types of plots can be generated
 automatically: Venn diagrams and bar plots.
 
-#### Barplots
+### Barplots
 
 Compare the temporal evolution of different queries across different servers.
 
@@ -264,7 +323,7 @@ plot_comparison(
 ![molreps](https://github.com/jannisborn/paperscraper/blob/main/assets/molreps.png?raw=true "MolReps")
 
 
-#### Venn Diagrams
+### Venn Diagrams
 
 ```py
 from paperscraper.plotting import (
@@ -323,6 +382,7 @@ If you use `paperscraper`, please cite a paper that motivated our development of
 
 ## Contributions
 Thanks to the following contributors:
+- [@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
 - [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
 - [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
 - [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
diff --git a/codecov.yml b/codecov.yml
@@ -3,7 +3,7 @@ coverage:
     patch: off
     project:
       default:
-        target: 90%
+        target: 80%
         threshold: 2% # Up to 2% drop/fluctuation is OK
 
   ignore:
diff --git a/paperscraper/__init__.py b/paperscraper/__init__.py
@@ -1,7 +1,7 @@
 """Initialize the module."""
 
 __name__ = "paperscraper"
-__version__ = "0.2.16"
+__version__ = "0.3.0"
 
 import logging
 import os
diff --git a/paperscraper/pdf.py b/paperscraper/pdf.py
diff --git a/paperscraper/tests/test_pdf.py b/paperscraper/tests/test_pdf.py