Skip to content

Commit 1eae847

Browse files
mathinicjannisborn
andauthored
Full text download fallback implementation (#72)
* Adding full text download fallback options * Fix bug when not providing api_keys * Fix missing log message with no citation_pdf_url * Adding Table of Contents to ReadMe for better overview * Fix ReadMe formatting * Fix type hints and logging based on review feedback Co-authored-by: Jannis Born <[email protected]> * Refactor PDF download functions to remove logger parameter and improve type hints * Add tests for fallback functions when doing full text download * Bump version to 0.3.0 and add contributor info in README * Fix test case comment to ignore codespell warning for DOI * Update codespell ignore list for false-positive (smll in DOI) * chore: apply formatting and slightly lower the logging verbosity * chore: cleanup * ci: lower codecov requirement * Add additional API tests --------- Co-authored-by: Jannis Born <[email protected]> Co-authored-by: Jannis Born <[email protected]>
1 parent b611d65 commit 1eae847

File tree

7 files changed

+612
-45
lines changed

7 files changed

+612
-45
lines changed

.codespellrc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@
33
skip = .git*,.codespellrc
44
check-hidden = true
55
# ignore-regex =
6-
ignore-words-list = vor
6+
ignore-words-list = vor,smll

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,4 +142,7 @@ dmypy.json
142142
__pycache__/
143143

144144
# Specific folders
145-
paperscraper/server_dumps/*.json
145+
paperscraper/server_dumps/*.json
146+
147+
# ignore api keys
148+
api_keys.txt

README.md

Lines changed: 86 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,27 @@ MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.or
99
[![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
1010
# paperscraper
1111

12-
`paperscraper` is a `python` package for scraping publication metadata or full PDF files from
12+
`paperscraper` is a `python` package for scraping publication metadata or full text files (PDF or XML) from
1313
**PubMed** or preprint servers such as **arXiv**, **medRxiv**, **bioRxiv** and **chemRxiv**.
1414
It provides a streamlined interface to scrape metadata, allows to retrieve citation counts
1515
from Google Scholar, impact factors from journals and comes with simple postprocessing functions
1616
and plotting routines for meta-analysis.
1717

18+
## Table of Contents
19+
20+
1. [Getting Started](#getting-started)
21+
- [Download X-rxiv Dumps](#download-x-rxiv-dumps)
22+
- [Arxiv Local Dump](#arxiv-local-dump)
23+
2. [Examples](#examples)
24+
- [Publication Keyword Search](#publication-keyword-search)
25+
- [Full-Text Retrieval (PDFs & XMLs)](#full-text-retrieval-pdfs--xmls)
26+
- [Citation Search](#citation-search)
27+
- [Journal Impact Factor](#journal-impact-factor)
28+
3. [Plotting](#plotting)
29+
- [Barplots](#barplots)
30+
- [Venn Diagrams](#venn-diagrams)
31+
4. [Citation](#citation)
32+
5. [Contributions](#contributions)
1833

1934
## Getting started
2035

@@ -43,6 +58,21 @@ medrxiv(start_date="2023-04-01", end_date="2023-04-08")
4358
```
4459
But watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.
4560

61+
#### Arxiv local dump
62+
If you prefer local search rather than using the arxiv API:
63+
64+
```py
65+
from paperscraper.get_dumps import arxiv
66+
arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.
67+
```
68+
69+
Afterwards you can search the local arxiv dump just like the other x-rxiv dumps.
70+
The direct endpoint is `paperscraper.arxiv.get_arxiv_papers_local`. You can also specify the
71+
backend directly in the `get_and_dump_arxiv_papers` function:
72+
```py
73+
from paperscraper.arxiv import get_and_dump_arxiv_papers
74+
get_and_dump_arxiv_papers(..., backend='local')
75+
```
4676

4777
## Examples
4878

@@ -113,27 +143,71 @@ get_and_dump_scholar_papers(topic)
113143
```
114144
*NOTE*: The scholar endpoint does not require authentication but since it regularly prompts with captchas, it's difficult to apply large scale.
115145

116-
### Scrape PDFs
146+
### Full-Text Retrieval (PDFs & XMLs)
147+
148+
`paperscraper` allows you to download full text of publications using DOIs. The basic functionality works reliably for preprint servers (arXiv, bioRxiv, medRxiv, chemRxiv), but retrieving papers from PubMed dumps is more challenging due to publisher restrictions and paywalls.
149+
150+
#### Standard Usage
117151

118-
`paperscraper` also allows you to download the PDF files.
152+
The main download functions work for all paper types with automatic fallbacks:
119153

120154
```py
121155
from paperscraper.pdf import save_pdf
122156
paper_data = {'doi': "10.48550/arXiv.2207.03928"}
123157
save_pdf(paper_data, filepath='gt4sd_paper.pdf')
124158
```
125159

126-
If you want to batch download all PDFs for your previous metadata search, use the wrapper.
127-
Here we scrape the PDFs for the metadata obtained in the previous example.
160+
To batch download full texts from your metadata search results:
128161

129162
```py
130163
from paperscraper.pdf import save_pdf_from_dump
131164

132-
# Save PDFs in current folder and name the files by their DOI
165+
# Save PDFs/XMLs in current folder and name the files by their DOI
133166
save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')
134167
```
135-
*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs.
136-
Many publishers detect and block scraping and many publications are simply behind paywalls.
168+
169+
#### Automatic Fallback Mechanisms
170+
171+
When the standard text retrieval fails, `paperscraper` automatically tries these fallbacks:
172+
173+
- **BioC-PMC**: For biomedical papers in [PubMed Central](https://pmc.ncbi.nlm.nih.gov/) (open-access repository), it retrieves open-access full-text XML from the [BioC-PMC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/).
174+
- **eLife Papers**: For [eLife](https://elifesciences.org/) journal papers, it fetches XML files from eLife's open [GitHub repository](https://github.com/elifesciences/elife-article-xml).
175+
176+
These fallbacks are tried automatically without requiring any additional configuration.
177+
178+
#### Enhanced Retrieval with Publisher APIs
179+
180+
For more comprehensive access to papers from major publishers, you can provide API keys for:
181+
182+
- **Wiley TDM API**: Enables access to [Wiley](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) publications (2,000+ journals).
183+
- **Elsevier TDM API**: Enables access to [Elsevier](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) publications (The Lancet, Cell, ...).
184+
185+
To use publisher APIs:
186+
187+
1. Create a file with your API keys:
188+
```
189+
WILEY_TDM_API_TOKEN=your_wiley_token_here
190+
ELSEVIER_TDM_API_KEY=your_elsevier_key_here
191+
```
192+
193+
2. Pass the file path when calling retrieval functions:
194+
195+
```py
196+
from paperscraper.pdf import save_pdf_from_dump
197+
198+
save_pdf_from_dump(
199+
'pubmed_query_results.jsonl',
200+
pdf_path='./papers',
201+
key_to_save='doi',
202+
api_keys='path/to/your/api_keys.txt'
203+
)
204+
```
205+
206+
For obtaining API keys:
207+
- Wiley TDM API: Visit [Wiley Text and Data Mining](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) (free for academic users with institutional subscription)
208+
- Elsevier TDM API: Visit [Elsevier's Text and Data Mining](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) (free for academic users with institutional subscription)
209+
210+
*NOTE*: While these fallback mechanisms improve retrieval success rates, they cannot guarantee access to all papers due to various access restrictions.
137211

138212

139213
### Citation search
@@ -185,28 +259,13 @@ i.search("quantum information", threshold=90, return_all=True)
185259
# ]
186260
```
187261

188-
## Arxiv local dump
189-
If you prefer local search rather than using the arxiv API:
190-
191-
```py
192-
from paperscraper.get_dumps import arxiv
193-
arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.
194-
```
195-
196-
Afterwards you can search the local arxiv dump just like the other x-rxiv dumps.
197-
The direct endpoint is `paperscraper.arxiv.get_arxiv_papers_local`. You can also specify the
198-
backend directly in the `get_and_dump_arxiv_papers` function:
199-
```py
200-
from paperscraper.arxiv import get_and_dump_arxiv_papers
201-
get_and_dump_arxiv_papers(..., backend='local')
202-
```
203262

204-
### Plotting
263+
## Plotting
205264

206265
When multiple query searches are performed, two types of plots can be generated
207266
automatically: Venn diagrams and bar plots.
208267

209-
#### Barplots
268+
### Barplots
210269

211270
Compare the temporal evolution of different queries across different servers.
212271

@@ -264,7 +323,7 @@ plot_comparison(
264323
![molreps](https://github.com/jannisborn/paperscraper/blob/main/assets/molreps.png?raw=true "MolReps")
265324

266325

267-
#### Venn Diagrams
326+
### Venn Diagrams
268327

269328
```py
270329
from paperscraper.plotting import (
@@ -323,6 +382,7 @@ If you use `paperscraper`, please cite a paper that motivated our development of
323382

324383
## Contributions
325384
Thanks to the following contributors:
385+
- [@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
326386
- [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
327387
- [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
328388
- [@daenuprobst](https://github.com/daenuprobst): Since `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)

codecov.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ coverage:
33
patch: off
44
project:
55
default:
6-
target: 90%
6+
target: 80%
77
threshold: 2% # Up to 2% drop/fluctuation is OK
88

99
ignore:

paperscraper/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
"""Initialize the module."""
22

33
__name__ = "paperscraper"
4-
__version__ = "0.2.16"
4+
__version__ = "0.3.0"
55

66
import logging
77
import os

0 commit comments

Comments
 (0)