You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Adding full text download fallback options
* Fix bug when not providing api_keys
* Fix missing log message with no citation_pdf_url
* Adding Table of Contents to ReadMe for better overview
* Fix ReadMe formatting
* Fix type hints and logging based on review feedback
Co-authored-by: Jannis Born <[email protected]>
* Refactor PDF download functions to remove logger parameter and improve type hints
* Add tests for fallback functions when doing full text download
* Bump version to 0.3.0 and add contributor info in README
* Fix test case comment to ignore codespell warning for DOI
* Update codespell ignore list for false-positive (smll in DOI)
* chore: apply formatting and slightly lower the logging verbosity
* chore: cleanup
* ci: lower codecov requirement
* Add additional API tests
---------
Co-authored-by: Jannis Born <[email protected]>
Co-authored-by: Jannis Born <[email protected]>
But watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.
45
60
61
+
#### Arxiv local dump
62
+
If you prefer local search rather than using the arxiv API:
63
+
64
+
```py
65
+
from paperscraper.get_dumps import arxiv
66
+
arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.
67
+
```
68
+
69
+
Afterwards you can search the local arxiv dump just like the other x-rxiv dumps.
70
+
The direct endpoint is `paperscraper.arxiv.get_arxiv_papers_local`. You can also specify the
71
+
backend directly in the `get_and_dump_arxiv_papers` function:
72
+
```py
73
+
from paperscraper.arxiv import get_and_dump_arxiv_papers
*NOTE*: The scholar endpoint does not require authentication but since it regularly prompts with captchas, it's difficult to apply large scale.
115
145
116
-
### Scrape PDFs
146
+
### Full-Text Retrieval (PDFs & XMLs)
147
+
148
+
`paperscraper` allows you to download full text of publications using DOIs. The basic functionality works reliably for preprint servers (arXiv, bioRxiv, medRxiv, chemRxiv), but retrieving papers from PubMed dumps is more challenging due to publisher restrictions and paywalls.
149
+
150
+
#### Standard Usage
117
151
118
-
`paperscraper` also allows you to download the PDF files.
152
+
The main download functions work for all paper types with automatic fallbacks:
119
153
120
154
```py
121
155
from paperscraper.pdf import save_pdf
122
156
paper_data = {'doi': "10.48550/arXiv.2207.03928"}
123
157
save_pdf(paper_data, filepath='gt4sd_paper.pdf')
124
158
```
125
159
126
-
If you want to batch download all PDFs for your previous metadata search, use the wrapper.
127
-
Here we scrape the PDFs for the metadata obtained in the previous example.
160
+
To batch download full texts from your metadata search results:
128
161
129
162
```py
130
163
from paperscraper.pdf import save_pdf_from_dump
131
164
132
-
# Save PDFs in current folder and name the files by their DOI
165
+
# Save PDFs/XMLs in current folder and name the files by their DOI
*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs.
136
-
Many publishers detect and block scraping and many publications are simply behind paywalls.
168
+
169
+
#### Automatic Fallback Mechanisms
170
+
171
+
When the standard text retrieval fails, `paperscraper` automatically tries these fallbacks:
172
+
173
+
-**BioC-PMC**: For biomedical papers in [PubMed Central](https://pmc.ncbi.nlm.nih.gov/) (open-access repository), it retrieves open-access full-text XML from the [BioC-PMC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/).
174
+
-**eLife Papers**: For [eLife](https://elifesciences.org/) journal papers, it fetches XML files from eLife's open [GitHub repository](https://github.com/elifesciences/elife-article-xml).
175
+
176
+
These fallbacks are tried automatically without requiring any additional configuration.
177
+
178
+
#### Enhanced Retrieval with Publisher APIs
179
+
180
+
For more comprehensive access to papers from major publishers, you can provide API keys for:
181
+
182
+
-**Wiley TDM API**: Enables access to [Wiley](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) publications (2,000+ journals).
183
+
-**Elsevier TDM API**: Enables access to [Elsevier](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) publications (The Lancet, Cell, ...).
184
+
185
+
To use publisher APIs:
186
+
187
+
1. Create a file with your API keys:
188
+
```
189
+
WILEY_TDM_API_TOKEN=your_wiley_token_here
190
+
ELSEVIER_TDM_API_KEY=your_elsevier_key_here
191
+
```
192
+
193
+
2. Pass the file path when calling retrieval functions:
194
+
195
+
```py
196
+
from paperscraper.pdf import save_pdf_from_dump
197
+
198
+
save_pdf_from_dump(
199
+
'pubmed_query_results.jsonl',
200
+
pdf_path='./papers',
201
+
key_to_save='doi',
202
+
api_keys='path/to/your/api_keys.txt'
203
+
)
204
+
```
205
+
206
+
For obtaining API keys:
207
+
- Wiley TDM API: Visit [Wiley Text and Data Mining](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) (free for academic users with institutional subscription)
208
+
- Elsevier TDM API: Visit [Elsevier's Text and Data Mining](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) (free for academic users with institutional subscription)
209
+
210
+
*NOTE*: While these fallback mechanisms improve retrieval success rates, they cannot guarantee access to all papers due to various access restrictions.
@@ -323,6 +382,7 @@ If you use `paperscraper`, please cite a paper that motivated our development of
323
382
324
383
## Contributions
325
384
Thanks to the following contributors:
385
+
-[@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
326
386
-[@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
327
387
-[@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
328
388
-[@daenuprobst](https://github.com/daenuprobst): Since `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
0 commit comments