Skip to content

Commit b3a205a

Browse files
Merge pull request #20 from weblyzard/feature/improved-documentation
Feature/improved documentation
2 parents 4e04742 + 2ef9012 commit b3a205a

File tree

1 file changed

+85
-47
lines changed

1 file changed

+85
-47
lines changed

README.md

Lines changed: 85 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -2,27 +2,35 @@
22

33
[![Build Status](https://www.travis-ci.org/weblyzard/inscriptis.png?branch=master)](https://www.travis-ci.org/weblyzard/inscriptis)
44

5-
A python based HTML to text converter with support for nested tables and a subset of CSS.
5+
A python based HTML to text conversion library, command line client and Web service with support for nested tables and a subset of CSS.
66
Please take a look at the [Rendering](https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md) document for a demonstration of inscriptis' conversion quality.
77

8-
## Requirements
8+
##### Table of Contents
9+
1. [Requirements and installation](#requirements-and-installation)
10+
2. [Command line client](#command-line-client)
11+
3. [Python library](#python-library)
12+
4. [Web service](#flask-web-service)
13+
5. [Fine tuning](#fine-tuning)
14+
6. [Testing, benchmarking and evaluation](#testing-benchmarking-and-evaluation)
15+
7. [Changelog](#changelog)
16+
17+
## Requirements and installation
18+
19+
### Requirements
920
* Python 3.5+ (preferred) or Python 2.7+
1021
* lxml
1122
* requests
1223

13-
## Usage
14-
15-
### Command line
16-
The command line client converts text files or text retrieved from Web pages to the
17-
corresponding text representation.
18-
19-
#### Installation
20-
24+
### Installation
2125
``` {.sourceCode .bash}
2226
sudo python3 setup.py install
2327
```
28+
## Command line client
29+
The command line client converts text files or text retrieved from Web pages to the
30+
corresponding text representation.
31+
2432

25-
#### Command line parameters
33+
### Command line parameters
2634

2735
``` {.sourceCode .bash}
2836
usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input
@@ -44,23 +52,27 @@ optional arguments:
4452
Display link targets (default:false).
4553
-d, --deduplicate-image-captions
4654
Deduplicate image captions (default:false).
55+
--indentation
56+
How to handle indentation (extended or standard; default: extended)
4757
```
4858

49-
#### Examples
59+
### Examples
5060

5161
```
5262
# convert the given page to text and output the result to the screen
53-
inscript.py http://www.htwchur.ch
63+
inscript.py https://www.fhgr.ch
5464
5565
# convert the file to text and save the output to output.txt
56-
inscript.py htwchur.html -o htwchur.txt
66+
inscript.py fhgr.html -o fhgr.txt
5767
5868
# convert the text provided via stdin and save the output to output.txt
59-
echo '<body><p>Make it so!</p>></body>' | inscript.py -o htwchur.txt
69+
echo '<body><p>Make it so!</p>></body>' | inscript.py -o output.txt
6070
```
6171

6272

63-
### Library
73+
## Python library
74+
75+
Embedding inscriptis into your code is easy, as outlined below:
6476

6577
```python
6678
import urllib.request
@@ -74,7 +86,56 @@ text = get_text(html)
7486
print(text)
7587
```
7688

77-
## Unit tests
89+
## Flask Web Service
90+
91+
The Flask Web Service translates HTML pages to the corresponding plain text.
92+
93+
### Additional Requirements
94+
95+
* python3-flask
96+
97+
### Startup
98+
99+
``` {.sourceCode .bash}
100+
export FLASK_APP="web-service.py"
101+
python3 -m flask run
102+
```
103+
104+
### Usage
105+
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
106+
in the `Content-Type` header (`UTF-8` in the example below).
107+
108+
``` {.sourceCode .bash}
109+
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
110+
```
111+
112+
## Fine tuning
113+
114+
The following options are available for fine tuning the way inscriptis translates HTML to text.
115+
116+
1. **More rigorous indentation:** call `get_text()` with the parameter `indentation='extended'` to also use indentation for tags such as `<div>` and `<span>` that do not provide indentation in their standard definition. This strategy is the default in `inscript.py` and many other tools such as lynx. If you do not want extended indentation you can use the parameter `indentation='standard'` instead.
117+
118+
2. **Overwriting the default CSS definition:** inscriptis uses CSS definitions that are maintained in `inscriptis.css.CSS` for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:
119+
120+
```python
121+
from inscriptis.css import CSS, HtmlElement
122+
from inscriptis.html_properties import Display
123+
124+
# change the rendering of `div` and `span` elements
125+
CSS['div'] = HtmlElement('div', display=Display.block, padding=2)
126+
CSS['span'] = HtmlElement('span', prefix=' ', suffix=' ')
127+
```
128+
The following code snippet restores the standard behaviour:
129+
```python
130+
from inscriptis.css import CSS, DEFAULT_CSS
131+
132+
# restore standard behaviour
133+
CSS = DEFAULT_CSS.copy()
134+
```
135+
136+
## Testing, benchmarking and evaluation
137+
138+
### Unit tests
78139

79140
Test cases concerning the html to text conversion are located in the `tests/html` directory and consist of two files:
80141

@@ -83,13 +144,13 @@ Test cases concerning the html to text conversion are located in the `tests/html
83144

84145
the latter one containing the reference text output for the given html file.
85146

86-
## Text convertion output comparison and speed benchmarking
87-
inscriptis offers a small benchmarking script that can compare different HTML to txt convertion approaches.
88-
The script will run the different approaches on a list of URLs, ```url_list.txt```, and save the text output into a time stamped folder in ```benchmarking/benchmarking_results``` for manual comparison.
89-
Additionally the processing speed of every approach per URL is measured and saved in a text file called ```speed_comparisons.txt``` in the respective time stamped folder.
147+
### Text conversion output comparison and speed benchmarking
148+
inscriptis offers a small benchmarking script that can compare different HTML to text conversion approaches.
149+
The script will run the different approaches on a list of URLs, `url_list.txt`, and save the text output into a time stamped folder in `benchmarking/benchmarking_results` for manual comparison.
150+
Additionally the processing speed of every approach per URL is measured and saved in a text file called `speed_comparisons.txt` in the respective time stamped folder.
90151

91-
To run the benchmarking script execute ```run_benchmarking.py``` from within the folder ```benchmarking```.
92-
In ```def pipeline()``` set the which HTML -> Text algorithms to be executed by modifying
152+
To run the benchmarking script execute `run_benchmarking.py` from within the folder `benchmarking`.
153+
In `def pipeline()` set the which HTML -> Text algorithms to be executed by modifying
93154
```python
94155
run_lynx = True
95156
run_justext = True
@@ -98,37 +159,14 @@ run_beautifulsoup = True
98159
run_inscriptis = True
99160
```
100161

101-
In ```url_list.txt``` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
162+
In `url_list.txt` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
102163
e.g.
103164
```
104165
http://www.informationscience.ch
105166
https://en.wikipedia.org/wiki/Information_science
106167
...
107168
```
108169

109-
## Flask Web Service
110-
111-
The Flask Web Service translates HTML pages to the corresponding plain text.
112-
113-
### Requirements
114-
115-
* python3-flask
116-
117-
### Startup
118-
119-
``` {.sourceCode .bash}
120-
export FLASK_APP="web-service.py"
121-
python3 -m flask run
122-
```
123-
124-
### Usage
125-
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
126-
in the `Content-Type` header (`UTF-8` in the example below).
127-
128-
``` {.sourceCode .bash}
129-
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
130-
```
131-
132170
## Changelog
133171

134172
see [Release notes](https://github.com/weblyzard/inscriptis/releases).

0 commit comments

Comments
 (0)