You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A python based HTML to text converter with support for nested tables and a subset of CSS.
5
+
A python based HTML to text conversion library, command line client and Web service with support for nested tables and a subset of CSS.
6
6
Please take a look at the [Rendering](https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md) document for a demonstration of inscriptis' conversion quality.
7
7
8
-
## Requirements
8
+
##### Table of Contents
9
+
1.[Requirements and installation](#requirements-and-installation)
10
+
2.[Command line client](#command-line-client)
11
+
3.[Python library](#python-library)
12
+
4.[Web service](#flask-web-service)
13
+
5.[Fine tuning](#fine-tuning)
14
+
6.[Testing, benchmarking and evaluation](#testing-benchmarking-and-evaluation)
15
+
7.[Changelog](#changelog)
16
+
17
+
## Requirements and installation
18
+
19
+
### Requirements
9
20
* Python 3.5+ (preferred) or Python 2.7+
10
21
* lxml
11
22
* requests
12
23
13
-
## Usage
14
-
15
-
### Command line
16
-
The command line client converts text files or text retrieved from Web pages to the
17
-
corresponding text representation.
18
-
19
-
#### Installation
20
-
24
+
### Installation
21
25
```{.sourceCode .bash}
22
26
sudo python3 setup.py install
23
27
```
28
+
## Command line client
29
+
The command line client converts text files or text retrieved from Web pages to the
How to handle indentation (extended or standard; default: extended)
47
57
```
48
58
49
-
####Examples
59
+
### Examples
50
60
51
61
```
52
62
# convert the given page to text and output the result to the screen
53
-
inscript.py http://www.htwchur.ch
63
+
inscript.py https://www.fhgr.ch
54
64
55
65
# convert the file to text and save the output to output.txt
56
-
inscript.py htwchur.html -o htwchur.txt
66
+
inscript.py fhgr.html -o fhgr.txt
57
67
58
68
# convert the text provided via stdin and save the output to output.txt
59
-
echo '<body><p>Make it so!</p>></body>' | inscript.py -o htwchur.txt
69
+
echo '<body><p>Make it so!</p>></body>' | inscript.py -o output.txt
60
70
```
61
71
62
72
63
-
### Library
73
+
## Python library
74
+
75
+
Embedding inscriptis into your code is easy, as outlined below:
64
76
65
77
```python
66
78
import urllib.request
@@ -74,7 +86,56 @@ text = get_text(html)
74
86
print(text)
75
87
```
76
88
77
-
## Unit tests
89
+
## Flask Web Service
90
+
91
+
The Flask Web Service translates HTML pages to the corresponding plain text.
92
+
93
+
### Additional Requirements
94
+
95
+
* python3-flask
96
+
97
+
### Startup
98
+
99
+
```{.sourceCode .bash}
100
+
export FLASK_APP="web-service.py"
101
+
python3 -m flask run
102
+
```
103
+
104
+
### Usage
105
+
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
106
+
in the `Content-Type` header (`UTF-8` in the example below).
107
+
108
+
```{.sourceCode .bash}
109
+
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
110
+
```
111
+
112
+
## Fine tuning
113
+
114
+
The following options are available for fine tuning the way inscriptis translates HTML to text.
115
+
116
+
1.**More rigorous indentation:** call `get_text()` with the parameter `indentation='extended'` to also use indentation for tags such as `<div>` and `<span>` that do not provide indentation in their standard definition. This strategy is the default in `inscript.py` and many other tools such as lynx. If you do not want extended indentation you can use the parameter `indentation='standard'` instead.
117
+
118
+
2.**Overwriting the default CSS definition:** inscriptis uses CSS definitions that are maintained in `inscriptis.css.CSS` for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:
119
+
120
+
```python
121
+
from inscriptis.css importCSS, HtmlElement
122
+
from inscriptis.html_properties import Display
123
+
124
+
# change the rendering of `div` and `span` elements
The following code snippet restores the standard behaviour:
129
+
```python
130
+
from inscriptis.css importCSS, DEFAULT_CSS
131
+
132
+
# restore standard behaviour
133
+
CSS=DEFAULT_CSS.copy()
134
+
```
135
+
136
+
## Testing, benchmarking and evaluation
137
+
138
+
### Unit tests
78
139
79
140
Test cases concerning the html to text conversion are located in the `tests/html` directory and consist of two files:
80
141
@@ -83,13 +144,13 @@ Test cases concerning the html to text conversion are located in the `tests/html
83
144
84
145
the latter one containing the reference text output for the given html file.
85
146
86
-
## Text convertion output comparison and speed benchmarking
87
-
inscriptis offers a small benchmarking script that can compare different HTML to txt convertion approaches.
88
-
The script will run the different approaches on a list of URLs, ```url_list.txt```, and save the text output into a time stamped folder in ```benchmarking/benchmarking_results``` for manual comparison.
89
-
Additionally the processing speed of every approach per URL is measured and saved in a text file called ```speed_comparisons.txt``` in the respective time stamped folder.
147
+
###Text conversion output comparison and speed benchmarking
148
+
inscriptis offers a small benchmarking script that can compare different HTML to text conversion approaches.
149
+
The script will run the different approaches on a list of URLs, `url_list.txt`, and save the text output into a time stamped folder in `benchmarking/benchmarking_results` for manual comparison.
150
+
Additionally the processing speed of every approach per URL is measured and saved in a text file called `speed_comparisons.txt` in the respective time stamped folder.
90
151
91
-
To run the benchmarking script execute ```run_benchmarking.py``` from within the folder ```benchmarking```.
92
-
In ```def pipeline()``` set the which HTML -> Text algorithms to be executed by modifying
152
+
To run the benchmarking script execute `run_benchmarking.py` from within the folder `benchmarking`.
153
+
In `def pipeline()` set the which HTML -> Text algorithms to be executed by modifying
93
154
```python
94
155
run_lynx =True
95
156
run_justext =True
@@ -98,37 +159,14 @@ run_beautifulsoup = True
98
159
run_inscriptis =True
99
160
```
100
161
101
-
In ```url_list.txt``` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
162
+
In `url_list.txt` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
102
163
e.g.
103
164
```
104
165
http://www.informationscience.ch
105
166
https://en.wikipedia.org/wiki/Information_science
106
167
...
107
168
```
108
169
109
-
## Flask Web Service
110
-
111
-
The Flask Web Service translates HTML pages to the corresponding plain text.
112
-
113
-
### Requirements
114
-
115
-
* python3-flask
116
-
117
-
### Startup
118
-
119
-
```{.sourceCode .bash}
120
-
export FLASK_APP="web-service.py"
121
-
python3 -m flask run
122
-
```
123
-
124
-
### Usage
125
-
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
126
-
in the `Content-Type` header (`UTF-8` in the example below).
127
-
128
-
```{.sourceCode .bash}
129
-
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
130
-
```
131
-
132
170
## Changelog
133
171
134
172
see [Release notes](https://github.com/weblyzard/inscriptis/releases).
0 commit comments