Closed
Description
For example, we can't parse an XML of Wikipedia: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
REXML::ParseException: #<RangeError: integer 2147501889 too big to convert to 'int'>
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/source.rb:127:in 'StringScanner#pos='
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/source.rb:127:in 'REXML::Source#position='
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/parsers/baseparser.rb:447:in 'REXML::Parsers::BaseParser#pull_event'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/parsers/baseparser.rb:207:in 'REXML::Parsers::BaseParser#pull'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/parsers/streamparser.rb:20:in 'REXML::Parsers::StreamParser#parse'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/wikipedia.rb:51:in 'block in Datasets::Wikipedia#each'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:78:in 'block (2 levels) in Datasets::Dataset#extract_bz2'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:56:in 'IO.pipe'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:56:in 'block in Datasets::Dataset#extract_bz2'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:55:in 'IO.pipe'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:55:in 'Datasets::Dataset#extract_bz2'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/wikipedia.rb:71:in 'Datasets::Wikipedia#open_data'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/wikipedia.rb:48:in 'Datasets::Wikipedia#each'
...
Exception parsing
Line: -1
Position: -1
Last 80 unconsumed characters:
/text>
(ruby -r datasets -e 'Datasets::Wikipedia.new.each {}'
will reproduce this.)
We need to drop parsed content in StringScanner
of REXML::Source
to parse large XML.
Metadata
Metadata
Assignees
Labels
No labels