Skip to content

Issues on English Wikipedia #1

@tgalery

Description

@tgalery

Hi there and thanks for coming up with the tool. I tried to extract on a decompressed dump of the english wikipedia and the process got stuck on the third parse of the dump for a few hours.

Here is the output

/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 15200000 pagesdone in 1601421ms
too many redirections: KWV,,KWV  Koöperatieve Wijnbouwers Vereniging van Zuid-Afrika Bpkt,,KWV,,13
3512031 elements in the interesting sf
start parsing the dump: /home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml

/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 15200000 pagesdone in 2852276ms
start parsing the dump: /home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml

/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 100000 pages

I specified 14GB of ram. Anything that I might be doing wrong ?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions