-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
Hi there and thanks for coming up with the tool. I tried to extract on a decompressed dump of the english wikipedia and the process got stuck on the third parse of the dump for a few hours.
Here is the output
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 15200000 pagesdone in 1601421ms
too many redirections: KWV,,KWV Koöperatieve Wijnbouwers Vereniging van Zuid-Afrika Bpkt,,KWV,,13
3512031 elements in the interesting sf
start parsing the dump: /home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 15200000 pagesdone in 2852276ms
start parsing the dump: /home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 100000 pages
I specified 14GB of ram. Anything that I might be doing wrong ?
Metadata
Metadata
Assignees
Labels
No labels