PLABO: dumping wikipedia

Friday 1 June 2007

dumping wikipedia

Finally I have found how to dump all my mediawiki pages and process them with a perl script!!
(you can even do that with the wikipedia if you will)

After dumping all with

./maintenance/dumpBackup.ph

I have found a CPAN module for process the XML file:

http://en.wikipedia.org/wiki/Wikipedia:Computer_help_desk/ParseMediaWikiDump

The latest version of Parse::MediaWikiDump is available at http://www.cpan.org/modules/by-authors/id/T/TR/TRIDDLE/

Examples

Find uncategorized articles in the main name space

#!/usr/bin/perl -w

use strict;
use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;

while(defined($page = $pages->page)) {
#main namespace only
next unless $page->namespace eq '';

print $page->title, "\n" unless defined($page->categories);

PLABO

Friday 1 June 2007

dumping wikipedia

No comments:

Blog Archive

About Me

PLABO

Friday 1 June 2007

dumping wikipedia

No comments:

Blog Archive

About Me

Subscribe To