«

»

Jun 16

Getting all the links from a MediaWiki format using PyParsing

Hi,
Just sharing a snippet of code. Part of a project I'm doing, I need to analyse the links in the Wikipedia corpus. While using the API is one solution, it doesn't retain the order of where links appear in the page. It also returns links that are not part of the main text, which makes the linkage DB very cluttered.
So, I set out to parse the raw MediaWiki format all Wikipedia articles are written in, to get only the relevant links and in order. I call them contextual because they live inside the text and have context.
Initially I used string matching, and other complex string scraping parsing methods. It was a bust. There are too many end-cases to deal with. That is when I discovered PyParsing, the excellent parsing library. It did the job, and here are the results.

The grammer

This is the grammer I have so far:


	textNoStop = Regex('[^\s\{\}\[\]]+')
	myHtmlComment = QuotedString("<!--",endQuoteChar="-->",multiline=True)
	regularText = (textNoStop ^ Literal("[") ^ Literal("]") ) 

	regularBrackets = Forward() 
	regularBrackets << Combine(Literal("(") + ZeroOrMore(Regex('[^]+') ^ regularBrackets) + Literal(")"))

	link = Forward()
	link << Combine( Literal("[[").suppress() + ZeroOrMore(Regex('[^\[\]]+') ^ link) + Literal("]]").suppress()) 

	curlyShit = Forward()
	curlyShit << Combine( Literal("{{") + ZeroOrMore( Regex('[^\{\}]+') ^ curlyShit ) + Literal("}}") , joinString=" ") 

	curlyCurlyBar = QuotedString("{|",endQuoteChar="|}",multiline=True)+Optional(QuotedString("}",endQuoteChar="|}",multiline=True))
	strangeCurlyBar = QuotedString("|",endQuoteChar="|}",multiline=True) #+NotAny(Literal("}")) # strangely it may also appear like this...
	curlyBar = curlyCurlyBar ^ strangeCurlyBar

	strangeBeginRemark = Combine(Literal(":") + QuotedString("''") , joinString=" ")

	if debug:
		wikiMarkup = OneOrMore(regularText ^ strangeBeginRemark ^ curlyBar ^ curlyShit ^ myHtmlComment ^ link ^ regularBrackets)
	else:
		wikiMarkup = Optional(OneOrMore(regularText.suppress() ^ strangeBeginRemark.suppress() ^ curlyBar.suppress() ^ curlyShit.suppress() ^ myHtmlComment.suppress() ^ link ^ regularBrackets.suppress()))

If you're not familiar with PyParsing, these are simply rules for each element in the format that I considered relevant.
So first I had to get rid of any "double curly" things in the MediaWiki format (MWf), and they can be nested which stikes out using the simple QuotedString method. I had to build a recursive grammer:

	curlyShit = Forward()
	curlyShit << Combine( Literal("{{") + ZeroOrMore( Regex('[^\{\}]+') ^ curlyShit ) + Literal("}}") , joinString=" ") 

As you can see I wasn't so happy with it...
Anyway, first I have to make an empty declaration of the double-curly, so I can use it recursively, and then work it into the grammer. It basically looking for an opening '{{' and won't stop looking inside it until it hits a '{' (or a '}'), then allows for another possible double-curly, and just keeps going until the end. It works.

The rules for the other nested things in the MWf are implemented the same way.

All done?

Alright, everything seems fine and dandy. Well they are not, because Wikipedia is a cesspool of errors in the format, such as this value: http://en.wikipedia.org/wiki/Fianna_Fail (keep in mind it might be updated and fixed by the time you read this)
Which has an un-balanced parenthesis:

'''Fianna Fáil – The Republican Party''' ({{lang-ga|Fianna Fáil – An Páirtí Poblachtánach}}), more commonly known as '''Fianna Fáil''' ({{IPA-ga|ˌfʲiənə ˈfɔːlʲ}} is a [[political party]] in the [[Republic of Ireland]],

It's right after the IPA double-curly.

And, since my grammer works well for balanced and correctly formatted documents - it fails on this document.

There are more anomalis in the format like this (from http://en.wikipedia.org/wiki/Fredrich_Wilhelm_I_of_Prussia):

{{Refimprove|date=December 2009}}
{|align=right
|{{Infobox royalty|monarch
| name           = Frederick William I
| title          =King in Russia; Elector of Brandenburg
| image          =Friedrich Wilhelm I 1713.jpg
....

| death_place =[[Berlin]], [[Kingdom of Prussia|Prussia]]
| place of burial=[[Sanssouci]], [[Potsdam]]
| religion         =[[Calvinism]]
|}}
|-
|{{House of Hohenzollern (Prussia)|frederickwilliam1}}
|}

While it may seem OK, it's really parser's hell... See that opening "{|"? that's cool but then comes the closing "|}}", which also matches for the real closer "|}", that only appears later...

There are more anomalis like that, it's a whole mess.

Results

Anyway, looking at the big picture I was able to parse ~99.9% of the articles, which is fine by me. Only Wikipedia is ~10,000,000 articles (incl. redirects and disambigs), so I know I'm missing a lot.

So, here's a result:
Original: Machine

{{About|devices that perform tasks|other uses|Machine (disambiguation)}}
A '''machine''' manages power to accomplish a task, examples include, a [[mechanical system]], a [[computer|computing system]], an [[electronic system]], a [[molecular machine]] and a [[biological machine]]. In common usage, the meaning is that of a device having parts that perform or assist in performing any type of [[Work (physics)|work]]. A [[simple machine]] is a device that transforms the direction or magnitude of a [[force]].

Links found after parsing

['mechanical system', 'computer|computing system', 'electronic system', 'molecular machine', 'biological machine', 'Work (physics)|work', 'simple machine', 'force']

Another one:
Original: Agrippina the elder

{{Infobox royalty
| name        = Agrippina the Elder
| image       = Agripina Maior (M.A.N. Madrid) 01.jpg
| caption     = Agrippina, wife of Germanicus
| imgw        = 200px
| spouse      = [[Germanicus]]
| issue       = [[Nero (son of Germanicus)|Nero Caesar]]<br>[[Drusus Caesar]]<br>[[Caligula]], Roman Emperor<br>[[Agrippina the Younger]], Roman Empress<br>[[Drusilla (sister of Caligula)|Julia Drusilla]]<br>[[Julia Livilla]]
| father      = [[Marcus Vipsanius Agrippa]]
| mother      = [[Julia the Elder]]
| birth_date   = 14 BC
| birth_place  = [[Athens]], [[Greece]]
| death_date   = 18 October 33 (aged 47)
| death_place  = [[Ventotene|Pandataria]]
| place of burial = [[Ventotene|Pandataria]], later the <br>[[Mausoleum of Augustus]]
}}

'''Vipsania Agrippina''' or most commonly known as '''Agrippina Major''' or '''Agrippina the Elder''' (''Major'' Latin for ''the elder'', [[Classical Latin]]: <small>AGRIPPINAGERMANICI</small>,<ref>{{Aut|E. Groag, A. Stein, L. Petersen - e.a.}} (edd.), ''Prosopographia Imperii Romani saeculi I, II et III'' ('''[[PIR]]'''), Berlin, 1933 - V 463</ref> 14 BC  [[18 October]] [[33]]) was a distinguished and prominent granddaughter of the Emperor Augustus. Agrippina was the wife of the general, statesman [[Germanicus]] and a relative to the first [[Roman Emperors]]. She was the second granddaughter of the Emperor [[Augustus]], sister-in-law, stepdaughter and daughter-in-law of the Emperor [[Tiberius]], mother of the Emperor [[Caligula]], maternal second cousin and sister-in-law of the Emperor [[Claudius]] and the maternal grandmother of the Emperor [[Nero]].

Links found:

['Germanicus', 'Roman Emperors', 'Augustus', 'Tiberius', 'Caligula', 'Claudius', 'Nero']

See it handled all the curlys with grace.

So, code is up:

svn checkout http://morethantechnical.googlecode.com/svn/trunk/simpleWikiParser/simpleWiki.py

Enjoy,
Roy.

Share
  • http://www.43things.com/entries/view/6457694 malaysia payday

    Hey would you mind letting me know which hosting company you're working with?
    I've loaded your blog in 3 different internet browsers and I must say this blog loads a
    lot quicker then most. Can you recommend a good web hosting provider at a
    fair price? Thank you, I appreciate it!