Just sharing a snippet of code. Part of a project I’m doing, I need to analyse the links in the Wikipedia corpus. While using the API is one solution, it doesn’t retain the order of where links appear in the page. It also returns links that are not part of the main text, which makes the linkage DB very cluttered.
So, I set out to parse the raw MediaWiki format all Wikipedia articles are written in, to get only the relevant links and in order. I call them contextual because they live inside the text and have context.
Initially I used string matching, and other complex string scraping parsing methods. It was a bust. There are too many end-cases to deal with. That is when I discovered PyParsing, the excellent parsing library. It did the job, and here are the results.