Ignore Wikipedia Redirects with mwlib - python

I'm using mwlib in Python to iterate over a Wikipedia dump. I want to ignore redirects and just look at page contents with the actual full title. I've already run mw-buildcdb, and I'm loading that:
wiki_env = wiki.makewiki(wiki_conf_file)
When I loop over wiki_env.wiki.articles(), the strings appear to contain redirect titles (I've checked this on a couple of samples against Wikipedia). I don't see an accessor that skips these, and wiki_env.wiki.redirects is an empty dictionary, so I can't check which article titles are actually just redirects that way.
I've tried looking through the mwlib code, but if I use
page = wiki_env.wiki.get_page(page_title)
wiki_env.wiki.nshandler.redirect_matcher(page.rawtext)
the page.rawtext appears to already be redirected (containing the full page content, and no indication that there is a title mismatch). Similarly the Article node returned by getParsedArticle() does not appear to contain the "true" title to check against.
Anyone know how to do this? Do I need to run mw-buildcdb in a way to not store redirects? As far as I can tell that command just takes an input dump file and an output CDB, with no other options.

When in doubt, patch it yourself. :o)
mw-buildcdb now takes an --ignore-redirects command-line option: https://github.com/pediapress/mwlib/commit/f9198fa8288faf4893b25a6b1644e4997a8ff9b2

Related

Accessing a hidden form using MechanicalSoup will result in "Value Error: No closing quotation"

First of all my english is not my native language.
Problem
I try to access and manipulate a form using MechanicalSoup as described in the docs. I did successfull login to the page using the given login form which I found using the "debug mode"(F12) built into chrome.
form action="https://www.thegoodwillout.de/customer/account/loginPost/"
The Form can be found here using the chrome "debugger"
this is working fine and will not produce any error. I tried to up my game and move to a more complicated form which is given on this site. I managed to track down the form to this snippet
form action="https://www.thegoodwillout.de/checkout/cart/add/uenc/aHR0cHM6Ly93d3cudGhlZ29vZHdpbGxvdXQuZGUvbmlrZS1haXItdm9ydGV4LXNjaHdhcnotd2Vpc3MtYW50aHJheml0LTkwMzg5Ni0wMTA_X19fU0lEPVU,/product/115178/form_key/r19gQi8K03l21bYk/"
This will result in a
ValueError: No Closing quotation
which is weird since it does not use any special characters and I double checked so that every quotation is closing correctly
What have I tried
I tried tracking down a more specific form which will apply for the given shoe size but this form seems to manage all the content on the Website. I searched the web and found several articles pointing to a bug inside python which I cannot believe will be true!
Source Code with attached error log
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.thegoodwillout.de/nike-air-vortex-schwarz-weiss-anthrazit-903896-010")
browser.select_form('form[action="https://www.thegoodwillout.de/checkout/cart/add/uenc/aHR0cHM6Ly93d3cudGhlZ29vZHdpbGxvdXQuZGUvbmlrZS1haXItdm9ydGV4LXNjaHdhcnotd2Vpc3MtYW50aHJheml0LTkwMzg5Ni0wMTA_X19fU0lEPVU,/product/115178/form_key/r19gQi8K03l21bYk/"]')
NOTE: it all seems to track down to a module called shlex which is causing the error
Finally the error log
It would be really helpful if you could point me into the right directions and link some Websites I may not have fully investigated yet.
It's actually an issue with BeautifulSoup4, the library used by MechanicalSoup to navigate within HTML documents, related to the fact that you use a comma (,) in the CSS selector.
BeautifulSoup splits CSS selectors on commas, and therefore considers your query as: browser.select_form('form[action="https://www.thegoodwillout.de/checkout/cart/add/uenc/aHR0cHM6Ly93d3cudGhlZ29vZHdpbGxvdXQuZGUvbmlrZS1haXItdm9ydGV4LXNjaHdhcnotd2Vpc3MtYW50aHJheml0LTkwMzg5Ni0wMTA_X19fU0lEPVU and /product/115178/form_key/r19gQi8K03l21bYk/"], parsed separately. When parsing the first, it finds an opening " but no closing ", and errors out.
It's somewhat a feature (you can specify multiple CSS selectors as argument to select), but it's useless here as a feature (there's no point providing several selectors when you expect a single object).
Solution: don't use commas in CSS selectors. You probably have other criteria to match your form.
You may try using %2C instead of the comma (untested).

Accessing Hovertext with html

I am trying to access hover text found on graph points at this site (bottom):
http://matchhistory.na.leagueoflegends.com/en/#match-details/TRLH1/1002200043?gameHash=b98e62c1bcc887e4&tab=overview
I have the full site html but I am unable to find the values displayed in the hover text. All that can be seen when inspecting a point are x and y values that are transformed versions of these values. The mapping can be determined with manual input taken from the hovertext but this defeats the purpose of looking at the html. Additionally, the mapping changes with each match history so it is not feasible to do this for a large number of games.
Is there any way around this?
thank you
Explanation
Nearly everything on this webpage is loaded via JSON through JavaScript. We don't even have to request the original page. You will, however, have to repiece together the page via id's of items, mysteries and etc., which won't be too hard because you can request masteries similar to how we fetch items.
So, I went through the network tab in inspect and I noticed that it loaded the following JSON formatted URL:
https://acs.leagueoflegends.com/v1/stats/game/TRLH1/1002200043?gameHash=b98e62c1bcc887e4
If you notice, there is a gameHash and the id (similar to that of the link you just sent me). This page contains everything you need to rebuild it, given that you fetch all reliant JSON files.
Dealing with JSON
You can use json.loads in Python to load it, but a great tool I would recomend is:
https://jsonformatter.curiousconcept.com/
You copy and paste JSON in there and it will help you understand the data structure.
Fetching items
The webpage loads all this information via a JSON file:
https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json
It contains all of the information and tool tips about each item in the game. You can access your desired item via: theirJson['data']['1001']. Each image on the page's file name is the id (or 1001) in this example.
For instance, for 'Boots of Speed':
import requests, json
itemJson = json.loads(requests.get('https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json').text)
print(itemJson['data']['1001'])
An alternative: Selenium
Selenium could be used for this. You should look it up. It's been ported for several programming languages, one being Python. It may work as you want it to here, but I sincerely think that the JSON method (describe above), although a little more convoluted, will perform faster (since speed, based on your post, seems to be an important factor).

Parse HTML, 'ValueError: stat: path too long for Windows'

I'm trying to scrape data from NYSE's website, from this URL:
nyse = http://www1.nyse.com/about/listed/IPO_Index.html
Using requests, my I've set my request up like this:
page = requests.get(nyse)
soup = BeautifulSoup(page.text)
tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))
However, I keep getting this error
'ValueError: stat: path too long for Windows'
I don't understand how to interpret this error, and furthermore, solve the problem. I've seen one other posting on this area (Copy a file with a too long path to another directory in Python) but I don't fully understand the workaround, and am not sure which path is the problem in this case.
The error is getting thrown at the test = pandas.io.... line but there isn't a clear definition of path, where I'm storing the table locally. Do I need to use pywin32? Why does this error only show for some URLs and not others? How do I solve this problem?
For reference, I'm using python 3.4
Update:
The error only appears with the nyse website, and not for others that I'm also scraping. In all cases, I'm doing the str(tables) conversion.
The pandas read_html method accepts urls, files, or raw HTML strings as its first argument. It definitely looks like it's trying to interpret the str(tables) argument as a URL -- which would of course be quite long and overrun whatever limit Windows apparently has.
Are you certain that str(tables) produces raw, parseable HTML? Tables looks like it would be represented as a list of abstract node objects -- it seems likely that calling str() on this would not produce what you're looking for.

Automatically preventing wiki-rot in Trac?

Hi guys : Is there a way to improve trac wiki quality using a plugin that deals with artifacts like for obsolete pages, or pages that refer to code which doesn't exist anymore, pages that are unlinked, or pages which have a low update-rate ? I think there might be several heuristics which could be used to prevent wiki-rot :
Number of recent edits
Number of recent views
Wether or not a page links to a source file
Wether or not a wiki page's last update is < or > the source files it links to
Wether entire directories in the wiki have been used/edited/ignored over the last "n" days
etc. etc. etc.
If nothing else, just these metrics alone would be useful for each page and each directory from an administrative standpoint.
I don't know of an existing plugin that does this, but everything you mentioned certainly sounds do-able in one way or another.
You can use the trac-admin CLI command to get a list of wiki pages and to dump the contents of a particular wiki page (as plain text) to a file or stdout. Using this, you can write a script that reads in all of the wiki pages, parses the content for links, and generates a graph of which pages link to what. This should pinpoint "orphans" (pages that aren't linked to), pages that link to source files, and pages that link to external resources. Running external links through something like wget can help you identify broken links.
To access last-edited dates, you'll want to query Trac's database. The query you'll need will depend on the particular database type that you're using. For playing with the database in a (relatively) safe and easy manner, I find the WikiTableMacro and TracSql plugins quite useful.
The hardest feature in your question to implement would be the one regarding page views. I don't think that Trac keeps track of page views, you'll probably have to parse your web server's log for that sort of information.
How about these:
BadLinksPlugin: This plugin logs bad local links found in wiki content.
It's a quite new one, just deals with dangling links, but any bad links as I see from source code. This is at least one building block to your solution request.
VisitCounterMacro: Macro displays how many times was wiki page displayed.
This is a rather old one. You'll get just the statistic per page while an administrative view is missing, but this could be built rather easily, i.e. like a custom PageIndex.

python method to extract content (excluding navigation) from an HTML page

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags

Categories