Parsing wikipedia stubs using python wikitools - python

I implemented the example from: Mediawiki and Python
I read Get wikipedia abstract using python and How to parse/extract data from a mediawiki marked-up article via python and several others.
I am trying to get a dump of some Wikipedia stubs associated with a category and insert them into an internal semantic mediawiki site. For the purpose of this example I am using the "Somali_Region" category. The script uses the mediawiki API to obtain data then it parses the data removing all template information which is desirable.
from wikitools import wiki
from wikitools import category
import mwparserfromhell
wikisite = "http://en.wikipedia.org/w/api.php"
parse_category = "Somali_Region"
wikiObject = wiki.Wiki(wikisite)
wikiCategory = category.Category(wikiObject, parse_category)
articles = wikiCategory.getAllMembersGen(namespaces=[0])
for article in articles:
wikiraw = article.getWikiText()
parsedWikiText = mwparserfromhell.parse(wikiraw)
for template in parsedWikiText.filter_templates():
parsedWikiText.remove(template)
print parsedWikiText
The internal semantic mediawiki site fails if I try to do a dump from wikipedia and do an insert, so that is not an option. Is it possible to do use the API to insert data into the semantic mediawiki site? I read the mediawiki API edit page, but I could not find a python example.

If I understand correctly, you want to take your parsedWikiText and save it into a private wiki.
Here's what I have for doing that kind of thing (you'll need to store USERNAME and PASSWORD somewhere; I use a config file, but there are more secure ways). I'll pick up from right before your for loop...
# Set up and authenticate into the target wiki if you need to.
from wikitools import wiki, page
target_wiki = wiki.Wiki('http://wiki.example.com/w/api.php')
site.login(USERNAME, PASSWORD)
for article in articles:
wikiraw = article.getWikiText()
parsedWikiText = mwparserfromhell.parse(wikiraw)
for template in parsedWikiText.filter_templates():
parsedWikiText.remove(template)
# Use the API's edit function to save the new content.
target_title = article.title
target_page = page.Page(target_wiki, target_title)
result = target_page.edit(text=parsedWikiText, summary="Imported text")
# Check to see if it worked.
if result['edit']['result'] == 'Success':
print 'Saved', target_title
else:
print 'Save failed', target_title
I'm assuming here you want to save parsedWikiText into a new page. If there's already something on the page in your wiki, you'll have to read it first with target_page.getWikiText() and then mix the new text in somehow. I've also assumed the article will have the same name it had in Wikipedia; if not then change target_title.

Related

Python: a script to find the website language

Hello everyone,
I am trying to write a program in Python to automatically check a website language. My script looks at the HTML header, identify where the string 'lang' appears, and print the corresponding language. I use the module 'requests'.
request = requests.get('https://en.wikipedia.org/wiki/Main_Page')
splitted_text = request.text.split()
matching = [s for s in splitted_text if "lang=" in s]
language_website = matching[0].split('=')[1]
print(language_website[1:3])
>>> en
I have tested it over several websites, and it works (given the language is correctly configured in the HTML in the first place, which is likely for the websites I consider in my research).
My question is: is there a more straightforward / consistent / systematic way to achieve the same thing. How one would look at the HTML using python and return the language the website is written in? Is there a quicker way using lxml for instance (that does not involve parsing strings like I do)?
I know the question of how to find a website language has been asked before, and the method using the HTML header to retrieve the language was mentioned, but it was not developed and no code was suggested, so I think this post is reasonably different.
Thank you so very much! Have a wonderful day,
Berti
You can try this :
import requests
request = requests.head('https://en.wikipedia.org/wiki/Main_Page')
print(request.headers["Content-language"])
If you are interested to get the data from page source. This might help.
import lxml
request = requests.get('https://en.wikipedia.org/wiki/Main_Page')
root = lxml.html.fromstring(request.text)
language_construct = root.xpath("//html/#lang") # this xpath is reliable(in long-term), since this is a standard construct.
language = "Not found in page source"
if language_construct:
language = language_construct[0]
print(language)
Note: This approach will not give result for all webpages, only those which contains HTML Language Code Reference.
Refer https://www.w3schools.com/tags/ref_language_codes.asp for more.
Combining the above responses
import requests
request = requests.head('https://en.wikipedia.org/wiki/Main_Page')
print(request.headers.get("Content-language", "Not found in page source"))

Getting author's articles from Scopus using Scopus API (AUTHENTICATION_ERROR)

I've registered at http://www.developers.elsevier.com/action/devprojects. I created a project and got my scopus key:
Now, using this generated key, I would like to find an author by firstname, lastname and subjectarea. I make requests from my university network, which is allowed to visit Scopus (I have full manual access to Scopus search, use it from Firefox with no problem). However, I wanted to automatize my Scopus mining, by writing a simple script. I would like to find publications of an author by giving his/her firstname, lastname and subjectarea.
Here's my code:
# !/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import json
from scopus import SCOPUS_API_KEY
scopus_author_search_url = 'http://api.elsevier.com/content/search/author?'
headers = {'Accept':'application/json', 'X-ELS-APIKey': SCOPUS_API_KEY}
search_query = 'query=AUTHFIRST(%) AND AUTHLASTNAME(%s) AND SUBJAREA(%s)' % ('John', 'Kitchin', 'COMP')
# api_resource = "http://api.elsevier.com/content/search/author?apiKey=%s&" % (SCOPUS_API_KEY)
# request with first searching page
page_request = requests.get(scopus_author_search_url + search_query, headers=headers)
print page_request.url
# response to json
page = json.loads(page_request.content.decode("utf-8"))
print page
Where SCOPUS_API_KEY looks just like this: SCOPUS_API_KEY="xxxxxxxx".
Although I have full access to scopus from my university network, I'm getting such response:
{u'service-error': {u'status': {u'statusText': u'Requestor
configuration settings insufficient for access to this resource.',
u'statusCode': u'AUTHENTICATION_ERROR'}}}
The generated link looks like this: http://api.elsevier.com/content/search/author?query=AUTHFIRST(John)%20AND%20AUTHLASTNAME(Kitchin)%20AND%20SUBJAREA(COMP) and when I click it, it shows an XML file:
<service-error><status>
<statusCode>AUTHORIZATION_ERROR</statusCode>
<statusText>No APIKey provided for request</statusText>
</status></service-error>
Or, when I change the scopus_author_search_url to "http://api.elsevier.com/content/search/author?apiKey=%s&" % (SCOPUS_API_KEY) I'm getting:
{u'service-error': {u'status': {u'statusText': u'Requestor configuration settings insufficient for access to this resource.', u'statusCode': u'AUTHENTICATION_ERROR'}}} and the XML file:
<service-error>
<status>
<statusCode>AUTHENTICATION_ERROR</statusCode>
<statusText>Requestor configuration settings insufficient for access to this resource.</statusText>
</status>
</service-error>
What can be the cause of this problem and how can I fix it?
I have just registered for an API key and tested it first with this URL:
http://api.elsevier.com/content/search/author?apikey=4xxxxxxxxxxxxxxxxxxxxxxxxxxxxx43&query=AUTHFIRST%28John%29+AND+AUTHLASTNAME%28Kitchin%29+AND+SUBJAREA%28COMP%29
This works fine from my university network. I also tested a second API Key, so have verified one with registered website on my university domain, one with registered website http://apitest.example.com, ruling out the domain name used to register as the source of your problem.
I tested this
in the browser,
using your python code both with the api key in the headers. The only change I made to your code is removing
from scopus import SCOPUS_API_KEY
and adding
SCOPUS_API_KEY ='4xxxxxxxxxxxxxxxxxxxxxxxxxxxxx43'
using your python code adapted to put the apikey in the URL instead of the headers.
In all cases, the query returns two authors, one at Carnegie Mellon and one at Palo Alto.
I can't replicate your error message. If I try to use the API key from an IP address unregistered with elsevier (e.g. my home computer), I see a different error:
<service-error>
<status>
<statusCode>AUTHENTICATION_ERROR</statusCode>
<statusText>Client IP Address: xxx.yyy.aaa.bbb does not resolve to an account</statusText>
</status>
</service-error>
If I use a random (wrong) API key from the university network, I see
<service-error>
<status>
<statusCode>AUTHORIZATION_ERROR</statusCode>
<statusText>APIKey <mad3upa1phanum3r1ck3y> with IP address <my.uni.IP.add> is unrecognized or has insufficient privileges for access to this resource</statusText>
</status>
</service-error>
Debug steps
As I can't replicate your problem - here are some diagnostic steps you can use to resolve:
Use your browser at uni to actually submit the api query with your key in the URL (i.e. copy the URL above, paste it into the address bar, substitute your key and see whether you get the XML back)
If 1 returns the XML you expect, move onto submitting the request via Python - first, copy the exact URL straight into Python (no variable substitution via %s, no apikey in the header) and simply do a .get() on it.
If 2 returns correctly, ensure that your SCOPUS_API_KEY holds the exact key value, no more no less. i.e. print 'SCOPUS_API_KEY' should return your apikey: 4xxxxxxxxxxxxxxxxxxxxxxxxxxxxx43
If 1 returns the error, it looks like your uni (for whatever reason) has not got access to the authors query API. This doesn't make much sense given that you can perform manual search, but that is all I can conclude
Docs
For reference the authentication algorithm documentation is here, but it is not very simple to follow. You are following authentication option 1 and your method should just work.
N.B. The API is limited to 5000 author retrievals per week. If you have run a lot of queries in a loop, even if they have failed, it is possible that you have exceeded that...
For future reference. OP was using the package scopus which has long been renamed to pybliometrics.
Nowadays you can do
from pybliometrics.scopus import AuthorSearch
q = "AUTHFIRST(John) AND AUTHLASTNAME(Kitchin) AND SUBJAREA(COMP)"
s = AuthorSearch(q) # handles access, retrieval, parsing and even caches results
print(s)
results = s.authors # Holds all the information as a list of namedtuples
print(results) # You can put this into a pandas DataFrame as well

Wikipedia Python API - Prevent hidden categories

I want to find topics related to a given topic and also the degree of relationship between multiple topics. For this, I tried to extract the Wiki Page of the Topic and build a taxonomy using the Categories of the topic (given at the bottom of the page). I want to use Python API of Wikipedia for this (https://wikipedia.readthedocs.org/en/latest/code.html#api). But when I extract categories, it returns the hidden categories too that are normally not visible on the Wiki Page.
import wikipedia
import requests
import pprint
from bs4 import BeautifulSoup
wikipedia.set_lang("en")
query = raw_input()
WikiPage = wikipedia.page(title = query,auto_suggest = True)
cat = WikiPage.categories
for i in cat:
print i
I know the other option is to use a scraper. But I want to use the API to do this.
You can definitely use the API for this. Just append &clshow=!hidden to your category query, like this:
http://en.wikipedia.org/w/api.php?action=query&titles=Stack%20Overflow&prop=categories&clshow=!hidden
(I'm assuming English Wikipedia here, but the API is the same everywhere.
Also, just to be clear: There is no such thing as a “Python API” to Wikipedia, just the MediaWiki API, that you can call from any programming language. In your example code you are using a Python library (one of many) to access the Wikipedia API. This library does not seem to have an option for excluding hidden categories. For a list of other, perhaps more flexible, Python libraries, see http://www.mediawiki.org/wiki/API:Client_code#Python. Personally I quite like wikitools for simple tasks like yours. It would then look something like this:
from wikitools.wiki import Wiki
from wikitools.api import APIRequest
site = Wiki("http://fa.wikipedia.org/w/api.php")
site.login("username", "password")
params = {
"action": "query",
"titles": "سرریز_پشته",
"prop": "categories",
"clshow": "!hidden",
}
request = APIRequest(site, params)
result = request.query()
echo result

How do I get the HTML of a wiki page with Pywikibot?

I'm using pywikibot-core, and I used before another python Mediawiki API wrapper as Wikipedia.py (which has a .HTML method). I switched to pywikibot-core 'cause I think it has many more features, but I can't find a similar method.
(beware: I'm not very skilled).
I'll post here user283120 second answer, more precise than the first one:
Pywikibot core doesn't support any direct (HTML) way to interact to Wiki, so you should use API.
If you need to, you can do it easily by using urllib2.
This is an example I used to get HTML of a wiki page in commons:
import urllib2
...
url = "https://commons.wikimedia.org/wiki/" + page.title().replace(" ","_")
html = urllib2.urlopen(url).read().decode('utf-8')
"[saveHTML.py] downloads the HTML-pages of articles and images and saves the interesting parts, i.e. the article-text and the footer to a file"
source: https://git.wikimedia.org/blob/pywikibot%2Fcompat.git/HEAD/saveHTML.py
IIRC you want the HTML of the entire pages, so you need something that uses api.php?action=parse. In Python I'd often just use wikitools for such a thing, I don't know about PWB or the other requirements you have.
In general you should use pywikibot instead of wikipedia (e.g. instead of "import wikipedia" you should use "import pywikibot") and if you are looking for methods and class that were been in wikipedia.py, they are now separated and can be found in pywikibot folder (mainly in page.py and site.py)
If you want to run your scripts that you wrote in compat, you can use a script in pywikibot-core named compat2core.py (in scripts folder) and there is a detailed help about conversion named README-conversion.txt, read it carefully.
The Mediawiki API has a parse action which allows to get the html snippet for the wikimarkup as returned by the Mediawiki markup parser.
For the pywikibot library there is already a function implemented which you can use like this:
def getHtml(self,pageTitle):
'''
get the HTML code for the given page Title
Args:
pageTitle(str): the title of the page to retrieve
Returns:
str: the rendered HTML code for the page
'''
page=self.getPage(pageTitle)
html=page._get_parsed_page()
return html
When using the mwclient python library there is a generic api method see:
https://github.com/mwclient/mwclient/blob/master/mwclient/client.py
Which can be used to retrieve the html code like this:
def getHtml(self,pageTitle):
'''
get the HTML code for the given page Title
Args:
pageTitle(str): the title of the page to retrieve
'''
api=self.getSite().api("parse",page=pageTitle)
if not "parse" in api:
raise Exception("could not retrieve html for page %s" % pageTitle)
html=api["parse"]["text"]["*"]
return html
As shown above this gives a duck typed interface which is implemented in the py-3rdparty-mediawiki library for which i am a committer. This was resolved with closing issue 38 - add html page retrieval
With Pywikibot you may use http.request() to get the html content:
import pywikibot
from pywikibot.comms import http
site = pywikibot.Site('wikipedia:en')
page = pywikibot.Page(s, 'Elvis Presley')
path = '{}/index.php?title={}'.format(site.scriptpath(), page.title(as_url=True))
r = http.request(site, path)
print(r[94:135])
This should give the html content
'<title>Elvis Presley – Wikipedia</title>\n'
With Pywikibot 6.0 http.request() gives a requests.Response object rather than plain text. In this case you must use the text Attribute:
print(r.text[94:135])
to get the same result.

Scraping with Python?

I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?
Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.
http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined
what are the modules used? Is there any tutorial available?
I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.
You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.
Example - retrieving all questions from the StackOverflow.com main page:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)
for incident in soup('h3'):
print [i.decode('utf8') for i in incident.contents]
print
This code sample was adapted from the BeautifulSoup documentation.
You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything.
http://www.crummy.com/software/BeautifulSoup/
The documentation is built like a tutorial. Sorta:
http://www.crummy.com/software/BeautifulSoup/documentation.html
In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:
import urllib2
def getArticles(query, start_index, count):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
(query, start_index, count))
# TODO:
# parse xml code here (using BeautifulSoup or an xml parser like Python's
# own xml.etree. We should at least have the name and ID for each article.
# article = (article_name, article_id)
return (article_names # a list of parsed names from XML
def getArticleContent(article):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xart&arid=%d&sphra=undefined' % article_id)
# TODO: parse xml
return parsed_article
Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:
query = 'ana*'
article_dict = {}
i = 0
while (true):
new_articles = getArticles(query, i, 100)
if len(new_articles) == 0:
break
i += 100
for article_name, article_id in new_articles:
article_dict[article_name] = getArticleContent(article_id)
Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).
A word of warning though:
You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.

Categories