Query Wikipedia data page - python

I have trouble understanding wikipedia API.
I have isolated a link, by processing json that I got as a response after sending a request to http://en.wikipedia.org/w/api.php
Assuming that I got the following link, how do I get access to information like date of birth, etc.
I'm using python. I tried doing a
import urllib2,simplejson
search_req = urllib2.Request(direct_url_to_required_wikipedia_page)
response = urllib2.urlopen(search_req)
I have tried reading the api. But, I can't figure out how to extract data from specific pages.

Try:
import urllib
import urllib2
import simplejson
url = 'http://en.wikipedia.org/w/api.php'
values = {'action' : 'query',
'prop' : 'revisions',
'titles' : 'Jennifer_Aniston',
'rvprop' : 'content',
'format' : 'json'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
json = response.read()
Variable json is the json of the wikipedia page. You can now parse it with simplejson or whatever...

Go to MediaWiki API. It's better organized, and friendly for humans :-).

You won't get information like date of birth from the API, at least not directly. The best you can do is to get the code of the page (or rendered HTML) and parse that to get the information you need.
As an alternative, you might want to look at DBpedia.

Related

Decoding title returned by Wikipedia API for Python requests library

The code below queries the Wikipedia API for pages in the "Physics" category and converts the response into a Python dictionary.
import ast
import requests
url = "https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&cmlimit=500&cmcontinue="
response = requests.get(url)
text = response.text
dict = ast.literal_eval(sourceCode)
Here is one of the results returned by the Wikipedia API:
{
"pageid": 50724262,
"ns": 0,
"title": "Blasius\u2013Chaplygin formula"
},
The Wikipedia page that "Blasius\u2013Chaplygin formula" corresponds to is https://en.wikipedia.org/wiki/Blasius–Chaplygin_formula.
I want to use the "title" to download pages from Wikipedia. I've replaced all spaces with underscores. But it's failing. I'm doing:
import requests
url = "https://en.wikipedia.org/wiki/Blasius\u2013Chaplygin_formula"
response = requests.get(url)
This gives me:
requests.exceptions.HTTPError: 404 Client Error:
Not Found for url: https://en.wikipedia.org/wiki/Blasius%5Cu2013Chaplygin_formula
How do I change the title Blasius\u2013Chaplygin formula into a URL that can be successfully called by requests?
When I tried to insert the Wikipedia link into this question on Stack Overflow, Stack Overflow automatically converted it to https://en.wikipedia.org/wiki/Blasius%E2%80%93Chaplygin_formula.
When I did:
import requests
url = "https://en.wikipedia.org/wiki/Blasius%E2%80%93Chaplygin_formula"
response = requests.get(url)
it was successful, so I want a library that will do a conversion like this that I can use in Python.
That "\u2013" is a unicode character. It gets automatically turned into an en-dash by python, but you can't put en-dashes in wikipedia links, so you have to url encode it, which is what stackoverflow did for you earlier.
You can do it yourself by using something like this:
import requests
import urllib.parse
url = "Blasius\u2013Chaplygin_formula"
response = requests.get("https://en.wikipedia.org/wiki/" + urllib.parse.quote(url))
How to urlencode a querystring in Python?
To make your life easier you can always use some existing wrapper around Wikipedia API such as Wikipedia-API.
import wikipediaapi
api = wikipediaapi.Wikipedia('en')
# it will shield you from URL encoding problems
p = api.page('Blasius\u2013Chaplygin formula')
print(p.summary)
# and it can make your code shorter
physics = api.page('Category:Physics')
for p in physics.categorymembers.values():
print(f'[{p.title}]\t{p.summary}')

PUT method using python3 and urbllib - headers

So I am trying to just receive the data from this json. I to use POST, GET on any link but the link I am currently trying to read. It needs [PUT]. So I wanted to know if I was calling this url correctly via urllib or am I missing something?
Request
{"DataType":"Word","Params":["1234"], "ID":"22"}
Response {
JSON DATA IN HERE
}
I feel like I am doing the PUT method call wrong since it is wrapped around Request{}.
import urllib.request, json
from pprint import pprint
header = {"DataType":"Word","Params":"[1234"]", "ID":"22"}
req = urllib.request.Request(url = "website/api/json.service", headers =
heaer, method = 'PUT')
with urllib.request.urlopen(req) as url:
data = json.loads(url.read(), decode())
pprint(data)
I am able to print json data as long as its anything but PUT. As soon as I get a site with put on it with the following JSON template I get an Internal Error 500. So I assumed it was was my header.
Thank you in advanced!

Posting user input to web form via Python

I have a web form that I would like to be filled out via users through a python bot. Everything that I can find regarding this pulls all of the data from a pre-defined payload via request or mechanize, however my situation differs in that I'd like users to be able to trigger this with their own text (for example - .submit Ticket #1234 - Blah blah blah).
The page they are submitting to is a simple form - 1 text area and 1 submit button.
Could anyone shine some light on some tutorials or how I'd go about this?
Here's my attempt:
import re
import urllib.parse
import requests
from lxml import etree
#hook.command("addquote")
def addquote(text, bot):
"""<query> -- adds a quote to the qdb."""
url = 'example.com/?add';
values = {'addquote' : text}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
Thanks!
Here's an example using HTTP POST. If your web form uses PUT just change the method. It outputs the page that is returned (which probably isn't needed unless you want to check for success) and the HTTP status code and reason. This is Python 3, only small changes are needed for older versions:
import urllib.parse, urllib.request
url="http://hskhsk.pythonanywhere.com/cidian"
params={"q":"apple","secondparam":"ignored"}
encoded=urllib.parse.urlencode(params)
data=bytes(encoded, 'utf-8')
req = urllib.request.Request(url=url, data=data, method="POST")
with urllib.request.urlopen(req) as f:
print(f.read().decode('utf-8'))
print(f.status)
print(f.reason)
The web form that it calls is a simple English-Chinese dictionary that accepts HTTP GET or POST requests. The text to search for (apple) is the q parameter and the other parameter is ignored.
I would recommend reading this book http://learnpythonthehardway.org/book/ex51.html
Here is a very simple example, using Grablib:
# we will use GrabLib (http://docs.grablib.org/en/latest/)
from grab import Grab
g = Grab()
g.go ('someurl.com')
g.set_input ('form name', 'text to be inputted')
g.submit
Thats all)

POST request via requests (python) not returning data

I have another question about posts.
This post should be almost identical to one referenced on stack overflow using this question 'Using request.post to post multipart form data via python not working', but for some reason I can't get it to work. The website is http://www.camp.bicnirrh.res.in/predict/. I want to post a file that is already in the FASTA format to this website and select the 'SVM' option using requests in python. This is based on what #NorthCat gave me previously, which worked like a charm:
import requests
import urllib
file={'file':(open('Bishop/newdenovo2.txt','r').read())}
url = 'http://www.camp.bicnirrh.res.in/predict/hii.php'
payload = {"algo[]":"svm"}
raw = urllib.urlencode(payload)
response = session.post(url, files=file, data=payload)
print(response.text)
Since it's not working, I assumed the payload was the problem. I've been playing with the payload, but I can't get any of these to work.
payload = {'S1':str(data), 'filename':'', 'algo[]':'svm'} # where I tried just reading the file in, called 'data'
payload = {'svm':'svm'} # not actually in the headers, but I tried this too)
payload = {'S1': '', 'algo[]':'svm', 'B1': 'Submit'}
None of these payloads resulted in data.
Any help is appreciated. Thanks so much!
You need to set the file post variable name to "userfile", i.e.
file={'userfile':(open('Bishop/newdenovo2.txt','r').read())}
Note that the read() is unnecessary, but it doesn't prevent the file upload succeeding. Here is some code that should work for you:
import requests
session = requests.session()
response = session.post('http://www.camp.bicnirrh.res.in/predict/hii.php',
files={'userfile': ('fasta.txt', open('fasta.txt'), 'text/plain')},
data={'algo[]':'svm'})
response.text contains the HTML results, save it to a file and view it in your browser, or parse it with something like Beautiful Soup and extract the results.
In the request I've specified a mime type of "text/plain" for the file. This is not necessary, but it serves as documentation and might help the receiving server.
The content of my fasta.txt file is:
>24.6jsd2.Tut
GGTGTTGATCATGGCTCAGGACAAACGCTGGCGGCGTGCTTAATACATGCAAGTCGAACGGGCTACCTTCGGGTAGCTAGTGGCGGACGGGTGAGTAACACGTAGGTTTTCTGCCCAATAGTGGGGAATAACAGCTCGAAAGAGTTGCTAATACCGCATAAGCTCTCTTGCGTGGGCAGGAGAGGAAACCCCAGGAGCAATTCTGGGGGCTATAGGAGGAGCCTGCGGCGGATTAGCTAGATGGTGGGGTAAAGGCCTACCATGGCGACGATCCGTAGCTGGTCTGAGAGGACGGCCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAAGGAATATTCCACAATGGCCGAAAGCGTGATGGAGCGAAACCGCGTGCGGGAGGAAGCCTTTCGGGGTGTAAACCGCTTTTAGGGGAGATGAAACGCCACCGTAAGGTGGCTAAGACAGTACCCCCTGAATAAGCATCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGATGCAAGCGTTGTCCGGATTTACTGGGCGTAAAGCGCGCGCAGGCGGCAGGTTAAGTAAGGTGTGAAATCTCCCTGCTCAACGGGGAGGGTGCACTCCAGACTGACCAGCTAGAGGACGGTAGAGGGTGGTGGAATTGCTGGTGTAGCGGTGAAATGCGTAGAGATCAGCAGGAACACCCGTGGCGAAGGCGGCCACCTGGGCCGTACCTGACGCTGAGGCGCGAAGGCTAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTAGCAGTAAACGATGTCCACTAGGTGTGGGGGGTTGTTGACCCCTTCCGTGCCGAAGCCAACGCATTAAGTGGACCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGCGGAGCGTGTGGTTTAATTCGATGCGACGCGAAGAACCTTACCTGGGCTTGACATGCTATCGCAACACCCTGAAAGGGGTGCCTCCTTCGGGACGGTAGCACAGATGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGTATATCTAAGGAGACTGCCGGAGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCAGCATGGCTCTTACGTCCAGGGCTACACATACGCTACAATGGCCGTTACAGTGAGATGCCACACCGCGAGGTGGAGCAGATCTCCAAAGGCGGCCTCAGTTCAGATTGCACTCTGCAACCCGAGTGCATGAAGTCGGAGTTGCTAGTAACCGCGTGTCAGCATAGCGCGGTGAATATGTTCCCGGGTCTTGTACACACCGCCCGTCACGTCATGGGAGCCGGCAACACTTCGAGTCCGTGAGCTAACCCCCCCTTTCGAGGGTGTGGGAGGCAGCGGCCGAGGGTGGGGCTGGTGACTGGGACGAAGTCGTAACAAGGT

Python urllib POST response

I am trying to write script that searches an inchikey (ex: OBSSCZVQJAGPOE-KMKNQKDISA-N) to get a chemical structure from this website:
http://www.chemspider.com/inchi-resolver/Resolver.aspx
From the documentation my code looks like it should work, but instead it just returns the original search page.
Thanks for the help,
import urllib
inchi = 'OBSSCZVQJAGPOE-KMKNQKDISA-N'
url = 'http://www.chemspider.com/inchi-resolver/Resolver.aspx'
data = urllib.urlencode({'"ctl00$ContentPlaceHolder1$TextBox1"':inchi})
response = urllib.urlopen(url, data)
print response.read()
Your code is performing a GET request and not a POST request. Apart from that: the form contains various hidden fields with some strange values which might be necessary for the processing as well.

Categories