I am trying to write script that searches an inchikey (ex: OBSSCZVQJAGPOE-KMKNQKDISA-N) to get a chemical structure from this website:
http://www.chemspider.com/inchi-resolver/Resolver.aspx
From the documentation my code looks like it should work, but instead it just returns the original search page.
Thanks for the help,
import urllib
inchi = 'OBSSCZVQJAGPOE-KMKNQKDISA-N'
url = 'http://www.chemspider.com/inchi-resolver/Resolver.aspx'
data = urllib.urlencode({'"ctl00$ContentPlaceHolder1$TextBox1"':inchi})
response = urllib.urlopen(url, data)
print response.read()
Your code is performing a GET request and not a POST request. Apart from that: the form contains various hidden fields with some strange values which might be necessary for the processing as well.
Related
I've been working on a project to reverse-enginner twitter's app to scrape public posts from Twitter using an unofficial API, with Python. (I want to create an "alternative" app, which is simply a localhost that can search for a user, and get its posts)
I've been searching and reading everything related to REST, AJAX, and the python modules requests, requests-html, BeautifulSoup, and more.
I can see when looking at twitter on the devtools (for example on Marvel's profile page) that the only relevant requests being sent (by POST and GET) are the following: client_event.json and UserTweets?variables=... .
I understood that these are the relevant messages being received by cleaning the network tab and recording only when I scroll down and load new tweets - these are the only messages that came up which aren't random videos (I cleaned the search using -video -init -csp_report -config -ondemand -like -pageview -recommendations -prefetch -jot -key_live_kn -svg -jpg -jpeg -png -ico -analytics -loader -sharedCore -Hebrew).
I am new to this field, so I am probably doing something wrong. I can see on UserTweets the response I'm looking for - a beautiful JSON with all the data I need - but I am unable, no matter how much I've been trying to, to access it.
I tried different modules and different headers, and I get nothing. I DON'T want to use Selenium since it's tiresome, and I know where the data I need is stored.
I've been trying to send a GET reuest to:
https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D
by doing:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
response = session.get('https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D')
response.html.render()
s = BeautifulSoup(response.html.html, 'lxml')
but I get back an HTML script that either says Chromium is unsupported, or just a static page without the javascript updating the DOM.
All help appreciated.
Thank you
P.S
I've posted the same question on reverseengineering.stackexchange, just to be safe (overflow has more appropriate tags :-))
Before you deep dive into the actual code, I would first start building the correct request to twitter. I would use a 3rd party tool focused on REST and APIs such as Postman to build and test the required request - and only then would write the actual code.
From your questions it seems that you'll be using an open API of twitter, so it means you'll only need to send x-guest-token and basic Bearer authorization in your request headers.
The Bearer is static - you can just browse to twitter and copy/paste
it from the dev tools network monitor.
To get the x-guest-token you'll need something dynamic because it has expiration, what I would suggest is send a curl request to twitter, parse the token from there and put it in your header before sending the request. You can see something very similar in: Python Downloading twitter video using python (without using twitter api)
.
After you have both of the above, build the required GET request in Postman and test if you get back the correct response. Only after you have everything working in Postman - write the same in Python, or any other language**
**You can use Postman snippets which automatically generates the code needed in many programming languages.
#TripleS, example of how one may extract json data from __INITIAL_STATE__ and write it to text file.
import requests
import re
import json
from contextlib import suppress
# get page
result = requests.get('https://twitter.com/ThePSF')
# Extract json from "window.__INITIAL_STATE__={....};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)
# convert text string to structured json data
twitter_json = json.loads(json_string)
# Save structured json data to a text file that may help
# you to orient yourself and possible pick some parts you
# are interested in (if there are any)
with open('twitter_json_data.txt', 'w') as outfile:
outfile.write(json.dumps(twitter_json, indent=4, sort_keys=True))
I've just tried the same, but with requests, not requests_html module. I could get all site contents, but I would not call it "beautiful".
Also, now I am blocked to access the site without logging in.
Here is my small example.
Use official Twitter API instead.
I also think that I will probably be blocked after some tries of using this script. I've tried it only 2 times.
import requests
import bs4
def example():
result = requests.get("https://twitter.com/childrightscnct")
soup = bs4.BeautifulSoup(result.text, "lxml")
print(soup)
if __name__ == '__main__':
example()
To select any element with bs4, use
some_text = soup.select('locator').getText()
I found one tool for scraping Twitter, that has quite a lot of stars on Github https://github.com/twintproject/twint I did not try it myself and hope it is legal.
What you're missing is the bearer and guest token needed to make your request. If I just hit your endpoint with curl and no headers I get no response. However, if I add headers for the bearer token and guest token then I get that json you're looking for:
curl https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'' -H 'x-guest-token: 1452696114205847552'
You can get the bearer token (which may not expire that often) and the guest token (which does expire, I think) like this:
The html of the twitter link you go to links a file called main.some random numbers.js. Within that javascript file is the bearer token. You can recognize it is because a long string starting with lots of A's.
Take the bearer token and call https://api.twitter.com/1.1/guest/activate.json using the bearer token as an authorization header
curl 'https://api.twitter.com/1.1/guest/activate.json' -X POST -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'
In python this looks like:
import requests
import json
url = "https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D"
headers = {"authorization": "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA", "x-guest-token": "1452696114205847552"}
resp = requests.get(url, headers=headers)
j = json.loads(resp.text)
And now, that variable, j, holds your beautiful json. One warning, sometimes the response back can be so big that it doesn't seem to fit into a single response. If this happens, you'll notice the resp.text isn't valid json, but just some portion of a big blog of json. To fix this, you'll just need to adapt the requests to use "stream=True" and stream out the whole response before you try to parse it as json.
I am using Python requests to get information from the mobile website of the german railways company (https://mobile.bahn.de/bin/mobil/query.exe/dox')
For instance:
import requests
query = {'S':'Stuttgart Hbf', 'Z':'München Hbf'}
rsp = requests.get('https://mobile.bahn.de/bin/mobil/query.exe/dox', params=query)
which in this case gives the correct page.
However, using the following query:
query = {'S':'Cottbus', 'Z':'München Hbf'}
It gives another response, where the user is required to choose one of the given options (The server is confused about the starting stations, since there are many beginning with 'Cottbus')
Now, my question is: given this response, how can I choose one of the given options, and then repeat the request without getting this error ?
I tried to look at the cookies, to use a session instead of a simple get request. But nothing worked so far.
I hope you can help me.
Thanks.
You can use Beautifulsoup to parse the response and get the options if there is a select on the response:
import requests
from bs4 import BeautifulSoup
query = {'S': u'Cottbus', 'Z': u'München Hbf'}
rsp = requests.get('https://mobile.bahn.de/bin/mobil/query.exe/dox', params=query)
soup = BeautifulSoup(rsp.content, 'lxml')
# check if has choice dropdown
if soup.find('select'):
# Get list of tuples with text and input values that you will nee do use in the next POST request
options_value = [(option['value'], option.text) for option in soup.find_all('option')]
I have a web form that I would like to be filled out via users through a python bot. Everything that I can find regarding this pulls all of the data from a pre-defined payload via request or mechanize, however my situation differs in that I'd like users to be able to trigger this with their own text (for example - .submit Ticket #1234 - Blah blah blah).
The page they are submitting to is a simple form - 1 text area and 1 submit button.
Could anyone shine some light on some tutorials or how I'd go about this?
Here's my attempt:
import re
import urllib.parse
import requests
from lxml import etree
#hook.command("addquote")
def addquote(text, bot):
"""<query> -- adds a quote to the qdb."""
url = 'example.com/?add';
values = {'addquote' : text}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
Thanks!
Here's an example using HTTP POST. If your web form uses PUT just change the method. It outputs the page that is returned (which probably isn't needed unless you want to check for success) and the HTTP status code and reason. This is Python 3, only small changes are needed for older versions:
import urllib.parse, urllib.request
url="http://hskhsk.pythonanywhere.com/cidian"
params={"q":"apple","secondparam":"ignored"}
encoded=urllib.parse.urlencode(params)
data=bytes(encoded, 'utf-8')
req = urllib.request.Request(url=url, data=data, method="POST")
with urllib.request.urlopen(req) as f:
print(f.read().decode('utf-8'))
print(f.status)
print(f.reason)
The web form that it calls is a simple English-Chinese dictionary that accepts HTTP GET or POST requests. The text to search for (apple) is the q parameter and the other parameter is ignored.
I would recommend reading this book http://learnpythonthehardway.org/book/ex51.html
Here is a very simple example, using Grablib:
# we will use GrabLib (http://docs.grablib.org/en/latest/)
from grab import Grab
g = Grab()
g.go ('someurl.com')
g.set_input ('form name', 'text to be inputted')
g.submit
Thats all)
I have another question about posts.
This post should be almost identical to one referenced on stack overflow using this question 'Using request.post to post multipart form data via python not working', but for some reason I can't get it to work. The website is http://www.camp.bicnirrh.res.in/predict/. I want to post a file that is already in the FASTA format to this website and select the 'SVM' option using requests in python. This is based on what #NorthCat gave me previously, which worked like a charm:
import requests
import urllib
file={'file':(open('Bishop/newdenovo2.txt','r').read())}
url = 'http://www.camp.bicnirrh.res.in/predict/hii.php'
payload = {"algo[]":"svm"}
raw = urllib.urlencode(payload)
response = session.post(url, files=file, data=payload)
print(response.text)
Since it's not working, I assumed the payload was the problem. I've been playing with the payload, but I can't get any of these to work.
payload = {'S1':str(data), 'filename':'', 'algo[]':'svm'} # where I tried just reading the file in, called 'data'
payload = {'svm':'svm'} # not actually in the headers, but I tried this too)
payload = {'S1': '', 'algo[]':'svm', 'B1': 'Submit'}
None of these payloads resulted in data.
Any help is appreciated. Thanks so much!
You need to set the file post variable name to "userfile", i.e.
file={'userfile':(open('Bishop/newdenovo2.txt','r').read())}
Note that the read() is unnecessary, but it doesn't prevent the file upload succeeding. Here is some code that should work for you:
import requests
session = requests.session()
response = session.post('http://www.camp.bicnirrh.res.in/predict/hii.php',
files={'userfile': ('fasta.txt', open('fasta.txt'), 'text/plain')},
data={'algo[]':'svm'})
response.text contains the HTML results, save it to a file and view it in your browser, or parse it with something like Beautiful Soup and extract the results.
In the request I've specified a mime type of "text/plain" for the file. This is not necessary, but it serves as documentation and might help the receiving server.
The content of my fasta.txt file is:
>24.6jsd2.Tut
GGTGTTGATCATGGCTCAGGACAAACGCTGGCGGCGTGCTTAATACATGCAAGTCGAACGGGCTACCTTCGGGTAGCTAGTGGCGGACGGGTGAGTAACACGTAGGTTTTCTGCCCAATAGTGGGGAATAACAGCTCGAAAGAGTTGCTAATACCGCATAAGCTCTCTTGCGTGGGCAGGAGAGGAAACCCCAGGAGCAATTCTGGGGGCTATAGGAGGAGCCTGCGGCGGATTAGCTAGATGGTGGGGTAAAGGCCTACCATGGCGACGATCCGTAGCTGGTCTGAGAGGACGGCCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAAGGAATATTCCACAATGGCCGAAAGCGTGATGGAGCGAAACCGCGTGCGGGAGGAAGCCTTTCGGGGTGTAAACCGCTTTTAGGGGAGATGAAACGCCACCGTAAGGTGGCTAAGACAGTACCCCCTGAATAAGCATCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGATGCAAGCGTTGTCCGGATTTACTGGGCGTAAAGCGCGCGCAGGCGGCAGGTTAAGTAAGGTGTGAAATCTCCCTGCTCAACGGGGAGGGTGCACTCCAGACTGACCAGCTAGAGGACGGTAGAGGGTGGTGGAATTGCTGGTGTAGCGGTGAAATGCGTAGAGATCAGCAGGAACACCCGTGGCGAAGGCGGCCACCTGGGCCGTACCTGACGCTGAGGCGCGAAGGCTAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTAGCAGTAAACGATGTCCACTAGGTGTGGGGGGTTGTTGACCCCTTCCGTGCCGAAGCCAACGCATTAAGTGGACCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGCGGAGCGTGTGGTTTAATTCGATGCGACGCGAAGAACCTTACCTGGGCTTGACATGCTATCGCAACACCCTGAAAGGGGTGCCTCCTTCGGGACGGTAGCACAGATGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGTATATCTAAGGAGACTGCCGGAGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCAGCATGGCTCTTACGTCCAGGGCTACACATACGCTACAATGGCCGTTACAGTGAGATGCCACACCGCGAGGTGGAGCAGATCTCCAAAGGCGGCCTCAGTTCAGATTGCACTCTGCAACCCGAGTGCATGAAGTCGGAGTTGCTAGTAACCGCGTGTCAGCATAGCGCGGTGAATATGTTCCCGGGTCTTGTACACACCGCCCGTCACGTCATGGGAGCCGGCAACACTTCGAGTCCGTGAGCTAACCCCCCCTTTCGAGGGTGTGGGAGGCAGCGGCCGAGGGTGGGGCTGGTGACTGGGACGAAGTCGTAACAAGGT
I have trouble understanding wikipedia API.
I have isolated a link, by processing json that I got as a response after sending a request to http://en.wikipedia.org/w/api.php
Assuming that I got the following link, how do I get access to information like date of birth, etc.
I'm using python. I tried doing a
import urllib2,simplejson
search_req = urllib2.Request(direct_url_to_required_wikipedia_page)
response = urllib2.urlopen(search_req)
I have tried reading the api. But, I can't figure out how to extract data from specific pages.
Try:
import urllib
import urllib2
import simplejson
url = 'http://en.wikipedia.org/w/api.php'
values = {'action' : 'query',
'prop' : 'revisions',
'titles' : 'Jennifer_Aniston',
'rvprop' : 'content',
'format' : 'json'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
json = response.read()
Variable json is the json of the wikipedia page. You can now parse it with simplejson or whatever...
Go to MediaWiki API. It's better organized, and friendly for humans :-).
You won't get information like date of birth from the API, at least not directly. The best you can do is to get the code of the page (or rendered HTML) and parse that to get the information you need.
As an alternative, you might want to look at DBpedia.