I'm new with this whole requests stuff. I tried to get the response from an url and I there were some few times in which I was able to. However, after several tries and adding new lines of code, it stopped working.
This is the part of my code where I'm having trouble:
api_request = requests.get("https://www.airnowapi.org/aq/observation/zipCode/current/?format=application/json&zipCode=12345&distance=5&API_KEY=1234-1234-1234-1234")
api = json.loads(api_request.content)
city = api[0]["Reporting Area"]
it tells me that the variable city is not defined, therefore it follows that the requests.get() part has been unsuccessful.
I read some other people who had the same problem, but theirs had something to do with Headers (I tried copying the solutions but they didn't seem to work). I also tried copying the url on google and it opened the json file correctly, so it's a Python problem.
Thanks!
I don't have enough reputation to comment yet, but wanted to warn you that you accidentally leaked your API key. If you don't want others to exploit it, try to keep it private as much as possible ;)
Try this -
api_request = requests.get("https://www.airnowapi.org/aq/observation/zipCode/current/?format=application/json&zipCode=89129&distance=5&API_KEY=1369737F-5361-4CCC-B7C1-F52625548A41")
api = api_request.json()
city = api[0]["ReportingArea"]
Related
I am not experienced in web development, and trying to use requests.get to get some authenticated data. So far the internet appears to tell me to just do it, and i think i am formatting it wrong, but unsure how. After some trial and error, i was able to grab my cookie for the website. The following is some a made up version of what i grabbed with similar formating.
cookie = "s:abcDEfGHIJ12k34LMNopqRst5UvW-6xy.ZAbCd/eFGhi7j8KlmnoPqrstUvWXYZ90a1BCDE2fGH3"
Then, in python, i am trying to send a request. Following is a bit more pseudo code for what i am doing
r = requests.get('https://www.website.com/api/getData', cookies={"connect.sid": cookie})
After all this, the site keeps sending me a 400 error. Wondering if you guys had any idea if I am putting in the wrong cookie/part of cookie. If everything looks right and it is probably the site at fault, or what.
Grabbed a wireshark capture, and found there were other fields in the cookie that were sent that i had not filled out.
_ga
_gid
___gads
Filled those out with the relevant values, and it works.
I am trying to collect tweets and extract the text part for my project. I tried many ways and most of them works just fine for me. Though I stumbled upon this nltk.twitter package and some code snippets to do the same work. The codes are pretty clean and i want to use that. But even the simplest code is giving me 401 error though I have my account at twitter developers' and have all the four keys required.
from nltk.twitter import Twitter
tw = Twitter()
tw.tweets(keywords='love, hate', limit=10)
I took this example from http://www.nltk.org/howto/twitter.html#simple and tried every example that is given. None of them works. And apparently I cannot find why. Thank you for your help in advance.
There is a few things that may have caused this. But I bet it is the time issue as nltk is trying to use the streamer and the time of your computer/server is out of sync.
Also make sure you install nltk completely. Try
import nltk
dl = nltk.downloader.Downloader("http://nltk.github.com/nltk_data/")
dl.download()
Using nltk.twitter requires your credentials.txt file path in TWITTER environment variable and the data inside the text file has to be entered correctly.
For example:
`app_key=YOUR CONSUMER KEY
app_secret=YOUR CONSUMER SECRET
oauth_token=YOUR ACCESS TOKEN
oauth_token_secret=YOUR ACCESS TOKEN SECRET
`
There should be no space before of after the '='. Also, don't put the key in quotes like "YOUR CONSUMER KEY".
This solved my issue with 401.
This is my first post here. It has been 5 months since I have been learning python from scratch, on my own, and I did acquire most of my knowledge thanks to this forum, and I am now able to create webbots which can easily scrape all types of data, especially in sport betting sites.
Though for this particular need, there is one site from which I cannot extract what I am looking for:
winamax
I would like to get all links from all football events (on the left side, for example:
"https://www.winamax.fr/paris-sportifs#!/match/prelive/7894014"
but when I look at the source code, or when I print my soup, I just get nothing.
url = "https://www.winamax.fr/paris-sportifs#!/sports"
urlRequest = requests.get(url, proxies=proxies, headers=headers)
#of course, proxies and headers are defined beforehand
soup = BeautifulSoup(urlRequest.content)
print(soup)
For all bookmakers I have already come up with, there was always either a simple html tree structure in which all items were easy to find, or a hidden javascript file, or a json link.
But for this one, even when trying to catch the flow with Firebug, I cannot find anything relevant.
Thanks in advance if someone has an idea on how to get that (I considered using PhantomJS but not tried yet).
EDIT:
#ssundarraj:
Hereunder the header, the same I have been using in all my projects, so not relevant in my opinion, but anyway, here it is:
AgentsFile='UserAgents.txt'
lines = open(AgentsFile).read().splitlines()
myline =random.choice(lines)
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer' : 'https://www.winamax.fr',
'User-Agent': myline}
EDIT2:
#Chris Lear
using firebug, in the net panel, you can search through all the
response bodies (there's a checkbox called "Response Bodies" that
appears when you click the search box). That will show you that the
data is being fetched by json. I'll leave you to try to make sense of
it, but that might give you a start (searching for ids is probably
best)
I checked the box you mentioned hereabove, but with no effect :(
With or without filter, nothing is displayed in my network panel, as you can see on the picture:
nothing caught
Used firebug and find out this.
Make POST request to https://www.winamax.fr/betting/slider/slider.php with parameters:
key=050e42fb0761c96526e8510eda89248f
lang=FR
Don't know if key is changing but now it works.
I'm trying to read the following URL into Python:
http://www.google.com/trends/fetchComponent?q=nepal&cid=TIMESERIES_GRAPH_0&export=3
with the code:
trend_url = 'http://www.google.com/trends/fetchComponent?q=nepal&cid=TIMESERIES_GRAPH_0&export=3'
response = urllib2.urlopen(trend_url)
the_page = response.read()
The resulting value of the_page, for reasons that I don't understand, is an error page.
UPDATE: I think that the problem is related to some authentication issue: when I try to open the link in the browser's incognito window, it also returns an error page.
use requests
import requests
a = requests.get('http://www.google.com/trends/fetchComponent?q=nepal&cid=TIMESERIES_GRAPH_0&export=3')
a.text
u'// Data table response\ngoogle.visualization.Query.setResponse({"version":" ....
I tested your example and it is works.
I think is kinda late, but I think that Google does that in order to protect their data. You have to create a web-scraping that will go to the interface put the word you want, and it will generate the page/url. That is not the same as going at first sight for the URL generated.
I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En
So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?
Edit:
Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.
There's no way to tell from the URL what it's going to give you. Even if it ends in .pdf it could still give you HTML or anything it likes.
You could do a HEAD request and look at the content-type, which, if the server isn't lying to you, will tell you if it's a PDF.
Alternatively you can download it and then work out whether what you got is a PDF.
In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).
However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.
So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.
(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)
At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.
As has been said there is no way to tell content type from URL. But if you don't mind getting the headers for every URL you can do this:
obj = urllib.urlopen(URL)
headers = obj.info()
if headers['Content-Type'].find('pdf') != -1:
# we have pdf file, download whole
...
This way you won't have to download each URL just it's headers. It's still not exactly saving network traffic, but you won't get better than that.
Also you should use mime-types instead of my crude find('pdf').
No. It is impossible to tell what kind of resource is referenced by a URL just by looking at it. It is totally up to the server to decide what he gives you when you request a certain URL.
Check the mimetype with the urllib.info() function. This might not be 100% accurate, it really depends on what the site returns as a Content-Type header. If it's well behaved it'll return the proper mime type.
A PDF should return application/pdf, but that may not be the case.
Otherwise you might just have to download it and try it.
You can't see it from the url directly. You could try to only download the header of the HTTP response and look for the Content-Type header. However, you have to trust the server on this - it could respond with a wrong Content-Type header not matching the data provided in the body.
Detect the file type in Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using
pip3 install python-magic
For Mac OS X, you should also install libmagic using
brew install libmagic
Code snippet
import urllib
import magic
from urllib.request import urlopen
url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read())
print(mime_type)