I am using Python request library to scrape robots.txt data from a list of URLs:
for url in urls:
url = urllib.parse.urljoin(url, "robots.txt")
try:
r = requests.get(url, headers=headers, allow_redirects=False)
r.raise_for_status()
extract_robots(r)
except (exceptions.RequestException, exceptions.HTTPError, exceptions.Timeout) as err:
handle_exeption(err)
In my list of urls, I have this webpage: https://reward.ff.garena.com. When I am requesting https://reward.ff.garena.com/robots.txt, I am directly redirected to https://reward.ff.garena.com/en. However, I specified in my request parameters that I don't want redirects allow_redirects=False.
How can I skip this kind of redirect and make sure I only have domain/robots.txt data calling my extract_robots(data) method?
Do you know for sure that there is a robots.txt at that location?
I note that if I request https://reward.ff.garena.com/NOSUCHFILE.txt that I get the same result as for robots.txt
The allow_redirects=False only stops requests from automatically following 302/location= responses - i.e. it doesn’t stop the server you’re trying to access from returning a redirect as the response to the request you’re making.
If you get this type of response I guess it indicates the file you requested isn’t available, or some other error preventing you accessing it, perhaps in the general case of file access this might indicate need for authentication but for robots.txt that shouldn’t be the problem - simplest to assume the robots.txt isn’t there.
Related
I'm making a get request with the requests module of Python. This request returns a 302 code and redirects to another url, from local I am able to capture that new url with:
r = requests.get(URL)
finalURL = r.url
But when this code is executed in Heroku, the redirection is not carried out and the original url is returned to me.
I've tried everything, including forcing the redirection with this code:
r = requests.get(URL, allow_redirects=True)
I have also tried to pick up the url from the response headers, such as location or X-Originating-URL, but when the request is made from Heroku, the response does not return those values in the header.
Redirects are on by default, try viewing content of it, there has to be something that is redirecting u also u can go try and fiddle it.
I'm trying to check multiple URLs on Google Safebrowsing API, but it returns an empty response every time. Have been googling for quite few hours with no results, and I don't need some overkill library for a simple POST request.
Edit: Using Python 3.5.2
import requests
import json
api_key = '123456'
url = "https://safebrowsing.googleapis.com/v4/threatMatches:find"
payload = {'client': {'clientId': "mycompany", 'clientVersion': "0.1"},
'threatInfo': {'threatTypes': ["SOCIAL_ENGINEERING", "MALWARE"],
'platformTypes': ["ANY_PLATFORM"],
'threatEntryTypes': ["URL"],
'threatEntries': [{'url': "http://www.thetesturl.com"}]}}
params = {'key': api_key}
r = requests.post(url, params=params, json=payload)
# Print response
print(r)
print(r.json())
This is my code, that returns HTTP 200 OK, but the response is empty.
What am I doing wrong?
I have the feeling that the api is not working properly. It returns 200 empty result even for urls that are marked as dangerous. For example, I checked this url using google's form and got the result `Some pages on this site are unsafe``. But using the api, it returns 200 empty... I believe it returns results only for specific pages. If only some pages are infected/dangerous, then you won't get any data for the main domain... Not very useful if you're asking me, but hey...it's free.
It would be nice if someone from Google could confirm this and add it to the documentation.
A real malware test url would have been much appreciated also, so we can test with real data.
According to the Safe Browsing API documentation, if you receive an empty object is because there was no match found:
Note: If there are no matches (that is, if none of the URLs specified
in the request are found on any of the lists specified in a request),
the HTTP POST response simply returns an empty object in the response
body.
I have a huge list of URLs which redirect to different URLs.
I am supplying them in for loop from a list, and trying to print the redirected URLs
The first redirected URL prints fine.
But from the second one - requests just stops giving me redirected URLs, and just prints the given URL
I tried implementing with urllib, urllib2, and mechanize.
They give the first redirected url fine, and then throws an error at second one and stops.
Can anyone please let me know why this is happening?
Below is the pseudo code/implementation:
for given_url in url_list:
print ("Given URL: " + given_url)
s = requests.Session()
r = requests.get(given_url, allow_redirects=True)
redirected_url = r.url
print ("Redirected URL: " + redirected_url)
Output:
Given URL: www.xyz.com
Redirected URL: www.123456789.com
Given URL: www.abc.com
Redirected URL: www.abc.com
Given URL: www.pqr.com
Redirected URL: www.pqr.com
Try a HEAD request, it won't follow redirects or download the entire body:
r = requests.head('http://www.google.com/')
print r.headers['Location']
There is nothing wrong with the code snippet you provided, but as you mentioned in the comments you are getting HTTP 400 and 401 responses. HTTP 401 means Unauthorized, which means the site is blocking you. HTTP 400 means Bad Request which typically means the site doesn't understand your request, but it can also be returned when you are being blocked, which I suspect is the case on those too.
When I run your code for the ABC website I get redirected properly, which leads me to believe they are blocking your ip address for sending too many requests in a short period of time and/or for having no User-Agent set.
Since you mentioned you can open the links correctly in a browser, you can try setting your User-Agent string to match that of a browser, however this is not guaranteed to work since it is one of many parameters a site may use to detect whether you are a bot or not.
For example:
headers = {'User-agent': 'Mozilla/5.0'}
r = requests.get(url, headers=headers)
I wish to make a requests with the Python requests module. I have a large database of urls I wish to download. the urls are in the database of the form page.be/something/something.html
I get a lot of ConnectionError's. If I search the URL in my browser, the page exists.
My Code:
if not webpage.url.startswith('http://www.'):
new_html = requests.get(webpage.url, verify=True, timeout=10).text
An example of a page I'm trying to download is carlier.be/categorie/jobs.html. This gives me a ConnectionError, logged as below:
Connection error, Webpage not available for
"carlier.be/categorie/jobs.html" with webpage_id "229998"
What seems to be the problem here? Why can't requests make the connection, while I can find the page in the browser?
The Requests library requires that you supply a schema for it to connect with (the 'http://' part of the url). Make sure that every url has http:// or https:// in front of it. You may want a try/except block where you catch a requests.exceptions.MissingSchema and try again with "http://" prepended to the url.
I'm writing a script to DL the entire collection of BBC podcasts from various show hosts. My script uses BS4, Mechanize, and wget.
I would like to know how I can test if a request for a URL yields a response code of '404' form the server. I have wrote the below function:
def getResponseCode(br, url):
print("Opening: " + url)
try:
response = br.open(url)
print("Response code: " + str(response.code))
return True
except (mechanize.HTTPError, mechanize.URLError) as e:
if isinstance(e,mechanize.HTTPError):
print("Mechanize error: " + str(e.code))
else:
print("Mechanize error: " + str(e.reason.args))
return False
I pass into it my Browser() object and a URL string. It returns either True or False depending on whether the response is a '404' or '200' (well actually, Mechanize throws and Exception if it is anything other than a '200' hence the Exception handling).
In main() I am basically looping over this function passing in a number of URLs from a list of URLs that I have scraped with BS4. When the function returns True I proceed to download the MP3 with wget.
However. My problem is:
The URLs are direct path to the podcast MP3 files on the remote
server and I have noticed that when the URL is available,
br.open(<URL>) will hang. I suspect this is because Mechanize is
caching/downloading the actual data from the server. I do not want
this because I merely want to return True if the response code is
'200'. How can I not cache/DL and just test the response code?
I have tried using br.open_novisit(url, data=None) however the hang still persists...
I don't think there's any good way to get Mechanize to do what you want. The whole point of Mechanize is that it's trying to simulate a browser visiting a URL, and a browser visiting a URL downloads the page. If you don't want to do that, don't use an API designed for that.
On top of that, whatever API you're using, by sending a GET request for the URL, you're asking the server to send you the entire response. Why do that just to hang up on it as soon as possible? Use the HEAD request to ask the server whether it's available. (Sometimes servers won't HEAD things even when they should, so you'll have to fall back to GET. But cross that bridge if you come to it.)
For example:
req = urllib.request.Request(url, method='HEAD')
resp = urllib.request.urlopen(req)
return 200 <= resp.code < 300
But this raises a question:
When the function returns True I proceed to download the MP3 with wget.
Why? Why not just use wget in the first place? If the URL is gettable, it will get the URL; if not, it will give you an error—just as easily as Mechanize will. And that avoids hitting each URL twice.
For that matter, why try to script wget, instead of using the built-in support in the stdlib or a third-party module like requests?
If you're just looking for a way to parallelize things, that's easy to do in Python:
def is_good_url(url):
req = urllib.request.Request(url, method='HEAD')
resp = urllib.request.urlopen(req)
return url, 200 <= resp.code < 300
with futures.ThreadPoolExecutor(max_workers=8) as executor:
fs = [executor.submit(is_good_url, url) for url in urls]
results = (f.result() for f in futures.as_completed(fs))
good_urls = [url for (url, good) in results if good]
And to change this to actually download the valid URLs instead of just making a note of which ones are valid, just change the task function to something that fetches and saves the data from a GET instead of doing the HEAD thing. The ThreadPoolExecutor Example in the docs does almost exactly what you want.