I'm getting a 406 error with Mechanize when trying to open a URL:
for url in urls:
if "http://" not in url:
url = "http://" + url
print url
try:
page = mech.open("%s" % url)
except urllib2.HTTPError, e:
print "there was an error opening the URL, logging it"
print e.code
logfile = open ("log/urlopenlog.txt", "a")
logfile.write(url + "," + "couldn't open this page" + "\n")
continue
else:
print "opening this URL..."
page = mech.open(url)
Any idea what would cause a 406 error to occur? If I go to the URL in question I can open it in the browser.
Try adding headers to your request based on what your browser sends; start with adding an Accept header (406 normally means the server didn't like what you want to accept).
See "Adding headers" in the documentation:
req = mechanize.Request(url)
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
page = mechanize.urlopen(req)
The Accept header value there is based on the header sent by Chrome.
If you want to find out which headers your browser sends, this webpage shows them to you: https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending
The 'Accept' and 'User-Agent' headers should be enough. This is what I did to get rid of the error:
#establish counter
j = 0
#Create headers for webpage
headers = {'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
#Create for loop to get through list of URLs
for url in URLs:
#Verify scraper agent so that web security systems don't block webpage scraping upon URL opening, with j as a counter
req = mechanize.Request(URLs[j], headers = headers)
#Open the url
page = mechanize.urlopen(req)
#increase counter
j += 1
You could also try importing the "urllib2" or "urllib" libraries to open these URLs. The syntax is the same.
Related
i get data from an api in django.
The data comes from an order form from another website.
The data also includes an url, for example like example.com but i can't validate the input because i don't have access to the order form.
The url that i get can also have different kinds. More examples:
example.de
http://example.de
www.example.com
https://example.de
http://www.example.de
https://www.example.de
Now i would like to open the url to get the correct url.
For example if i open example.com in my browser, i got the correct url http://example.com/ and that is what i wish for all urls.
How can i do that in python fast?
If you get status_code 200 you know that you have a valid address.
In regards to HTTPS://. You will get an SSL error if you don't Follow the answers in this guide. Once you have that in place, the program will find the correct URL for you.
import requests
import traceback
validProtocols = ["https://www.", "http://www.", "https://", "http://"]
def removeAnyProtocol(url):
url = url.replace("www.","") # to remove any inputs containing just www since we aren't planning on using them regardless.
for protocol in validProtocols:
url = url.replace(protocol, "")
return url
def validateUrl(url):
for protocol in validProtocols:
if(protocol not in url):
pUrl = protocol + removeAnyProtocol(url)
try:
req = requests.head(pUrl, allow_redirects=True)
if req.status_code == 200:
return pUrl
else:
continue
except Exception:
print(traceback.format_exc())
continue
else:
try:
req = requests.head(url, allow_redirects=True)
if req.status_code == 200:
return url
except Exception:
print(traceback.format_exc())
continue
Usage:
correctUrl = validateUrl("google.com") # https://www.google.com
I have gone through the relevant questions, and I did not find answer to this one:
I want to open an url and parse its contents.
When I do that on, say, google.com, no problem.
When I do it on an url that does not have a file name, I often get told that I read an empty string.
See the code below as an example:
import urllib.request
#urls = ["http://www.google.com", "http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
#urls = ["http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
urls = ["http://www.whoscored.com/LiveScores"]
print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
sock=urllib.request.urlopen(url)
print("I have this sock: {0}.".format(sock))
htmlSource = sock.read()
print("I read the source code...")
htmlSourceLine = sock.readlines()
sock.close()
htmlSourceString = str(htmlSource)
print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
htmlSourceString = htmlSourceString.replace(">", ">\n")
htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
print(htmlSourceString)
print("\n\nI am done with this url: {0}.".format(url))
I do not know what does that I sometimes get that empty string as a return for urls that don't have a file name--such as "www.whoscored.com/LiveScores" in the example--whereas "google.com" or "www.whoscored.com" seem to work all the time.
I hope that my formulation is understandable...
It looks like the site is coded to explicitly reject requests from non-browser clients. You'll have to spoof creating sessions and the like, ensuring that Cookies are passed back and forth as required. The third-party requests library can help you with these tasks, but the bottom line is you are going to have to find out more about how that site operates.
Your code worked intermittently for me but using requests and sending the user-agent worked perfectly:
headers = {
'User-agent': 'Mozilla/5.0,(X11; U; Linux i686; en-GB; rv:1.9.0.1): Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'}
urls = ["http://www.whoscored.com/LiveScores"]
import requests
print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
sock= requests.get(url, headers=headers)
print("I have this sock: {0}.".format(sock))
htmlSource = sock.content
print("I read the source code...")
htmlSourceString = str(htmlSource)
print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
htmlSourceString = htmlSourceString.replace(">", ">\n")
htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
print(htmlSourceString)
print("\n\nI am done with this url: {0}.".format(url))
I'm going through a set of pages and I'm not certain how many there are, but the current page is represented by a simple number present in the url (e.g. "http://www.website.com/page/1")
I would like to use a for loop in scrapy to increment the current guess at the page and stop when it reaches a 404. I know the response that is returned from the request contains this information, but I'm not sure how to automatically get a response from a request.
Any ideas on how to do this?
Currently my code is something along the lines of :
def start_requests(self):
baseUrl = "http://website.com/page/"
currentPage = 0
stillExists = True
while(stillExists):
currentUrl = baseUrl + str(currentPage)
test = Request(currentUrl)
if test.response.status != 404: #This is what I'm not sure of
yield test
currentPage += 1
else:
stillExists = False
You can do something like this:
from __future__ import print_function
import urllib2
baseURL = "http://www.website.com/page/"
for n in xrange(100):
fullURL = baseURL + str(n)
#print fullURL
try:
req = urllib2.Request(fullURL)
resp = urllib2.urlopen(req)
if resp.getcode() == 404:
#Do whatever you want if 404 is found
print ("404 Found!")
else:
#Do your normal stuff here if page is found.
print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
except:
print ("Could not connect to URL: {0} ".format(fullURL))
This iterates through the range and attempts to connect to each URL via urllib2. I don't know scapy or how your example function opens the URL but this is an example with how to do it via urllib2.
Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.
You need to yield/return the request in order to check the status, creating a Request object does not actually send it.
class MySpider(BaseSpider):
name = 'website.com'
baseUrl = "http://website.com/page/"
def start_requests(self):
yield Request(self.baseUrl + '0')
def parse(self, response):
if response.status != 404:
page = response.meta.get('page', 0) + 1
return Request('%s%s' % (self.baseUrl, page), meta=dict(page=page))
Please tell me why this similar lists of code get different results.
First one (yandex.ru) get page of request, and another one get Main page of site (moyareklama.ru)
import urllib
base = "http://www.moyareklama.ru/single_ad_new.php?"
data = {"id":"201623465"}
url = base + urllib.urlencode(data)
print url
page = urllib.urlopen(url).read()
f = open ("1.html", "w")
f.write(page)
f.close()
print page
##base = "http://yandex.ru/yandsearch?"
##data = (("text","python"),("lr","192"))
##url = base + urllib.urlencode(data)
##print url
##page = urllib.urlopen(url).read()
##f = open ("1.html", "w")
##f.write(page)
##f.close()
##print page
Most likely the reason you get something different with urllib.urlopen and your browser is because your browser can be redirected with javascript and meta/refresh tags as well as standard HTTP 301/302 responses. I'm pretty sure the urllib module will only be redirected by HTTP 301/302 responses.
I ran into a problem which I can not solve. I am able to successfully get the cookies and use them after the login to a web application. The problem is that the web application sets new cookies after a couple of clicks which I need to have.
How do I extract, or get the additional, cookies after the login? Here is my code so far:
def _login_to_page(self,url):
cj = cookielib.CookieJar()
cookiehandler = urllib2.HTTPCookieProcessor(cj)
proxy_support = urllib2.ProxyHandler({"https" : self._proxy})
opener = urllib2.build_opener(cookiehandler, proxy_support)
try:
login_post_data = {'op':'login','user':self._username,'passwd':self._password,'api_type':'json'}
response = opener.open(str(self._path_to_login_url), urllib.urlencode(login_post_data), self._request_timeout).read()
if response:
print "[+] Login successful"
self._login_cookies = cj
else:
"[-] Login has probably failed. Wrong Credentials?"
def get_url_loggedin(self,url):
#the variable self._login_cookies are the cookies from the previous login
cookiehandler = urllib2.HTTPCookieProcessor(self._login_cookies)
proxy_support = urllib2.ProxyHandler({"http" : self._proxy})
opener = urllib2.build_opener(cookiehandler, proxy_support)
urllib2.install_opener(opener)
try:
url_response = opener.open(url, None, self._request_timeout).read()
except Exception,e:
print "[-] Could not read page: "
print "[??] Error: " +repr(e)
Sorry if my English is a bit weird I'm not a native speaker.
After the application has set the cookies you want, you should do cj.save( 'cookies.txt' ) to save the current set cookies to that file, and use cj.load('cookies.txt') to load them at application start.
See the cookielib documentation