I am having an issue with mechanize's timeout feature. On most pages it works perfectly, if the URL fails to load in a reasonable amount of time it raises an error: urllib2.URLError: <urlopen error timed out>. However, on certain pages the timer does not work and the program becomes unresponsive even to a keyboard interrupt. Here is an example page where that occurs:
import mechanize
url = 'https://web.archive.org/web/20141104183547/http://www.dallasnews.com/'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
html = br.open(url, timeout=0.01).read() #hangs on this page, timeout set extremely low to trigger timeout on all pages for debugging
First, does this script hang for other people for this particular URL? Second, what could be going wrong/how do I debug?
I don't know why that url request hangs up for mechanize but using urllib2; the request comes back fine. Maybe they have some code that recognizes mechanize despite setting robots to false.
I think urllib2 should be a good solution for your situation
import mechanize
import urllib2
url = 'https://web.archive.org/web/20141104183547/http://www.dallasnews.com/'
try:
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
html = br.open(url).read() #set_handle_robots
except:
req = urllib2.Request(url, headers={'User-Agent' : 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16'})
con = urllib2.urlopen( req )
html = con.read()
print html
Related
I have used mechanize and successfully logged into a user login page. Now I want to navigate the site to a specific page in the submenus. When I try this by opening the URL of the specific page after logging in, another login page comes up which I do not have a username and password for. This log in page does not usually show up when I am navigating the site on a web browser.
How can I do this?
import mechanize
import webbrowser
import cookielib
usern = '****'
passw = '****'
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
r = br.open("https://myunihub-1.mdx.ac.uk/cas-web/login?service=https%3A%2F%2Fmyunihub.mdx.ac.uk%2Fc%2Fportal%2Flogin")
br.select_form(nr=0)
br.form['username'] = usern
br.form['password'] = passw
br.set_cookiejar(cj)
br.submit
url = "https://misis.mdx.ac.uk/mislve/bwskfshd.P_CrseSchd"
webbrowser.open_new(url)
Try to use cookies and pretend to be actual browser. Some sites doesn't allow automated scripts/robots to crawl their sites. But you can always tell them no no I'm actual browser.
import cookielib
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
And let's pretend we are not a robot and a actual browser.
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
I am receiving following response from the server
ctrlDateTime%24txtSpecifyFromDate=05%2F02%2F2015&
ctrlDateTime%24rgApplicable=rdoApplicableFor&
ctrlDateTime%24txtSpecifyToDate=05%2F02%2F2015&
I am trying with
br["ctrlDateTime%24txtSpecifyFromDate"]="05%2F02%2F2015";
br["ctrlDateTime%24rgApplicable"]="rdoApplicableFor";
br["ctrlDateTime%24txtSpecifyToDate"]="05%2F02%2F2015";
How can I fix ControlnotfoundError? Here is my code:
Any idea how to solve it?
import mechanize
import re
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0')]
response = br.open("http://marketinformation.natgrid.co.uk /gas/frmDataItemExplorer.aspx")
br.select_form(nr=0)
br.set_all_readonly(False)
mnext = re.search("""<a id="lnkNext" href="javascript:__doPostBack('(.*?)','(.*?)')">XML""", html)
br["tvDataItem_ExpandState"]="cccccccceennncennccccccccc";
br["tvDataItem_SelectedNode"]="";
br["__EVENTTARGET"]="lbtnCSVDaily";
br["__EVENTARGUMENT"]="";
br["tvDataItem_PopulateLog"]="";
br["__VIEWSTATE"]="%2FwEP.....SNIP....%2F90SB9E%3D";
br["__VIEWSTATEGENERATOR"]="B2D04314";
br["__EVENTVALIDATION"]="%2FwEW...SNIP...uPceSw%3D%3D";
br["txtSearch"]="";
br["tvDataItemn11CheckBox"]="on";
br["tvDataItemn15CheckBox"]="on";
br["ctrlDateTime%24txtSpecifyFromDate"]="05%2F02%2F2015";
br["ctrlDateTime%24rgApplicable"]="rdoApplicableFor";
br["ctrlDateTime%24txtSpecifyToDate"]="05%2F02%2F2015";
br["btnViewData"]="View+Data+for+Data+Items";
br["hdnIsAddToList"]="";
response = br.submit()
print(response.read());
Thanks in advance.
P.
This is solved in two steps: 1) I replaced %24 with '$'; 2) some of the parameters required a true parameter to be passed on and some to be passed on as ['',]
Hi I'm using python mechanize to get datas from webpages.
I'm trying to get imgurl from google image search webpage to download search result images.
Here's my code
I fill search form as 'dog' and submit. (search 'dog')
import mechanize
import cookielib
import urllib2
import urllib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time = 1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (x11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'), ('Accept', '*/*') ,('Accept-Language', 'ko-KR')]
br.open('http://www.google.com/imghp?hl=en')
br.select_form(nr=0)
br.form['q'] = 'dog'
a = br.submit()
searched_url = br.geturl()
file0 = open("1.html", "wb")
file0.write(a.read())
file0.close()
when i see page-source from chrome browser, there are 'imgurl's in pagesource. But when i read data from
python mechanize, there's no such things.
also, the size of 1.html(which i write by python) is much smaller than html file downloaded from chrome.
How can i get exactly same html data as web-browsers by using python?
Do i have to set request headers same as web-browsers?
thanks
This question already has answers here:
Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"
(8 answers)
Closed 8 years ago.
So, I created a Django website to web-scrape news webpages for articles..
Even though i use mechanize, i they still telling me:
HTTP Error 403: request disallowed by robots.txt
I tried everything, look at my code(Just the part to scrape):
br = mechanize.Browser()
page = br.open(web)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
#BeautifulSoup
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)
I tried too to use de br.open before the set_hande_robots(Flase) ,etc. It didn't work either.
Any way to get trough this sites?
You're setting br.set_handle_robots(False) after br.open()
It should be:
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(web)
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)
I'm trying to use python mechanize to retrive the list of apps on iTunes connect. Once this list is retrieved, further work will be done with those links.
Logging in succeeds but then when i follow the "Manage Your Applications" link I get redirected back to the login page. It is as if the session gets lost.
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text
filename = 'itunes.html'
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('https://itunesconnect.apple.com/WebObjects/iTunesConnect.woa')
br.select_form(name='appleConnectForm')
br.form['theAccountName'] = username
br.form['theAccountPW'] = password
br.submit()
apps_link = br.find_link(text='Manage Your Applications')
print "Manage Your Apps link = ", apps_link
req = br.follow_link(text='Manage Your Applications')
for app_link in br.links():
print "link is ", app_link
Any ideas what could be wrong?
You need to save/load the cookiejar
Figured this out after further investigation. This was due to a known bug in cookielib documented here: http://bugs.python.org/issue3924
Basically some sites (notably itunesconnect), set the cookie version as a string not an int. Which causes an error in cookielib since it does not deal with that condition. The fix at the bottom of that issue thread worked for me.