This question already has answers here:
Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"
(8 answers)
Closed 8 years ago.
So, I created a Django website to web-scrape news webpages for articles..
Even though i use mechanize, i they still telling me:
HTTP Error 403: request disallowed by robots.txt
I tried everything, look at my code(Just the part to scrape):
br = mechanize.Browser()
page = br.open(web)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
#BeautifulSoup
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)
I tried too to use de br.open before the set_hande_robots(Flase) ,etc. It didn't work either.
Any way to get trough this sites?
You're setting br.set_handle_robots(False) after br.open()
It should be:
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(web)
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)
Related
I have used mechanize and successfully logged into a user login page. Now I want to navigate the site to a specific page in the submenus. When I try this by opening the URL of the specific page after logging in, another login page comes up which I do not have a username and password for. This log in page does not usually show up when I am navigating the site on a web browser.
How can I do this?
import mechanize
import webbrowser
import cookielib
usern = '****'
passw = '****'
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
r = br.open("https://myunihub-1.mdx.ac.uk/cas-web/login?service=https%3A%2F%2Fmyunihub.mdx.ac.uk%2Fc%2Fportal%2Flogin")
br.select_form(nr=0)
br.form['username'] = usern
br.form['password'] = passw
br.set_cookiejar(cj)
br.submit
url = "https://misis.mdx.ac.uk/mislve/bwskfshd.P_CrseSchd"
webbrowser.open_new(url)
Try to use cookies and pretend to be actual browser. Some sites doesn't allow automated scripts/robots to crawl their sites. But you can always tell them no no I'm actual browser.
import cookielib
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
And let's pretend we are not a robot and a actual browser.
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
I am having an issue with mechanize's timeout feature. On most pages it works perfectly, if the URL fails to load in a reasonable amount of time it raises an error: urllib2.URLError: <urlopen error timed out>. However, on certain pages the timer does not work and the program becomes unresponsive even to a keyboard interrupt. Here is an example page where that occurs:
import mechanize
url = 'https://web.archive.org/web/20141104183547/http://www.dallasnews.com/'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
html = br.open(url, timeout=0.01).read() #hangs on this page, timeout set extremely low to trigger timeout on all pages for debugging
First, does this script hang for other people for this particular URL? Second, what could be going wrong/how do I debug?
I don't know why that url request hangs up for mechanize but using urllib2; the request comes back fine. Maybe they have some code that recognizes mechanize despite setting robots to false.
I think urllib2 should be a good solution for your situation
import mechanize
import urllib2
url = 'https://web.archive.org/web/20141104183547/http://www.dallasnews.com/'
try:
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
html = br.open(url).read() #set_handle_robots
except:
req = urllib2.Request(url, headers={'User-Agent' : 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16'})
con = urllib2.urlopen( req )
html = con.read()
print html
Hi I'm using python mechanize to get datas from webpages.
I'm trying to get imgurl from google image search webpage to download search result images.
Here's my code
I fill search form as 'dog' and submit. (search 'dog')
import mechanize
import cookielib
import urllib2
import urllib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time = 1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (x11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'), ('Accept', '*/*') ,('Accept-Language', 'ko-KR')]
br.open('http://www.google.com/imghp?hl=en')
br.select_form(nr=0)
br.form['q'] = 'dog'
a = br.submit()
searched_url = br.geturl()
file0 = open("1.html", "wb")
file0.write(a.read())
file0.close()
when i see page-source from chrome browser, there are 'imgurl's in pagesource. But when i read data from
python mechanize, there's no such things.
also, the size of 1.html(which i write by python) is much smaller than html file downloaded from chrome.
How can i get exactly same html data as web-browsers by using python?
Do i have to set request headers same as web-browsers?
thanks
I am trying to do some automation for this site http://www.beistle.com/ but for visiting the search page i need pid which is generated on the server.
I looked for a response the browser makes and tried to do the same with mechanize in python.
import mechanize;
import cookielib;
import urllib2;
import lxml.html;
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(False)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'),('Referer', 'http://www.beistle.com/'), ('Connection', 'keep-alive'), ('Host','www.beistle.com'), ('Accept-Language', 'en-US,en;q=0.5'), ('Accept-Encoding', 'gzip, deflate'), ('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8') ]
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
br.open("http://www.beistle.com/");
br.select_form(name='aspnetForm');
br.submit();
In browser it returns 302 with correct location in which the pid is written. However mechanize returns "Location: /Search.aspx" which can't be used for searching
how I can get input from html forms on other sites?
I want it to return a dictionary such as:
form = [('name' = 'somename', 'type' = 'text', 'value':''},{' name' = 'somename', 'type' = 'submit', 'value': ' submit ').
Sorry for my English.
you probably wont be able to retrieve form data from other users on other sites. If you wish to use a script to send data to a form, mechanize is one tool that makes this quite easy.
Yeah mechanize is sweet !
import mechanize
# Browser
br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# we inspect the all form element in the http://stackoverflow.com
br.open('http://stackoverflow.com')
for form in br.forms():
print form
Look at mechanize, lxml.html and BeatifulSoup.