I've built a simple python web scraper that works as expected locally but does not work on AWS Lambda -- specifically and only for the website I would like to scrape. I've tested out just the scraping portion of the code and can confirm that is is a cloudflare anti-bot issue.
I've combed through relevant SO and medium articles and tried:
adding the appropriate headers
specifying user agent
using different libraries (urllib, cloudscraper, selenium)
using a virtual display (pyvirtualdisplay with xvfb) as according to this post: How to bypass Cloudflare bot protection in selenium
Example code of the urllib version to illustrate the question:
import json
import urllib.request
def lambda_handler(event, context):
url = 'https://disboard.org/servers/tag/python/15'
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
return respData
The above code returns a 403 status + reCAPTCHA.
I understand that data center IP ranges get handled more carefully by antispam than residential IPs -- is there any workaround for this?
Thank you in advance.
What is the proper way to Google something in Python 3? I have tried requests and urllib for a Google page. When I simply res = requests.get("https://www.google.com/#q=" + query) that doesn't come back with the same HTML as when I inspect the Google page in Safari. The same happens with urllib. A similar thing happens when I use Bing. I am familiar with AJAX. However, it seems that that is now depreciated.
In python, if you do not specify the user agent header in http requests manually, python will add for you by default which can be detected by Google and may be forbidden by it.
Try the following if it can help.
import urllib
yourUrl = "post it here"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(yourUrl, headers = headers)
page = urllib.request.urlopen(req)
I'm trying to put together a python program that will periodically gather course data from a university website and save them to a csv file for personal use. After some searching around I stumbled across mechanize. I'm trying to set it up so I can log into my account but I've run into a stumbling block. The code I put together here is supposed to submit the form containing the login information. The website is https://idp-prod.cc.ucf.edu/idp/Authn/UserPassword. The response page is supposed to display a red error message when a form is submitted with the wrong log in information. The response I keep getting lacks such an error message and I cant figure out what I'm doing wrong.
import mechanize
from bs4 import BeautifulSoup
import cookielib
import urllib2
# log in page url
url = "https://idp-prod.cc.ucf.edu/idp/Authn/UserPassword"
# create the browser object and give it fake headers
myBrowser = mechanize.Browser()
myBrowser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
myBrowser.set_handle_robots(False)
# have the browser open the site
myBrowser.open(url)
# handle the cookies
cj = cookielib.LWPCookieJar()
myBrowser.set_cookiejar(cj)
# select third form down
myBrowser.select_form(nr=2)
# fill out the form
myBrowser["j_username"] = "somestudentusername"
myBrowser["pwd"] = "somestudentpassword"
# submit the form and save the response
response = myBrowser.submit()
# upon the submission of incorrect login information an error message is displayed in red
soup = BeautifulSoup(response.read(), 'html.parser')
print (soup.find(color="red")) #find the error message in the response
I am trying to open the URL to parse for content using the following code. but I receive a 403 error when i try through python and not while using the same URL through a web browser. any help to overcome this?
import urllib2
URL = 'http://www.google.com/search?q=something%20unusual'
response = urllib2.urlopen(URL)
Response from Py Interpreter: HTTPError: HTTP Error 403: Forbidden
Google is using User-Agent filtering to prevent bots from interacting with its search service. You can observe this by comparing these results with curl(1) and optionally using the -A flag to change the User-Agent string:
$ curl -I 'http://www.google.com/search?q=something%20unusual'
HTTP/1.1 403 Forbidden
...
$ curl -I 'http://www.google.com/search?q=something%20unusual' -A 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'
HTTP/1.1 200 OK
You should probably instead be using the Google Custom Search service to automate Google searches. Alternatively, you could set your own User-Agent header with the urllib2 library (instead of the default of something like "Python-urllib/2.6"), but this may contravene Google's terms of service.
User-Agent header is the one giving you problem. Seems to me the web page forbid any request made from a non browser by checking the User-Agent header. The key is setting a User-Agent that simulates a browser in python.
This worked for me:
In [1]: import urllib2
In [2]: URL = 'http://www.google.com/search?q=something%20unusual'
In [4]: opener = urllib2.build_opener()
In [5]: opener.addheaders = [('User-agent', 'Mozilla/5.0')]
In [6]: response = opener.open(URL)
In [7]: response
Out[7]: <addinfourl at 47799472 whose fp = <socket._fileobject object at 0x02D7F5B0>>
In [8]: response.read()
Hope this helps!
I am quite new to python and I am rather struck for a couple of days now trying to send a cookie with urllib2. So, basically, on the page I want to get, I see from firebug that there is a "sent cookie" which looks like:
list_type=height
.. which basically arranges the list on the page in a certain order.
I would like to send this above cookie info via urllib2, so that the rendered page taked this above setting into effect - and here is the code I am trying to write to make it work:
class Networksx(object):
def __init__(self):
self.cj = cookielib.CookieJar()
self.opener = urllib2.build_opener\
#socks handler
self.opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'),
('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'),
('Keep-Alive', '115'),
('Connection', 'keep-alive'),
('Cache-Control', 'max-age=0'),
('Referer', 'http://www.google.com'),
("Cookie", {"list_type":"height"}),
]
urllib2.install_opener(self.opener)
self.params = { 'Set-Cookie': "list_type":"height"}
self.encoded_params = urllib.urlencode( self.params )
def fullinfo(self,url):
return self.opener.open(url,self.encoded_params).read()
..as you can see, I have tried a couple of things:
setting the parameter via a header
setting a cookie
however, these do not seem to render the page in the certain list_order (height) as I would like. I was wondering if someone could point me in the right direction as to how to send the cookie information with urllib2
Thanks.
An easy way to generate a cookie.txt is this chrome extension: https://chrome.google.com/webstore/detail/cookietxt-export/lopabhfecdfhgogdbojmaicoicjekelh
import urllib2, cookielib
url = 'https://example.com/path/default.aspx'
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
cj = cookielib.LWPCookieJar()
# cj.load signature: filename=None, ignore_discard=False, ignore_expires=False
cj.load('/path/to/my/cookies.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
req = urllib2.Request(url, None, txheaders)
handle = urllib2.urlopen(req)
[update]
Sorry, I was pasting from an old code snippet long forgotten. From the LWPCookieJar docstring:
The LWPCookieJar saves a sequence of "Set-Cookie3" lines. "Set-Cookie3" is the format used by the libwww-perl libary, not known to be compatible with any browser, but which is easy to read and doesn't lose information about RFC 2965 cookies.
So it is not compatible with the cookie.txt generated by modern browsers. If you try to load it with you will get: LoadError: 'cookies.txt' does not look like a Set-Cookie3 (LWP) format file.
You can do as the OP and convert the file:
there is something wrong with the format of the output from chrome extension. I just googled the lwp problem and found: code.activestate.com/recipes/302930-cookielib-example the code spits out the cookie in lwp format and then I follow your steps as it is. - James W
You can also use this Firefox addon, and then "Tools->Export cookies". Make sure the first line in the cookies.txt file is "# Netscape HTTP Cookie File" and use:
cj = cookielib.MozillaCookieJar('/path/to/my/cookies.txt')
cj.load()
You would better look into the 'request' module for Python making HTTP much easier approachable than through the low-level urllib modules.
See
http://docs.python-requests.org/en/latest/user/quickstart/#cookies