I am trying to open the URL to parse for content using the following code. but I receive a 403 error when i try through python and not while using the same URL through a web browser. any help to overcome this?
import urllib2
URL = 'http://www.google.com/search?q=something%20unusual'
response = urllib2.urlopen(URL)
Response from Py Interpreter: HTTPError: HTTP Error 403: Forbidden
Google is using User-Agent filtering to prevent bots from interacting with its search service. You can observe this by comparing these results with curl(1) and optionally using the -A flag to change the User-Agent string:
$ curl -I 'http://www.google.com/search?q=something%20unusual'
HTTP/1.1 403 Forbidden
...
$ curl -I 'http://www.google.com/search?q=something%20unusual' -A 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'
HTTP/1.1 200 OK
You should probably instead be using the Google Custom Search service to automate Google searches. Alternatively, you could set your own User-Agent header with the urllib2 library (instead of the default of something like "Python-urllib/2.6"), but this may contravene Google's terms of service.
User-Agent header is the one giving you problem. Seems to me the web page forbid any request made from a non browser by checking the User-Agent header. The key is setting a User-Agent that simulates a browser in python.
This worked for me:
In [1]: import urllib2
In [2]: URL = 'http://www.google.com/search?q=something%20unusual'
In [4]: opener = urllib2.build_opener()
In [5]: opener.addheaders = [('User-agent', 'Mozilla/5.0')]
In [6]: response = opener.open(URL)
In [7]: response
Out[7]: <addinfourl at 47799472 whose fp = <socket._fileobject object at 0x02D7F5B0>>
In [8]: response.read()
Hope this helps!
Related
I've built a simple python web scraper that works as expected locally but does not work on AWS Lambda -- specifically and only for the website I would like to scrape. I've tested out just the scraping portion of the code and can confirm that is is a cloudflare anti-bot issue.
I've combed through relevant SO and medium articles and tried:
adding the appropriate headers
specifying user agent
using different libraries (urllib, cloudscraper, selenium)
using a virtual display (pyvirtualdisplay with xvfb) as according to this post: How to bypass Cloudflare bot protection in selenium
Example code of the urllib version to illustrate the question:
import json
import urllib.request
def lambda_handler(event, context):
url = 'https://disboard.org/servers/tag/python/15'
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
return respData
The above code returns a 403 status + reCAPTCHA.
I understand that data center IP ranges get handled more carefully by antispam than residential IPs -- is there any workaround for this?
Thank you in advance.
I am trying to download a file using the python requests module, my code works for some urls/hosts but I've come across one that does not work.
Based on other similar questions it may be related to the User-Agent request header, I have tried to remedy by adding the chrome user-agent but the connection still times out for this particular url (it does work for others).
I have tested opening the url in chrome browser (which works all OK) and inspecting the request headers, but I still can't figure out why my code is failing:
import requests
url = 'http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Indices-2020-03.csv'
headers = {'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
session = requests.Session()
session.headers.update(headers)
response = session.get(url, stream=True)
# !!! code fails here for this particular url !!!
with open('test.csv', "wb") as fh:
for x in response.iter_content(chunk_size=1024):
if x: fh.write(x)
Update 2020-08-14
I have figured out what was wrong; on the instances where the code was working the urls were using https protocol. This url is http protocol, and my proxy settings were not configured for http only https. After providing a http proxy to requests my code did work as written.
The code you posted worked for me, it saved the file (129007 lines). It could be that the host is rate-limiting you, try again later to see if it works.
# count lines
$ wc -l test.csv
129007 test.csv
# inspect headers
$ head -n 4 test.csv
Date,Region_Name,Area_Code,Index
1968-04-01,Wales,W92000004,2.11932727
1968-04-01,Scotland,S92000003,2.108087275
1968-04-01,Northern Ireland,N92000001,3.300419757
You can disable requests' timeouts by passing timeout=None. Here is the official documentation: https://requests.readthedocs.io/en/master/user/advanced/#timeouts
I want to get data from this site.
When I get data from the main url. I get an HTML file that contains structure but not the values.
import requests
from bs4 import BeautifulSoup
url ='http://option.ime.co.ir/'
r = requests.get(url)
soup = BeautifulSoup(r,'lxml')
print(soup.prettify())
I find out that the site get values from
url1 = 'http://option.ime.co.ir/GetTime'
url2 = 'http://option.ime.co.ir/GetMarketData'
When I watch responses from those url in the browser. I see a JSON format response and time in a specific format.
but when I use requests to get the data it gives me same HTML that I get from url.
Do you know whats the reason? How should I get the responses that I see in the browser?
I check headers for all urls and I didn't find something special that I should send with my request.
You have to provide the proper HTTP headers in the request. In my case, I was able to make it work using the following headers. Note that in my testing the HTTP response was a 200 OK rather than a redirect to the root website (as when no HTTP headers were provided in the request).
Raw HTTP Request:
GET http://option.ime.co.ir/GetTime HTTP/1.1
Host: option.ime.co.ir
Referer: "http://option.ime.co.ir/"
Accept: "application/json, text/plain, */*"
User-Agent: "Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0"
This should give you the proper JSON response you need.
You first connection using the browser is getting a 302 Redirection response (to the same url).
Then it is running some JS so the so the second request doesn't redirect anymore and gets the expected JSON.
It is a usual technique so other people don't use their API without permission.
Set the "preserve log" checkbox in dev. tools so you can see it by yourself.
What is the proper way to Google something in Python 3? I have tried requests and urllib for a Google page. When I simply res = requests.get("https://www.google.com/#q=" + query) that doesn't come back with the same HTML as when I inspect the Google page in Safari. The same happens with urllib. A similar thing happens when I use Bing. I am familiar with AJAX. However, it seems that that is now depreciated.
In python, if you do not specify the user agent header in http requests manually, python will add for you by default which can be detected by Google and may be forbidden by it.
Try the following if it can help.
import urllib
yourUrl = "post it here"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(yourUrl, headers = headers)
page = urllib.request.urlopen(req)
Hello everybody (first post here).
I am trying to send data to a webpage. This webpage request two fields (a file and an e-mail address) if everything is ok the webpage returns a page saying "everything is ok" and sends a file to the provided e-mail address. I execute the code below and I get nothing in my e-mail account.
import urllib, urllib2
params = urllib.urlencode({'uploaded': open('file'),'email': 'user#domain.com'})
req = urllib2.urlopen('http://webpage.com', params)
print req.read()
the print command gives me the code of the home page (I assume instead it should give the code of the "everything is ok" page).
I think (based o google search) the poster module should do the trick but I need to keep dependencies to a minimum, hence I would like a solution using standard libraries (if that is possible).
Thanks in advance.
Thanks everybody for your answers. I solve my problem using the mechanize library.
import mechanize
br = mechanize.Browser()
br.open('webpage.com')
email='user#domain.com'
br.select_form(nr=0)
br['email'] = email
br.form.add_file(open('filename'), 'mime-type', 'filename')
br.form.set_all_readonly(False)
br.submit()
This site could checks Referer, User-Agent and Cookies.
Way to handle all of this is using urllib2.OpenerDirector which you can get by urllib2.build_opener.
# Cookies handle
cj = cookielib.CookieJar()
CookieProcessor = urllib2.HTTPCookieProcessor(cj)
# Build OpenerDirector
opener = urllib2.build_opener(CookieProcessor)
# Valid User-Agent from Firefox 3.6.8 on Ubuntu 10.04
user_agent = 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8'
# Referer says that you send request from web-site title page
referer = 'http://webpage.com'
opener.addheaders = [
('User-Agent', user_agent),
('Referer', referer),
('Accept-Charset', 'utf-8')
]
Then prepare parameters with urlencode and send request by opener.open(params)
Documentation for Python 2.7: cookielib, OpenerDirector