I'm recently working on occasional data intensive projects and I'm in need of gathering data from e-commerce platforms like Amazon so I created a web scraping program in Python. I'm using requests library along with a list of user agents and proxies however I think they are not working and it is causing failure of the program. Note that Amazon Api is limiting in terms of content and access rates and is not suitable for my needs.
Here's how I send requests:
import requests
import random
session = requests.session()
proxies = [{'https:': 'https://' + item.rstrip(), 'http':
'http://' + item.rstrip()} for item in open('proxies.txt').readlines()]
user_agent = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
print(session.get('https://icanhazip.com', proxies=random.choice(proxies), headers=user_agent).text)
However I keep getting the same ip address printed and this means the proxies are not working this way. and the proxies.txt contains proxies in this format:
ex:
178.168.19.139:30736
342.552.34.456:8080
...
What is the best way to workaround captchas and robot checks presented by Amazon using these tools (or extra tools if you have any suggestions) and why are the proxies failing to work?
I'm not sure if this will work for you, but I found that removing the protocol at the start of the ip within the dictionary solved the problem.
proxies = [{'https': item.rstrip(), 'http': item.rstrip()} for item in open('proxies.txt').readlines()]
Related
I've built a simple python web scraper that works as expected locally but does not work on AWS Lambda -- specifically and only for the website I would like to scrape. I've tested out just the scraping portion of the code and can confirm that is is a cloudflare anti-bot issue.
I've combed through relevant SO and medium articles and tried:
adding the appropriate headers
specifying user agent
using different libraries (urllib, cloudscraper, selenium)
using a virtual display (pyvirtualdisplay with xvfb) as according to this post: How to bypass Cloudflare bot protection in selenium
Example code of the urllib version to illustrate the question:
import json
import urllib.request
def lambda_handler(event, context):
url = 'https://disboard.org/servers/tag/python/15'
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
return respData
The above code returns a 403 status + reCAPTCHA.
I understand that data center IP ranges get handled more carefully by antispam than residential IPs -- is there any workaround for this?
Thank you in advance.
I need to download ~50 CSV files in python. Based on the Google Chrome network stats, the download takes only 0.1 seconds, while the request takes about 7 seconds to process.
I am currently using headless Chrome to make the requests.
I tried multithreading, but from what I can tell, the browser doesn't support that (it can't make another request before the first request finishes processing). I don't think Multiprocessing is an option as this script will be hosted on a virtual server.
My next idea is to use the requests module instead of headless Chrome, but I am having issues connecting to the company network without a browser. Will this work, though? Any other solutions? Could I do something with multiple driver instances or multiple tabs on a single driver?Thanks!
Here's my code:
from Multiprocessing.pool import ThreadPool
driver=ChromeDriver()
Login(driver)
def getFile(item):
driver.get(url.format(item))
updateSet=blah
pool= ThreadPool(len(updateSet))
for item in updateSet:
pool.apply_async(getFile,(item,))
pool.close()
pool.join()
For request maybe try setting the user agent string to a browser like Chrome, ex: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36.
Some example code:
import requests
url = 'SOME URL'
headers = {
'User-Agent': 'user agent here',
'From': 'youremail#domain.com' # This is another valid field
}
response = requests.get(url, headers=headers)
'''
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get(url_,headers=headers)
'''
I am trying to scrape this website "https://allegro.pl/uzytkownik/feni44/lampy-przednie-i-elementy-swiatla-do-jazdy-dziennej-drl-255102?bmatch=cl-e2101-d3793-c3792-fd-60-aut-1-3-0412"
Everything was working good till yesterday, but suddenly I get 403 error.
I have used proxies/VPN but still the error persists.
When scraping a website, you must be careful of the website's anti-DDOS protection strategies. One form of DDOS is submitting many load requests at once via refresh, which can increase a server's load and hinder its performance. Using a web scraper does exactly that as it goes through each link, and so the website can mistake your bot as a DDOS'er and block it's IP address, making it FORBIDDEN (error 403) to access the website from it's IP address.
Usually this is only temporary, so after 12 hours or 24 hours (or however long the website sets a block period) it should be good to go. If you don't want to avoid a future 403 FORBIDDEN error, then consider sleeping for 10 seconds between each request.
Try to use some proxy services like Bright proxy. They have more than 72million+ proxies. I think this issue will resolve on rotating the proxy and useragent.
I am using Python to scrape pages. Until now I didn't have any issues. I use Selenium for this purpose, but i also do hear that people get IP banned from some websites. I didn't faced that. Those people used beautifulsoup, lxml and requests libraries...
Selenium feels like a user is using the browser and not the bots, but can it also IP banned from some sites?
I am also using a header user_agent as:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/80.0.3987.132 Safari/537.36'
yes, It depends on the requests you send to a website, usually while datascraping a website can get you banned using the user agent is a plus because some websites wont let you in if that is not set up
if you dont want to get banned use a proxy IP.
I want to open a url using urllib.request.urlopen('someurl'):
with urllib.request.urlopen('someurl') as url:
b = url.read()
I keep getting the following error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the solutions don't work.
How can I fix this problem with python 3?
From the Python docs:
import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
from urllib.request import urlopen, Request
urlopen(Request(url, headers={'User-Agent': 'Mozilla'}))
I just answered a similar question here: https://stackoverflow.com/a/43501438/206820
In case you just not only want to open the URL, but also want to download the resource(say, a PDF file), you can use the code as below:
# proxy = ProxyHandler({'http': 'http://192.168.1.31:8888'})
proxy = ProxyHandler({})
opener = build_opener(proxy)
opener.addheaders = [('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30')]
install_opener(opener)
result = urlretrieve(url=file_url, filename=file_name)
The reason I added proxy is to monitor the traffic in Charles, and here is the traffic I got:
The host site rejection is coming from the OWASP ModSecurity Core Rules for Apache mod-security. Rule 900002 has a list of "bad" user agents, and one of them is "python-urllib2". That's why requests with the default user agent fail.
Unfortunately, if you use Python's "robotparser" function,
https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser
it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.