I am using Python to scrape pages. Until now I didn't have any issues. I use Selenium for this purpose, but i also do hear that people get IP banned from some websites. I didn't faced that. Those people used beautifulsoup, lxml and requests libraries...
Selenium feels like a user is using the browser and not the bots, but can it also IP banned from some sites?
I am also using a header user_agent as:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/80.0.3987.132 Safari/537.36'
yes, It depends on the requests you send to a website, usually while datascraping a website can get you banned using the user agent is a plus because some websites wont let you in if that is not set up
if you dont want to get banned use a proxy IP.
Related
I am using python module requests to send some requests to google but after some requests, a reCaptcha pops up.I am using user agent but it still pops up!
What should I do?
I used user agent, it did change the browser looks but it did no effect on the Captcha problem
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
sleep(2)
headers = {'User-Agent': user_agent}
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
file = requests.get(f'https://www.google.com/search?q=contact+email+{keyword}+site:{site}&num=100', headers=headers)
I used sleep but in vain. Any suggestions?
That’s kind of the entire point of captchas. They help deter against bots and spammers. Most captchas can’t be bypassed easily, so just changing the user agent won’t make the captcha go away. Since it sounds like the captchas only appear after a certain number of requests, you could use rotating residential proxies and change the session’s IP address whenever a captcha is detected.
Alternatively, you can use a captcha solving service like Anti-Captcha or DeathByCaptcha which involves parsing information about the captcha and then sending it to a service that has workers manually complete it for you. It’s not exactly convenient or efficient, though, and it can often take up to ~30 seconds for a worker to complete a single captcha. Both options cost money.
'''
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get(url_,headers=headers)
'''
I am trying to scrape this website "https://allegro.pl/uzytkownik/feni44/lampy-przednie-i-elementy-swiatla-do-jazdy-dziennej-drl-255102?bmatch=cl-e2101-d3793-c3792-fd-60-aut-1-3-0412"
Everything was working good till yesterday, but suddenly I get 403 error.
I have used proxies/VPN but still the error persists.
When scraping a website, you must be careful of the website's anti-DDOS protection strategies. One form of DDOS is submitting many load requests at once via refresh, which can increase a server's load and hinder its performance. Using a web scraper does exactly that as it goes through each link, and so the website can mistake your bot as a DDOS'er and block it's IP address, making it FORBIDDEN (error 403) to access the website from it's IP address.
Usually this is only temporary, so after 12 hours or 24 hours (or however long the website sets a block period) it should be good to go. If you don't want to avoid a future 403 FORBIDDEN error, then consider sleeping for 10 seconds between each request.
Try to use some proxy services like Bright proxy. They have more than 72million+ proxies. I think this issue will resolve on rotating the proxy and useragent.
I'm recently working on occasional data intensive projects and I'm in need of gathering data from e-commerce platforms like Amazon so I created a web scraping program in Python. I'm using requests library along with a list of user agents and proxies however I think they are not working and it is causing failure of the program. Note that Amazon Api is limiting in terms of content and access rates and is not suitable for my needs.
Here's how I send requests:
import requests
import random
session = requests.session()
proxies = [{'https:': 'https://' + item.rstrip(), 'http':
'http://' + item.rstrip()} for item in open('proxies.txt').readlines()]
user_agent = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
print(session.get('https://icanhazip.com', proxies=random.choice(proxies), headers=user_agent).text)
However I keep getting the same ip address printed and this means the proxies are not working this way. and the proxies.txt contains proxies in this format:
ex:
178.168.19.139:30736
342.552.34.456:8080
...
What is the best way to workaround captchas and robot checks presented by Amazon using these tools (or extra tools if you have any suggestions) and why are the proxies failing to work?
I'm not sure if this will work for you, but I found that removing the protocol at the start of the ip within the dictionary solved the problem.
proxies = [{'https': item.rstrip(), 'http': item.rstrip()} for item in open('proxies.txt').readlines()]
I'm trying to scrape the Home Depot website using Python and requests. Selenium Webdriver works fine, but takes way too much time, as the goal is to make a time-sensitive price comparison tool between local paint shops and power tool shops.
When I send a request to any other website, it works like normal. If I use any browser to navigate manually to the website, it also works fine (with or without session data/cookie data). I tried adding randomized headers into the request, but it does not seem to help the issue. From what I can see, it's not an issue of sending too many requests per time-period, (considering that selenium and manual browsing still works at any time.) I am confident that this specific issue is NOT because of a rate limitation.
my code:
from random import choice
import requests
import traceback
list_desktopagents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36']
def random_headers():
return {'User-Agent': choice(list_desktopagents),
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
response = requests.get(
'https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-
Interior-Exterior-Paint-390001/300831629',
headers=myheaders,
timeout=10)
my error:
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.homedepot.com', port=443): Read timed out. (read timeout=10)
Does anyone have a suggestion on what else I could do to successfully receive my response? I would prefer to use Requests, but anything that runs fast unlike selenium will be suitable. I understand that im being blocked, my question is not so much 'whats happening to stop me from scraping?', but rather, 'what can i do to further humanize my scraper so it allows me to continue?'
The error is coming from the User Agent. The reason why Selenium is working and not request is because Selenium is using a web driver to make the request, so it is more humanlike while request is much easier to be detected as a script. From Home Depot's robots.txt page it doesn't look like products are allowed for scraping. I just used this code and got a response by using this code:
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get('https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-Interior-Exterior-Paint-390001/300831629', headers=headers)
print(response.content)
By using these user agents you can "trick" the site into thinking you are an actual person, which is what the web driver with Selenium does.
I'm trying to find out which browsers are my users using and I'm running into a problem.
If I try to read header "User-Agent" it usually gives me lots of text, and tells me nothing.
For example, if I visit the site with Chrome, in "User-Agent" header there is:
User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Framework I've been using is Bottle (Python).
Any help would be appreciated, thanks.
User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Your conclusion above is wrong. The UA tells you many things including the type and version of the web browser.
The post below explains why Mozilla and Safari exist in Chrome's UA.
History of the browser user-agent string
You can try to analyze it manually on user-agent-string-db.
There's a Python API for it.
from uasparser2 import UASparser
uas_parser = UASparser()
# Instead of fecthing data via network every time, you can cache the db in local
# uas_parser = UASparser('/path/to/your/cache/folder', mem_cache_size=1000)
# Updating data is simple: uas_parser.updateData()
result = ua_parser.parse('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36')
# result
{'os_company': u'',
'os_company_url': u'',
'os_family': u'Linux',
'os_icon': u'linux.png',
'os_name': u'Linux',
'os_url': u'http://en.wikipedia.org/wiki/Linux',
'typ': u'Browser',
'ua_company': u'Google Inc.',
'ua_company_url': u'http://www.google.com/',
'ua_family': u'Chrome',
'ua_icon': u'chrome.png',
'ua_info_url': u'http://user-agent-string.info/list-of-ua/browser-detail?browser=Chrome',
'ua_name': u'Chrome 31.0.1650.57',
'ua_url': u'http://www.google.com/chrome'}
Thank you everyone for your answers, I found something really simple that works.
Download httpagentparser module from:
https://pypi.python.org/pypi/httpagentparser
after that, just import it in your pythong program
import httpagentparser
Then you can write a function like this that returns browser, works like a charm:
def detectBrowser(request):
agent = request.environ.get('HTTP_USER_AGENT')
browser = httpagentparser.detect(agent)
if not browser:
browser = agent.split('/')[0]
else:
browser = browser['browser']['name']
return browser
That's it
As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
It's not that the User Agent string tells you "nothing;" it's that it's telling you too much.
If you want a report that breaks down your users browser, your best bet is to analyze your logs. Several programs are available to help. (One caveat, if you're using Bottle's "raw" web server, is that it won't log in Common Log Format out of the box. You have options.)
If you need to know in real time, you'll need to spend time learning user agent strings (useragentstring.com might help here) or use an API like this one.