I am using python module requests to send some requests to google but after some requests, a reCaptcha pops up.I am using user agent but it still pops up!
What should I do?
I used user agent, it did change the browser looks but it did no effect on the Captcha problem
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
sleep(2)
headers = {'User-Agent': user_agent}
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
file = requests.get(f'https://www.google.com/search?q=contact+email+{keyword}+site:{site}&num=100', headers=headers)
I used sleep but in vain. Any suggestions?
That’s kind of the entire point of captchas. They help deter against bots and spammers. Most captchas can’t be bypassed easily, so just changing the user agent won’t make the captcha go away. Since it sounds like the captchas only appear after a certain number of requests, you could use rotating residential proxies and change the session’s IP address whenever a captcha is detected.
Alternatively, you can use a captcha solving service like Anti-Captcha or DeathByCaptcha which involves parsing information about the captcha and then sending it to a service that has workers manually complete it for you. It’s not exactly convenient or efficient, though, and it can often take up to ~30 seconds for a worker to complete a single captcha. Both options cost money.
Related
I am trying to read a web page using a get request in python.
The original URL is given here. I found out that the information I am interested in is in a subpage with this URL (I replaced the authenticity token with XXX).
I tried using the second URL in my script but I get a 406 error. Can you suggest what am I doing wrong? Is the authenticity token for preventing scraping? if so, can I work around it?
import urllib.request
url = ...
agent={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = urllib.request.Request(url,headers=agent)
data = urllib.request.urlopen(req)
Thanks!
PS, This is how I get the URL using Chrome:
First I browse to https://www.goodreads.com/book/show/385228.On_Liberty
Then I open Chrome's developer tools: three dots -> more tools -> developer tools. Choose the network tab.
Then I go to the bottom of the page (just after the last review) and click "next".
In the tool window choose the request and in the header I get the Request URL: https://www.goodreads.com/book/reviews/385228?csm_scope=&hide_last_page=true&language_code=en&page=2&authenticity_token=XXX
Can you try to update your headers to include one more item, like:
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3',
'X-Requested-With': 'XMLHttpRequest',
}
req = urllib.request.Request(url,headers= headers)
I managed to get 200 OK back when adding that header, however, the response you'll get back from this endpoint might not really be what you need in the end, since it is a piece of JavaScript code which in return updates the HTML page. You can still use it in some way, but it's very dirty approach and might complicate things a lot.
What information do you need exactly? There might be a different approach than using that "problematic" response from your second URL.
I'm currently working on attempting to scrape some HTML files from an electronic medical system that I use for work. I currently have a python bot that logs into the system and is able to download and send faxes for me, but there's some pages I want my bot to quickly grab before it even is logged in and sending faxes. These pages are basic HTML that have extremely predictable URLs and I have tested I can manually call the pages from my browser, so once I do get my session established it should be easy work.
The website is: https://kinnser.net/
Login URL: https://kinnser.net/login.cfm
second URL: https://kinnser.net/AM/Message/inbox.cfm
import requests
import json
import logging
import json
from requests.auth import HTTPBasicAuth
from lxml import html
#This URL will be the URL that your login form points to with the "action" tag.
POST_LOGIN_URL = 'https://kinnser.net/loginlogic.cfm'
#This URL is the page you actually want to pull down with requests.
REQUEST_URL = 'https://kinnser.net/AM/Message/inbox.cfm'
#username-input-name is the "name" tag associated with the username input field of the login form.
#password-input-name is the "name" tag associated with the password input field of the login form.
payload = {
'username': 'XXXXXXXX',
'password': 'XXXXXXXXX'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'}
with requests.Session() as session:
post = session.post(POST_LOGIN_URL, data=payload, headers=headers)
print(post)
r = session.get(REQUEST_URL)
print(r.text) #or whatever else you want to do with the request data!
I played around with the username, & password field by setting them equal to the input's name/ID but that wouldn't work. So I tried this script on our old EMR we used just to confirm it wasn't broken, and it did indeed work perfectly. So I began to play around with the headers in my request and it was still no dice. I'm not sure if my login is just failing or if they're detecting me being a bot and serving me the login page over and over again but I have spent about 10 hours trying to research a solution and I've hit a wall with my project currently.
If anyone see's any mistakes in my code or has workable solutions please feel free to suggest them. Thanks for the help and hopefully I'll soon grow to understand more about RESTful web services.
Think the HTML might actually be in post.text?
edit:
try the request with these headers:
...
user_agent_str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
+ "AppleWebKit/537.36 (KHTML, like Gecko) " \
+ "Chrome/78.0.3904.97 " \
+ "Safari/537.36"
content_type_str = "application/json"
headers = {
"user-agent": user_agent_str,
"content-type": content_type_str
}
...
Another edit:
I'm not sure if requests already handles this, but payload isn't valid JSON. You might also try using double instead of single quotes.
I would suggest trying out this two things.
kinnser.net/loginlogic.cfm From network calls it looks like this is post url.
Change 'Username' to 'username' and 'Password' to 'password' and try.
Since I don't have access username and password i can not verify this but this two thing might be causing the problem.
I'm trying to scrape the Home Depot website using Python and requests. Selenium Webdriver works fine, but takes way too much time, as the goal is to make a time-sensitive price comparison tool between local paint shops and power tool shops.
When I send a request to any other website, it works like normal. If I use any browser to navigate manually to the website, it also works fine (with or without session data/cookie data). I tried adding randomized headers into the request, but it does not seem to help the issue. From what I can see, it's not an issue of sending too many requests per time-period, (considering that selenium and manual browsing still works at any time.) I am confident that this specific issue is NOT because of a rate limitation.
my code:
from random import choice
import requests
import traceback
list_desktopagents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36']
def random_headers():
return {'User-Agent': choice(list_desktopagents),
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
response = requests.get(
'https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-
Interior-Exterior-Paint-390001/300831629',
headers=myheaders,
timeout=10)
my error:
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.homedepot.com', port=443): Read timed out. (read timeout=10)
Does anyone have a suggestion on what else I could do to successfully receive my response? I would prefer to use Requests, but anything that runs fast unlike selenium will be suitable. I understand that im being blocked, my question is not so much 'whats happening to stop me from scraping?', but rather, 'what can i do to further humanize my scraper so it allows me to continue?'
The error is coming from the User Agent. The reason why Selenium is working and not request is because Selenium is using a web driver to make the request, so it is more humanlike while request is much easier to be detected as a script. From Home Depot's robots.txt page it doesn't look like products are allowed for scraping. I just used this code and got a response by using this code:
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get('https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-Interior-Exterior-Paint-390001/300831629', headers=headers)
print(response.content)
By using these user agents you can "trick" the site into thinking you are an actual person, which is what the web driver with Selenium does.
I have been trying to do a get request on a YouTube video page in order to read simple information off of the page. I have done this many times before, and generally it is quite easy to reverse engineer a get request with the help of Google Chrome's developer tools.
To demonstrate, here is a screen shot of the request I get when I reload a YouTube video in a fresh incognito window (to prevent cookies from being sent) as seen from the developer menu:
chrome screenshot
Every time I close the window and reload the page I recieve nearly identical HTML (apart from authorization keys and the like), the bottom of which can seen here: another chrome screenshot
First I tried recreating this request using a header-less get with Requests in Python:
import requests
sesh = requests.Session()
print sesh.get("https://www.youtube.com/watch?v=5eA8IVrQWn8").content
this returns a different page which still contains some of the data present on the page I get from chrome but not nearly all of it. Next I tried including all the headers I saw in the chrome request using the following code:
import requests
sesh = requests.Session()
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding": "gzip, deflate, br",
"accept-language":"en-US,en;q=0.8",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"}
print sesh.get("https://www.youtube.com/watch?v=5eA8IVrQWn8", headers = headers).content
However this very strangely returns a seemingly random quick paragraph of unicode characters in varying lengths, sometimes around 10 characters long, sometimes closer to 50. I couldn't think of any other ways to make this closer to the request I was seeing from chrome. I tried fiddling with this for a couple of hours doing things like running the request multiple times in the same session and messing with the headers a bit, but to no avail.
FInally out of desperation I tried dropping everything except the user agent, using the following code:
import requests
sesh = requests.Session()
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"}
print sesh.get("https://www.youtube.com/watch?v=5eA8IVrQWn8", headers = headers).content
And this got me the page I wanted.
However I am left unsatisfied with the knowledge that somehow replicating the Get I was seeing in chrome didn't work. What am I missing from my second attempt?
I'm running Scrapy 0.24.4, and have encountered quite a few sites that shut down the crawl very quickly, typically within 5 requests. The sites return 403 or 503 for every request, and Scrapy gives up. I'm running through a pool of 100 proxies, with the RotateUserAgentMiddleware enabled.
Does anybody know how a site could identify Scrapy that quickly, even with the proxies and user agents changing? Scrapy doesn't add anything to the request headers that gives it away, does it?
Some sites incorporate javascript code that needs to be run.
Scrapy doesn't execute javascript code so the web app really quickly knows it's a bot.
http://scraping.pro/javascript-protected-content-scrape/
Try using selenium for those sites that return 403. If crawling with selenium works, you can assume that problem is in javascript.
I think crunchbase.com uses such protection against scraping.
It appears that the primary problem was not having cookies enabled. Having enabled cookies, I'm having more success now. Thanks.
For me cookies were already enabled.
What fixed it was using another user agent, one that is common.
Replace in settings.py file of your project USER_AGENT with this:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
I simply set AutoThrottle_ENABLED to True and my script was able to run.