(fake_useragent) UserAgent() will not connect - python

Essentially, I had a code that has been working for a few months. I try to run the program today and, like the title says, the connection for UserAgent()is timing out. I've tried upgrading the file with "pip install ---upgrade fake_useragent" and I'm told the package is up to date. I've also tried to delete the file (in order to re-install) but I am unable to for some reason. Does anyone have any ideas as to how else I can approach this issue?
from fake_useragent import UserAgent
...
ua = UserAgent()#program cannot progress past this point

You should add a fallback user_agent to the ua object, this way if the server is down then the fallback useragent will kick in, better a working outdated u_agent than complete program crash.
from fake_useragent import UserAgent
ua = UserAgent(fallback='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36')
headers = {'User-Agent':ua.chrome}
I learned this from this question:
Scrapy FakeUserAgentError: Error occurred during getting browser

The fake_useragent package connects to the http://useragentstring.com/ to get the list of up-to-date user agent strings. Looks like the http://useragentstring.com/ is down and I hope it is temporarily.

Related

Getting reCaptcha in python requests

I am using python module requests to send some requests to google but after some requests, a reCaptcha pops up.I am using user agent but it still pops up!
What should I do?
I used user agent, it did change the browser looks but it did no effect on the Captcha problem
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
sleep(2)
headers = {'User-Agent': user_agent}
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
file = requests.get(f'https://www.google.com/search?q=contact+email+{keyword}+site:{site}&num=100', headers=headers)
I used sleep but in vain. Any suggestions?
That’s kind of the entire point of captchas. They help deter against bots and spammers. Most captchas can’t be bypassed easily, so just changing the user agent won’t make the captcha go away. Since it sounds like the captchas only appear after a certain number of requests, you could use rotating residential proxies and change the session’s IP address whenever a captcha is detected.
Alternatively, you can use a captcha solving service like Anti-Captcha or DeathByCaptcha which involves parsing information about the captcha and then sending it to a service that has workers manually complete it for you. It’s not exactly convenient or efficient, though, and it can often take up to ~30 seconds for a worker to complete a single captcha. Both options cost money.

getting past ReadTimeout from Python Requests

I'm trying to scrape the Home Depot website using Python and requests. Selenium Webdriver works fine, but takes way too much time, as the goal is to make a time-sensitive price comparison tool between local paint shops and power tool shops.
When I send a request to any other website, it works like normal. If I use any browser to navigate manually to the website, it also works fine (with or without session data/cookie data). I tried adding randomized headers into the request, but it does not seem to help the issue. From what I can see, it's not an issue of sending too many requests per time-period, (considering that selenium and manual browsing still works at any time.) I am confident that this specific issue is NOT because of a rate limitation.
my code:
from random import choice
import requests
import traceback
list_desktopagents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36']
def random_headers():
return {'User-Agent': choice(list_desktopagents),
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
response = requests.get(
'https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-
Interior-Exterior-Paint-390001/300831629',
headers=myheaders,
timeout=10)
my error:
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.homedepot.com', port=443): Read timed out. (read timeout=10)
Does anyone have a suggestion on what else I could do to successfully receive my response? I would prefer to use Requests, but anything that runs fast unlike selenium will be suitable. I understand that im being blocked, my question is not so much 'whats happening to stop me from scraping?', but rather, 'what can i do to further humanize my scraper so it allows me to continue?'
The error is coming from the User Agent. The reason why Selenium is working and not request is because Selenium is using a web driver to make the request, so it is more humanlike while request is much easier to be detected as a script. From Home Depot's robots.txt page it doesn't look like products are allowed for scraping. I just used this code and got a response by using this code:
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get('https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-Interior-Exterior-Paint-390001/300831629', headers=headers)
print(response.content)
By using these user agents you can "trick" the site into thinking you are an actual person, which is what the web driver with Selenium does.

Cannot open PhantomJS webpages in desktop mode (always in mobile mode)

I have been trying to fix this issue through stack overflow posts, but cannot find any relevant topics to my issue.
I am creating an automated python script that would automatically login to my facebook account and would utilize some features that facebook offers.
When I use selenium, I usually have the program run on the Chrome browser and I use the code as following
driver = webdriver.Chrome()
And I program the rest of the stuff that I want to do from there since it's easy to visually see whats going on with the program. However, when I switch to the PhantomJS browser, the program runs Facebook in a mobile version of the website (Like an android/ios version of Facebook). Here is an example of what it looks like
I was wondering if anyone would be able to help me in try understanding how to convert this into desktop mode, since the mobile version of Facebook is coded differently than the desktop version, and I don't want to redo the code for this difference. I need to have this running on PhantomJS, because it will be running on a low-powered raspberry pi device that can barely open google chrome.
I have also tried the following to see if it worked, and it didn't help.
headers = { 'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
driver = webdriver.PhantomJS(desired_capabilities = headers)
driver.set_window_size(1366, 768)
Any help would be greatly appreciated!!
I had the same problem with PhantomJS Selenium and Python and next code was resolve it.
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()
desired_capabilities['phantomjs.page.customHeaders.User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) ' \
'AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/39.0.2171.95 Safari/537.36'
driver = webdriver.PhantomJS('./phantom/bin/phantomjs.exe', desired_capabilities=desired_capabilities)
driver.get('http://facebook.com')

Changing User Agent in Python 3 for urrlib.request.urlopen

I want to open a url using urllib.request.urlopen('someurl'):
with urllib.request.urlopen('someurl') as url:
b = url.read()
I keep getting the following error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the solutions don't work.
How can I fix this problem with python 3?
From the Python docs:
import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
from urllib.request import urlopen, Request
urlopen(Request(url, headers={'User-Agent': 'Mozilla'}))
I just answered a similar question here: https://stackoverflow.com/a/43501438/206820
In case you just not only want to open the URL, but also want to download the resource(say, a PDF file), you can use the code as below:
# proxy = ProxyHandler({'http': 'http://192.168.1.31:8888'})
proxy = ProxyHandler({})
opener = build_opener(proxy)
opener.addheaders = [('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30')]
install_opener(opener)
result = urlretrieve(url=file_url, filename=file_name)
The reason I added proxy is to monitor the traffic in Charles, and here is the traffic I got:
The host site rejection is coming from the OWASP ModSecurity Core Rules for Apache mod-security. Rule 900002 has a list of "bad" user agents, and one of them is "python-urllib2". That's why requests with the default user agent fail.
Unfortunately, if you use Python's "robotparser" function,
https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser
it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.

Trouble downloading website using urllib(2) and requests - Bad status line

I'm trying to download pages from the site
http://statsheet.com/
like this
url = 'http://statsheet.com'
urllib2.urlopen(url)
I have tried with the Python modules urllib, urllib2 and "reqests", but I only get error messages like "got a bad status line", "BadStatusLine" or similar
Is there any way to get around this?
You need to specify a common browser user agent e.g.
wget -U "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.34
Safari/537.36" http://statsheet.com
Related question/answer:
Changing user agent on urllib2.urlopen

Categories