Get user browser info in Python Bottle - python

I'm trying to find out which browsers are my users using and I'm running into a problem.
If I try to read header "User-Agent" it usually gives me lots of text, and tells me nothing.
For example, if I visit the site with Chrome, in "User-Agent" header there is:
User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Framework I've been using is Bottle (Python).
Any help would be appreciated, thanks.

User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Your conclusion above is wrong. The UA tells you many things including the type and version of the web browser.
The post below explains why Mozilla and Safari exist in Chrome's UA.
History of the browser user-agent string
You can try to analyze it manually on user-agent-string-db.
There's a Python API for it.
from uasparser2 import UASparser
uas_parser = UASparser()
# Instead of fecthing data via network every time, you can cache the db in local
# uas_parser = UASparser('/path/to/your/cache/folder', mem_cache_size=1000)
# Updating data is simple: uas_parser.updateData()
result = ua_parser.parse('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36')
# result
{'os_company': u'',
'os_company_url': u'',
'os_family': u'Linux',
'os_icon': u'linux.png',
'os_name': u'Linux',
'os_url': u'http://en.wikipedia.org/wiki/Linux',
'typ': u'Browser',
'ua_company': u'Google Inc.',
'ua_company_url': u'http://www.google.com/',
'ua_family': u'Chrome',
'ua_icon': u'chrome.png',
'ua_info_url': u'http://user-agent-string.info/list-of-ua/browser-detail?browser=Chrome',
'ua_name': u'Chrome 31.0.1650.57',
'ua_url': u'http://www.google.com/chrome'}

Thank you everyone for your answers, I found something really simple that works.
Download httpagentparser module from:
https://pypi.python.org/pypi/httpagentparser
after that, just import it in your pythong program
import httpagentparser
Then you can write a function like this that returns browser, works like a charm:
def detectBrowser(request):
agent = request.environ.get('HTTP_USER_AGENT')
browser = httpagentparser.detect(agent)
if not browser:
browser = agent.split('/')[0]
else:
browser = browser['browser']['name']
return browser
That's it

As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
It's not that the User Agent string tells you "nothing;" it's that it's telling you too much.
If you want a report that breaks down your users browser, your best bet is to analyze your logs. Several programs are available to help. (One caveat, if you're using Bottle's "raw" web server, is that it won't log in Common Log Format out of the box. You have options.)
If you need to know in real time, you'll need to spend time learning user agent strings (useragentstring.com might help here) or use an API like this one.

Related

Can we get IP banned even using Selenium?

I am using Python to scrape pages. Until now I didn't have any issues. I use Selenium for this purpose, but i also do hear that people get IP banned from some websites. I didn't faced that. Those people used beautifulsoup, lxml and requests libraries...
Selenium feels like a user is using the browser and not the bots, but can it also IP banned from some sites?
I am also using a header user_agent as:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/80.0.3987.132 Safari/537.36'
yes, It depends on the requests you send to a website, usually while datascraping a website can get you banned using the user agent is a plus because some websites wont let you in if that is not set up
if you dont want to get banned use a proxy IP.

Python requests in IIS always times out

I have a flask application running on a IIS server. Everything works fine, however I always get a timeout error when using requests.
import requests
r = requests.get('https://github.com')
Using web services is therefore impossible.
I have tried using headers with the requests. But still the same result:
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get('https://github.com', headers=headers)
Also tried increasing the timeouts limits, both in code and in the IIS.
Also tried changing the Identity field under Process Model section to LocalSystem.
I'm not familiar with IIS and I cannot think of anything else. Need help.
According to your description, I think this issue is not related with the IIS. It seems your network issue.
I suggest you could firstly check your server's firewall to make sure you let your server could access the internet.
If you need to use proxy to access the internet, I suggest you could try to add below settings in your web.config for your flask application.
<system.net>
<defaultProxy>
<proxy
proxyaddress="The IP address"
bypassonlocal="true"
/>
</defaultProxy>
</system.net>
Details, you could see this article.

Cookies not showing in Scrapy output even I have enabled it

By default Cookies are enabled in Python Scrapy
I have this in settings.py
COOKIES_DEBUG = True
It works in all other projects and shows cookies in terminal when I run code.
But it is not showing received cookies in terminal for a specific project.
I have searched internet but I am not sure what to do.
PS:
The website I am scraping of course sets cookies, I can see cookies when I visit that site from browser
What I can be missing?
From the dicsussions with OP, it appears that this website does not send Set-Cookie headers when using scrapy's default User-Agent string.
Changing the User-Agent string to something like this (in settings.py for example):
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.100 Safari/537.36)'
fixes the issue.

urllib getting HTML but missing data

Right basically I'm getting and displaying HTML which displays the data I'm looking for just fine in a normal browser but not in a HTML dump with urllib.
Example URL: https://betfred.mobi/sports/horses/event/4315034.2
Example data: Horse names like "She Is No Lady"
Displays just fine under a browser. Doesn't need any login or preexisting cookies or anything.
I thought maybe it was waiting to see an actual user agent or something but that should be fine as well. I'm setting one and I've checked - it's working.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36')]
response = opener.open("https://betfred.mobi/sports/horses/event/4315034.2")
print response.read()
It's showing something alright and I'm getting a HTML dump of the site but horse names for example are not showing up.
Am I missing something blindingly obvious here?
If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.
At present, Mechanize doesn't handle JavaScript.
Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.

Scrapy crawl blocked with 403/503

I'm running Scrapy 0.24.4, and have encountered quite a few sites that shut down the crawl very quickly, typically within 5 requests. The sites return 403 or 503 for every request, and Scrapy gives up. I'm running through a pool of 100 proxies, with the RotateUserAgentMiddleware enabled.
Does anybody know how a site could identify Scrapy that quickly, even with the proxies and user agents changing? Scrapy doesn't add anything to the request headers that gives it away, does it?
Some sites incorporate javascript code that needs to be run.
Scrapy doesn't execute javascript code so the web app really quickly knows it's a bot.
http://scraping.pro/javascript-protected-content-scrape/
Try using selenium for those sites that return 403. If crawling with selenium works, you can assume that problem is in javascript.
I think crunchbase.com uses such protection against scraping.
It appears that the primary problem was not having cookies enabled. Having enabled cookies, I'm having more success now. Thanks.
For me cookies were already enabled.
What fixed it was using another user agent, one that is common.
Replace in settings.py file of your project USER_AGENT with this:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
I simply set AutoThrottle_ENABLED to True and my script was able to run.

Categories