Use scrapy in simple pages - python

I'm scraping a page that have a simple structure, I use chrome to get an idea of the xpaths I need to use, but in this case isn't working.
I got this kind of xpaths:
/html/body/text()[1]
/html/body/div[9]/p/span[2]/text()
But when I try:
response.xpath('/html/body/div[9]/p/span[2]/text()')
or
response.xpath('/html/body/div[9]/p/span[2]/text()').extract()
I don't get any response, just an empty list

You need to fix your XPath expression. Demo from the Shell:
$ scrapy shell "http://www.bbb.org/boston/business-reviews/appliances-major-dealers/dracut-appliance-center-inc-in-dracut-ma-76793/ReadReviews?page=1&exp=1" -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36"
>>> print(response.xpath("//span[. = 'Comment from the Business']/following-sibling::span/text()").extract_first())
Mr, ********,
Thank you very much for your positive review. It's great to hear your install went smoothly. *** (our sales manager of over 45 years) and *** (Sales for over 10 years) have been notified of this positive response and truly appreciated it. We look forward to service you again in the future!!

Related

Return Google Desktop version with Scrapy

How could I do to get the Google Desktop Version with Scrapy?
The issue is this:
When you scrapy to google, it returns the mobile version. So, I set up an user agent, but now returns a weird html, like the mobile version.
import scrapy
class Searcher(scrapy.Spider):
name='rast'
start_urls=[
'https://google.com/search?q=lawyers'
]
user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
def parse(self, response):
data={}
body=response.css('div').getall()
data['body']=body
yield data
With Selenium/Requests, it's easy, but I've reading that the right way to scrape websites is with Scrapy.
And BTW, the reason I need the Desktop Version is because Google gives some bold terms called variations to the keywords.
Thank you!

Can we get IP banned even using Selenium?

I am using Python to scrape pages. Until now I didn't have any issues. I use Selenium for this purpose, but i also do hear that people get IP banned from some websites. I didn't faced that. Those people used beautifulsoup, lxml and requests libraries...
Selenium feels like a user is using the browser and not the bots, but can it also IP banned from some sites?
I am also using a header user_agent as:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/80.0.3987.132 Safari/537.36'
yes, It depends on the requests you send to a website, usually while datascraping a website can get you banned using the user agent is a plus because some websites wont let you in if that is not set up
if you dont want to get banned use a proxy IP.

Why do I not get any content with python requests get, but still a 200 response?

I'm doing a requests.get(url='url', verify=False), from my django application hosted on an Ubuntu server from AWS, to a url that has a Django Rest Framework. There are no permissions or authentication on the DRF, because I'm the one that made it. I've added headers such as
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}, but wasn't able to get any content.
BUT when I do run ./manage.py shell and run the exact same command, I get the output that I need!
EDIT 1:
So I've started using subprocess.get_output("curl <url> --insecure", shell=True) and it works, but I know this is not a very "nice" way to do things.
I know what the problem was.
My application when it was being deployed was single threaded, not multithreaded.
I changed my worker number and that fixed everything.

Scrapy view returns a blank page

I'm new at Scrapy and I was just trying to scrape http://www.diseasesdatabase.com/
When I type scrapy view http://www.diseasesdatabase.com/, it displays a blank page but if I download the page and do it on the local file, it displays as usual. Why is this happening?
Pretend being a real browser providing a User-Agent header:
scrapy view http://www.diseasesdatabase.com/ -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36"
Worked for me.
Note that -s option here helps to override the built-in USER_AGENT setting.

Get user browser info in Python Bottle

I'm trying to find out which browsers are my users using and I'm running into a problem.
If I try to read header "User-Agent" it usually gives me lots of text, and tells me nothing.
For example, if I visit the site with Chrome, in "User-Agent" header there is:
User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Framework I've been using is Bottle (Python).
Any help would be appreciated, thanks.
User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Your conclusion above is wrong. The UA tells you many things including the type and version of the web browser.
The post below explains why Mozilla and Safari exist in Chrome's UA.
History of the browser user-agent string
You can try to analyze it manually on user-agent-string-db.
There's a Python API for it.
from uasparser2 import UASparser
uas_parser = UASparser()
# Instead of fecthing data via network every time, you can cache the db in local
# uas_parser = UASparser('/path/to/your/cache/folder', mem_cache_size=1000)
# Updating data is simple: uas_parser.updateData()
result = ua_parser.parse('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36')
# result
{'os_company': u'',
'os_company_url': u'',
'os_family': u'Linux',
'os_icon': u'linux.png',
'os_name': u'Linux',
'os_url': u'http://en.wikipedia.org/wiki/Linux',
'typ': u'Browser',
'ua_company': u'Google Inc.',
'ua_company_url': u'http://www.google.com/',
'ua_family': u'Chrome',
'ua_icon': u'chrome.png',
'ua_info_url': u'http://user-agent-string.info/list-of-ua/browser-detail?browser=Chrome',
'ua_name': u'Chrome 31.0.1650.57',
'ua_url': u'http://www.google.com/chrome'}
Thank you everyone for your answers, I found something really simple that works.
Download httpagentparser module from:
https://pypi.python.org/pypi/httpagentparser
after that, just import it in your pythong program
import httpagentparser
Then you can write a function like this that returns browser, works like a charm:
def detectBrowser(request):
agent = request.environ.get('HTTP_USER_AGENT')
browser = httpagentparser.detect(agent)
if not browser:
browser = agent.split('/')[0]
else:
browser = browser['browser']['name']
return browser
That's it
As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
It's not that the User Agent string tells you "nothing;" it's that it's telling you too much.
If you want a report that breaks down your users browser, your best bet is to analyze your logs. Several programs are available to help. (One caveat, if you're using Bottle's "raw" web server, is that it won't log in Common Log Format out of the box. You have options.)
If you need to know in real time, you'll need to spend time learning user agent strings (useragentstring.com might help here) or use an API like this one.

Categories