Return Google Desktop version with Scrapy - python

How could I do to get the Google Desktop Version with Scrapy?
The issue is this:
When you scrapy to google, it returns the mobile version. So, I set up an user agent, but now returns a weird html, like the mobile version.
import scrapy
class Searcher(scrapy.Spider):
name='rast'
start_urls=[
'https://google.com/search?q=lawyers'
]
user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
def parse(self, response):
data={}
body=response.css('div').getall()
data['body']=body
yield data
With Selenium/Requests, it's easy, but I've reading that the right way to scrape websites is with Scrapy.
And BTW, the reason I need the Desktop Version is because Google gives some bold terms called variations to the keywords.
Thank you!

Related

Python web scraping, requests object hangs

I am trying to scrape the website in python, https://www.nseindia.com/
However when I try to load the website using Requests in python the call simply hangs below is the code I am using.
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get('https://www.nseindia.com/',headers=headers)
The requests.get call simply hangs, not sure what I am doing wrong here? The same URL works perfectly in Chrome or any other browser.
Appreciate any help.

How do I know which browser is used to crawl in Scrapy framework?

What's my context:
As you know, website HTML structure on Chrome, Firefox, Safari are quite different. So when I'm using CSS-Selector to get data in an element tag from HTML structure, sometimes It that tag is already have with Chrome browser but the other is not. So that, I just want to focus on only one browser to reduce my effort.
When I crawl data from urls by using Scrapy framework, I don't know which browser will be used by Scrapy to crawl data. Therefore, I also don't know what kind of HTML response body be returned. I checked the response and I found that sometimes the structure is the same as getting from Chrome but sometimes It's not. It seems that Scrapy framework used many different web browsers to crawl data.
What I want:
I want to use only Chrome browser for crawling data in Scrapy framework
The structure of the HTML response body must be obtained from Chrome
What I ask:
Does anyone have any Ideas or tips to help me deal with that issue?
Can I config the Webdriver in Scrapy Framework as Selenium does? (If It's possible, please show me Where and How?)
Thank you!
Scrapy does not use Browser, it parser for static html like BeautifulSoup. if you want to parse dynamic page (javascript generated) use selenium and if you want you can send the page source to Scrapy.
To set Scrapy to use custom user agent (Chrome), in settings.py add
USER_AGENT = Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
or in my_spider.py
class MySpider(scrapy.Spider):
def start_requests(self):
yield scrapy.Request(self.start_urls, callback=self.parse, headers={"User-Agent": "Your Custom User Agent"})
You can set the user agent in your setting file, something like this
USER_AGENT = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
So for web server it will look like the request is generating from Chrome.

Can't access this webpage

I am trying to write a python parser for this website 'http://www.topuniversities.com/university-rankings/world-university-rankings/2015#sorting=rank+region=+country=+faculty=+stars=false+search='
Every time I do the regular urlopen and print it, it says
'Access denied | www.topuniversities.com used CloudFlare to restrict access'.
After I tried this method
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
url = 'http://www.topuniversities.com/university-rankings/world-university rankings/2015#sorting=rank+region=+country=+faculty=+stars=false+search='
myopener = MyOpener()
page = myopener.open(url).read()
print page
But this prints out something other then what my chrome's inspect elements shows. I need to parse the names of the universities their rankings and the url that leads to their page.
What do I do? Please Help

Use scrapy in simple pages

I'm scraping a page that have a simple structure, I use chrome to get an idea of the xpaths I need to use, but in this case isn't working.
I got this kind of xpaths:
/html/body/text()[1]
/html/body/div[9]/p/span[2]/text()
But when I try:
response.xpath('/html/body/div[9]/p/span[2]/text()')
or
response.xpath('/html/body/div[9]/p/span[2]/text()').extract()
I don't get any response, just an empty list
You need to fix your XPath expression. Demo from the Shell:
$ scrapy shell "http://www.bbb.org/boston/business-reviews/appliances-major-dealers/dracut-appliance-center-inc-in-dracut-ma-76793/ReadReviews?page=1&exp=1" -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36"
>>> print(response.xpath("//span[. = 'Comment from the Business']/following-sibling::span/text()").extract_first())
Mr, ********,
Thank you very much for your positive review. It's great to hear your install went smoothly. *** (our sales manager of over 45 years) and *** (Sales for over 10 years) have been notified of this positive response and truly appreciated it. We look forward to service you again in the future!!

Scrapy view returns a blank page

I'm new at Scrapy and I was just trying to scrape http://www.diseasesdatabase.com/
When I type scrapy view http://www.diseasesdatabase.com/, it displays a blank page but if I download the page and do it on the local file, it displays as usual. Why is this happening?
Pretend being a real browser providing a User-Agent header:
scrapy view http://www.diseasesdatabase.com/ -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36"
Worked for me.
Note that -s option here helps to override the built-in USER_AGENT setting.

Categories