I'm new at Scrapy and I was just trying to scrape http://www.diseasesdatabase.com/
When I type scrapy view http://www.diseasesdatabase.com/, it displays a blank page but if I download the page and do it on the local file, it displays as usual. Why is this happening?
Pretend being a real browser providing a User-Agent header:
scrapy view http://www.diseasesdatabase.com/ -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36"
Worked for me.
Note that -s option here helps to override the built-in USER_AGENT setting.
Related
I am writing a code, where I have to use headless browser, but to access a specific website, I need to send user-agent as well. I am currently doing it by sending the following snippet of code(Python/Selenium/ChromeDriver).
opts = Options()
opts.add_argument("--headless")
opts.add_argument("--no-sandbox")
opts.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")
But I wanted to make the user-agent genuine, instead of same for every browser/device where the code runs, thus I want to know the user-agent of browser on user's device.
So is there any way to find a browser's user-agent by using Python/Selenium code or command prompt?
httpagentparser extracts os, browser etc... information from http user agent string
so try this
import httpagentparser as agent
s = "user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
print(agent.detect(s))
I am trying to scrape the website in python, https://www.nseindia.com/
However when I try to load the website using Requests in python the call simply hangs below is the code I am using.
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get('https://www.nseindia.com/',headers=headers)
The requests.get call simply hangs, not sure what I am doing wrong here? The same URL works perfectly in Chrome or any other browser.
Appreciate any help.
How could I do to get the Google Desktop Version with Scrapy?
The issue is this:
When you scrapy to google, it returns the mobile version. So, I set up an user agent, but now returns a weird html, like the mobile version.
import scrapy
class Searcher(scrapy.Spider):
name='rast'
start_urls=[
'https://google.com/search?q=lawyers'
]
user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
def parse(self, response):
data={}
body=response.css('div').getall()
data['body']=body
yield data
With Selenium/Requests, it's easy, but I've reading that the right way to scrape websites is with Scrapy.
And BTW, the reason I need the Desktop Version is because Google gives some bold terms called variations to the keywords.
Thank you!
What's my context:
As you know, website HTML structure on Chrome, Firefox, Safari are quite different. So when I'm using CSS-Selector to get data in an element tag from HTML structure, sometimes It that tag is already have with Chrome browser but the other is not. So that, I just want to focus on only one browser to reduce my effort.
When I crawl data from urls by using Scrapy framework, I don't know which browser will be used by Scrapy to crawl data. Therefore, I also don't know what kind of HTML response body be returned. I checked the response and I found that sometimes the structure is the same as getting from Chrome but sometimes It's not. It seems that Scrapy framework used many different web browsers to crawl data.
What I want:
I want to use only Chrome browser for crawling data in Scrapy framework
The structure of the HTML response body must be obtained from Chrome
What I ask:
Does anyone have any Ideas or tips to help me deal with that issue?
Can I config the Webdriver in Scrapy Framework as Selenium does? (If It's possible, please show me Where and How?)
Thank you!
Scrapy does not use Browser, it parser for static html like BeautifulSoup. if you want to parse dynamic page (javascript generated) use selenium and if you want you can send the page source to Scrapy.
To set Scrapy to use custom user agent (Chrome), in settings.py add
USER_AGENT = Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
or in my_spider.py
class MySpider(scrapy.Spider):
def start_requests(self):
yield scrapy.Request(self.start_urls, callback=self.parse, headers={"User-Agent": "Your Custom User Agent"})
You can set the user agent in your setting file, something like this
USER_AGENT = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
So for web server it will look like the request is generating from Chrome.
I'm trying to download pages from the site
http://statsheet.com/
like this
url = 'http://statsheet.com'
urllib2.urlopen(url)
I have tried with the Python modules urllib, urllib2 and "reqests", but I only get error messages like "got a bad status line", "BadStatusLine" or similar
Is there any way to get around this?
You need to specify a common browser user agent e.g.
wget -U "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.34
Safari/537.36" http://statsheet.com
Related question/answer:
Changing user agent on urllib2.urlopen