Scrapy shell doesn't crawl web page - python

I am trying to use Scrapy shell to try and figure out the selectors for zone-h.org. I run scrapy shell 'webpage' afterwards I tried to view the content to be sure that it is downloaded. But all I can see is a dash icon (-). It doesn't download the page. I tried to enter the website to check if my connection to the website is somehow blocked, but it was reachable. I tried setting user agent to something more generic like chrome but no luck there either. The website is blocking me somehow but I don't know how can I bypass it. I digged through the the website if they block crawling and it doesn't say it is forbidden to crawl it. Can anyone help out?

There is cookie issue with you spider, if you send your cookies with your request then you will get you desired data.
You can see that in attached picture.

Can you use scrapy shell "webpage" on another webpage that you know works/doesn't block scraping?
Have you tried using the view(response) command to open up what scrapy sees in a web browser?
When you go to the webpage using a normal browser, are you redirected to another, final homepage?
- if so, try using the final homepage's URL in your scrapy shell command
Do you have firewalls that could interfere with a Python/commandline app from connecting to the internet?

Related

Source code is not complete because "JS is disabled in your browser"

I'm writing a python code to, at first, get a full source code of a web page to later scrape it. But when I try to get the source code - I see the aforementioned message ("If you're seeing this message, that means JavaScript has been disabled on your browser, please enable JS to make this app work") with partial html code. Also when I click F12 to see 'elements' the entire code appears meanwhile, pressing Cntrl + U to view the source code yields the same result as getting it with the below mentioned py script
source = requests.get(link).text
soup = BeautifulSoup(source, 'lxml').prettify()
I've seen similar questions to mine but none of them had a satisfactory solution, for example, it was recommended to use selenium to open a new web page and then to work with it, but it would take additional time. JS is enabled in my browser
It is as you have seen on the other answers, you have to use selenium (or another browser automation tool) to enable javascript rendering. The web page you are trying to access uses client side rendering, which means that the first thing it sends when you access the url is a bunch of javascript code. Then the browser executes the javascript code to create the DOM of the web page.
You are saying that javascript is enabled in the browser but that has nothing to do with your python code. The library you are using requests is sending a HTTP GET request to the server to fetch the web page, and the server replies as it would to any other request with the javascript that knows how to render the web page. That's why you need something like selenium, that runs a browser instead of doing a simple HTTP request.

Scrapy - Page Access Denied

I am looking for some help for scraping www.mobile.de, while I get an "Access Denied Page"
A regular spider results in the attached picture.
So far I have tried/recognized:
I am not blocked, since I can open the page in Firefox/Chrome
I allowed cookies
I used the same header as used currently by Firefox
I used a referer
I enabled/disabled "Obey robots.txt"
I used Splash to activate/render Javascript
So right now, I cannot conclude how the page detects that my program is a bot and how to avoid that.
https://ibb.co/7RsMkM3

empty list response extract on scrapy

I'm new on scrapy and i have to crawl a webpage for a test. So I use the code below on a terminal but its returns a empty list i Don't understand why. When i use the same command on a another website, like amazon, with the right selector, it works. Can someone put light on it? thank you so much
scrapy shell "'https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"
response.css('.tileList-title').extract()
First of all, when I consulted the source code of the page you seemed interested to scrape the title Iced Teas in a header tags <h1>. Am I right ?
Second, I tried scrapy shell sessions to understand the issue. It seems to be a settings of user-agent request's headers. Look at the code sessions below:
Without user-agent set
scrapy shell https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas
In [1]: response.css('.tileList-title').extract()
Out[1]: []
view(response) #open the given response in your local web browser, for inspection.
With user agent set
scrapy shell https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas -s USER_AGENT='Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
In [1]: response.css('.tileList-title').extract()
Out[1]: ['<h1 class="tileList-title" ng-if="$ctrl.listTitle" tabindex="-1">Iced Teas</h1>']
#now as you can see it does not return an empty list.
view(response)
So to improve your future practices, know you can use -s KEYWORDSETTING=value in your scrapy shell sessions. Here the settings key words for scrapy.
And to check with view(response) to see if the requests returns the expected content even if it sent a 200. For my experience, with view(response) you can see that the content page, and even source code sometimes, is a little different when you use it in scrapy shell than when you use it in a normal browser. So that's a good practice to check with this shortcut. Here the shorcuts for scrapy. They are mentioned at each scrapy shell session too.

Scrape aspx site with python

I want to download supreme court cases. Below is the code, I am trying:
page = requests.get('http://judis.nic.in/supremecourt/Chrseq.aspx').text
I am getting below contents in page:
u'<html><p><hr></hr></p><b><center>The Problem may be due to 500 Server Error/404 Page Not Found.Please contact your system administrator.</center></b><p><hr></hr></p></html><!--0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234-->\r\n'
Is the site not scrapable or do I need to use some other method?
I checked this answer: How to scrape aspx pages with python , but the solution is in selenium.
Is it possible to do it in python and Beautiful soup?
The reason is you are hitting a url which may be no longer served by the server. I am able to get data from all pages. I checked response from scrapy shell as
scrapy shell "http://judis.nic.in/supremecourt/chejudis.asp"
and using xpath you can retrieve whatever data you want from same page.
I'm not able to open the website though my browser. I'm getting the same response from my browser. Maybe that's why you're getting that response back.

Scrapy for dynamic content

Can we use Scrapy for getting content from a web page which is loaded by Javascript?
I'm trying to scrape usage examples from this page,
but since they are loaded using Javascript as a JSON object I'm not able to get them with Scrapy.
Could you suggest what is the best way to deal with such issues?
Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:
After removing the JSONP paramter, the URL is pretty straightforward:
https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0
By making the minimal number of requests, your spider will be fast.
If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).

Categories