scrapy fetching incomplete html

scrapy fetching incomplete html - python

I'm a retired programmer but new to scrapy. Actually, this is my first python project so I could be doing anything wrong.
I brought up scrapy under anaconda and started a shell with :
scrapy shell "https://sailing-channels.com/by-subscribers"
Looks like everything is working fine and I can get some querys to work.
Here is my problem:
when I enter :
response.css('body').extract()
I get:['<body><noscript>If you\'re seeing this message, that means <strong>JavaScript has been disabled on your browser</strong>, please <strong>enable JS</strong> to make this app work.</noscript><div id="app"></div><script src="//apis.google.com/js/platform.js" async></script><script>!function(e,a,n,t,g,c,i){e.GoogleAnalyticsObject="ga",e.ga=e.ga||function(){(e.ga.q=e.ga.q||[]).push(arguments)},e.ga.l=1*new Date,c=a.createElement(n),i=a.getElementsByTagName(n)[0],c.async=1,c.src="//www.google-analytics.com/analytics.js",i.parentNode.insertBefore(c,i)}(window,document,"script"),ga("create","UA-15981085-17","auto"),ga("require","linkid"),ga("set","anonymizeIp",!0),ga("send","pageview")</script><script type="application/ld+json">{\n\t\t\t"#context": "http://schema.org",\n\t\t\t"#type": "Organization",\n\t\t\t"name": "Sailing Channels"\n\t\t\t"url": "https://www.sailing-channels.com",\n\t\t\t"logo": "https://sailing-channels.com/img/banner.png",\n\t\t\t"sameAs" : [\n\t\t\t\t"https://www.facebook.com/sailingchannels",\n\t\t\t\t"https://twitter.com/sailchannels"\n\t\t\t]\n\t }</script><script type="text/javascript" src="https://cdn.sailing-channels.com/1.15.9/main.1dad65fcb7a507930e1f.js"></script></body>']
My problem is I expect a lot more. When I do an inspect on chrome I see a lot more /div sections inside <div id="app"></div>
Could someone shine some light on what I'm doing wrong? I want to scrape the channel name, subscriber count, and views
Thanks

Understandable. It is because of they rendering the data through another script during loading of the page.
In normal scrapy setting, dynamic page loading content doesn't appear. For scraping that data you can use selenium.
selenium-with-scrapy-for-dynamic-page
For an alternative way, you can use splash for handling javascript enabled content.handling-javascript-in-scrapy-with-splash

Related

How to Bypass confirm age model through scrapy

If you go to the site, you'd notice that there is an age confirmation window which I want to bypass through scrapy but I messed up with that and I had to move on to selenium webdriver and now I'm using
driver.find_element_by_xpath('xpath').click()
to bypass that age confirmation window. Honestly I don't want to go with selenium webdriver because of its time consumption. Is there any way to bypass that window?
I searched a lot in stackoverflow and google
but didn't get any answer which may resolves my problem. If you've any link or idea of resolving it by Scrapy, that'd be appreciated. A single helpful comment will be up-voted!

To expand on Chillie's answer.
The age verification is irrelavant here. The data you are looking for is loaded via AJAX request:
See related question: Can scrapy be used to scrape dynamic content from websites that are using AJAX? to understand how they work.
You need to figure out how https://ns5bwtai8m-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.19.1&x-algolia-application-id=NS5BWTAI8M&x-algolia-api-key=e676b05f3844d3adf54a29732af6e43c url works and how can you retrieve in it scrapy.

But the age verification "window" is just a div that gets hidden when you press the button, not a real separate window:
<div class="age-check-modal" id="age-check-modal">
You can use the browser's Network tab in developer tools to see that no new info is uploaded or sent when you press the button. So everything is already loaded when you request a page. The "popup" is not even a popup, just an element whose display is changed to none when you click the button.
So Scrapy doesn't really care what's meant to be displayed as long as all html is loaded. If the elements are loaded, they are accessible. Or have you seen some information being unavailable without pressing the button?
You should inspect the html code more to see what each website does, this might make your scraping tasks easier.
Edit: After inspecting the original html you can see the following:
<div class="products-list">
<div class="products-container-block">
<div class="products-container">
<div id="hits" class='row'>
</div>
</div>
</div>
</div>
You can also see a lot of JS script tags.
The browser element inspector shows us the following:
The ::before part gives away that this was manipulated by JS, as you cannot do this with simple CSS. See Granitosaurus' answer for details on this.
What this means is that you need to somehow execute the arbitrary JS code on those pages. So you either need a solution with Scrapy, or just use Selenium, as many do, and as you already have.

Scraping webpage generated by javascript

I have a problem getting javascript content into HTML to use it for scripting. I used multiple methods as phantomjs or python QT library and they all get most of the content in nicely but the problem is that there are javascript buttons inside the page like this:
Pls see screenshot here
Now when I load this page from a script these buttons won't default to any value so I am getting back 0 for all SELL/NEUTRAL/BUY values below. Is there a way to set these values when you load the page from a script?
Example page with all the values is: https://www.tradingview.com/symbols/NEBLBTC/technicals/
Any help would be greatly appreciated.

If you are trying to achieve this with scrapy or with derivation of cURL or urrlib I am afraid that you can't do this. Python has another external packages such selenium that allow you to interact with the javascript of the page, but the problem with selenium is too slow, if you want something similar to scrapy you could check how the site works (as i can see it works through ajax or websockets) and fetch the info that you want through urllib, like you would do with an API.
Please let me know if you understand me or i misunderstood your question

I used seleneum which was perfect for this job, it is indeed slow but fits my purpose. I also used the seleneum firefox plugin to generate the python script as it was very challenging to find where exactly in the code as the button I had to press.

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !

You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer

A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.

TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Trying to view html on yikyak.com, get "browser out of date" page

Question: yikyak.com returns some sort of "browser not supported" landing page when I try to view source code in chrome (even for the page I'm logged in on) or when I write it out to the Python terminal. Why is this and what can I do to get around it?
Edit for clarification: I'm using the chrome webdriver. I can navigate around the yik yak website by clicking on it just fine. But whenever I try to see what html is on the page, I get an html page for a "browser not reported" page.
Background: I'm trying to access yikyak.com with selenium for python to download yaks and do fun things with them. I know fairly little about web programming.
Thanks!
Secondary, less important question: If you're already here, are there particularly great free resources for a super-quick intro to the certification knowledge I need to store logins and stuff like that to use my logged in account? That would be awesome.

I figured it out. I was being dumb. I saved off the html as a file and opened that file with chrome and it displayed the normal page. I just didn't see the fact that it was a normal page looking at it directly. Thanks all 15 people for your time.

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.

The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.

Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy fetching incomplete html - python

Related

How to Bypass confirm age model through scrapy

Scraping webpage generated by javascript

How to read a HTML page that takes some time to load? [duplicate]

Trying to view html on yikyak.com, get "browser out of date" page

Parsing from a website -- source code does not contain the info I need

Categories

Resources