In general, It is possible to download the video by right-click.
But I don't know how to find this video's src using selenium.
<div id="video-processing" class="video-processing hidden">Processing video, please check back in a while</div>
<video id="video-player" class="video-js vjs-default-skin hidden" controls preload="auto" width="640" height="264" poster="http://i3.ruliweb.com/profile/16/12/01/158b9f7cb02326425.jpeg"></video>
I needs your help. thanks.
I had a problem like your question,
If you can not find the src by Inspect by browser Or get the page source code by Python (For various reasons, such as controlling several quality of video or for security & police reasons) ➡
One hundred percent is being changed by JavaScript. Otherwise should have been visible in the HTML code.
So I have a very simple recommendation and that is to use this code in Console:
var x = document.getElementsById("video-player").getAttribute("src");
console.log(x);
alert(x);
Of course, it would be much better if you give the site Address.
Related
If you go to the site, you'd notice that there is an age confirmation window which I want to bypass through scrapy but I messed up with that and I had to move on to selenium webdriver and now I'm using
driver.find_element_by_xpath('xpath').click()
to bypass that age confirmation window. Honestly I don't want to go with selenium webdriver because of its time consumption. Is there any way to bypass that window?
I searched a lot in stackoverflow and google
but didn't get any answer which may resolves my problem. If you've any link or idea of resolving it by Scrapy, that'd be appreciated. A single helpful comment will be up-voted!
To expand on Chillie's answer.
The age verification is irrelavant here. The data you are looking for is loaded via AJAX request:
See related question: Can scrapy be used to scrape dynamic content from websites that are using AJAX? to understand how they work.
You need to figure out how https://ns5bwtai8m-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.19.1&x-algolia-application-id=NS5BWTAI8M&x-algolia-api-key=e676b05f3844d3adf54a29732af6e43c url works and how can you retrieve in it scrapy.
But the age verification "window" is just a div that gets hidden when you press the button, not a real separate window:
<div class="age-check-modal" id="age-check-modal">
You can use the browser's Network tab in developer tools to see that no new info is uploaded or sent when you press the button. So everything is already loaded when you request a page. The "popup" is not even a popup, just an element whose display is changed to none when you click the button.
So Scrapy doesn't really care what's meant to be displayed as long as all html is loaded. If the elements are loaded, they are accessible. Or have you seen some information being unavailable without pressing the button?
You should inspect the html code more to see what each website does, this might make your scraping tasks easier.
Edit: After inspecting the original html you can see the following:
<div class="products-list">
<div class="products-container-block">
<div class="products-container">
<div id="hits" class='row'>
</div>
</div>
</div>
</div>
You can also see a lot of JS script tags.
The browser element inspector shows us the following:
The ::before part gives away that this was manipulated by JS, as you cannot do this with simple CSS. See Granitosaurus' answer for details on this.
What this means is that you need to somehow execute the arbitrary JS code on those pages. So you either need a solution with Scrapy, or just use Selenium, as many do, and as you already have.
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
I am trying to include a youtube video on website that I'm developing using GAE and python.
I know I should use this<iframe width="420" height="345"
src="http://www.youtube.com/watch?v=MYSVMgRr6pw">
</iframe> in my HTML, but I am also guessing I have to make some changes in app.yaml file. I can't figure out how to amend my app.yaml correctly. Currently I can only see a square box and no video. Here is a link to a web page with a video http://www.firstpiproject.appspot.com/learninglinux
Thanks
I believe, per http://www.w3schools.com/html/html_youtube.asp, that the canonical form is something like, and I quote:
<iframe width="420" height="315"
src="http://www.youtube.com/embed/XGSy3_Czz8k">
</iframe>
Note the slightly different format for the src= URL, with .../embed/ -- your page has src="http://www.youtube.com/watch?v=hBvaB8aAp1I&feature=youtu.be", which is a somewhat-different format.
I don't think this has anything to do with App Engine, python, app.yaml, and the like -- it's all about what, exactly, you put in that src= parameter of the iframe you serve as part of your HTML page. Try the w3schools-recommended format with .../embed/... and let us know!
Preferably using python (currently using urllib and BeautifulSoup) given a URL.
For example I'm trying to scrape the main photo on this page: http://www.marcjacobs.com/marc-jacobs/womens/bags-and-accessories/c3122001/the-single#?p=1&s=12
In Firefox under Tools > Page Info > Media lists all the visible images, including the link to the image I want to grab ( http://imagesec.mj.ctscdn.com/image/336/504/6ace6aac-c049-4d7e-9465-c19b5cd8e4ac.jpg )
Two interrelated problems:
If I do a view source the image path retrieved from the
Firefox tool is not found in the html document... Is there any way I can retrieve this
path without going through Firefox Page Info? Perhaps through either
Python and/or Javascript/JQuery?
I'm trying to get the photo of the product in "Orange", and notice the page always loads the black color by default
A working example is probably Google 'Shopping', if you type the name of this product and select color, the image shows up in the correct color (from the exact same page) in the search results.
Basically, I want to be able to scrape color and style/variation specific images from mostly shopping sites.
Selecting the right color seems more complicated, in that case I'll settle for just the main product image in black for now..
So far I've tried selecting images based on img height tags, also trying to read dimensions when there are no height/width tags... but occurred to me there has to be a better way.
This can be a bit complex but most of the solutions that work in this particular situation, are pretty much the same.
First, let me tell you why using Beautiful Soup or xlml won't work. You need to retrieve some information which is available only after you click on the orange bag thumbnail, right?. That is loaded using Javascript, so that orange bag image won't be available to Beautiful Soup and friends (because they don't parse Javascript nor elements that are absent in the parsed tree).
So that is a death end.
However, there are other screen scraping tools like Selenium or PhantomJS. I have tested both and work great. They basically integrate a browser so they obviously are capable of managing javascript. I don't know if you need to scrape this automatically from your server or you want to start the scraping process at will. With Selenium (after you tell it what page you want to open, what element you want to click, etc), you will see your browser doing all that stuff by itself. There are other options available such as using a headless browser. In my opinion, it's very powerful but it can get quite complex to get it working.
A far more easier solution is using PhantomJs. It's similar to Selenium although, as its name indicates, you give the instructions via Javascript (which can be a bit more comfortable since you're already dealing with web elements). I recommend you to use CasperJS: It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks...
Let me give some feel for what it looks like:
casperFunction = function(){
var casper = require('casper').create({
verbose: true
});
casper.start('yourwebpage'); // loading the webpage
casper.then(function(){ // after loading...
casper.evaluate(function(){ // get me some element's value
document.getElementById('yourelement').value
});
});
casper.then(function(){ // after that, click on this other element
this.click('#id_of_other_element');
})
casper.wait(7000); // wait for some processing... this can be quite
// useful if you need to wait a few seconds in
// order to retrieve your orange bag later
casper.run(); // actually runs the whole thing
There you have most of the things you need to accomplish your task.
By the way, let me remind you that usually it's needed to ask for permission to retrieve that kind of stuff.
Hope that helps.
I want to save a web page. I use python urllib to parse the web page. But I
find the saved file, where some content is missing. The missing part
is block from the source web page, such as this part <div
style="display: block;" id="GeneInts">...</div>.
I don't know how to parse a whole page without something block in it. Could you help me
figure it out? Thank you!
This is my program
url = 'http://receptome.stanford.edu/hpmr/SearchDB/getGenePage.asp?Param=4502931&ProtId=1&ProtType=Receptor'
f = urllib.urlretrieve(url,'test.html')
Whenever I need to let Javascript operate on a page before I can scrape it, the first thing I always turn to is SeleniumRC -- while it's mainly designed for purposes of testing, I've never found a better tool for this challenging task. For the "using it from Python" part, see here and links therefrom.
That page generates a great deal of its content with JavaScript executed at load-time, including, I think, the part you're trying to extract. You need a screen-scraper that can run JavaScript and then save out the modified DOM. I don't know where you get one of those.