Screen scraping images (ie. Firefox Page Info / Google Images) - python

Preferably using python (currently using urllib and BeautifulSoup) given a URL.
For example I'm trying to scrape the main photo on this page: http://www.marcjacobs.com/marc-jacobs/womens/bags-and-accessories/c3122001/the-single#?p=1&s=12
In Firefox under Tools > Page Info > Media lists all the visible images, including the link to the image I want to grab ( http://imagesec.mj.ctscdn.com/image/336/504/6ace6aac-c049-4d7e-9465-c19b5cd8e4ac.jpg )
Two interrelated problems:
If I do a view source the image path retrieved from the
Firefox tool is not found in the html document... Is there any way I can retrieve this
path without going through Firefox Page Info? Perhaps through either
Python and/or Javascript/JQuery?
I'm trying to get the photo of the product in "Orange", and notice the page always loads the black color by default
A working example is probably Google 'Shopping', if you type the name of this product and select color, the image shows up in the correct color (from the exact same page) in the search results.
Basically, I want to be able to scrape color and style/variation specific images from mostly shopping sites.
Selecting the right color seems more complicated, in that case I'll settle for just the main product image in black for now..
So far I've tried selecting images based on img height tags, also trying to read dimensions when there are no height/width tags... but occurred to me there has to be a better way.

This can be a bit complex but most of the solutions that work in this particular situation, are pretty much the same.
First, let me tell you why using Beautiful Soup or xlml won't work. You need to retrieve some information which is available only after you click on the orange bag thumbnail, right?. That is loaded using Javascript, so that orange bag image won't be available to Beautiful Soup and friends (because they don't parse Javascript nor elements that are absent in the parsed tree).
So that is a death end.
However, there are other screen scraping tools like Selenium or PhantomJS. I have tested both and work great. They basically integrate a browser so they obviously are capable of managing javascript. I don't know if you need to scrape this automatically from your server or you want to start the scraping process at will. With Selenium (after you tell it what page you want to open, what element you want to click, etc), you will see your browser doing all that stuff by itself. There are other options available such as using a headless browser. In my opinion, it's very powerful but it can get quite complex to get it working.
A far more easier solution is using PhantomJs. It's similar to Selenium although, as its name indicates, you give the instructions via Javascript (which can be a bit more comfortable since you're already dealing with web elements). I recommend you to use CasperJS: It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks...
Let me give some feel for what it looks like:
casperFunction = function(){
var casper = require('casper').create({
verbose: true
});
casper.start('yourwebpage'); // loading the webpage
casper.then(function(){ // after loading...
casper.evaluate(function(){ // get me some element's value
document.getElementById('yourelement').value
});
});
casper.then(function(){ // after that, click on this other element
this.click('#id_of_other_element');
})
casper.wait(7000); // wait for some processing... this can be quite
// useful if you need to wait a few seconds in
// order to retrieve your orange bag later
casper.run(); // actually runs the whole thing
There you have most of the things you need to accomplish your task.
By the way, let me remind you that usually it's needed to ask for permission to retrieve that kind of stuff.
Hope that helps.

Related

<img> tag mechanics of in HTML, selenium scraping

This is a bit of a long theoretical question about how img tags really work, for the purposes of web scraping. I've done a lot of research and have seen a bunch of working solutions, but I haven't felt that the core question was answered.
First off, my task: I wish to efficiently scrape ~100k HTML pages from a website and also download images on these pages, while respecting their robots.txt crawl rate of 3 seconds per page.
First, I built a scraper intending to just crawl the HTML and get a long list of image urls to download on a second pass. But then, I realized that, with ~10 images per page this would be ~1M images. At a 3-second crawl rate, this would take ~30 days.
So I thought: "if I'm scraping using Selenium, the images are getting downloaded anyway! I can just download the images on page-scrape."
Now, my background research: This sent me down a rabbit hole, and I learned that the following options exist to download images on a page without making additional calls:
You can right-click and "Save as" (SO post)
You can screenshot the image (SO post)
Sometimes, weirdly, the image data is loaded into src anyway (SO post)
Selenium Wire exists, which is really the best way to address this. (SO Post)
These all seem like viable answers, but on a deeper level, they all (even Selenium Wire**) seem like hacks.
** Selenium Wire allows you to access the data in the requests made by Selenium. This is great, but I naively assumed that when a page is rendered and the images are placed in the img tags, they're in the page and I shouldn't have to worry about the requests that retrieved them.
Now, finally, my question. I clearly have a fundamental misunderstanding about how the img tag works.
Why can't one directly access image data through the Selenium driver, which is loading and rendering the images anyway? The images are there; I see the images when the driver loads. Theoretically, naively, I would expect to be able to download whatever is loaded on the page.
The one parallel I know of is with iframes -- you can visually see the content of the iframe, but you can only scrape it after directing Selenium to switch into the frame (background). So naively I assumed there would be a switch method for img's as well. The fact that there isn't, and it's not clear how to use Selenium to download the image data, tells me that I'm not really understanding how a browser handles an img tag.
I understand all the hacks and the workarounds, but my question here is why?

Python Data extraction from a pop-up window

I'm trying to get a specific data from a website, but this is a little bit complicated to understand so here is some images.
So, first, I'm on this page,
Image1
then I click on the icon in the middle and something pop,
popup
then I have to click on this,
almost there
And finally I land here
arrival
And I want to get all the names of the people here
So, my question is, is there a way to get directly this list with a requests ?
If yes, how do i have do to ? I can't find the URL of this kind of pop up and I'm a complete beginner with requests and all this kind of things..
(To get the name, I have to be connected on my account by the way)
So, since I don't know how to access to the pop-up windows, this is the only code I got :
import requests
x = requests.get('https://www.tiktok.com/#programm___r?lang=en', headers={'User-Agent':'test'})
print(x.text)
I checked what it prints, and i didn't see a sign of the pop-up window
you can get some sort of network interception tool like Burpsuite and watch the network traffic that comes through each time you click on each link along the way to your final destination, this should give you an endpoint you may be able to send your request too. I think this network information should also be available in the browser tools but I'm not sure. A potential issue here is that usually tokens and other information has to be passed down the chain along the way, which might make scripting something like this too hard.
So aside from that, with browser automation software like selenium, you could automate the process of getting to that point on the page, and be able to pull out the list you want once you're there. I've used selenium myself and it's really usable and well documented!

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

How to scrape content rendered in popup window with javascript: links using scrapy

I'm trying to use scrapy to get content rendered only after a javascript: link is clicked. As the links don't appear to follow a systematic numbering scheme, I don't know how to
1 - activate a javascript: link to expand a collapsed panel
2 - activate a (now visible) javascript: link to cause the popup to be rendered so that its content (the abstract) can be scraped
The site https://b-com.mci-group.com/EventProgramme/EHA19.aspx contains links to abstracts that will be presented at a conference I plan to attend. The site's export to PDF is buggy, in that it duplicates a lot of data at PDF generation time. Rather than dealing with the bug, I turned to scrapy only to realize that I'm in over my head. I've read:
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
and
How to scrape coupon code of coupon site (coupon code comes on clicking button)
But I don't think I'm able to connect the dots. I've also seen mentions to Selenium, but am not sure that I must resort to that.
I have made little progress, and wonder if I can get a push in the right direction, with the following information in hand:
In order to make the POST request that will expand the collapsed panel (item 1 above) I have a traced that the on-page JS javascript:ShowCollapsiblePanel(116114,1695,44,191); will result in a POST request to TARGETURLOFWEBSITE/EventSessionAjaxService/GetSessionDetailsHtml with payload:
{"eventSessionID":116114,"eventSessionWebSiteSetupViewID":191}
The parameters for eventSessionID and eventSessionWebSiteSetupViewID are clearly in the javascript:ShowCollapsiblePanel text.
How do I use scrapy to iterate over all of the links of form javascript:ShowCollapsiblePanel? I tried to use SgmlLinkExtractor, but that didn't return any of the javascript:ShowCollapsiblePanel() links - I suspect that they don't meet the criteria for "links".
UPDATE
Making progress, I've found that SgmlLinkExtractor is not the right way to go, and the much simpler:
sel.xpath('//a[contains(#href, "javascript:ShowCollapsiblePanel")]').re('((\d+)\,(\d+)\,(\d+)\,(\d+)')
in scrapy console returns me all of the numeric parameters for each javascript:ShowCollapsiblePanel() (of course, right now they are all in one long string, but I'm just messing around in the console).
The next step will be to take the first javascript:ShowCollapsiblePanel() and generate the POST request and analyze the response to see if the response contains what I see when I click the link in the browser.
I fought with a similar problem and after much pulling out hair I pulled the data set I needed with import.io which has a visual type scraper but it's able to run with javascript enabled which did just what I needed and it's free. There's also a fork on git hub I saw last night of scrapy that looked just like the import io scraper it called ..... give me a min
Portia but I don't know if it'll do what you want
https://codeload.github.com/scrapinghub/portia/zip/master
Good

Is there a way to save a captcha image and view it later in python?

I am scripting in python for some web automation. I know i can not automate captchas but here is what i want to do:
I want to automate everything i can up to the captcha. When i open the page (usuing urllib2) and parse it to find that it contains a captcha, i want to open the captcha using Tkinter. Now i know that i will have to save the image to my harddrive first, then open it but there is an issue before that. The captcha image that is on screen is not directly in the source anywhere. There is a variable in the source, inside some javascript, that points to another page that has the link to the image, BUT if you load that middle page, the captcha picture for that link changes, so the image associated with that javascript variable is no longer valid. It may be impossible to gather the image using this method, so please enlighten me if you have any ideas on this.
Now if I use firebug to load the page, there is a "GET" that is a direct link to the current Captcha image that i am seeing, and i'm wondering if there is anyway to make python or ullib2 see the "GET"s that are going on when a page is loaded, because if that was possible, this would be simple.
Please let me know if you have any suggestions.
Of course the captcha's served by a page which will serve a new one each time (if it was repeated, then once it was solved for one fake userid, a spammer could automatically make a million!). I think you need some "screenshot" functionality to capture the image you want -- there is no cross-platform way to invoke such functionality, but each platform (or desktop manager in the case of Linux, BSD, etc) tends to have one. Or, you could automate the browser (e.g. via SeleniumRC) to "screenshot" (e.g. "print to PDF") things at the right time. (I believe what you're seeing in firebug may be misleading you because it is "showing a snapshot"... just at the html source or DOM level rather than at a screen/bitmap level).

Categories