How to scrape the images from this javascript website? - python

link text
This is a link from a digital book library.There are forward and backward buttons to see next and previous page.I want to download these pictures automatically. I have once used urllib in python but the website baned it soon. I just want to download this book for study purpose so can anyone recommend me some programming tools such as web-spiders which can simulate the process of turning pages and get the pictures automatically. Thanks!

That site uses Javascript, so you can't easily scrape it with Python. Two suggestions:
Work out what requests are being made when clicking the next button. You can do this with a tool like firebug. You might then find you can scrape it without processing any JS.
Use a tool such as Selenium which allows for browser scripting which lets you "execute" the JS.
As for the site blocking you, there are two ways to reduce the chance of being blocked:
Change your user-agent to that of a common browser, e.g. Firefox.
Add random delays between accessing the next image, so that you appear more human-like.

wget is an excellent web spider
http://linux.die.net/man/1/wget

You need a real browser to work with this (kind of) site. Selenium is one option, but it is more geared towards web testing. For web scraping iMacros is really nice. I had a quick test and it works well with iMacros for Firefox/IE.
Chris

Related

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Scraping a web page as you manually navigate

Is there a way, using some library or method, to scrape a webpage in real time as a user navigates it manually? Most scrapers I know of such as python mechanize create a browser object that emulates a browser - of course this is not what I am looking for since if I have a browser open, it will be different than the one mechanize creates.
If there is no solution, my problem is I want to scrape elements from a HTML5 game to make an intelligent agent of sorts. I won't go into more detail, but I suspect if others are trying to do the same in the future (or any real time scraping with a real user), a solution to this could be useful for them as well.
Thanks in advance!
Depending on what your use-case is, you could set up a SOCKS proxy or some other form of proxy and configure it to log all traffic, then instruct your browser to use it. You'd then scrape that log somehow.
Similarly, if you have control over your router, you could configure capture and logging there, e.g. using tcpdump. This wouldn't decrypt encrypted traffic, of course.
If you are working with just one browser, there may be a way to instruct it to do something at each action via a custom browser plugin, but I'd have to guess you'd be running into security model issues a lot.
The problem with a HTML5 game is that typically most of its "navigation" is done using a lot of Javascript. The Javascript is typically doing a lot -- manipulating the DOM, triggering requests for new content to fit into the DOM, etc...
Because of this you might be better off looking into OS-level or browser-level scripting services that can "drive" keyboard and mouse events, take screenshots, or possibly even take a snapshot of the current page DOM and query it.
You might investigate browser automation and testing frameworks like Selenium for this.
I am not sure if this would work in your situation but it is possible to create a simple web browser using PyQt which will work with HTML5 and from this it might be possible to capture what is going on when a live user plays the game.
I have used PyQt for a simple browser window (for a completely different application) and it seems to handle simple, sample HTML5 games. How one would delve into the details of what is going on the game is a question for PyQt experts, not me.

Can scrapy be used to complete form submission and do everything what a browser can do

I have a task where in I need to submit a form to website but they dont provide any API. I am currently using webdriver and faced many problems because of asynchronous nature between my code and browser. I am looking for a light weight reliable library/tool using with I can do all the tasks a user do with a browser.
Casperjs is one of the option which can do my job but I am more familiar with python and scrapy has larger developer community compare to casperjs.
Navigation utility without browser, light weight and fail-proof is one of the related question.
in short answer is No.
scrapy can't render java script but browser can.
you can use Selenium.
if you are sure to use scrapy and there is javascript that you need to run you can use
scrapy with selenium
scrapy with gtk/webkit/jswebkit
scrapy with webdrivers
If you like CasperJS but want to stick with Python, you should have a look at Ghost.py

How can I input data into a webpage to scrape the resulting output using Python?

I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned?
I'm trying to obtain the geographic distance between two addresses using this website: http://www.freemaptools.com/how-far-is-it-between.htm
I want to be able to go to the page, enter two addresses, click "Show", and then extract the "Distance as the Crow Flies" and "Distance by Land Transport" values and save them to a dictionary.
Is there any way to input data into a webpage using Python?
Take a look at tools like mechanize or scrape:
http://pypi.python.org/pypi/mechanize
http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
http://www.ibm.com/developerworks/linux/library/l-python-mechanize-beautiful-soup/
http://zesty.ca/scrape/
Packt Publishing has an article on that matter, too:
http://www.packtpub.com/article/web-scraping-with-python
Yes! Try mechanize for this kind of Web screen-scraping task.
I think you can also use PySide/PyQt, because they have a browser core of qtwebkit, you can control the browser to open pages, simulate human actions(fill, click...), then scrape data from pages. FMiner is work on this way, it's a web scraping software I developed with PySide.
Or you can try phantomjs, it's an easy library to control browser, but not it's javascript not python lanuage.
In addition with the answers already given, you could simply do a request on that page. Using your browser you could always inspect the Network (under Tools/Web Developer tools) behaviors and actions when you interact with the page. E.g. http://www.freemaptools.com/ajax/getaandb.php?a=Florida_Usa&b=New%20York_Usa&c=6052 -> request query for getting the results page you are expecting. Request that page and scrape the field you wanted to. IMHO, page requests are way faster than screen scraping (case-to-case basis).
But of course, you could always do screen scraping/browser simulation also (Mechanize, Splinter) and use headless browsers (PhantomJS, etc.) or the browser driver of the browser you want to use.
The query may have been resolved.
You can use Selenium WebDriver for this purpose. A web page can be interacted using programming language. All the operations can be performed as if a human user is accessing the web page.

Python and webbrowser form fill

Hello how can i make changes in my web browser with python? Like filling forms and pressing Submit?
What lib's should i use? And maybe someone of you have some examples?
Using urllib does not make any changes in opened browser for me
Urllib is not intended to do anyting with your browser, but rather to get contents from urls.
To fill in forms and this kind of things, have a look into mechanize, to scrap the webpages, consider using pyquery.
Selenium is great for this. It's a browser automation tool that you can use to launch a browser (any major browser or a 'headless' one), navigate to a url, and interact with the page.
It's used primarily for testing web code against multiple browsers, but is also very useful for 'scraping' pages and automating mundane tasks.
Here are the python docs: http://selenium-python.readthedocs.org/en/latest/index.html

Categories