Downloading a webpage and associated resources to a WARC in python - python

I'm interested in downloading for later analysis a bunch of webpages. There are two things that I would like to do:
Download the page and associated resources (images, multiple pages associated with an article, etc) to a WARC file.
change all links to point to the now local files.
I would like to do this in Python.
Are there any good libraries for doing this? Scrapy seems designed to scrape websites, rather than single pages, and I'm not sure how to generate WARC files. Calling out to wget is a doable solution if there isn't something more python native. Heritrix is complete overkill, and not so much of a python solution. wpull would be ideal if it had a well documented python library, but it seems instead to be mostly an application.
Any other ideas?

just use wget, is the simplest and most stable tool you can have to crawl web and save into a warc.
man wget, or just to start:
--warc-file=FILENAME save request/response data to a .warc.gz file
-p, --page-requisites get all images, etc. needed to display HTML page
please note that you don't have to change any links, the warc preserve the original web pages. is the job of replay software (openwayback, pywb) to make the warc content browsable again.
if you need to go with python:
internetarchive/warc is the default library
take a look at this if you want manually crafting a warc file ampoffcom/htmlwarc

Related

Scraping PDF's from a password protected website

I work in tech support and currently have to manually keep our manuals for products updated manually by periodically checking to see if it has an update and if it does replacing the current one saved on our network.
I was wondering if it would be possible to build a small program to quickly download all files on a suppliers website and have them automatically download and be sorted into the given folders for those products, replacing the current PDF's in that file. I must also note that the website is password protected and is sorted into folders.
Is this possible with Python? I figured a small program I could perhaps run once a week or something to automatically update our manuals would be super useful (and a learning experience).
Apologies if I haven't explained the requirement well, any questions let me know.
It's certainly possible. As the other answer suggests you will want to use libaries like Requests (Handle HTTP requests) or Selenium (AUtomated browser activity) to navigate through the login.
You'll need to sort through the links on a given page, could be done with beautifulsoup ideally (An HTML parser) but could be done with selenium (Automated Browser activity).You'll need to check out libraries like requests (To handle HTTP requests) for downloading the pdf's, the OS module for sorting the folders out into specific folders and replacing files.
I strongly urge you to think through the steps, But I hope that gives an idea about the libaries that you'll need to learn abit about. The most challenging thing to learn will be using selenium, so if you can use requests to do the login that is much better.
If you've got a basic grasp of python the requests, OS module and beautifulsoup libraries are not difficult things to pick up.
You can use selenium for browser automation. This could insert the password (although the are you a robot stuff might stop you), and then you can download the pdf's simply by setting a default download location and clicking the download button. This will make the browser download the files to the default download location.

What's the best way to disable image download in scrapy?

It is not disabled by default.
I have written a spider which consumes almost 2 GB of data per hour. Now I want to save my data consumption, images are of no use for me, so want to make sure they not being fetched.
Given that this is a P0 scenario, it should be a simple flag in settings.py but surprisingly from docs I couldn't find any. I found a lot of details about ImagesPipeline, enabling those pipelines, their storage etc, but no flag for people not interested in images. Let me know if I am missing anything.
Scrapy does not download images unless you explicitly tell it to do it.
You can check in the run time logs the URLs that Scrapy downloads. If a image URL does not appear in the logs, it is not being downloaded, even if a webpage that contains images is downloaded.
When you open a downloaded page in a web browser, images are downloaded on the fly by the web browser. They do not come from the downloaded webpage, they are not (usually) embedded in the webpage, the webpage indicates where in the Internet they are, and the web browser downloads them to display them, but Scrapy does not.
The only exception would be that images are actually embedded in the HTML code, as base64. This is uncommon, and probably not your case. And when that happens, there is no way you can prevent their download, you cannot download a webpage excluding part of its content.

Does selenium or other web scraper tools are mandatory for scraping data from chrome to python script

So I wanted to scrape an website data. I have used selenium in my python script to scrape the data. But i have noticed that in Network section of Google Chrome Inspect, the chrome can record the XmlHttpRequest to find out the json/xml file of websites. So i was wondering that can i directly use this data in my python script as selenium is quite heavy weight and needs more bandwidth. Does selenium or other web scraper tools should be used as a medium to communicate with browser? If not, please give out some information about scraping data to be used for my python file only by using chrome itself.
Definitely! Check out the requests module.
From there you can access the page source, and using data from it you can access the different aspects separately. Here are the things to consider though:
Pros:
Faster, less to download. For things like AJAX requests, is extremely more efficient.
Does not require graphic UI like selenium
More precise; Get exactly what you need
The ability to set Headers/Cookies/etc before making requests
Images may be downloaded separately, with no obligation to download any of them.
Allows as many sessions as you want to be opened in parallel, each
can have different options(proxies, no cookies, consistent cookies,
custom headers, block redirects, etc) without affecting the other.
Cons:
Much harder to get into as opposed to Selenium, requires
minimal knowledge of HTML's GET and POST , and a library
like re or BeautifulSoup to extract data.
For pages with javascript-generated data, depending how the
javascript is implemented(or obfuscated), while always possible,
could be extremely difficult to extract wanted data.
Conclusion:
I suggest you definitely learn requests, and use it for most cases; However if the javascript gets too complicated, then switch to selenium for an easier solution. Look for some tutorials online, and then check the official page for an overview of what you've learned.

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Downloading a web page and all of its resource files in Python

I want to be able to download a page and all of its associated resources (images, style sheets, script files, etc) using Python. I am (somewhat) familiar with urllib2 and know how to download individual urls, but before I go and start hacking at BeautifulSoup + urllib2 I wanted to be sure that there wasn't already a Python equivalent to "wget --page-requisites http://www.google.com".
Specifically I am interested in gathering statistical information about how long it takes to download an entire web page, including all resources.
Thanks
Mark
Websucker? See http://effbot.org/zone/websucker.htm
websucker.py doesn't import css links. HTTrack.com is not python, it's C/C++, but it's a good, maintained, utility for downloading a website for offline browsing.
http://www.mail-archive.com/python-bugs-list#python.org/msg13523.html
[issue1124] Webchecker not parsing css "#import url"
Guido> This is essentially unsupported and unmaintaned example code. Feel free
to submit a patch though!

Categories