I want to download a zip file using python.
With this type of url,
http://server.com/file.zip
this is quite simple by using urllib2.urlopen and writing it in a local file.
But in my case I have this type of url:
http://server.com/customer/somedata/download?id=121&m=zip,
the download is launched after a form validation.
It could be useful to precise that in my case I want to deploy it on heroku, so I can't use spynner that is built with C++. This download is launched after a scraping that uses scrapy.
From a browser the download works well, I get a good zip file with its name. Using python I just get html and header data...
Is there any way to get a file from this type of url in python ?
This Site is serving JavaScript which then invokes the download.
You have no choice but to: a) evaluate the JavaScript in a simulated Browser environment or b) parse manually what the JS does, and re-implement that in python. e.g. string extraction of the URL and download key, possibly invoking an AJAX request, and finally download the file
I generally recommend Mechanize for webpage related automation, but it cannot deal with JavaScript either, so I guess you can stick with Scrapy if you want to go for plan b).
When you do the download in the browser, open up the network tab of the developer console and record what HTTP method (probably POST), the POST parameters, the cookie, and everything else that is part of the validation; then use a library to replicate that.
Related
Is it possible to upload and manipulate a photo in the browser with GitHub-pages? The photo doesn't need to be stored else than just for that session.
PS. I'm new to this area and I am using python to manipulate the photo.
GitHub pages allows users to create static HTML sites. This means you have no control over the server which hosts the HTML files - it is essentially a file server.
Even if you did have full control over the server (e.g. if you hosted your own website), it would not be possible to allow the client to run Python code in the browser since the browser only interprets JavaScript.
Therefore the most easy solution is to re-write your code in JavaScript.
Failing this, you could offer a download link to your Python script, and have users trust you enough to run it on their computer.
I am downloading information for a research project from a site that uses ajax to load URLs and does not allow serial downloading. I am dumping the urls from casperjs into a file I read and use browser.retrieve(url,dump_filename) to download the information with mechanize. I mostly get blank file downloads but they are periodically filled with content. Is there a way to modify the headers so that I can always get data. Also, a casperjs download alternative is welcome. I have tried casperjs download() but it saves a blank file as well. I think it has something to do with the headers. File downloads always work in a browser.
I prefer Selenium over Mechanize when it comes to more "sophisticated" web-sites, that use AJAX, JS, etc.
You said downloading works, when you're using your browser. Well Selenium does the same thing - it uses Firefox on your desktop to fulfill its tasks
I am trying to grab a PNG image which is being dynamically generated with JSP in a web service.
I have tried visiting the web page it is contained in and grabbing the image src attribute; but the link leads to a .jsp file. Reading the response with urllib2 just shows a lot of gibberish.
I also need to do this while logged into the web service in question, using mechanize. This seems to exclude the option of grabbing a screenshot with webkit2png or similar.
Thanks for any suggestions.
If you use urllib correctly (for example, making sure your User-Agent resembles a browser etc), the "gibberish" you get back is the actual file, so you just need to write it out to disk (open the file with "wb" for writing in binary mode) and re-read it with some image-manipulation library if you need to play with it. Or you can use urlretrieve to save it directly on the filesystem.
If that's a jsp, chances are that it takes parameters, which might be appended by the browser via javascript before the request is done; you should look at the real request your browser makes, before trying to reproduce it. You can do that with the Chrome Developer Tools, Firefox LiveHTTPHeaders, etc etc.
I do hope you're not trying to break a captcha.
I am trying to retrieve query results on sites based on ajax like www.snapbird.org using Python. Since it doesn't show in the page source, I am not sure how to proceed.
I am a Python newbie and hence it would be great if I could get a pointer in the right direction.
I am also open to some other approach to the task if that is easier
This is going to be complex but as a start, ppen firebug and find the URL that gets called when the AJAX request is handled. You can call that directly in your Python program and parse the output.
You could use Selenium's Python client driver to parse the page source. I usually use this in conjunction with PyQuery to make web scraping easier.
Here's the basic tutorial for Selenium's Python driver. Be sure to follow the instructions for Selenium version 2 instead of version 1 (unless you're using version 1 for some reason).
You could also configure chrome/firefox to an HTTP proxy and then log/extract the necessary content with the proxy. I've tinkered with python proxies to save/log the requests/content based on content-type or URI globs.
For other projects I've written site-specific javascript bookmarklets which poll for new data and then POST it to my server (by dynamically creating both a form and iframe, and setting myform.target=myiframe;
Other javascript scripts/bookmarklets simulate a user interacting with sites, so instead of polling every few seconds the javascript automates clicking buttons and form submissions, etc. These scripts are always very site-specific of course but they've been hugely useful for me, especially when iterating over all the paginated results for a given search.
Here is a stripped down version of walking over a list of "paginated" results and preparing to send the data off to my server (which then further parses it with BeautifulSoup). In particular this was designed for Youtube's Sent/Inbox messages.
var tables = [];
function process_and_repeat(){
if(!(inbox && inbox.message_pane_ && inbox.message_pane_.innerHTML)){
alert("We've got no data!");
}
if(inbox.message_pane_.innerHTML.indexOf('<table') === 0)
{
tables.push(inbox.message_pane_.innerHTML);
inbox.next_page();
setTimeout("process_and_repeat()",3000);
}
else{
alert("Fininshed, [" + tables.length + " processed]");
document.write('<form action=http://curl.sente.cc method=POST><textarea name=sent.html>'+escape(tables.join('\n'))+'</textarea><input type=submit></form>')
}
}
process_and_repeat(); // now we wait and watch as all the paginated pages are viewed :)
This is a stripped down example without any fancy iframes/non-essentials which just add complexity.
Adding to what Liam said, Selenium is a great tool, too, which has aided in my various scraping needs. I'd be more than happy to help you out with this if you'd like.
One easy solution might be using a browser like Mechanize. So you can browse site, follow links, make searches and nearly everything that you can do with a browser with user interface.
But for a very sepcific job, you may not even need a such library, you can use urllib and urllib2 python libraries to make a connection and read response... You can use Firebug to see data structure of a search and response body. Then use urllib to make a request with relevant parameters...
With an example...
I made a search with joyvalencia and check the request url with firebug to see:
http://api.twitter.com/1/statuses/user_timeline.json?screen_name=joyvalencia&count=100&page=2&include_rts=true&callback=twitterlib1321017083330
So calling this url with urllib2.urlopen() will be the same with making the query on Snapbird. Response body is:
twitterlib1321017083330([{"id_str":"131548107799396357","place":null,"geo":null,"in_reply_to_user_id_str":null,"coordinates":.......
When you use urlopen() and read the response, the upper string is what you get... Then you can use json library of python to read the data and parse it to a pythonic data structure...
I am writing a script which will run on my server. Its purpose is to download the document. If any person hit the particular url he/she should be able to download the document. I am using urllib.urlretrieve but it download document on the server side not on the client. How to download in python at client side?
If the script runs on your server, its purpose is to serve a document, not to download it (the latter would be the urllib solution).
Depending on your needs you can:
Set up static file serving with e.g. Apache
Make the script execute on a certain URL (e.g. with mod_wsgi), then the script should set the Content-Type (provides document type such as "text/plain") and Content-Disposition (provides download filename) headers and send the document data
As your question is not more specific, this answer can't be either.
Set the appropriate Content-type header, then send the file contents.
If the document is on your server and your intention is that the user should be able to download this file, couldn't you just serve the url to that resource as a hyperlink in your HTML code. Sorry if I have been obtuse but this seems the most logical step given your explanation.
You might want to take a look at the SocketServer module.