downloading files to users machine? - python

I am trying to download mp3 file to users machine without his/her consent while they are listening the song.So, next time they visit that web page they would not have to download same mp3, but palypack from the local file. this will save some bandwidth for me and for them. it something pandora used to do but I really don't know how to.
any ideas?

You can't forcefully download files to a user without his consent. If that was possible you can only imagine what severe security flaw that would be.
You can do one of two things:
count on the browser to cache the media file
serve the media via some 3rd party plugin (Flash, for example)

Don't do this.
Most files are cached anyway.
But if you really want to add this (because users asked for it), make it optional (default off).

I have no experience with this, but you could try your luck with DOM storage in newer browsers:
IE8
FF>=2
Opera > 9.5, as I have read somewhere
Safari and Chrome should implement it, too, I guess
Here's an article by John Resig about that.
However, if questions contain something like 'without the user's knowledge' or 'without the user having to agree', typically the one asking really should reconsider his/her thoughts about the visitors and what is or isn't good for them.
Cheers,

Related

Is it wrong to use modules like wget to let users download files on django?

I am working on a django application in which users can download their own files. I need to make the files secure and only let them download it.
At first, I was thinking of using something like
{%if files%}
<a href='/media/files/pics/photo.png' download>
Then i realised that anyone can brute force my site and get any files. So I thought of handling the download through views. I am very beginner and don't know how to make my own download view. So I used something like:
at views.py
def download(id):
file = data.objects.get(pk=id)
url = file.fileurl
filename = wget.download(url)
and call the function when the user want to download the file. I am using wget module. I think I am doing wrong, So I decided to ask for some suggestions.
At last my question is :
Is it wrong to use other modules to download files? Or how to write a download view on Django?
Thank you!!
How to download a file in Django depends on your needs. any library that can download and store the file somewhere should be fine. How to serve them to other users on the other hand is a different story.
As you mentioned, you need URLs to be private so only the authorized user can download the file. In this case, You can set up a view and authorize the user and then let your webserver know that it should serve a specific file to this user.
You can use X-Accel-Redirect if you're using Nginx. Other web servers such as Apache have a similar option named X-Sendfile.
Take a look here for nginx example: Django and Nginx X-accel-redirect
As for saving files, you should save the files somewhere and store the path to that file in the database and link it to a user (with a foreignkey for example) and let only the owner of the file download the file. It is also worth mentioning that downloading files might take a long time so it's a good idea to lunch a background task for downloading the files and allow users to start the download when it's finished. You can take a look at Celery for background tasks.

Scraping PDF's from a password protected website

I work in tech support and currently have to manually keep our manuals for products updated manually by periodically checking to see if it has an update and if it does replacing the current one saved on our network.
I was wondering if it would be possible to build a small program to quickly download all files on a suppliers website and have them automatically download and be sorted into the given folders for those products, replacing the current PDF's in that file. I must also note that the website is password protected and is sorted into folders.
Is this possible with Python? I figured a small program I could perhaps run once a week or something to automatically update our manuals would be super useful (and a learning experience).
Apologies if I haven't explained the requirement well, any questions let me know.
It's certainly possible. As the other answer suggests you will want to use libaries like Requests (Handle HTTP requests) or Selenium (AUtomated browser activity) to navigate through the login.
You'll need to sort through the links on a given page, could be done with beautifulsoup ideally (An HTML parser) but could be done with selenium (Automated Browser activity).You'll need to check out libraries like requests (To handle HTTP requests) for downloading the pdf's, the OS module for sorting the folders out into specific folders and replacing files.
I strongly urge you to think through the steps, But I hope that gives an idea about the libaries that you'll need to learn abit about. The most challenging thing to learn will be using selenium, so if you can use requests to do the login that is much better.
If you've got a basic grasp of python the requests, OS module and beautifulsoup libraries are not difficult things to pick up.
You can use selenium for browser automation. This could insert the password (although the are you a robot stuff might stop you), and then you can download the pdf's simply by setting a default download location and clicking the download button. This will make the browser download the files to the default download location.

Trying to automate downloading campaign disclosure reports from an ASP server but it uses an encrypted __VIEWSTATE to handle important data

In my line of work, I often need to look at campaign disclosure reports for my state from ethics.ga.gov. However, the state system is one of the shittiest webapps I've ever dealt with.
It only provides contribution data per report. There are six reports per election cycle. And to add insult to injury, the system is slow. Not only are you having to download a shit ton of files, you have to wait a good minute for the damn thing to generate.
This is like an obvious opportunity to automate the process. What I had planned on doing is writing a program where I can input a URL of the page that links to all disclosure reports, and it will download all the contribution reports.
For a given candidate, I would input a link to this page - http://media.ethics.ga.gov/Search/Campaign/Campaign_Name.aspx?NameID=5753&FilerID=C2009000086&Type=candidate (the view report links are in the dropdown list titled "campaign contribution reports"). I then plan on following each of those links to the report page, following that link to the contributions page, and downloading the csv file. Once I have the csv file, (I think) the project comes under the scope of my coding ability.
The problem I am stuck on right now is that I can't figure out how to follow the view report links. The system is written in ASP. The links call a javascript postback function with a call of the sort "View Report". ctl02 is the identifier of the control. It appears that the information to map that control identifier to the url I need (in this case http://media.ethics.ga.gov/search/Campaign/Campaign_ReportOptions.aspx?NameID=5753&FilerID=C2009000086&CDRID=85776) is embedded in an encrypted __VIEWSTATE field.
I installed the Firebug debugger to try and get data that way. While I am very new to Firebug, all I could find is that in the net tab it shows a GET request to the URL that I need.
Obviously, somehow my browser is getting the next page, which means it should be automatable, but I am now at a loss. I've been working this up in python because I'm really starting to like it, but everything's negotiable. I am doing this on a mac (with full gnu environment), and would prefer to keep working in the environment I am familiar with, but I do have a windows xp vm with visual c++ '10 if I have to go that route.
What do y'all think?
Turns out the data wasn't in the encrypted __VIEWSTATE at all. There was a POST operation that Firebug was clearing on a redirect (despite having it set not to clear things.) I ran it with the Chrome dev console, and I was able to capture the POST data and replicate the POST operation in my application. That got me the URL I was looking for.
Thanks to everyone that looked at this!

Scraping a web page as you manually navigate

Is there a way, using some library or method, to scrape a webpage in real time as a user navigates it manually? Most scrapers I know of such as python mechanize create a browser object that emulates a browser - of course this is not what I am looking for since if I have a browser open, it will be different than the one mechanize creates.
If there is no solution, my problem is I want to scrape elements from a HTML5 game to make an intelligent agent of sorts. I won't go into more detail, but I suspect if others are trying to do the same in the future (or any real time scraping with a real user), a solution to this could be useful for them as well.
Thanks in advance!
Depending on what your use-case is, you could set up a SOCKS proxy or some other form of proxy and configure it to log all traffic, then instruct your browser to use it. You'd then scrape that log somehow.
Similarly, if you have control over your router, you could configure capture and logging there, e.g. using tcpdump. This wouldn't decrypt encrypted traffic, of course.
If you are working with just one browser, there may be a way to instruct it to do something at each action via a custom browser plugin, but I'd have to guess you'd be running into security model issues a lot.
The problem with a HTML5 game is that typically most of its "navigation" is done using a lot of Javascript. The Javascript is typically doing a lot -- manipulating the DOM, triggering requests for new content to fit into the DOM, etc...
Because of this you might be better off looking into OS-level or browser-level scripting services that can "drive" keyboard and mouse events, take screenshots, or possibly even take a snapshot of the current page DOM and query it.
You might investigate browser automation and testing frameworks like Selenium for this.
I am not sure if this would work in your situation but it is possible to create a simple web browser using PyQt which will work with HTML5 and from this it might be possible to capture what is going on when a live user plays the game.
I have used PyQt for a simple browser window (for a completely different application) and it seems to handle simple, sample HTML5 games. How one would delve into the details of what is going on the game is a question for PyQt experts, not me.

Crawling web for specific file type

As a part of a research, I need to download freely available RDF (Resource Description Framework - *.rdf) files via web, as much as possible. What are the ideal libraries/frameworks available in Python for doing this?
Are there any websites/search engines capable of doing this? I've tried Google filetype:RDF search. Initially, Google shows you 6,960,000 results. However, as you browse individual results pages, the results drastically drop down to 205 results. I wrote a script to screen-scrape and download files, but 205 is not enough for my research and I am sure there are more than 205 files in the web. So, I really need a file crawler. I'd like to know whether there are any online or offline tools that can be used for this purpose or frameworks/sample scripts in Python to achieve this. Any help in this regards is highly appreciated.
Crawling RDF content from the Web is no different than crawling any other content. That said, if your question is "what is a good python Web crawler", than you should read this question: Anyone know of a good Python based web crawler that I could use?. If your question is related to processing RDF with python, then there are several options, one being RDFLib
Did you notice the text something like "google has hidden similar results, click here to show all results" at the bottom of one page? Might help.
I know that I'm a bit late with this answer - but for future searchers - http://sindice.com/ is a great index of rdf documents
teleport pro, although it maybe cant copy from google, too big, it can probably handly proxy sites that return google results, and i know, for a fact, i could download 10 000 pdfs with in a day if i wanted to. it has filetype specifiers and many options.
here's one workaround :
get "download master" from chrome extensions, or similar program
search on google or other for results, set google to 100 per page
select - show all files
write your file extension, .rdf press enter
press download
you can have 100 files per click, not bad.

Categories