Python/Mechanize: Preventing images from being downloaded/opened

Python/Mechanize: Preventing images from being downloaded/opened - python

When I open a page, I'm assuming all images are also downloaded with Mechanize.
Is there a way for me to just retrieve the source code?
If mechanize doesn't allow for this, is there an alternative that does?

Related

Scraping PDF's from a password protected website

I work in tech support and currently have to manually keep our manuals for products updated manually by periodically checking to see if it has an update and if it does replacing the current one saved on our network.
I was wondering if it would be possible to build a small program to quickly download all files on a suppliers website and have them automatically download and be sorted into the given folders for those products, replacing the current PDF's in that file. I must also note that the website is password protected and is sorted into folders.
Is this possible with Python? I figured a small program I could perhaps run once a week or something to automatically update our manuals would be super useful (and a learning experience).
Apologies if I haven't explained the requirement well, any questions let me know.

It's certainly possible. As the other answer suggests you will want to use libaries like Requests (Handle HTTP requests) or Selenium (AUtomated browser activity) to navigate through the login.
You'll need to sort through the links on a given page, could be done with beautifulsoup ideally (An HTML parser) but could be done with selenium (Automated Browser activity).You'll need to check out libraries like requests (To handle HTTP requests) for downloading the pdf's, the OS module for sorting the folders out into specific folders and replacing files.
I strongly urge you to think through the steps, But I hope that gives an idea about the libaries that you'll need to learn abit about. The most challenging thing to learn will be using selenium, so if you can use requests to do the login that is much better.
If you've got a basic grasp of python the requests, OS module and beautifulsoup libraries are not difficult things to pick up.

You can use selenium for browser automation. This could insert the password (although the are you a robot stuff might stop you), and then you can download the pdf's simply by setting a default download location and clicking the download button. This will make the browser download the files to the default download location.

Download a .xls file from a aspx page using Python

I'm trying to download a .xls from this Site
I need to somehow click on the second button("Exporta informácion diária") on the grid and download the .xls file.
I tried with requests and beautifulsoup but didnt work.
After that, tried with selenium just for some tests and i managed to do what i needed.
Can someone please explain how can i download the .xls file without using a headless browser?
Thank You.

To do this, you first need to understand what the flow of network requests that performs the download.
The easiest way is to open the developer tools in the browser you are using. And follow the appropriate requests.
In your case, there is an POST Request, Which returns the exact address to the file.
Download it with a GET request.

Python write web scraper to get pdf files without several logins

How can I download all the pdf (or specific extension files like .tif or .pdf) from a webpage that requires login. I dont want to log in everytime for every pdf so I cant use link generation and pushing to browser scheme

The solution was simple: just posting it for others may have the same question
mydriver.get("https://username:password#www.somewebsite.com/somelink")

Grabbing a .jsp generated PNG in Python

I am trying to grab a PNG image which is being dynamically generated with JSP in a web service.
I have tried visiting the web page it is contained in and grabbing the image src attribute; but the link leads to a .jsp file. Reading the response with urllib2 just shows a lot of gibberish.
I also need to do this while logged into the web service in question, using mechanize. This seems to exclude the option of grabbing a screenshot with webkit2png or similar.
Thanks for any suggestions.

If you use urllib correctly (for example, making sure your User-Agent resembles a browser etc), the "gibberish" you get back is the actual file, so you just need to write it out to disk (open the file with "wb" for writing in binary mode) and re-read it with some image-manipulation library if you need to play with it. Or you can use urlretrieve to save it directly on the filesystem.
If that's a jsp, chances are that it takes parameters, which might be appended by the browser via javascript before the request is done; you should look at the real request your browser makes, before trying to reproduce it. You can do that with the Chrome Developer Tools, Firefox LiveHTTPHeaders, etc etc.
I do hope you're not trying to break a captcha.

Is there a way to save a captcha image and view it later in python?

I am scripting in python for some web automation. I know i can not automate captchas but here is what i want to do:
I want to automate everything i can up to the captcha. When i open the page (usuing urllib2) and parse it to find that it contains a captcha, i want to open the captcha using Tkinter. Now i know that i will have to save the image to my harddrive first, then open it but there is an issue before that. The captcha image that is on screen is not directly in the source anywhere. There is a variable in the source, inside some javascript, that points to another page that has the link to the image, BUT if you load that middle page, the captcha picture for that link changes, so the image associated with that javascript variable is no longer valid. It may be impossible to gather the image using this method, so please enlighten me if you have any ideas on this.
Now if I use firebug to load the page, there is a "GET" that is a direct link to the current Captcha image that i am seeing, and i'm wondering if there is anyway to make python or ullib2 see the "GET"s that are going on when a page is loaded, because if that was possible, this would be simple.
Please let me know if you have any suggestions.

Of course the captcha's served by a page which will serve a new one each time (if it was repeated, then once it was solved for one fake userid, a spammer could automatically make a million!). I think you need some "screenshot" functionality to capture the image you want -- there is no cross-platform way to invoke such functionality, but each platform (or desktop manager in the case of Linux, BSD, etc) tends to have one. Or, you could automate the browser (e.g. via SeleniumRC) to "screenshot" (e.g. "print to PDF") things at the right time. (I believe what you're seeing in firebug may be misleading you because it is "showing a snapshot"... just at the html source or DOM level rather than at a screen/bitmap level).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Mechanize: Preventing images from being downloaded/opened - python

When I open a page, I'm assuming all images are also downloaded with Mechanize. Is there a way for me to just retrieve the source code? If mechanize doesn't allow for this, is there an alternative that does?

Related

Scraping PDF's from a password protected website

Download a .xls file from a aspx page using Python

Python write web scraper to get pdf files without several logins

Grabbing a .jsp generated PNG in Python

Is there a way to save a captcha image and view it later in python?

Categories

Resources