This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to download a file in python
I'm playing with Python for doing some crawling stuff. I do know there is urllib.urlopen("http://XXXX") That can help me to get the html for target website. However, The link to the original image in that webpage will usually make the image in the backup page unavailable. I am wondering is there a way that can also save the image in the local space, then we can read the full content on the website without internet connection. It's like back up the whole webpage, but I'm not sure is there any way to do that in Python. Also, if it can get rid of the advertisement stuff, it will be more awesome though. Thanks.
If you're looking to backup a single webpage, you're well on your way.
Since you mention crawling, if you want to backup an entire website, you'll need to do some real crawling and you'll need scrapy for that.
There are several ways of downloading files off the interwebs, just see these questions:
Python File Download
How to- download a file in python
Automate file download from http using python
Hope this helps
Related
I work in tech support and currently have to manually keep our manuals for products updated manually by periodically checking to see if it has an update and if it does replacing the current one saved on our network.
I was wondering if it would be possible to build a small program to quickly download all files on a suppliers website and have them automatically download and be sorted into the given folders for those products, replacing the current PDF's in that file. I must also note that the website is password protected and is sorted into folders.
Is this possible with Python? I figured a small program I could perhaps run once a week or something to automatically update our manuals would be super useful (and a learning experience).
Apologies if I haven't explained the requirement well, any questions let me know.
It's certainly possible. As the other answer suggests you will want to use libaries like Requests (Handle HTTP requests) or Selenium (AUtomated browser activity) to navigate through the login.
You'll need to sort through the links on a given page, could be done with beautifulsoup ideally (An HTML parser) but could be done with selenium (Automated Browser activity).You'll need to check out libraries like requests (To handle HTTP requests) for downloading the pdf's, the OS module for sorting the folders out into specific folders and replacing files.
I strongly urge you to think through the steps, But I hope that gives an idea about the libaries that you'll need to learn abit about. The most challenging thing to learn will be using selenium, so if you can use requests to do the login that is much better.
If you've got a basic grasp of python the requests, OS module and beautifulsoup libraries are not difficult things to pick up.
You can use selenium for browser automation. This could insert the password (although the are you a robot stuff might stop you), and then you can download the pdf's simply by setting a default download location and clicking the download button. This will make the browser download the files to the default download location.
This question already has answers here:
Programmatic Python Browser with JavaScript
(8 answers)
Closed 4 years ago.
I am new to this subject, so my question could prove stupid.. sorry in advance.
My challenge is to do web-scraping, say for this page: link (google)
I try to web-scrape it using Python,
My problem is that once I use Python requests.get, I don't seem to get the full content of the page. I guess it is because that page has many resources, and Python does not get them all. (more than that, once I scroll my mouse up - more data is reviled on Chrome. I can see from the source code that no more data is downloaded to be shown..)
How can I get the full content of a web page? what am I missing?
thanks
requests.get will get you the page web but only what the page decides to give a robot. If you want the full page web as you see it as a human you need to trick it by changing your headers. If you need to scroll or click on buttons in order to see the whole page web, which is what I think you'll need to do, I suggest you take a look at selenium.
Sorry if this is not a valid question, i personally feel it kind of boarders on the edge.
Assuming the website involved has given full permission
How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg.
Using the link:
https://www.dogs.com
could I pull info from:
https://www.dogs.com/about-us
and any other directory attached to the "https://www.dogs.com/"
(I have no idea is dogs.com is a real website or not, just an example)
I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!
while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess.
anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages.
you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.
This question already has answers here:
Set up a scheduled job?
(26 answers)
Closed 5 years ago.
Newbie here. I have a lot of different functions in a Python program that downloads a bunch of data from the internet, manipulates it, and displays it to the public. I have bunch of links to different tables of data, and I figure it would be very inconvenient for people to have to wait for the data to download when they clicked the link on my website. How can I configure Django to run the scripts that download the data at say, like 6am? And save some type of cached template of the data so people could quickly view that data for the entire day, and then refresh and update the data for the next day. Your insight and guidance would be very appreciated! Thank you! and Happy holidays!
I'd suggest celery for any recurring tasks in Django. Their docs are great and already have a use with Django tutorial right in them.
How can I download all the pdf (or specific extension files like .tif or .pdf) from a webpage that requires login. I dont want to log in everytime for every pdf so I cant use link generation and pushing to browser scheme
The solution was simple: just posting it for others may have the same question
mydriver.get("https://username:password#www.somewebsite.com/somelink")