Download all the links(related documents) on a webpage using Python - python

I have to download a lot of documents from a webpage. They are wmv files, PDF, BMP etc. Of course, all of them have links to them. So each time, I have to RMC a file, select 'Save Link As' Then save then as type All Files. Is it possible to do this in Python? I search the SO DB and folks have answered question of how to get the links from the webpage. I want to download the actual files. Thanks in advance. (This is not a HW question :)).

Here is an example of how you could download some chosen files from http://pypi.python.org/pypi/xlwt
you will need to install mechanize first: http://wwwsearch.sourceforge.net/mechanize/download.html
import mechanize
from time import sleep
#Make a Browser (think of this as chrome or firefox etc)
br = mechanize.Browser()
#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
#for more ways to set up your br browser object e.g. so it look like mozilla
#and if you need to fill out forms with passwords.
# Open your site
br.open('http://pypi.python.org/pypi/xlwt')
f=open("source.html","w")
f.write(br.response().read()) #can be helpful for debugging maybe
filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
myfiles=[]
for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
for t in filetypes:
if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
myfiles.append(l)
def downloadlink(l):
f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.
br.click_link(l)
f.write(br.response().read())
print l.text," has been downloaded"
#br.back()
for l in myfiles:
sleep(1) #throttle so you dont hammer the site
downloadlink(l)
Note: In some cases you may wish to replace br.click_link(l) with br.follow_link(l). The difference is that click_link returns a Request object whereas follow_link will directly open the link. See Mechanize difference between br.click_link() and br.follow_link()

Follow the Python codes in this link: wget-vs-urlretrieve-of-python.
You can also do this very easily with Wget. Try --limit, --recursive and --accept command-lines in Wget. For example:
wget --accept wmv,doc --limit 2 --recursive http://www.example.com/files/

Related

Getting filename from link and downloading it. Python

I'm trying to make an automated program that downloads a certain file from a link.
Problem is, I don't know what this file will be called. Its always a .zip so for example: filename_4213432.zip . The link does not include this filename in it. It looks something like this https://link.com/api/download/433265902. Therefore its impossible to get the filename trough the link. Is there a way to fetch this name and download it?
print("link:")
url = input("> ")
request = requests.get(url, allow_redirects=True)
I'm stuck at this point because I don't know what to put in my open() now.

I'm using selenium to scrape a local web

I need to upload a file using 'upload' button. after that a window will appear but I can't find the exact ID from HTML code. here is the screen shots and my code:
`time.sleep(1)
element=driver.find_element_by_id("Upload-Action-Ico").click()
driver.find_element_by_xpath("//*[contains(text(), 'File')]").send_keys("file path")`
I think that the ID is 'file' so I think this should work
time.sleep(1)
element=driver.find_element_by_id("file").click()
Try click on it and show the HTML code. There is a word "Button" or similar. Can you share me the url of the site?
I hope I can help you and excuse me for my english. (It isn't my mother-language)
The input field does not contain any text. And its id is explicitly mentioned in the html, so you can try find_element_by_id:
driver.find_element_by_id("file").send_keys("file path")
If this doesn't work for you, then you can try using the xpath:
driver.find_element_by_xpath("//*[#id='file']").send_keys("file path")
You can use this code to select files, and after that, you should click on the upload button.
filePath = os.getcwd()+'\img.jpg'
driver.find_element_by_id('Upload-Action-Ico').send_keys(filePath)
os.getcwd() : returns the current working directory.
img.jpg is located right next to the running script in the same directory.

Using python to open various links

This is my first post here and i hope i get my answers.
I want to open various links from my ftp server, and do some stuff in them. My links are http://mypage/photos0001/ , /photos002/, /photos003/ etc.
How can i write a script to open all of them and do the same job in all of them?
I tried:
Link = 'http://mypage/photos0001/' + 1
To do something like loop, but this doesn't work of course.
Any help?
Without being able to see your actual FTP directory tree, this may be a little difficult, but hopefully the following can get you started.
Consider reading up on ftplib for more information (see Docs)
import ftplib
ftp = ftplib.FTP('mypage')
ftp.login()
for dir in ftp.nlst():
if 'photos' in dir:
ftp.cwd('/mypage/{}'.format(dir))
for file in ftp.nlst():
if file.endswith('.jpg'):
try:
print('Attempting to download {}...'.format(file), end=' ')
with open(file, 'wb') as f:
ftp.retbinary('RETR ' + file, f.write, 8*1024)
print('[SUCCESS]')
except Exception as e:
print('[FAILED]')
print(e)
ftp.close()
So let's try and run through what is going on here:
Log in to your FTP server mypage.
List all the directories found in the root directory of your server.
If the folder name contains 'photos' then change working directory into that folder.
List all the files in this photos sub-folder.
If the file ends in .jpg its probably a picture we want.
Create a file on your system with the same name, and download the picture into it.
Repeat.
Now, expect to run into problems when you directory tree turns out to be slightly different than you've described to use here; however, you should be able to modify the example to fit your server. I do know this code works, as I have been able to use it to recursively download .html files from ftp.debian.org.

Python: Save Excel File As-Is To Folder

I'm downloading Excel files from a website using beautifulsoup4.
I only need to download the files. I don't need to rename them, just download them to a folder, relative to where the code is.
the function takes in a beautifulsoup call, searches for <a> then makes a call to the links.
def save_excel_files(sfile):
print("starting")
for link in sfile.find_all("a"):
candidate_link = link.get("href")
if (candidate_link is not None
and "Flat.File" in candidate_link):
xfile = requests.get(candidate_link)
if xfile:
### I just don't know what to do...
I've tried using os.path ; with open("xtest", "wb") as f: and many other variations. Been at this for two evenings and totally stuck.
The first issue is that I can't even get the files to downlaod and save anywhere. xfile resolves to [response 200], so that part is working, I'm just having a hard time coding the actual download and save.
Something like this should've worked :
xfile = requests.get(candidate_link)
file_name = candidate_link.split('/')[-1]
if xfile:
with open(file_name, "wb") as f:
f.write(xfile.content)
Tested with the following link I found randomly in google :
candidate_link = "http://berkeleycollege.edu/browser_check/samples/excel.xls"

Download Lone Image From a Set of URLs

I have a set of URLs and names in a file as follows:
www.test.yom/something/somethingelse/Profile.aspx?id=1
John Doe
www.test.yom/something/somethingelse/Profile.aspx?id=24
John Benjamin
www.test.yom/something/somethingelse/Profile.aspx?id=307
Benjamin Franklin
....
Each URL page contains normal html and any amount of text, tables, etc. but will always have 1 image in an tag.
My goal is to download this image somehow to my drive, renaming it with the second line name (i.e. "John Doe.jpg" and "John Benjamin.jpg").
Is there an easy way to accomplish this? I parsed out the URL-Name file from raw HTML on a different page using UNIX commands (grep, tr, sed), but I'm guessing this will require something a bit more intricate. Right now I'm thinking Python script, but I'm not exactly sure which libraries to look at or where to start in general (although I am familiar with Python language itself). I would also be down to use Java or any other language if it would make the process easier. Any advice?
Edit: So... ran into a problem where the urls require authentication to access. This is fine but the problem is that it is two-step authentication, and the second step is a passcode sent to mobile. :-( But thanks for the help!
You can put the links in a list or a file and use requests to get the html, then use BeautifulSoup to find the image you want, extract the src attribute and use requests again to download the file. Both libraries are quite simple to use, you won't have a problem doing that simple script :).
Pseudo-code to help you start:
url_list = ['url1', 'url2']
for url in url_list:
html = requests.get(url)
soup = BeautifulSoup(html)
img_element = soup.find('img')
image_url = img_element['src']
requests.download(image_url) # Not sure how to download this to a file
You can use extraction module with requests module :
pip install requests
pip install extraction
Then:
import extraction
import requests
url = "http://google.com/"
html = requests.get(url).text
extracted = extraction.Extractor().extract(html, source_url=url)
print(extracted.image) # If you know that there is only one image in your page
print(extracted.images) # List of images on page
http://google.com/images/srpr/logo9w.png
['http://google.com/images/srpr/logo9w.png']
Note that source_url is optional in extract, but is recommendedas it makes it possible to rewrite relative urls and image urls into absolute paths.
And extracted.image is first item of extracted.images if exist, or None
This is what I ended up doing to bypass the two-step authentication. Note that for the URLs I had if I log into one of the URLs and click the "Remember Me" option on login, this avoids the login page for the following method.
Download the "Save images" extension on Firefox. Restart Firefox.
In Tools -> "Save images" -> Options. Go to "Save" tab. In "Folder Options", pick folder to save files. In "File Names", pick "Use file name:". Enter appropriate file name.
Go to "http://tejji.com/ip/open-multiple-urls.aspx" in Firefox (not Chrome).
Copy and paste only the URLs into the textbox. Click "Submit". After all tabs load, close the tejji.com tab.
On the first profile page, right click -> "Save images" -> "Save images from ALL tabs".
Close the Save prompt if everything looks right.
All the images should now be in your designated folder.
All that's left is to rename the files based on the names (the files are numbered in order which coincide with order of names if you kept URLs in same order), but that should be rudimentary.

Categories