I'm trying to make an automated program that downloads a certain file from a link.
Problem is, I don't know what this file will be called. Its always a .zip so for example: filename_4213432.zip . The link does not include this filename in it. It looks something like this https://link.com/api/download/433265902. Therefore its impossible to get the filename trough the link. Is there a way to fetch this name and download it?
print("link:")
url = input("> ")
request = requests.get(url, allow_redirects=True)
I'm stuck at this point because I don't know what to put in my open() now.
Related
Currently, I'm using Flask with Python 3.
For sample purposes, here is a dropbox link
In order to fetch the file and save it, I'm doing the following.
urllib.request.urlretrieve("https://www.dropbox.com/s/w1h6vw2st3wvtfb/Sample_List.xlsx?dl=0", "Sample_List.xlsx")
The file is saved successfully to my project's root directory, however there is a problem. When I try to open the file, I get this error.
What am I doing wrong over here?
Also, is there a way to get the file name and extension from the URL itself? Example, filename = Sample_List and extension = xlsx...something like this.
This is my first post here and i hope i get my answers.
I want to open various links from my ftp server, and do some stuff in them. My links are http://mypage/photos0001/ , /photos002/, /photos003/ etc.
How can i write a script to open all of them and do the same job in all of them?
I tried:
Link = 'http://mypage/photos0001/' + 1
To do something like loop, but this doesn't work of course.
Any help?
Without being able to see your actual FTP directory tree, this may be a little difficult, but hopefully the following can get you started.
Consider reading up on ftplib for more information (see Docs)
import ftplib
ftp = ftplib.FTP('mypage')
ftp.login()
for dir in ftp.nlst():
if 'photos' in dir:
ftp.cwd('/mypage/{}'.format(dir))
for file in ftp.nlst():
if file.endswith('.jpg'):
try:
print('Attempting to download {}...'.format(file), end=' ')
with open(file, 'wb') as f:
ftp.retbinary('RETR ' + file, f.write, 8*1024)
print('[SUCCESS]')
except Exception as e:
print('[FAILED]')
print(e)
ftp.close()
So let's try and run through what is going on here:
Log in to your FTP server mypage.
List all the directories found in the root directory of your server.
If the folder name contains 'photos' then change working directory into that folder.
List all the files in this photos sub-folder.
If the file ends in .jpg its probably a picture we want.
Create a file on your system with the same name, and download the picture into it.
Repeat.
Now, expect to run into problems when you directory tree turns out to be slightly different than you've described to use here; however, you should be able to modify the example to fit your server. I do know this code works, as I have been able to use it to recursively download .html files from ftp.debian.org.
I'm downloading Excel files from a website using beautifulsoup4.
I only need to download the files. I don't need to rename them, just download them to a folder, relative to where the code is.
the function takes in a beautifulsoup call, searches for <a> then makes a call to the links.
def save_excel_files(sfile):
print("starting")
for link in sfile.find_all("a"):
candidate_link = link.get("href")
if (candidate_link is not None
and "Flat.File" in candidate_link):
xfile = requests.get(candidate_link)
if xfile:
### I just don't know what to do...
I've tried using os.path ; with open("xtest", "wb") as f: and many other variations. Been at this for two evenings and totally stuck.
The first issue is that I can't even get the files to downlaod and save anywhere. xfile resolves to [response 200], so that part is working, I'm just having a hard time coding the actual download and save.
Something like this should've worked :
xfile = requests.get(candidate_link)
file_name = candidate_link.split('/')[-1]
if xfile:
with open(file_name, "wb") as f:
f.write(xfile.content)
Tested with the following link I found randomly in google :
candidate_link = "http://berkeleycollege.edu/browser_check/samples/excel.xls"
I am trying to download zip files from measuredhs.com using the following code:
url ='https://dhsprogram.com/customcf/legacy/data/download_dataset.cfm?Filename=BFBR62DT.ZIP&Tp=1&Ctry_Code=BF'
request = urllib2.urlopen(url)
output = open("install.zip", "w")
output.write(request.read())
output.close()
However the downloaded file does not open. I get a message saying the compressed zip folder is invalid.
To access the download link, one needs to long in, which I have done so. If i click on the link, it automatically downloads the file, or even if i paste it in a browser.
Thanks
Try writing to local file in binary mode.
with open('install.zip', 'wb') as output:
output.write(request.read())
Also, comparing the md5/sha1 hash of the downloaded file will let you know if the downloaded file has been corrupted.
I have a set of URLs and names in a file as follows:
www.test.yom/something/somethingelse/Profile.aspx?id=1
John Doe
www.test.yom/something/somethingelse/Profile.aspx?id=24
John Benjamin
www.test.yom/something/somethingelse/Profile.aspx?id=307
Benjamin Franklin
....
Each URL page contains normal html and any amount of text, tables, etc. but will always have 1 image in an tag.
My goal is to download this image somehow to my drive, renaming it with the second line name (i.e. "John Doe.jpg" and "John Benjamin.jpg").
Is there an easy way to accomplish this? I parsed out the URL-Name file from raw HTML on a different page using UNIX commands (grep, tr, sed), but I'm guessing this will require something a bit more intricate. Right now I'm thinking Python script, but I'm not exactly sure which libraries to look at or where to start in general (although I am familiar with Python language itself). I would also be down to use Java or any other language if it would make the process easier. Any advice?
Edit: So... ran into a problem where the urls require authentication to access. This is fine but the problem is that it is two-step authentication, and the second step is a passcode sent to mobile. :-( But thanks for the help!
You can put the links in a list or a file and use requests to get the html, then use BeautifulSoup to find the image you want, extract the src attribute and use requests again to download the file. Both libraries are quite simple to use, you won't have a problem doing that simple script :).
Pseudo-code to help you start:
url_list = ['url1', 'url2']
for url in url_list:
html = requests.get(url)
soup = BeautifulSoup(html)
img_element = soup.find('img')
image_url = img_element['src']
requests.download(image_url) # Not sure how to download this to a file
You can use extraction module with requests module :
pip install requests
pip install extraction
Then:
import extraction
import requests
url = "http://google.com/"
html = requests.get(url).text
extracted = extraction.Extractor().extract(html, source_url=url)
print(extracted.image) # If you know that there is only one image in your page
print(extracted.images) # List of images on page
http://google.com/images/srpr/logo9w.png
['http://google.com/images/srpr/logo9w.png']
Note that source_url is optional in extract, but is recommendedas it makes it possible to rewrite relative urls and image urls into absolute paths.
And extracted.image is first item of extracted.images if exist, or None
This is what I ended up doing to bypass the two-step authentication. Note that for the URLs I had if I log into one of the URLs and click the "Remember Me" option on login, this avoids the login page for the following method.
Download the "Save images" extension on Firefox. Restart Firefox.
In Tools -> "Save images" -> Options. Go to "Save" tab. In "Folder Options", pick folder to save files. In "File Names", pick "Use file name:". Enter appropriate file name.
Go to "http://tejji.com/ip/open-multiple-urls.aspx" in Firefox (not Chrome).
Copy and paste only the URLs into the textbox. Click "Submit". After all tabs load, close the tejji.com tab.
On the first profile page, right click -> "Save images" -> "Save images from ALL tabs".
Close the Save prompt if everything looks right.
All the images should now be in your designated folder.
All that's left is to rename the files based on the names (the files are numbered in order which coincide with order of names if you kept URLs in same order), but that should be rudimentary.