I'm trying to download images from a list of URL's. Each URL contains a txt file with jpeg information. The URL's are uniform except for an incremental change in the folder number. Below are example URL's
Min: https://marco.ccr.buffalo.edu/data/train/train-00001-of-00407
Max: https://marco.ccr.buffalo.edu/data/train/train-00407-of-00407
I want to read each of these URL's and store the their output to another folder. I was looking into the requests python library to do this but Im wondering how to iterate over the URL's and essentially write my loop to increment over that number in the URL. Apologize in advance if I misuse the terminology. Thanks!
# This may be terrible starting code
# imported the requests library
import requests
url = "https://marco.ccr.buffalo.edu/data/train/train-00001-of-00407"
# URL of the image to be downloaded is defined as image_url
r = requests.get(url) # create HTTP response object
# send a HTTP request to the server and save
# the HTTP response in a response object called r
with open("data.txt",'wb') as f:
# Saving received content as a png file in
# binary format
# write the contents of the response (r.content)
# to a new file in binary mode.
f.write(r.content)
You can generate urls like this and perform get for each
for i in range(1,408):
url = "https://marco.ccr.buffalo.edu/data/train/train-" + str(i).zfill(5) + "-of-00407"
print (url)
Also use a variable in the filename to keep a different copy of each. For eg, use this
with open("data" + str(i) + ".txt",'wb') as f:
Overall code may look something like this (not exactly this)
import requests
for i in range(1,408):
url = "https://marco.ccr.buffalo.edu/data/train/train-" + str(i).zfill(5) + "-of-00407"
r = requests.get(url)
# you might have to change the extension
with open("data" + str(i).zfill(5) + ".txt",'wb') as f:
f.write(r.content)
Related
I want to download multiple .hdr files from a website called hdrihaven.com.
My knowledge of python is not that great but here is what I have tried so far:
import requests
url = 'https://hdrihaven.com/files/hdris/'
resolution = '4k'
file = 'pump_station' #would need to be every file
url_2k = url + file + '_' + resolution + '.hdr'
print(url_2k)
r = requests.get(url_2k, allow_redirects=True)
open(file + resolution + '.hdr', 'wb').write(r.content)
Idealy file would just loop over every file in the directory.
Thanks for your answers in advance!
EDIT
I found a script on github that does what I need: https://github.com/Alzy/hdrihaven_dl. I edited it to fit my needs here: https://github.com/ktkk/hdrihaven-downloader. It uses the technique of looping through a list of all available files as proposed in the comments.
I have found that the requests module as well as urllib are extremly slow compared to native downloading from eg. Chrome. If anyone has an idea as to how I can speed these up pls let me know.
There are 2 ways you can do this:
You can use an URLto fetch all the files and iterate through a loop to download them individually. This of course only works if there exists such a URL.
You can pass in individual URL to a function that can download them in parallel/bulk.
For example:
import os
import requests
from time import time
from multiprocessing.pool import ThreadPool
def url_response(url):
path, url = url
r = requests.get(url, stream = True)
with open(path, 'wb') as f:
for ch in r:
f.write(ch)
urls = [("Event1", "https://www.python.org/events/python-events/805/"),("Event2", "https://www.python.org/events/python-events/801/"),
("Event3", "https://www.python.org/events/python-user-group/816/")]
start = time()
for x in urls:
url_response (x)
print(f"Time to download: {time() - start}")
This code snippet is taken from here Download multiple files (Parallel/bulk download). Read on there for more information on how you can do this.
Please correct me if I am wrong as I am a beginner in python.
I have a web services URL which contains an XML file:
http://abc.tch.xyz.edu:000/patientlabtests/id/1345
I have a list of values and I want to append each value in that list to the URL and download file for each value and the name of the downloaded file should be the same to the value appended from the list.
It is possible to download one file at a time but I have 1000's of values in the list and I was trying to write a function with a for loop and I am stuck.
x = [ 1345, 7890, 4729]
for i in x :
url = http://abc.tch.xyz.edu:000/patientlabresults/id/{}.format(i)
response = requests.get(url2)
****** Missing part of the code ********
with open('.xml', 'wb') as file:
file.write(response.content)
file.close()
The files downloaded from URL should be like
"1345patientlabresults.xml"
"7890patientlabresults.xml"
"4729patientlabresults.xml"
I know there is a part of the code which is missing and I am unable to fill in that missing part. I would really appreciate if anyone can help me with this.
Accessing your web service url seem not to be working. Check this.
import requests
x = [ 1345, 7890, 4729]
for i in x :
url2 = "http://abc.tch.xyz.edu:000/patientlabresults/id/"
response = requests.get(url2+str(i)) # i must be converted to a string
Note: When you use 'with' to open a file, you do not have close the file since it will closed automatically.
with open(filename, mode) as file:
file.write(data)
Since the Url you provide is not working, I am going to use a different url. And I hope you get the idea and how to write to a file using the custom name
import requests
categories = ['fruit', 'car', 'dog']
for category in categories :
url = "https://icanhazdadjoke.com/search?term="
response = requests.get(url + category)
file_name = category + "_JOKES_2018" #Files will be saved as fruit_JOKES_2018
r = requests.get(url + category)
data = r.status_code #Storing the status code in 'data' variable
with open(file_name+".txt", 'w+') as f:
f.write(str(data)) # Writing the status code of each url in the file
After running this code, the status codes will be written in each of the files. And the file will also be named as follows:
car_JOKES_2018.txt
dog_JOKES_2018.txt
fruit_JOKES_2018.txt
I hope this gives you an understanding of how to name the files and write into the files.
I think you just want to create a path using str.format as you (almost) are for the URL. maybe something like the following
import os.path
x = [ 1345, 7890, 4729]
for i in x:
path = '1345patientlabresults.xml'.format(i)
# ignore this file if we've already got it
if os.path.exists(path):
continue
# try and get the file, throwing an exception on failure
url = 'http://abc.tch.xyz.edu:000/patientlabresults/id/{}'.format(i)
res = requests.get(url)
res.raise_for_status()
# write the successful file out
with open(path, 'w') as fd:
fd.write(res.content)
I've added some error handling and better behaviour on retry
This code is supposed to download a list of pdfs into a directory
for pdf in preTag:
pdfUrl = "https://the-eye.eu/public/Books/Programming/" +
pdf.get("href")
print("Downloading...%s"% pdfUrl)
#downloading pdf from url
page = requests.get(pdfUrl)
page.raise_for_status()
#saving pdf to new directory
pdfFile = open(os.path.join(filePath, os.path.basename(pdfUrl)), "wb")
for chunk in page.iter_content(1000000):
pdfFile.write(chunk)
pdfFile.close()
I used os.path.basename() just to make sure the files would actually download. However, I want to know how to change the file name from 3D%20Printing%20Blueprints%20%5BeBook%5D.pdf to something like "3D Printing Blueprints.pdf"
You can use the urllib2 unquote function:
import urllib2
print urllib2.unquote("3D%20Printing%20Blueprints%20%5BeBook%5D.pdf") #3D Printing Blueprints.pdf
use this:
os.rename("3D%20Printing%20Blueprints%20%5BeBook%5D.pdf", "3D Printing Blueprints.pdf")
you can find more info here
I am executing the python script from commandline with this
python myscript.py
This is my script
if item['image_urls']:
for image_url in item['image_urls']:
subprocess.call(['wget','-nH', image_url, '-P images/'])
Now when i run that the on the screen i see output like this
HTTP request sent, awaiting response... 200 OK
Length: 4159 (4.1K) [image/png]
now what i want is that there should be no ouput on the terminal.
i want to grab the ouput and find the image extension from there i.e from [image/png] grab the png and renaqme the file to something.png
Is this possible
If all you want is to download something using wget, why not try urllib.urlretrieve in standard python library?
import os
import urllib
image_url = "https://www.google.com/images/srpr/logo3w.png"
image_filename = os.path.basename(image_url)
urllib.urlretrieve(image_url, image_filename)
EDIT: If the images are dynamically redirected by a script, you may try requests package to handle the redirection.
import requests
r = requests.get(image_url)
# here r.url will return the redirected true image url
image_filename = os.path.basename(r.url)
f = open(image_filename, 'wb')
f.write(r.content)
f.close()
I haven't test the code since I do not find a suitable test case. One big advantage for requests is it can also handle authorization.
EDIT2: If the image is dynamically served by a script, like gravatar image, you can usually find the filename in the response header's content-disposition field.
import urllib2
url = "http://www.gravatar.com/avatar/92fb4563ddc5ceeaa8b19b60a7a172f4"
req = urllib2.Request(url)
r = urllib2.urlopen(req)
# you can check the returned header and find where the filename is loacated
print r.headers.dict
s = r.headers.getheader('content-disposition')
# just parse the filename
filename = s[s.index('"')+1:s.rindex('"')]
f = open(filename, 'wb')
f.write(r.read())
f.close()
EDIT3: As #Alex suggested in the comment, you may need to sanitize the encoded filename in the returned header, I think just get the basename is ok.
import os
# this will remove the dir path in the filename
# so that `../../../etc/passwd` will become `passwd`
filename = os.path.basename(filename)
Is there a way I can download all/some the image files (e.g. JPG/PNG) from a Google Images search result?
I can use the following code to download one image that I already know its url:
import urllib.request
file = "Facts.jpg" # file to be written to
url = "http://www.compassion.com/Images/Hunger-Facts.jpg"
response = urllib.request.urlopen (url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read()) # read from request while writing to file
To download multiple images, it has been suggested that I define a function and use that function to repeat the task for each image url that I would like to write to disk:
def image_request(url, file):
response = urllib.request.urlopen(url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read())
And then loop over a list of urls with:
for i, url in enumerate(urllist):
image_request(url, str(i) + ".jpg")
However, what I really want to do is download all/some image files (e.g. JPG/PNG) from my own search result from Google Images without necessarily having a list of the image urls beforehand.
P.S.
Please I am a complete beginner and would favour an answer that breaks down the broad steps to achieve this over one that is bogs down on specific codes. Thanks.
You can use the Google API like this, where BLUE and DOG are your search parameters:
https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=BLUE%20DOG
There is a developer guide about this here:
https://developers.google.com/image-search/v1/jsondevguide
You need to parse this JSON format before you can use the links directly.
Here's a start to your JSON parsing:
import json
j = json.loads('{"one" : "1", "two" : "2", "three" : "3"}')
print(j['two'])