Images VERY LARGE when downloading - python

I'm trying to download some pictures from a website. When I download them with the browser they are much smaller than the one, downloaded with my code.They have the same resolution as the one downloaded with my code, but the difference between the filesizes is very large!
def download(url, pathname):
"""
Downloads a file given an URL and puts it in the folder `pathname`
"""
# if path doesn't exist, make that path dir
if not os.path.isdir(pathname):
os.makedirs(pathname)
# download the body of response by chunk, not immediately
response = requests.get(url, stream=True)
# get the total file size
file_size = int(response.headers.get("Content-Length", 0))
# get the file name
filename = os.path.join(pathname, url.split("/")[-1])
# progress bar, changing the unit to bytes instead of iteration (default by tqdm)
progress = tqdm(response.iter_content(1024), f"Downloading {filename}", total=file_size, unit="B", unit_scale=True,
unit_divisor=1024, disable=True)
with open(filename, "wb") as f:
for data in progress.iterable:
# write data read to the file
f.write(requests.get(url).content)
# update the progress bar manually
progress.update(len(data))
Example: https://wallpaper-mania.com/wp-content/uploads/2018/09/High_resolution_wallpaper_background_ID_77700030695.jpg
Browser: About 222 KB
Code: 48,4 MB
How does this difference come about? How can I improve the code to download images in way, they are smaller?

f.write(requests.get(url).content)
It looks like you're re-downloading the entire file for every 1024 byte chunk, so you're getting 222 copies of the image. Make that:
f.write(data)

Related

Different behavior using tqdm

I was making a image downloading project for a website, but I encountered some strange behavior using tqdm. In the code below I included two options for making the tqdm progress bar. In option one I did not passed the iteratable content from response into the tqdm directly, while the second option I did. Although the code looks similar, the result is strangely different.
This is what the progress bar's result looks like using Option 1
This is what the progress bar's result looks like using Option 2
Option one is the result I desire but I just couldn't find an explanation for the behavior of using Option 2. Can anyone help me explain this behavior?
import requests
from tqdm import tqdm
import os
# Folder to store in
default_path = "D:\\Downloads"
def download_image(url):
"""
This function will download the given url's image with proper filename labeling
If a path is not provided the image will be downloaded to the Downloads folder
"""
# Establish a Session with cookies
s = requests.Session()
# Fix for pixiv's request you have to add referer in order to download images
response = s.get(url, headers={'User-Agent': 'Mozilla/5.0',
'referer': 'https://www.pixiv.net/'}, stream=True)
file_name = url.split("/")[-1] # Retrieve the file name of the link
together = os.path.join(default_path, file_name) # Join together path with the file_name. Where to store the file
file_size = int(response.headers["Content-Length"]) # Get the total byte size of the file
chunk_size = 1024 # Consuming in 1024 byte per chunk
# Option 1
progress = tqdm(total=file_size, unit='B', unit_scale=True, desc="Downloading {file}".format(file=file_name))
# Open the file destination and write in binary mode
with open(together, "wb") as f:
# Loop through each of the chunks in response in chunk_size and update the progres by calling update using
# len(chunk) not chunk_size
for chunk in response.iter_content(chunk_size):
f.write(chunk)
progress.update(len(chunk))
# Option 2
"""progress = tqdm(response.iter_content(chunk_size),total=file_size, unit='B', unit_scale=True, desc="Downloading {file}".format(file = file_name))
with open(together, "wb") as f:
for chunk in progress:
progress.update(len(chunk))
f.write(chunk)
# Close the tqdm object and file object as good practice
"""
progress.close()
f.close()
if __name__ == "__main__":
download_image("Image Link")
Looks like an existing bug with tqdm. https://github.com/tqdm/tqdm/issues/766
Option 1:
Provides tqdm the total size
On each iteration, update progress. Expect the progress bar to keep moving.
Works fine.
Option 2:
Provides tqdm the total size along with a generator function that tracks the progress.
On each iteration, it should automatically get the update from generator and push the progress bar.
However, you also call progress.update manually, which should not be the case.
Instead let the generator do the job.
But this doesn't work either, and the issue is already reported.
Suggestion on Option 1:
To avoid closing streams manually, you can enclose them inside with statement. Same applies to tqdm as well.
# Open the file destination and write in binary mode
with tqdm(total=file_size,
unit='B',
unit_scale=True,
desc="Downloading {file}".format(file=file_name)
) as progress, open(file_name, "wb") as f:
# Loop through each of the chunks in response in chunk_size and update the progres by calling update using
# len(chunk) not chunk_size
for chunk in response.iter_content(chunk_size):
progress.update(len(chunk))
f.write(chunk)

Requests download with retries creates large corrupt zip

relative beginner here. I'm trying to complete a basic task with Requests, downloading zip files. It works fine on most downloads, but intermittently writes over-sized, corrupt zip files when working with large downloads (>5 GB or so). For example, there is a zip file I know to be ~11 GB that shows up anywhere between 16 and 20 GB, corrupted.
When unzipping in Windows Explorer, I get "The compressed (zipped) folder is invalid". 7-Zip will extract the archive, but says:
Headers Error --- Unconfirmed start of archive --- Warnings: There are some data after the end of the payload data
Interestingly, the 7-Zip dialog shows the correct file size as 11479 MB.
Here's my code:
save_dir = Path(f"{dirName}/{item_type}/{item_title}.zip")
file_to_resume = save_dir
try:
with requests.get(url, stream=True, timeout=30) as g:
with open(save_dir, 'wb') as sav:
for chunk in g.iter_content(chunk_size=1024*1024):
sav.write(chunk)
except:
attempts = 0
while attempts < 10:
try:
resume_header = {'Range':f'bytes = {Path(file_to_resume).stat().st_size}-'}
with requests.get(url, stream=True, headers=resume_header, timeout=30) as f:
with open(file_to_resume, 'ab') as sav:
for chunk in f.iter_content(chunk_size=1024*1024):
sav.write(chunk)
break
except:
attempts += 1
It appears the server did not support the Range header. Thanks, #RonieMartinez.

How to download invoice images from Coupa using Python

I have a csv file with the url for a list of invoices in Coupa. I need to use python to got to the Coupa and download the image scan PDF file to a specific folder. I have the following code. It runs but when I open the PDF file it comes back corrupted. Any help would be appreciated.
import requests
file_url = "https://mercuryinsurance.coupahost.com/invoice/15836/image_scan"
r = requests.get(file_url, stream = False)
with open("python.pdf","wb") as pdf:
for chunk in r.iter_content(chunk_size=1024):
# writing one chunk at a time to pdf file
if chunk:
pdf.write(chunk)

How can I open some unbounded # of files so that I can send them in a request?

I need to take the contents of a user's directory and send them in a request. Unfortunately I cannot modify the service i'm making the request to, and so I CANNOT zip all the files and send that, I must send all the files.
There's a limit on the total size of the files, but not on the # of files. Unfortunately once I try and open too many, Python will error out with: [Errno 24] Too many open files. Here's my current code:
files_to_send = []
files_to_close = []
for file_path in all_files:
file_obj = open(file_path, "rb")
files_to_send.append(("files", (file_path, file_obj)))
files_to_close.append(file_obj)
requests.post(url, files=files_to_send)
for file_to_close in files_to_close:
file_to_close.close()
Is there any way to get around this open file limit given my circumstances?
Can you send the files one at a time? Or maybe send them in batches? You can increase the number of allowed open files here:
IOError: [Errno 24] Too many open files:
But in general that's finicky and not recommended.
You can send your requests in batches, ala:
BATCH_SIZE = 10 # Just an example batch size
while len(all_files) > 0:
files_to_send = []
while len(files_to_send) < BATCH_SIZE:
files_to_send.append(all_files.pop())
file_obj = open(file_path, "rb")
requests.post(url, files=files_to_send)
for f in files_to_send:
f.close()
This will send files in batches of 10 at a time. If one at a time works:
for file in all_files:
with open(file, "rb") as outfile:
requests.post(url, files=[outfile])
It's generally bad practice to open too many files at once.

Download csv file using python 3

I am new to Python. Here is my environment setup:
I have Anaconda 3 ( Python 3). I would like to be able to download an CSV file from the website:
https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD
I would like to use the requests library. I would appreciate anyhelp in figuring our how I can use the requests library in downloading the CSV file to the local directory on my machine
It is recommended to download data as stream, and flush it into the target or intermediate local file.
import requests
def download_file(url, output_file, compressed=True):
"""
compressed: enable response compression support
"""
# NOTE the stream=True parameter. It enable a more optimized and buffer support for data loading.
headers = {}
if compressed:
headers["Accept-Encoding"] = "gzip"
r = requests.get(url, headers=headers, stream=True)
with open(output_file, 'wb') as f: #open as block write.
for chunk in r.iter_content(chunk_size=4096):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush() #Afterall, force data flush into output file (optional)
return output_file
Considering original post:
remote_csv = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
local_output_file = "test.csv"
download_file(remote_csv, local_output_file)
#Check file content, just for test purposes:
print(open(local_output_file).read())
Base code was extracted from this post: https://stackoverflow.com/a/16696317/176765
Here, you can have more detailed information about body stream usage with requests lib:
http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

Categories