I have created a program that downloads mp4 videos from a link. I've added the code below
chunk_size = 256
URL = 'https://gogodownload.net/download.php?url=aHR0cHM6LyAdrefsdsdfwerFrefdsfrersfdsrfer363435349URASDGHUSRFSJGYfdsffsderFStewthsfSFtrftesdfseWVpYnU0bmM3LmdvY2RuYW5pLmNvbS91c2VyMTM0Mi9lYzBiNzk3NmM1M2Q4YmY5MDU2YTYwNjdmMGY3ZTA3Ny9FUC4xLnYwLjM2MHAubXA0P3Rva2VuPW8wVnNiR3J6ZXNWaVA0UkljRXBvc2cmZXhwaXJlcz0xNjcxOTkzNzg4JmlkPTE5MzU1Nw=='
x = requests.head(URL)
y = requests.head(x.headers['Location'])
file_size = int(y.headers['content-length'])
response = requests.get(URL, stream=True)
with open('video.mp4', 'wb') as f:
for chunk in response.iter_content(chunk_size=chunk_size):
f.write(chunk)
This code works properly and downloads the video but I want to add a live progress bar. I Tried using alive-progress (code added below) but it didnt work properly.
def compute():
response = requests.get(URL, stream=True)
with open('video.mp4', 'wb') as f:
for chunk in response.iter_content(chunk_size=chunk_size):
f.write(chunk)
yield 256
with alive_bar(file_size) as bar:
for i in compute():
bar()
This is the response I got, however the file downloaded properly
Any help?
This is the code you should try. I updated your chunk_size property to match the file size, which I also converted into KB (Kilo Bytes). It shows the values properly alongside the percentage as well
As for the weird boxes showing up on your terminal, that is because your terminal does not support the encoding used by the library. Maybe try editing your terminal settings to support "UTF-8" or higher
file_size = int(int(y.headers['content-length']) / 1024)
chunk_size = 1024
def compute():
response = requests.get(URL, stream=True)
with open('video.mp4', 'wb') as f:
for chunk in response.iter_content(chunk_size=chunk_size):
f.write(chunk)
yield 1024
with alive_bar(file_size) as bar:
for i in compute():
bar()
Most probably this is happening because your terminal does not support the encoding which is being printed. Edit the settings of your terminal app to support UTF-8 encoding or higher (preferably, UTF-16 or UTF-32).
Also give tqdm a try. It is a more reknowned library so you know it has been tried and tested.
Related
I was making a image downloading project for a website, but I encountered some strange behavior using tqdm. In the code below I included two options for making the tqdm progress bar. In option one I did not passed the iteratable content from response into the tqdm directly, while the second option I did. Although the code looks similar, the result is strangely different.
This is what the progress bar's result looks like using Option 1
This is what the progress bar's result looks like using Option 2
Option one is the result I desire but I just couldn't find an explanation for the behavior of using Option 2. Can anyone help me explain this behavior?
import requests
from tqdm import tqdm
import os
# Folder to store in
default_path = "D:\\Downloads"
def download_image(url):
"""
This function will download the given url's image with proper filename labeling
If a path is not provided the image will be downloaded to the Downloads folder
"""
# Establish a Session with cookies
s = requests.Session()
# Fix for pixiv's request you have to add referer in order to download images
response = s.get(url, headers={'User-Agent': 'Mozilla/5.0',
'referer': 'https://www.pixiv.net/'}, stream=True)
file_name = url.split("/")[-1] # Retrieve the file name of the link
together = os.path.join(default_path, file_name) # Join together path with the file_name. Where to store the file
file_size = int(response.headers["Content-Length"]) # Get the total byte size of the file
chunk_size = 1024 # Consuming in 1024 byte per chunk
# Option 1
progress = tqdm(total=file_size, unit='B', unit_scale=True, desc="Downloading {file}".format(file=file_name))
# Open the file destination and write in binary mode
with open(together, "wb") as f:
# Loop through each of the chunks in response in chunk_size and update the progres by calling update using
# len(chunk) not chunk_size
for chunk in response.iter_content(chunk_size):
f.write(chunk)
progress.update(len(chunk))
# Option 2
"""progress = tqdm(response.iter_content(chunk_size),total=file_size, unit='B', unit_scale=True, desc="Downloading {file}".format(file = file_name))
with open(together, "wb") as f:
for chunk in progress:
progress.update(len(chunk))
f.write(chunk)
# Close the tqdm object and file object as good practice
"""
progress.close()
f.close()
if __name__ == "__main__":
download_image("Image Link")
Looks like an existing bug with tqdm. https://github.com/tqdm/tqdm/issues/766
Option 1:
Provides tqdm the total size
On each iteration, update progress. Expect the progress bar to keep moving.
Works fine.
Option 2:
Provides tqdm the total size along with a generator function that tracks the progress.
On each iteration, it should automatically get the update from generator and push the progress bar.
However, you also call progress.update manually, which should not be the case.
Instead let the generator do the job.
But this doesn't work either, and the issue is already reported.
Suggestion on Option 1:
To avoid closing streams manually, you can enclose them inside with statement. Same applies to tqdm as well.
# Open the file destination and write in binary mode
with tqdm(total=file_size,
unit='B',
unit_scale=True,
desc="Downloading {file}".format(file=file_name)
) as progress, open(file_name, "wb") as f:
# Loop through each of the chunks in response in chunk_size and update the progres by calling update using
# len(chunk) not chunk_size
for chunk in response.iter_content(chunk_size):
progress.update(len(chunk))
f.write(chunk)
I am new to Python. Here is my environment setup:
I have Anaconda 3 ( Python 3). I would like to be able to download an CSV file from the website:
https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD
I would like to use the requests library. I would appreciate anyhelp in figuring our how I can use the requests library in downloading the CSV file to the local directory on my machine
It is recommended to download data as stream, and flush it into the target or intermediate local file.
import requests
def download_file(url, output_file, compressed=True):
"""
compressed: enable response compression support
"""
# NOTE the stream=True parameter. It enable a more optimized and buffer support for data loading.
headers = {}
if compressed:
headers["Accept-Encoding"] = "gzip"
r = requests.get(url, headers=headers, stream=True)
with open(output_file, 'wb') as f: #open as block write.
for chunk in r.iter_content(chunk_size=4096):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush() #Afterall, force data flush into output file (optional)
return output_file
Considering original post:
remote_csv = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
local_output_file = "test.csv"
download_file(remote_csv, local_output_file)
#Check file content, just for test purposes:
print(open(local_output_file).read())
Base code was extracted from this post: https://stackoverflow.com/a/16696317/176765
Here, you can have more detailed information about body stream usage with requests lib:
http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow
So I've been using
urllib.request.urlretrieve(URL, FILENAME)
to download images of the internet. It works great, but fails on some images. The ones it fails on seem to be the larger images- eg. http://i.imgur.com/DEKdmba.jpg. It downloads them fine, but when I try to open these files photo viewer gives me the error "windows photo viewer cant open this picture because the file appears to be damaged corrupted or too large".
What might be the reason it can't download these, and how can I fix this?
EDIT: after looking further, I dont think the problem is large images- it manages to download larger ones. It just seems to be some random ones that it can never download whenever I run the script again. Now I'm even more confused
In the past, I have used this code for copying from the internet. I have had no trouble with large files.
def download(url):
file_name = raw_input("Name: ")
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_size = 8192
while True:
buffer = u.read(block_size)
if not buffer:
break
Here's the sample code for Python 3 (tested in Windows 7):
import urllib.request
def download_very_big_image():
url = 'http://i.imgur.com/DEKdmba.jpg'
filename = 'C://big_image.jpg'
conn = urllib.request.urlopen(url)
output = open(filename, 'wb') #binary flag needed for Windows
output.write(conn.read())
output.close()
For completeness sake, here's the equivalent code in Python 2:
import urllib2
def download_very_big_image():
url = 'http://i.imgur.com/DEKdmba.jpg'
filename = 'C://big_image.jpg'
conn = urllib2.urlopen(url)
output = open(filename, 'wb') #binary flag needed for Windows
output.write(conn.read())
output.close()
This should work: use requests module:
import requests
img_url = 'http://i.imgur.com/DEKdmba.jpg'
img_name = img_url.split('/')[-1]
img_data = requests.get(img_url).content
with open(img_name, 'wb') as handler:
handler.write(img_data)
Python 2.7
Ubuntu 12.04
I'm trying to put together an image scraper, I've done this before with no problems but right now I'm stuck at a certain point.
I get a list of image links from wherever, either a web page or a user, let's assume that they are valid links.
The site I am scraping is imgur, some of the links won't work because I haven't added support for them (single files), I have the code for getting the links for each image from an album page down, that works and returns links like:
http://i.imgur.com/5329B8H.jpg #(intentionally broken link)
The image_download function as I have it in my actual program:
def image_download(self, links):
for each in links:
url = each
name = each.split('/')[-1]
r = requests.get(url, stream=True)
with open(name, 'wb') as f:
for chunk in r.iter_content(1024):
if not chunk:
break
f.write(chunk)
The image_download function as I have it for testing to be run on it's own:
def down():
links = ['link-1', 'link-2']
for each in links:
name = each.split('/')[-1]
r = requests.get(each, stream=True)
with open(name, 'wb') as f:
for chunk in r.iter_content(1024):
if not chunk:
break
f.write(chunk)
Now here's the thing, the second one works.
They both take the same input, they both do the same thing.
The first one does return a file with the correct name and extension but the file-size is different, say 960b as opposed to the second one which returns a file of about 200kb.
When I print the request both return a response of 200, I've tried printing the output at different points and as far as I can see they operate in exactly the same way with exactly the same data, they just don't give back the same information.
What is going on here?
You need to indent f.write(chunk) one more time. You are only writing the last chunk to the file right now.
The corrected function will look like this:
def image_download(self, links):
for each in links:
url = each
name = each.split('/')[-1]
r = requests.get(url, stream=True)
with open(name, 'wb') as f:
for chunk in r.iter_content(1024):
if not chunk:
break
f.write(chunk) #This has been indented to be in the for loop.
Requests is a really nice library. I'd like to use it for downloading big files (>1GB).
The problem is it's not possible to keep whole file in memory; I need to read it in chunks. And this is a problem with the following code:
import requests
def DownloadFile(url)
local_filename = url.split('/')[-1]
r = requests.get(url)
f = open(local_filename, 'wb')
for chunk in r.iter_content(chunk_size=512 * 1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.close()
return
For some reason it doesn't work this way; it still loads the response into memory before it is saved to a file.
With the following streaming code, the Python memory usage is restricted regardless of the size of the downloaded file:
def download_file(url):
local_filename = url.split('/')[-1]
# NOTE the stream=True parameter below
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
# If you have chunk encoded response uncomment if
# and set chunk_size parameter to None.
#if chunk:
f.write(chunk)
return local_filename
Note that the number of bytes returned using iter_content is not exactly the chunk_size; it's expected to be a random number that is often far bigger, and is expected to be different in every iteration.
See body-content-workflow and Response.iter_content for further reference.
It's much easier if you use Response.raw and shutil.copyfileobj():
import requests
import shutil
def download_file(url):
local_filename = url.split('/')[-1]
with requests.get(url, stream=True) as r:
with open(local_filename, 'wb') as f:
shutil.copyfileobj(r.raw, f)
return local_filename
This streams the file to disk without using excessive memory, and the code is simple.
Note: According to the documentation, Response.raw will not decode gzip and deflate transfer-encodings, so you will need to do this manually.
Not exactly what OP was asking, but... it's ridiculously easy to do that with urllib:
from urllib.request import urlretrieve
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
dst = 'ubuntu-16.04.2-desktop-amd64.iso'
urlretrieve(url, dst)
Or this way, if you want to save it to a temporary file:
from urllib.request import urlopen
from shutil import copyfileobj
from tempfile import NamedTemporaryFile
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
with urlopen(url) as fsrc, NamedTemporaryFile(delete=False) as fdst:
copyfileobj(fsrc, fdst)
I watched the process:
watch 'ps -p 18647 -o pid,ppid,pmem,rsz,vsz,comm,args; ls -al *.iso'
And I saw the file growing, but memory usage stayed at 17 MB. Am I missing something?
Your chunk size could be too large, have you tried dropping that - maybe 1024 bytes at a time? (also, you could use with to tidy up the syntax)
def DownloadFile(url):
local_filename = url.split('/')[-1]
r = requests.get(url)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return
Incidentally, how are you deducing that the response has been loaded into memory?
It sounds as if python isn't flushing the data to file, from other SO questions you could try f.flush() and os.fsync() to force the file write and free memory;
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
os.fsync(f.fileno())
use wget module of python instead. Here is a snippet
import wget
wget.download(url)
Based on the Roman's most upvoted comment above, here is my implementation,
Including "download as" and "retries" mechanism:
def download(url: str, file_path='', attempts=2):
"""Downloads a URL content into a file (with large file support by streaming)
:param url: URL to download
:param file_path: Local file name to contain the data downloaded
:param attempts: Number of attempts
:return: New file path. Empty string if the download failed
"""
if not file_path:
file_path = os.path.realpath(os.path.basename(url))
logger.info(f'Downloading {url} content to {file_path}')
url_sections = urlparse(url)
if not url_sections.scheme:
logger.debug('The given url is missing a scheme. Adding http scheme')
url = f'http://{url}'
logger.debug(f'New url: {url}')
for attempt in range(1, attempts+1):
try:
if attempt > 1:
time.sleep(10) # 10 seconds wait time between downloads
with requests.get(url, stream=True) as response:
response.raise_for_status()
with open(file_path, 'wb') as out_file:
for chunk in response.iter_content(chunk_size=1024*1024): # 1MB chunks
out_file.write(chunk)
logger.info('Download finished successfully')
return file_path
except Exception as ex:
logger.error(f'Attempt #{attempt} failed with error: {ex}')
return ''
Here is additional approach for the use-case of async chunked download, without reading all the file content to memory.
It means that both read from the URL and the write to file are implemented with asyncio libraries (aiohttp to read from the URL and aiofiles to write the file).
The following code should work on Python 3.7 and later.
Just edit SRC_URL and DEST_FILE variables before copy and paste.
import aiofiles
import aiohttp
import asyncio
async def async_http_download(src_url, dest_file, chunk_size=65536):
async with aiofiles.open(dest_file, 'wb') as fd:
async with aiohttp.ClientSession() as session:
async with session.get(src_url) as resp:
async for chunk in resp.content.iter_chunked(chunk_size):
await fd.write(chunk)
SRC_URL = "/path/to/url"
DEST_FILE = "/path/to/file/on/local/machine"
asyncio.run(async_http_download(SRC_URL, DEST_FILE))
requests is good, but how about socket solution?
def stream_(host):
import socket
import ssl
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
context = ssl.create_default_context(Purpose.CLIENT_AUTH)
with context.wrap_socket(sock, server_hostname=host) as wrapped_socket:
wrapped_socket.connect((socket.gethostbyname(host), 443))
wrapped_socket.send(
"GET / HTTP/1.1\r\nHost:thiscatdoesnotexist.com\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\n\r\n".encode())
resp = b""
while resp[-4:-1] != b"\r\n\r":
resp += wrapped_socket.recv(1)
else:
resp = resp.decode()
content_length = int("".join([tag.split(" ")[1] for tag in resp.split("\r\n") if "content-length" in tag.lower()]))
image = b""
while content_length > 0:
data = wrapped_socket.recv(2048)
if not data:
print("EOF")
break
image += data
content_length -= len(data)
with open("image.jpeg", "wb") as file:
file.write(image)