Python's ftplib with tqdm - python

I have a console script which uses ftplib as a backend to get a number of files from an ftp server. I would like to use tqdm to give the user some feedback provided they have a "verbose" switch on. This must be optional as some users might use the script without tty access.
The ftplib's retrbinary method takes a callback so it should be possible to hook tqdm in there somehow. However, I have no idea what this callback would look like.

From FTP.retrbinary:
The callback function is called for each block of data received, with a single string argument giving the data block.
So the callback could be something like:
with open(filename, 'wb') as fd:
total = ftpclient.size(filename)
with tqdm(total=total) as pbar:
def callback_(data):
l = len(data)
pbar.update(l)
fd.write(data)
ftpclient.retrbinary('RETR {}'.format(filename), callback_)
Beware: This code is untested and probably has to be adapted.

That code shouldn't work as pbar will be "closed" when the with block terminates, which occurs just before ftpclient.retrbinary(...). You need a very minor indentation mod:
with open(filename, 'wb') as fd:
total = ftpclient.size(filename)
with tqdm(total=total,
unit='B', unit_scale=True, unit_divisor=1024,
disable=not verbose) as pbar:
def cb(data):
pbar.update(len(data))
fd.write(data)
ftpclient.retrbinary('RETR {}'.format(filename), cb)
EDIT added disable flag and bytes scaling

with open(filename, 'wb') as fd:
total = ftpclient.size(filename)
with tqdm(total=total,
unit='B', unit_scale=True, unit_divisor=1024,
disable=not verbose) as pbar:
def cb(data):
pbar.update(len(data))
fd.write(data)
ftpclient.retrbinary('RETR {}'.format(filename), cb)

Related

Python doesn't release file after it is closed

What I need to do is to write some messages on a .txt file, close it and send it to a server. This happens in a infinite loop, so the code should look more or less like this:
from requests_toolbelt.multipart.encoder import MultipartEncoder
num = 0
while True:
num += 1
filename = f"example{num}.txt"
with open(filename, "w") as f:
f.write("Hello")
f.close()
mp_encoder = MultipartEncoder(
fields={
'file': ("file", open(filename, 'rb'), 'text/plain')
}
)
r = requests.post("my_url/save_file", data=mp_encoder, headers=my_headers)
time.sleep(10)
The post works if the file is created manually inside my working directory, but if I try to create it and write on it through code, I receive this response message:
500 - Internal Server Error
System.IO.IOException: Unexpected end of Stream, the content may have already been read by another component.
I don't see the file appearing in the project window of PyCharm...I even used time.sleep(10) because at first, I thought it could be a time-related problem, but I didn't solve the problem. In fact, the file appears in my working directory only when I stop the code, so it seems the file is held by the program even after I explicitly called f.close(): I know the with function should take care of closing files, but it didn't look like that so I tried to add a close() to understand if that was the problem (spoiler: it was not)
I solved the problem by using another file
with open(filename, "r") as firstfile, open("new.txt", "a+") as secondfile:
secondfile.write(firstfile.read())
with open(filename, 'w'):
pass
r = requests.post("my_url/save_file", data=mp_encoder, headers=my_headers)
if r.status_code == requests.codes.ok:
os.remove("new.txt")
else:
print("File not saved")
I make a copy of the file, empty the original file to save space and send the copy to the server (and then delete the copy). Looks like the problem was that the original file was held open by the Python logging module
Firstly, can you change open(f, 'rb') to open("example.txt", 'rb'). In open, you should be passing file name not a closed file pointer.
Also, you can use os.path.abspath to show the location to know where file is written.
import os
os.path.abspath('.')
Third point, when you are using with context manager to open a file, you don't close the file. The context manger supposed to do it.
with open("example.txt", "w") as f:
f.write("Hello")

Different behavior using tqdm

I was making a image downloading project for a website, but I encountered some strange behavior using tqdm. In the code below I included two options for making the tqdm progress bar. In option one I did not passed the iteratable content from response into the tqdm directly, while the second option I did. Although the code looks similar, the result is strangely different.
This is what the progress bar's result looks like using Option 1
This is what the progress bar's result looks like using Option 2
Option one is the result I desire but I just couldn't find an explanation for the behavior of using Option 2. Can anyone help me explain this behavior?
import requests
from tqdm import tqdm
import os
# Folder to store in
default_path = "D:\\Downloads"
def download_image(url):
"""
This function will download the given url's image with proper filename labeling
If a path is not provided the image will be downloaded to the Downloads folder
"""
# Establish a Session with cookies
s = requests.Session()
# Fix for pixiv's request you have to add referer in order to download images
response = s.get(url, headers={'User-Agent': 'Mozilla/5.0',
'referer': 'https://www.pixiv.net/'}, stream=True)
file_name = url.split("/")[-1] # Retrieve the file name of the link
together = os.path.join(default_path, file_name) # Join together path with the file_name. Where to store the file
file_size = int(response.headers["Content-Length"]) # Get the total byte size of the file
chunk_size = 1024 # Consuming in 1024 byte per chunk
# Option 1
progress = tqdm(total=file_size, unit='B', unit_scale=True, desc="Downloading {file}".format(file=file_name))
# Open the file destination and write in binary mode
with open(together, "wb") as f:
# Loop through each of the chunks in response in chunk_size and update the progres by calling update using
# len(chunk) not chunk_size
for chunk in response.iter_content(chunk_size):
f.write(chunk)
progress.update(len(chunk))
# Option 2
"""progress = tqdm(response.iter_content(chunk_size),total=file_size, unit='B', unit_scale=True, desc="Downloading {file}".format(file = file_name))
with open(together, "wb") as f:
for chunk in progress:
progress.update(len(chunk))
f.write(chunk)
# Close the tqdm object and file object as good practice
"""
progress.close()
f.close()
if __name__ == "__main__":
download_image("Image Link")
Looks like an existing bug with tqdm. https://github.com/tqdm/tqdm/issues/766
Option 1:
Provides tqdm the total size
On each iteration, update progress. Expect the progress bar to keep moving.
Works fine.
Option 2:
Provides tqdm the total size along with a generator function that tracks the progress.
On each iteration, it should automatically get the update from generator and push the progress bar.
However, you also call progress.update manually, which should not be the case.
Instead let the generator do the job.
But this doesn't work either, and the issue is already reported.
Suggestion on Option 1:
To avoid closing streams manually, you can enclose them inside with statement. Same applies to tqdm as well.
# Open the file destination and write in binary mode
with tqdm(total=file_size,
unit='B',
unit_scale=True,
desc="Downloading {file}".format(file=file_name)
) as progress, open(file_name, "wb") as f:
# Loop through each of the chunks in response in chunk_size and update the progres by calling update using
# len(chunk) not chunk_size
for chunk in response.iter_content(chunk_size):
progress.update(len(chunk))
f.write(chunk)

Pyomo - how to write model.pprint() to file?

I want to "debug" my pyomo model. The output of the model.pprint() method looks helpful but it is too long so the console only displays and stores the last lines. How can I see the first lines. And how can I store this output in a file
(I tried pickle, json, normal f.write but since the output of .pprint() is of type NONE I wasn't sucessfull until now. (I am also new to python and learning python and pyomo in parallel).
None of this works :
'''
with open('some_file2.txt', 'w') as f:
serializer.dump(x, f)
import pickle
object = Object()
filehandler = open('some_file', 'wb')
pickle.dump(x, filehandler)
x = str(instance)
x = str(instance.pprint())
f = open('file6.txt', 'w')
f.write(x)
f.write(instance.pprint())
f.close()
Use the filename keyword argument to the pprint method:
instance.pprint(filename='foo.txt')
instance.pprint() prints in the console (stdout for standard output), but does not return the content (the return is None as you said). To have it print in a file, you can try to redirect the standard output to a file.
Try:
import sys
f = open('file6.txt', 'w')
sys.stdout = f
instance.pprint()
f.close()
It looks like there is a cleaner solution from Bethany =)
For me the accepted answer does not work, pprint has a different signature.
help(instance.pprint)
pprint(ostream=None, verbose=False, prefix='') method of pyomo.core.base.PyomoModel.ConcreteModel instance
# working for me:
with open(path, 'w') as output_file:
instance.pprint(output_file)

Write to csv with Python Multiprocessing apply_async causes missing of data

I have a csv file, where I read urls line by line to make a request for each enpoint. Each request is parsed and data is written to the output.csv. This process is paralleled.
The issue is connected with written data. Some portions of data are partially missed, or totally missed (blank lines). I suppose that it is happening because of collision or conflicts between async processes. Can you please advice how to fix that.
def parse_data(url, line_num):
print line_num, url
r = requests.get(url)
htmltext = r.text.encode("utf-8")
pois = re.findall(re.compile('<pois>(.+?)</pois>'), htmltext)
for poi in pois:
write_data(poi)
def write_data(poi):
with open('output.csv', 'ab') as resfile:
writer = csv.writer(resfile)
writer.writerow([poi])
resfile.close()
def main():
pool = Pool(processes=4)
with open("input.csv", "rb") as f:
reader = csv.reader(f)
for line_num, line in enumerate(reader):
url = line[0]
pool.apply_async(parse_data, args=(url, line_num))
pool.close()
pool.join()
Try to add file locking:
import fcntl
def write_data(poi):
with open('output.csv', 'ab') as resfile:
writer = csv.writer(resfile)
fcntl.flock(resfile, fcntl.LOCK_EX)
writer.writerow([poi])
fcntl.flock(resfile, fcntl.LOCK_UN)
# Note that you dont have to close the file. The 'with' will take care of it
Concurrent writes to a same file is indeed a known cause of data loss / file corruption. The safe solution here is the "map / reduce" pattern - each process writes in it's own result file (map), then you concatenate those files together (reduce).

Stream a file to the HTTP response in Pylons

I have a Pylons controller action that needs to return a file to the client. (The file is outside the web root, so I can't just link directly to it.) The simplest way is, of course, this:
with open(filepath, 'rb') as f:
response.write(f.read())
That works, but it's obviously inefficient for large files. What's the best way to do this? I haven't been able to find any convenient methods in Pylons to stream the contents of the file. Do I really have to write the code to read a chunk at a time myself from scratch?
The correct tool to use is shutil.copyfileobj, which copies from one to the other a chunk at a time.
Example usage:
import shutil
with open(filepath, 'r') as f:
shutil.copyfileobj(f, response)
This will not result in very large memory usage, and does not require implementing the code yourself.
The usual care with exceptions should be taken - if you handle signals (such as SIGCHLD) you have to handle EINTR because the writes to response could be interrupted, and IOError/OSError can occur for various reasons when doing I/O.
I finally got it to work using the FileApp class, thanks to Chris AtLee and THC4k (from this answer). This method also allowed me to set the Content-Length header, something Pylons has a lot of trouble with, which enables the browser to show an estimate of the time remaining.
Here's the complete code:
def _send_file_response(self, filepath):
user_filename = '_'.join(filepath.split('/')[-2:])
file_size = os.path.getsize(filepath)
headers = [('Content-Disposition', 'attachment; filename=\"' + user_filename + '\"'),
('Content-Type', 'text/plain'),
('Content-Length', str(file_size))]
from paste.fileapp import FileApp
fapp = FileApp(filepath, headers=headers)
return fapp(request.environ, self.start_response)
The key here is that WSGI, and pylons by extension, work with iterable responses. So you should be able to write some code like (warning, untested code below!):
def file_streamer():
with open(filepath, 'rb') as f:
while True:
block = f.read(4096)
if not block:
break
yield block
response.app_iter = file_streamer()
Also, paste.fileapp.FileApp is designed to be able to return file data for you, so you can also try:
return FileApp(filepath)
in your controller method.

Categories