Python file operations sometimes need not acceptable time - python

We are running a Python based system consisting of several processes in an ubuntu 20.04.04 LTS environment.
One process copies files (video chunks) from memory to a persistent storage device using the following algorithm:
In case a new video has started -->
Create the destination directory /base-path/x/y (path.mkdir)
Create and initialize the destination playlist file (open(fn,w)
Copy the first video chunks (shutil.copyfileobj)
In case a new video chunk has been detected -->
Update the destination playlist file (open(fn,a))
Copy the video chunk (shutil.copyfileobj)
Normally this algorithm has worked fine. Each operation takes some ms. In one of our current installations in some situations a call to mkdir can take up to 25 s, file copy can take up to 2 s, even update playlist can take more than 1 s.
In case the number of new videos decreases it is possible that the system behaves normally, again.
The cpu-load is < 40%, memory usage is approx. 2 out of 16.
We are using inotify to detect new video chunks.
We are currently not able to explain this strange behaviour and would appreciate any help.
Btw.: Is there any best practise how to copy binary video chunks. We are using the operations described above because they have worked in the past but I am not sure if this is the most performant approach.
In case it helps to better understand our problem let me add some code:
New directory:
target: Path = Path(Settings.APP_RECORDING_OUTPUT_DIR) / task.get('path')
timestamp_before_create = time_now()
try:
target.mkdir(parents=True, exist_ok=True)
except OSError as err:
return logger.error(f'Unable to create directory: {target.as_posix()}: {err}')
# Set ownership
try:
chown(target.as_posix(), user=Settings.APP_RECORDING_FILE_OWNER, group=Settings.APP_RECORDING_FILE_GROUP)
chmod(target.as_posix(), stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR | stat.S_IRGRP | stat.S_IXGRP)
except (OSError, LookupError) as err:
return logger.warning(f'Unable to set folder permissions: {err}')
timestamp_after_create = time_now()
Update playlist:
timestamp_before_update = time_now()
lines = []
# Get lines for the manifest file
for line in task.get('new_entries'):
lines.append(line)
try:
lines = "".join(lines)
f = open(playlist_path.as_posix(), "a")
f.write(lines)
f.close()
except OSError:
return logger.error(f'Failed appending to playlist file: {playlist_path}')
timestamp_after_update = time_now()
Copy video chunks:
source: Path = Path(Settings.APP_RECORDING_SOURCE_DIR) / task.get('src_path')
target: Path = Path(Settings.APP_RECORDING_OUTPUT_DIR) / task.get('dst_path')
timestamp_before_copy = time_now()
try:
# Create file handles, copy binary contents
with open(source.as_posix(), 'rb') as fsrc:
with open(target.as_posix(), 'wb') as fdst:
# Execute file copying
copyfileobj(fsrc, fdst)
# Set ownership
chown(target.as_posix(), user=Settings.APP_RECORDING_FILE_OWNER, group=Settings.APP_RECORDING_FILE_GROUP)
chmod(target.as_posix(), stat.S_IRUSR | stat.S_IWUSR | stat.S_IRGRP)
timestamp_after_copy = time_now()
except Exception as err:
return logger.error(f'Copying chunk for {camera} failed: {err}')

Related

Python: Stream gzip files from s3

I have files in s3 as gzip chunks, thus I have to read the data continuously and cant read random ones. I always have to start with the first file.
For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d. If I do cat f2.gz | gzip -d, it will fail with gzip: stdin: not in gzip format.
How can I stream these data from s3 using python? I saw smart-open and it has the ability to decompress gz files with
from smart_open import smart_open, open
with open(path, compression='.gz') as f:
for line in f:
print(line.strip())
Where path is the path for f1.gz. This works until it hits the end of the file, where it will abort. Same thing will happen locally, if I do cat f1.gz | gzip -d, it will error with gzip: stdin: unexpected end of file when it hits the end.
Is there a way to make it stream the files continuously using python?
This one will not abort, and can iterate through f1.gz, f2.gz and f3.gz
with open(path, 'rb', compression='disable') as f:
for line in f:
print(line.strip(), end="")
but the output are just bytes. I was thinking it will work by doing python test.py | gzip -d, with the above code but I get an error gzip: stdin: not in gzip format. Is there a way to have python print using smart-open that gzip can read?
For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d.
One idea would be to make a file object to implement this. The file object reads from one filehandle, exhausts it, reads from the next one, exhausts it, etc. This is similar to how cat works internally.
The handy thing about this is that it does the same thing as concatenating all of your files, without the memory use of reading in all of your files at the same time.
Once you have the combined file object wrapper, you can pass it to Python's gzip module to decompress the file.
Examples:
import gzip
class ConcatFileWrapper:
def __init__(self, files):
self.files = iter(files)
self.current_file = next(self.files)
def read(self, *args):
ret = self.current_file.read(*args)
if len(ret) == 0:
# EOF
# Optional: close self.current_file here
# self.current_file.close()
# Advance to next file and try again
try:
self.current_file = next(self.files)
except StopIteration:
# Out of files
# Return an empty string
return ret
# Recurse and try again
return self.read(*args)
return ret
def write(self):
raise NotImplementedError()
filenames = ["xaa", "xab", "xac", "xad"]
filehandles = [open(f, "rb") for f in filenames]
wrapper = ConcatFileWrapper(filehandles)
with gzip.open(wrapper) as gf:
for line in gf:
print(line)
# Close all files
[f.close() for f in filehandles]
Here's how I tested this:
I created a file to test this through the following commands.
Create a file with the contents 1 thru 1000.
$ seq 1 1000 > foo
Compress it.
$ gzip foo
Split the file. This produces four files named xaa-xad.
$ split -b 500 foo.gz
Run the above Python file on it, and it should print out 1 - 1000.
Edit: extra remark about lazy-opening the files
If you have a huge number of files, you might want to open only one file at a time. Here's an example:
def open_files(filenames):
for filename in filenames:
# Note: this will leak file handles unless you uncomment the code above that closes the file handles again.
yield open(filename, "rb")

Errno 13 Permission denied when trying to copy contents of a tempfile into a normal file

I have looked over the similar questions, and not found a solution to my problem.
I am trying to insert a line of text onto the second line of a text file, however I don't want to open and close the text file as I am running this code over a large number of files, and don't want to slow it down. I found this answer to a similar question, and have been working through it; where I use tempfile to create a temporary file which is working perfectly: as reproduced below:
from pathlib import Path
from shutil import copyfile
from tempfile import NamedTemporaryFile
sourcefile = Path("Path\to\source").resolve()
insert_lineno = 2
insert_data = "# Mean magnitude = " + str(mean_mag)+"\n"
with sourcefile.open(mode="r") as source:
destination = NamedTemporaryFile(mode="w", dir=str(sourcefile.parent))
lineno = 1
while lineno < insert_lineno:
destination.file.write(source.readline())
lineno += 1
destination.file.write(insert_data)
while True:
data = source.read(1024)
if not data:
break
destination.file.write(data)
# Finish writing data.
destination.flush()
# Overwrite the original file's contents with that of the temporary file.
# This uses a memory-optimised copy operation starting from Python 3.8.
copyfile(destination.name, str(sourcefile))
# Delete the temporary file.
destination.close()
The second to last command, copyfile(destination.name, str(sourcefile)) is not working, and I am getting
[Errno 13] Permission Denied: 'D:\My_folder\Subfolder\tmp_s570_5w'
I think perhaps the problem is I can't copy from the temporary file because it is currently open, but if I close the temporary file it gets deleted, so I can't close it before copying either.
EDIT: When I run my code on Linux I don't get any errors, so it must be some kind of Windows permission problem?
Try this:
sourcefile = Path("Path\to\source").resolve()
insert_lineno = 2
insert_data = "# Mean magnitude = " + str(mean_mag)+"\n"
with sourcefile.open(mode="r") as source:
with NamedTemporaryFile(mode="w", dir=str(sourcefile.parent)) as destination:
lineno = 1
while lineno < insert_lineno:
destination.file.write(source.readline())
lineno += 1
destination.file.write(insert_data)
while True:
data = source.read(1024)
if not data:
break
destination.file.write(data)
# Finish writing data.
destination.flush()
# Overwrite the original file's contents with that of the temporary file.
# This uses a memory-optimised copy operation starting from Python 3.8.
copyfile(destination.name, str(sourcefile))
# Delete the temporary file.
destination.close()

Python remove entry from zipfile

I'm currently writing an open source library for a container format, which involves modifying zip archives. Therefore I utilized pythons build-in zipfile module. Due to some limitations I decided to modify the module and ship it with my library. These modifications include a patch for removing entries from the zip file from the python issue tracker: https://bugs.python.org/issue6818
To be more specific I included the zipfile.remove.2.patch from ubershmekel.
After some modifications for Python-2.7 the patch works just fine according to the shipped unit-tests.
But nevertheless I'm running into some problems, when removing, adding and removing + adding files without closing the zipfile in between.
Error
Traceback (most recent call last):
File "/home/martin/git/pyCombineArchive/tests/test_zipfile.py", line 1590, in test_delete_add_no_close
self.assertEqual(zf.read(fname), data)
File "/home/martin/git/pyCombineArchive/combinearchive/custom_zip.py", line 948, in read
with self.open(name, "r", pwd) as fp:
File "/home/martin/git/pyCombineArchive/combinearchive/custom_zip.py", line 1003, in open
% (zinfo.orig_filename, fname))
BadZipFile: File name in directory 'foo.txt' and header 'bar.txt' differ.
Meaning the zip file is ok, but somehow the central dictionary/entry header gets messed up.
This unittest reproduces this error:
def test_delete_add_no_close(self):
fname_list = ["foo.txt", "bar.txt", "blu.bla", "sup.bro", "rollah"]
data_list = [''.join([chr(randint(0, 255)) for i in range(100)]) for i in range(len(fname_list))]
# add some files to the zip
with zipfile.ZipFile(TESTFN, "w") as zf:
for fname, data in zip(fname_list, data_list):
zf.writestr(fname, data)
for no in range(0, 2):
with zipfile.ZipFile(TESTFN, "a") as zf:
zf.remove(fname_list[no])
zf.writestr(fname_list[no], data_list[no])
zf.remove(fname_list[no+1])
zf.writestr(fname_list[no+1], data_list[no+1])
# try to access prior deleted/added file and prior last file (which got moved, while delete)
for fname, data in zip(fname_list, data_list):
self.assertEqual(zf.read(fname), data)
My modified zipfile module and the complete unittest file can be found in this gist: https://gist.github.com/FreakyBytes/30a6f9866154d82f1c3863f2e4969cc4
After some intensive debugging, I'm quite sure something went wrong with moving the remaining chunks. (The ones stored after the removed file) So I went ahead and rewrote this code part, so it copies these files/chunks each at a time. Also I rewrite the file header for each of them (to make sure it is valid) and the central directory at the end of the zipfile.
My remove function now looks like this:
def remove(self, member):
"""Remove a file from the archive. Only works if the ZipFile was opened
with mode 'a'."""
if "a" not in self.mode:
raise RuntimeError('remove() requires mode "a"')
if not self.fp:
raise RuntimeError(
"Attempt to modify ZIP archive that was already closed")
fp = self.fp
# Make sure we have an info object
if isinstance(member, ZipInfo):
# 'member' is already an info object
zinfo = member
else:
# Get info object for member
zinfo = self.getinfo(member)
# start at the pos of the first member (smallest offset)
position = min([info.header_offset for info in self.filelist]) # start at the beginning of first file
for info in self.filelist:
fileheader = info.FileHeader()
# is member after delete one?
if info.header_offset > zinfo.header_offset and info != zinfo:
# rewrite FileHeader and copy compressed data
# Skip the file header:
fp.seek(info.header_offset)
fheader = fp.read(sizeFileHeader)
if fheader[0:4] != stringFileHeader:
raise BadZipFile("Bad magic number for file header")
fheader = struct.unpack(structFileHeader, fheader)
fname = fp.read(fheader[_FH_FILENAME_LENGTH])
if fheader[_FH_EXTRA_FIELD_LENGTH]:
fp.read(fheader[_FH_EXTRA_FIELD_LENGTH])
if zinfo.flag_bits & 0x800:
# UTF-8 filename
fname_str = fname.decode("utf-8")
else:
fname_str = fname.decode("cp437")
if fname_str != info.orig_filename:
if not self._filePassed:
fp.close()
raise BadZipFile(
'File name in directory %r and header %r differ.'
% (zinfo.orig_filename, fname))
# read the actual data
data = fp.read(fheader[_FH_COMPRESSED_SIZE])
# modify info obj
info.header_offset = position
# jump to new position
fp.seek(info.header_offset, 0)
# write fileheader and data
fp.write(fileheader)
fp.write(data)
if zinfo.flag_bits & _FHF_HAS_DATA_DESCRIPTOR:
# Write CRC and file sizes after the file data
fp.write(struct.pack("<LLL", info.CRC, info.compress_size,
info.file_size))
# update position
fp.flush()
position = fp.tell()
elif info != zinfo:
# move to next position
position = position + info.compress_size + len(fileheader) + self._get_data_descriptor_size(info)
# Fix class members with state
self.start_dir = position
self._didModify = True
self.filelist.remove(zinfo)
del self.NameToInfo[zinfo.filename]
# write new central directory (includes truncate)
fp.seek(position, 0)
self._write_central_dir()
fp.seek(self.start_dir, 0) # jump to the beginning of the central directory, so it gets overridden at close()
You can find the complete code in the latest revision of the gist: https://gist.github.com/FreakyBytes/30a6f9866154d82f1c3863f2e4969cc4
or in the repo of the library I'm writing: https://github.com/FreakyBytes/pyCombineArchive

IOError [Errno 13] when using numpy.loadtext?

I have a function that polls a folder for new files, then loads them using numpy.loadtext when it shows up. The function is called from a while loop that runs for 30 seconds. The function works properly most of the time, but for some files, seemingly at random, I get the error IOError: [Errno 13] Permission denied: 'myfilename1.txt'. Here is the content of my function:
before = dict([(f, None) for f in os.listdir(mydir)])
while 1:
after = dict([(f, None) for f in os.listdir(mydir)])
added = [f for f in after if f not in before]
# New File
if added:
raw = numpy.loadtxt(mydir + added[0])
return raw
Any idea on why this is happening? It properly polls and reads most text files that are incoming, but sometimes spits the error and I can't find a systematic reason why.
UPDATE:
Has something to do with using the full path with loadtxt. When I change the working directory to the directory where the files are, I no longer get the permissions error.
Have you tried opening the file as read only, may be a conflict if the file is accessed by another application (or is still currently being created).
# New File
if added:
with open(mydir + added[0], 'r') as f:
raw = numpy.loadtxt(f)
You could also try some form of IOError handling which waits a little while and then tries again
import time
before = dict([(f, None) for f in os.listdir(mydir)])
added = False
while 1:
# New File
if added:
try:
raw = numpy.loadtxt(mydir + added[0])
return raw
except IOError:
time.sleep(5)
else:
after = dict([(f, None) for f in os.listdir(mydir)])
added = [f for f in after if f not in before]
I got the same error when I attempted the following:
Y = np.loadtxt("C:/Users/erios/images_3_color_15k_labeled/", dtype='int')
I.e., I passed the folder where the text was located
INstead, the following command executed with no error:
Y = np.loadtxt("C:/Users/erios/images_3_color_15k_labeled/labels_for_locations.txt", dtype='int')
In sum, specify the full name of the text file, not just the folder.

Python + Django: Weird memory leaks with Database Queries

I wrote a routine that browses through a bunch of files and adds database entries (if not already present) according to the data, found in each line of those files.
for root, dirs, files in walk_dir(path):
for file_name in files:
file_path = join_path(root, file_name)
with transaction.commit_on_success():
with open(file_path, "r") as f:
for i, line in enumerate(f):
handle_line(line)
def handle_line(line):
phrase, translated_phrase, context = get_params(line)
try:
ph = Phrase.objects.get(name=phrase)
except Phrase.DoesNotExist:
ph = Phrase(name=phrase)
ph.save()
try:
tr = Translation.object.get(phrase=ph, name=translated_phrase)
except Translation.DoesNotExist:
tr = Translation(phrase=ph, name=translated_phrase)
tr.save()
try:
tm = TMTranslation.object.get(translation=tr, context=context)
except TMTranslation.DoesNotExist:
tm = TMTranslation(translation=tr, context=context)
tm.save()
There might be a lot of data to be processed (maybe 1000 files with several 1000 lines in each file). But still, I don't see why I'm always running into memory problems. Shouldn't memory be freed at least after each file (after each transaction)?
But what I am experiencing is that this process will slowly eat up all my memory and will start using up the swap space as well. So what am I doing wrong?

Categories