High memory usage with Pythons native tarfile lib - python

I'm working in a memory constrained environment and uses a Python script with tarfile library (http://docs.python.org/2/library/tarfile.html) to continuously make backups of log files.
As the number of log files have grown (~74 000) I noticed that the system effectively kills this backup process when it runs now. I noticed that it consumes an awful lots of memory (~192mb before it gets killed by OS).
I can make a gzip tar archive ($ tar -czf) of the log files without a problem or high memory usage.
Code:
import tarfile
t = tarfile.open('asdf.tar.gz', 'w:gz')
t.add('asdf')
t.close()
The dir "asdf" consists of 74407 files with filenames of length 73.
Is it not recommended to use Python's tarfile when you have a huge amount of files ?
I'm running Ubuntu 12.04.3 LTS and Python 2.7.3 (tarfile version seems to be "$Revision: 85213 $").

I did some digging in the source code and it seems that tarfile is storing all files in a list of TarInfo objects (http://docs.python.org/2/library/tarfile.html#tarfile.TarFile.getmembers), causing the ever increasing memory footprint with many and long file names.
The caching of these TarInfo objects seems to have been optimized significantally in a commit from 2008, http://bugs.python.org/issue2058, but from what I can see it was only merged with py3k branch, for Python 3.
One could reset the members list again and again, as in http://blogs.it.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/, however I'm not sure what internal tarfile functionality one misses then so I went with using a system level call instead (> os.system('tar -czf asdf.tar asdf/').

two ways to solve: if your VM does not have swap add and try. i have 13GB files to be tarred to a big bundle it was consistently failing. OS killed . Adding 4GB swap helped.
if you are using k8-pod, or docker container one quick workaround could be - add swap in host , capability:sys-admin or privilege mode will use host swap.
if you need tarfile with stream to avoid memory - checkout : https://gist.github.com/leth/6adb9d30f2fdcb8802532a87dfbeff77

Related

Tried to delete a package from /nix/store now system has errors, how to fix?

error: opening file '/nix/store/4h464mkqfipf04jgz4jp3bx56sdn6av0-python3.7-somepackage-1.0.0.drv': No such file or directory
I manually deleted some files in attempt to remove the package. However nix-shell no longer works and gives me the above message. How do I fix the problem in nix? I want to completely remove the package and reinstall it.
Additionally when I run the command below:
~/sources/integration_test >>> nix-env -u python3.7-somepackagesomepackage-1.0.0
error: selector 'python3.7-somepackages-1.0.0' matches no derivations
Try running
nix-store --verify --check-contents --repair
From the manpages:
OPERATION --VERIFY
Synopsis
nix-store --verify [--check-contents] [--repair]
Description
The operation --verify verifies the internal consistency of the Nix database, and the
consistency between the Nix database and the Nix store. Any inconsistencies
encountered are automatically repaired. Inconsistencies are generally the result of
the Nix store or database being modified by non-Nix tools, or of bugs in Nix itself.
This operation has the following options:
--check-contents
Checks that the contents of every valid store path has not been altered by
computing a SHA-256 hash of the contents and comparing it with the hash stored in
the Nix database at build time. Paths that have been modified are printed out.
For large stores, --check-contents is obviously quite slow.
--repair
If any valid path is missing from the store, or (if --check-contents is given)
the contents of a valid path has been modified, then try to repair the path by
redownloading it. See nix-store --repair-path for details.
NB. I recommend reading the manpages yourself with man nix-store to ensure this is what you want before running this.
NB.2 Due to the nature of the operations, a lot has to be checked―this operation will take a while. For my 11 GiB /nix/store, this ran for 4m13s.
Addendum. In future, when you want to delete a package from the nix store manually, use
nix-store --delete /nix/store/[what you want to delete]
instead.

Dump Python sklearn model in Windows and read it in Linux

I am trying to save a sklearn model on a Windows server using sklearn.joblib.dump and then joblib.load the same file on a linux server (centOS71). I get the error below:
ValueError: non-string names in Numpy dtype unpickling
This is what I have tried:
Tried both python27 and python35
Tried the built in open() with 'wb' and 'rb' arguments
I really don't care how the file is moved, I just need to be able to move and load it in a reasonable amount of time.
Python pickle should run between windows/linux. There may be incompatibilities if:
python versions on the two hosts are different (If so, try installing same version of python on both hosts); AND/OR
if one machine is 32-bit and another is 64-bit (I dont know any fix so far for this problem)

How to limit memory usage within a python process

I run Python 2.7 on a Linux machine with 16GB Ram and 64 bit OS. A python script I wrote can load too much data into memory, which slows the machine down to the point where I cannot even kill the process any more.
While I can limit memory by calling:
ulimit -v 12000000
in my shell before running the script, I'd like to include a limiting option in the script itself. Everywhere I looked, the resource module is cited as having the same power as ulimit. But calling:
import resource
_, hard = resource.getrlimit(resource.RLIMIT_DATA)
resource.setrlimit(resource.RLIMIT_DATA, (12000, hard))
at the beginning of my script does absolutely nothing. Even setting the value as low as 12000 never crashed the process. I tried the same with RLIMIT_STACK, as well with the same result. Curiously, calling:
import subprocess
subprocess.call('ulimit -v 12000', shell=True)
does nothing as well.
What am I doing wrong? I couldn't find any actual usage examples online.
edit: For anyone who is curious, using subprocess.call doesn't work because it creates a (surprise, surprise!) new process, which is independent of the one the current python program runs in.
resource.RLIMIT_VMEM is the resource corresponding to ulimit -v.
RLIMIT_DATA only affects brk/sbrk system calls while newer memory managers tend to use mmap instead.
The second thing to note is that ulimit/setrlimit only affects the current process and its future children.
Regarding the AttributeError: 'module' object has no attribute 'RLIMIT_VMEM' message: the resource module docs mention this possibility:
This module does not attempt to mask platform differences — symbols
not defined for a platform will not be available from this module on
that platform.
According to the bash ulimit source linked to above, it uses RLIMIT_AS if RLIMIT_VMEM is not defined.

Running python from a mounted filesystem

I have a django-based application that I'm running from a virtualbox-shared folder. When starting using 'runserver' I get an error indicating that a module could not be found. After copying the same exact code to a directory on the local filesystem, it starts and runs as expected.
Anyone seen anything like this when working with virtualbox and python?
It appears that module resolution is working differently when python is run from the mounted shared folder vs. the local folder, but I can't find a smoking gun that indicates whether or not it's caused by how that folder is mounted or python.
Thanks!
Try avoid putting your projects (large number of files/directories) on vboxsf (default synced folder).
vboxsf lacks support for symbolic / hard links, which will potentially cause problems (e.g. using git as version control). see the ticket #818 here, it is still NOT fixed.
There is a chance you're running into a problem with file system case sensitivity. It took me a couple of hours to figure this out myself. The shared folder is case insensitive, but the local folders are case sensitive since they are on a different file system (ext3/4). So you'll run into problems where python files in your current directory will override an import of the same name.
A simple example with pycrypto showing this (pip install pycrypto if you dont have it):
vagrant#virtualos:/mnt/shared_folder$ python -c 'from Crypto.PublicKey import RSA'
vagrant#virtualos:/mnt/shared_folder$ touch crypto.py
vagrant#virtualos:/mnt/shared_folder$ python -c 'from Crypto.PublicKey import RSA'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: No module named PublicKey
If I do the same thing on a local (ext4) directory it works fine. It seems python has different import logic depending on what OS it is running on.
Unfortunately I have not found a good solution to this other than to manually copy files onto my VM instead of using shared folders.
One solution I found was to mount my shared folder with cifs into the VM. That seems to work flawlessly. I did not found a solution for the vboxsf.

ImportError: Bad magic number, since OSX Lion

I'm getting this error every time I run any python file in Eclipse using PyDev:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/site.py", line 73, in <module>
__boot()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/site.py", line 2, in __boot
import sys, imp, os, os.path
ImportError: Bad magic number in /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/os.pyc
I'm using python 2.6. This problem does not occur when I run python from the terminal (2.7 or 2.6). The only substantial thing I've changed since everything last worked, is an update to OSX Lion from Snow Leopard.
Similar discussions to this seem to suggest some kind of removal of the .pyc file, because of some kind of mismatch between what was originally using the .pyc files (I'm not entirely sure what a magic number is...). But I was a bit cautious of the idea of deleting os.pyc from the Frameworks directory. When the only other file is an os.pyo file (not sure what the difference it), rather than an os.py.
I've installed all OSX Lion updates, Eclipse updates and PyDev updates.
This problem occurs even with code such as :
if __name__ == '__main__':
pass
Any help resolving this would be appreciated!
Upgrading Python to 2.7.1, running "Update Shell Profile" command file which is located in Python directory and changing the Python settings in Netbeans according to new installation worked for me.
Yeah, you'll need to regenerate all your *.pyc and *.pyo files from the *.py files.
How you do this depends on how they were generated in the first place. Some packaging of python (and it's add-ons), such as in some Linux distros, gets a little too clever for its own good and keeps the original *.py files somewhere else and have their own build system for generating and placing the *.pyc and/or *.pyo files. In a case like that, you have to use that build system to regenerate them from the original *.py files.
FYI, here are a couple links on *.pyo files. They are the optimized versions of compiled python modules.
On OS X Lion, you should have a os.py file. This is likely the root cause of your error. The os.pyc file was generated from a different version of python than you are running now. Normally, I imagine the python interpreter would just regenerate the file from os.py. But for whatever reason, your system does not have that file.
I suspect that this is a small data point in a larger set of issues and would, in general, recommend a reinstallation of your operating system.
For comparison, I'm running 10.7.1, and I have the following:
[2:23pm][wlynch#orange workout] ls /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/os.*
/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/os.py
/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/os.pyc
/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/os.pyo
As an aside, the *.pyo file is an optimized version of the python bytecode.

Categories