Python ZipFile module extracts password protected zips slowly - python

i am trying to write a python-script, which should extract a zip file:
Board: Beagle-Bone black ~ 1GHz Arm-Cortex-a8, debian wheezy
Zipfile: /home/milo/my.zip, ~ 8 MB
>>> from zipfile import ZipFile
>>> zip = ZipFile("/home/milo/my.zip")
>>> zip.extractall(pwd="tst")
other solutions with opening and reading-> writing the zipfile and extracting even
particular file have the same effect. extracting take about 3-4 minutes.
Extracting the same file with just using unzip-tool takes less than 2 seconds.
Does anyone know what is wonrg with my code, or even with python zipfile lib??
Thanks
Ajava

This seems to be a documented issue with the ZipFile module in Python 2.7. If you look at the documentation for ZipFile, it clearly mentions:
Decryption is extremely slow as it is implemented in native Python
rather than C.
If you need faster performance, you can either invoke an an external program (like unzip or 7zip) from your code, or make sure the zip files you are working with are not password protected.

Copy from my answer https://stackoverflow.com/a/72513075/10860732
It's quite stupid that Python doesn't implement zip decryption in pure c.
So I make it in cython, which is 17 times faster.
Just get the dezip.pyx and setup.py from this gist.
https://gist.github.com/zylo117/cb2794c84b459eba301df7b82ddbc1ec
And install cython and build a cython library
pip3 install cython
python3 setup.py build_ext --inplace
Then run the original script with two more lines.
import zipfile
# add these two lines
from dezip import _ZipDecrypter_C
setattr(zipfile, '_ZipDecrypter', _ZipDecrypter_C)
z = zipfile.ZipFile('./test.zip', 'r')
z.extractall('/tmp/123', None, b'password')

Related

How to install parts of specific python package?

I am looking at a code that uses acicobra package. Acicobra library is 670MB. The python code uses only a few functions from acicobra library. Is there a way to install only the required modules from this acicobra library and not the entire library? If I install the entire library, my docker image size gets inflated because of this gigantic library.
root#1f5edb150a78:/usr/local/lib/python2.7/site-packages# ls -l |grep -v dist-info|du -sh *|sort -hr
659M cobra
These are the only references to cobra in the python code
from cobra.mit.access import MoDirectory
from cobra.mit.session import LoginSession, LoginError
from cobra.mit.request import ClassQuery, DnQuery, QueryError
As you can see, the code is referencing only 3 modules out of the entire library.
I am looking for ways to avoid installing the entire library to limit the size of the dockerimage
Not unless it relies on other smaller sublibraries that you can fetch.
You could check out the code and see if you can create your own smaller python package.
But 90% of the time you'll go down a depecdency chain and need the full library anyway.
Also check the license if commercial use.
This should be the correct github for the source:
https://github.com/datacenter/cobra

How to use pydicom in jython

When I tried to import dicom in pydicom package I got error.
I performed the following steps.
Downloaded the pydicom-0.9.9.tar file,extracted and performed 'jython setup.py install' in cmd.But its not working.
Is this due to compatability of jython with python?
How to make pydicom is working in jython?
There is a bug in Jython on the bytecode files size. That is, Jython can’t compile the file if a module have huge bytecode size and unfortunately PyDicom have 2 such files. So, The work around is to split the files into junks and try installing.
This is a temporary work around and this issue has been resolved in Jython2.7.1 version. For now, try the following
Split the “pydicom-0.9.8\dicom_dicom_dict.py” file into multiple files(4) files with 700 entries in a list.
Split the “pydicom-0.9.8\dicom_private_dict.py” files into multiple files with 700 entries in each
Search and change the usage of _dicom_dict.py contents in the pydicom package
example: go to datadict.py and edit the following
from dicom._dicom_dict_1 import DicomDictionaryOne
from dicom._dicom_dict_2 import DicomDictionaryTwo
from dicom._dicom_dict_3 import DicomDictionaryThree
from dicom._dicom_dict_4 import DicomDictionaryFour
DicomDictionary.update(DicomDictionaryOne)
DicomDictionary.update(DicomDictionaryTwo)
DicomDictionary.update(DicomDictionaryThree)
DicomDictionary.update(DicomDictionaryFour)
Search and change the usage of _private_dict.py contents in the pydicom package
Install the package using setup.py

Distributing a Python script to unpack .tar.xz

Is there a way to distribute a Python script that can unpack a .tar.xz file?
Specifically:
This needs to run on other people's machines, not mine, so I can't require any extra modules to have been installed.
I can get away with assuming the presence of Python 2.7, but not 3.x.
So that seems to amount to asking whether out-of-the-box Python 2.7 has such a feature, and as far as I can tell the answer is no, but is there anything I'm missing?
First decompress the xz file into tar data and then extract the tar data:
import lzma
import tarfile
with lzma.open("file.tar.xz") as fd:
with tarfile.open(fileobj=fd) as tar:
content = tar.extractall('/path/to/extract/to')
For python2.7 you need to install pip27.pylzma

Compiling IronPython into an exe that uses standard library packages

In my IronPython script, I'm using standard libary modules like ConfigParser, logging and JSON.
Then I use pyc.py to create an executable. At first I ran into problems, namely '...ImportException: no module named ...'
since they weren't being included in the exe and accompanying dlls.
So I ran a solution from here: IronPython: EXE compiled using pyc.py cannot import module "os" and it mostly worked.
For example, importing 'ConfigParser' would work since in the IronPython 'Lib' folder as a module, it's there as 'ConfigParser.py'. However I'm still having trouble using JSON and logging since they're inside of folders with their name (packages?).
I'm feeling that I'm just missing something simple, and probably need to read up more on python modules and how they really work, but I'm not sure what I should be looking for.
Any help would be greatly appreciated.
Thanks!
EDIT:
I can't answer my own question yet, so I'll leave this here.
Somehow got it to work in a really 'hacky' way. There must be another much cleaner solution to this that I'm missing (some option in pyc.py?)
Here's what I did:
1) Made the StdLib.dll file generated from the link above (IronPython: EXE compiled using pyc.py cannot import module "os"). This would be missing the std lib packages.
2) Used SharpDevelop to compile the standard lib packages that weren't included in the above dll following the method here: http://community.sharpdevelop.net/blogs/mattward/archive/2010/03/16/CompilingPythonPackagesWithIronPython.aspx
3) Used SharpDevelop to build my program and tie together all the references.
- Reference to the dlls made in step 2
- Reference to the StdLib.dll made in step 1
Again, there must be a better solution to this.
I've found two ways to compile standard library python packages:
1st way: Individually compile each package into a dll
Using pyc.py, run something like (this example compiles logging package):
ipy pyc.py ".\Lib\logging\__init__.py" ".\Lib\logging\config.py" ".\Lib\logging\handlers.py" /target:dll /out:logging
This creates a logging.dll file, which you can then use like this:
import clr
clr.AddReference('StdLib') #from the compilation of non-package std libraries
clr.AddReference('logging')
import logging
**Note: This is assuming you've run the solution from IronPython: EXE compiled using pyc.py cannot import module "os" to create StdLib.dll
2nd way: Modify the compilation script that generated StdLib.dll
I changed this line:
#Build StdLib.DLL
gb = glob.glob(r".\Lib\*.py")
gb.append("/out:StdLib")
To this:
#Build StdLib.DLL
gb1 = glob.glob(r".\Lib\*.py")
gb2 = glob.glob(r".\Lib\*\*.py")
gb3 = glob.glob(r".\Lib\*\*\*.py")
gb = list(set(gb1 + gb2 + gb3))
gb.append("/out:StdLib")
This includes the subfolders in the Lib directory which get missed in the original regex (only modules get included). Now, packages like xml, json, logging, etc. get included into StdLib.dll

Is there a faster method to load a yaml file than the standard .load method? Django/Python

I am loading a big yaml file and it is taking forever. I am wondering if there is a faster method than the yaml.load() method.
I have read that there is a CLoader method but havent been able to run it.
The website that suggested this CLoader method asks me to do this:
Download the source package PyYAML-3.08.tar.gz and unpack it.
Go to the directory PyYAML-3.08 and run:
$ python setup.py install
If you want to use LibYAML bindings, which are much faster than the pure Python version, you need to download and install LibYAML.
Then you may build and install the bindings by executing
$ python setup.py --with-libyaml install
In order to use LibYAML based parser and emitter, use the classes CParser and CEmitter:
from yaml import load, dump
try:
from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
from yaml import Loader, Dumper
This looks like this will work but I dont have a setup.py directory anywhere in my Django project and therefore can't install/import any of these things
Can anyone help me figure out how to do this or let me know about another faster loading method??
Thanks for the help!!
I have no idea what's faster - bspymaster's ideas might be the most useful.
When you download PyYAML-3.08.tar.gz, inside the archive there will be a setup.py what you can run.
Note to use LibYAML, download this: http://pyyaml.org/download/libyaml/yaml-0.1.4.tar.gz
And run using the instructions from http://pyyaml.org/wiki/LibYAML
You will need a set a build tools, which should be installed on linux/unix, for osx make sure xcode is installed, and I'm not sure about windows.

Categories