Python zipfile, How to set the compression level?

Python zipfile, How to set the compression level? - python

Python supports zipping files when zlib is available, ZIP_DEFLATE
see:
https://docs.python.org/3.4/library/zipfile.html
The zip command-line program on Linux supports -1 fastest, -9 best.
Is there a way to set the compression level of a zip file created in Python's zipfile module?

Starting from python 3.7, the zipfile module added the compresslevel parameter.
https://docs.python.org/3/library/zipfile.html
I know this question is dated, but for people like me, that fall in this question, it may be a better option than the accepted one.

The zipfile module does not provide this. During compression it uses constant from zlib - Z_DEFAULT_COMPRESSION. By default it equals -1. So you can try to change this constant manually, as possible solution.

Python 3.7+ answer: If you look at the zipfile.ZipFile constructor you'll see this:
def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=True,
compresslevel=None):
"""Open the ZIP file with mode read 'r', write 'w', exclusive create 'x',
or append 'a'.
...
compression: ZIP_STORED (no compression), ZIP_DEFLATED (requires zlib),
ZIP_BZIP2 (requires bz2) or ZIP_LZMA (requires lzma).
compresslevel: None (default for the given compression type) or an integer
specifying the level to pass to the compressor.
When using ZIP_STORED or ZIP_LZMA this keyword has no effect.
When using ZIP_DEFLATED integers 0 through 9 are accepted.
When using ZIP_BZIP2 integers 1 through 9 are accepted.
"""
which means you can pass the desired compression in the constructor:
myzip = zipfile.ZipFile(file_handle, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=9)
See also https://docs.python.org/3/library/zipfile.html

Related

How to read "-" (dash) as standard input with Python without writing extra code?

Using Python 3.5.x, not any greater version than that.
https://stackoverflow.com/a/30254551/257924 is the right answer, but doesn't provide a solution that is built into Python, but requires writing code from scratch:
I need to have a string that has a value of "-" to represent stdin, or its value is a path to a text file I want to read from. I want to use the with operator to open up either type of those files, without using conditional logic to check for "-" in my scripts. I have something that works, but it seems like it should be something that is built into Python core and not requiring me to roll my own context-manager, like this:
from contextlib import contextmanager
#contextmanager
def read_text_file_or_stdin(path):
"""Return a file object from stdin if path is '-', else read from path as a text file."""
if path == '-':
with open(0) as f:
yield f
else:
with open(path, 'r') as f:
yield f
# path = '-' # Means read from stdin
path = '/tmp/paths' # Means read from a text file given by this value
with read_text_file_or_stdin(path) as g:
paths = [path for path in g.read().split('\n') if path]
print("paths", paths)
I plan to pass in the argument to a script via something like -p - to mean "read from standard-input" or -p some_text_file meaning "read from some_text_file".
Does this require me to do the above, or is there something built into Python 3.5.x that provides this already? This seems like such a common need for writing CLI utilities, that it could have already been handled by something in the Python core or standard libraries.
I don't want to install any module/package from repositories outside of the Python standard library in 3.5.x, just for this.

The argparse module provides a FileType factory which knows about the - convention.
import argparse
p = argparse.ArgumentParser()
p.add_argument("-p", type=argparse.FileType("r"))
args = p.parse_args()
Note that args.p is an open file handle, so there's no need to open it "again". While you can still use it with a with statement:
with args.p:
for line in args.p:
...
this only ensures the file is closed in the event of an error in the with statement itself. Also, you may not want to use with, as this will close the file, even if you meant to use it again later.
You should probably use the atexit module to make sure the file gets closed by the end of the program, since it was already opened for you at the beginning.
import atexit
...
args = p.parse_args()
atexit.register(args.p.close)

Check https://docs.python.org/3/library/argparse.html#filetype-objects.
Where you can do like this
>>> parser = argparse.ArgumentParser()
>>> parser.add_argument('infile', nargs='?', type=argparse.FileType('r'),
... default=sys.stdin)
>>> parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'),
... default=sys.stdout)
So infile & outfile will support read & write streams for stdin & stdout by default.
Or using my favorite library click
Check more details at the library docs website
https://click.palletsprojects.com/en/7.x/api/#click.File
https://click.palletsprojects.com/en/7.x/arguments/#file-arguments
File Arguments
Since all the examples have already worked with filenames, it makes sense to explain how to deal with files properly. Command line tools are more fun if they work with files the Unix way, which is to accept - as a special file that refers to stdin/stdout.
Click supports this through the click.File type which intelligently handles files for you. It also deals with Unicode and bytes correctly for all versions of Python so your script stays very portable.
The library is the most "BASH/DASH" features friendly I believe.

Pass file as input to a program and store its output using the sh library in python

I'm confused on how exactly we should use the python sh library, specifically the sh.Command(). Basically, I wish to pass input_file_a to program_b.py and store its output in a different directory as output_file_b, how should I achieve this using the sh library in python?

If you mean input and output redirection, then see here (in) and here (out) respectively. In particular, looks like to "redirect" stdin you need to pass as argument the actual bytes (e.g. read them beforehand), in particular, the following should work according to their documentation (untested, as I don't have/work with sh - please let know if this works for you / fix whatever is missing):
import sh
python3 = sh.Command("python3")
with open(input_file_a, 'r') as ifile:
python3("program_b.py", _in=ifile.read(), _out=output_file_b)
Note that may need to specify argument search_paths for sh.Command for it to find python. Also, may need to specify full path to program_b.py file or os.chdir() accordingly.

Python 2.7 equivalent of importlib.machinery.EXTENSION_SUFFIXES

I want to use importlib.machinery.EXTENSION_SUFFIXES from Python 3 but am unfortunately using Python 2.7.
EXTENSION_SUFFIXES evaluates to ['.cpython-34m-x86_64-linux-gnu.so', '.cpython-34m.so', '.abi3.so', '.so'], but this is specific to my machine and possibly python version, so I cannot simply hardcode the list.
Here's where EXTENSION_SUFFIXES is built in Python 3's source: https://github.com/python/cpython/blob/3.6/Lib/importlib/_bootstrap_external.py#L1431. However it seems to go down into the C implementation (link), so it's unclear to me how I can get this info.
How can I obtain this list in Python 2.7?

Use imp.get_suffixes() instead:
Return a list of 3-element tuples, each describing a particular type of module. Each triple has the form (suffix, mode, type), where suffix is a string to be appended to the module name to form the filename to search for, mode is the mode string to pass to the built-in open() function to open the file (this can be 'r' for text files or 'rb' for binary files), and type is the file type, which has one of the values PY_SOURCE, PY_COMPILED, or C_EXTENSION, described below.
Thus, to filter this into a list of suffixes for C extension modules:
import imp
extension_suffixes = [suffix for (suffix, mode, type) in imp.get_suffixes()
if type == imp.C_EXTENSION]
This also works in Python 3, although imp is deprecated in Python 3.

zipfile module in python [duplicate]

I am creating an ZIP file with ZipFile in Python 2.5, it works OK so far:
import zipfile, os
locfile = "test.txt"
loczip = os.path.splitext (locfile)[0] + ".zip"
zip = zipfile.ZipFile (loczip, "w")
zip.write (locfile)
zip.close()
But I couldn't find how to encrypt the files in the ZIP file.
I could use system and call PKZIP -s, but I suppose there must be a more "Pythonic" way. I'm looking for an open source solution.

I created a simple library to create a password encrypted zip file in python. - here
import pyminizip
compression_level = 5 # 1-9
pyminizip.compress("src.txt", "dst.zip", "password", compression_level)
The library requires zlib.
I have checked that the file can be extracted in WINDOWS/MAC.

This thread is a little bit old, but for people looking for an answer to this question in 2020/2021.
Look at pyzipper
A 100% API compatible replacement for Python’s zipfile that can read and write AES encrypted zip files.
7-zip is also a good choice, but if you do not want to use subprocess, go with pyzipper...

The duplicate question: Code to create a password encrypted zip file? has an answer that recommends using 7z instead of zip. My experience bears this out.
Copy/pasting the answer by #jfs here too, for completeness:
To create encrypted zip archive (named 'myarchive.zip') using open-source 7-Zip utility:
rc = subprocess.call(['7z', 'a', '-mem=AES256', '-pP4$$W0rd', '-y', 'myarchive.zip'] +
['first_file.txt', 'second.file'])
To install 7-Zip, type:
$ sudo apt-get install p7zip-full
To unzip by hand (to demonstrate compatibility with zip utility), type:
$ unzip myarchive.zip
And enter P4$$W0rd at the prompt.
Or the same in Python 2.6+:
>>> zipfile.ZipFile('myarchive.zip').extractall(pwd='P4$$W0rd')

pyminizip works great in creating a password protected zip file. For unziping ,it fails at some situations. Tested on python 3.7.3
Here, i used pyminizip for encrypting the file.
import pyminizip
compression_level = 5 # 1-9
pyminizip.compress("src.txt",'src', "dst.zip", "password", compression_level)
For unzip, I used zip file module:
from zipfile import ZipFile
with ZipFile('/home/paulsteven/dst.zip') as zf:
zf.extractall(pwd=b'password')

You can use pyzipper for this task and it will work great when you want to encrypt a zip file or generate a protected zip file.
pip install pyzipper
import pyzipper
def encrypt_():
secret_password = b'your password'
with pyzipper.AESZipFile('new_test.zip',
'w',
compression=pyzipper.ZIP_LZMA,
encryption=pyzipper.WZ_AES) as zf:
zf.setpassword(secret_password)
zf.writestr('test.txt', "What ever you do, don't tell anyone!")
with pyzipper.AESZipFile('new_test.zip') as zf:
zf.setpassword(secret_password)
my_secrets = zf.read('test.txt')
The strength of the AES encryption can be configure to be 128, 192 or 256 bits. By default it is 256 bits. Use the setencryption() method to specify the encryption kwargs:
def encrypt_():
secret_password = b'your password'
with pyzipper.AESZipFile('new_test.zip',
'w',
compression=pyzipper.ZIP_LZMA) as zf:
zf.setpassword(secret_password)
zf.setencryption(pyzipper.WZ_AES, nbits=128)
zf.writestr('test.txt', "What ever you do, don't tell anyone!")
with pyzipper.AESZipFile('new_test.zip') as zf:
zf.setpassword(secret_password)
my_secrets = zf.read('test.txt')
Official Python ZipFile documentation is available here: https://docs.python.org/3/library/zipfile.html

#tripleee's answer helped me, see my test below.
This code works for me on python 3.5.2 on Windows 8.1 ( 7z path added to system).
rc = subprocess.call(['7z', 'a', output_filename + '.zip', '-mx9', '-pSecret^)'] + [src_folder + '/'])
With two parameters:
-mx9 means max compression
-pSecret^) means password is Secret^). ^ is escape for ) for Windows OS, but when you unzip, it will need type in the ^.
Without ^ Windows OS will not apply the password when 7z.exe creating the zip file.
Also, if you want to use -mhe switch, you'll need the file format to be in 7z instead of zip.
I hope that may help.

2022 answer:
I believe this is an utterly mundane task and therefore should be oneliner. I abstracted away all the frevolous details in a library that is as powerfull as a bash terminal.
from crocodile.toolbox import Path
file = Path(r'my_string_path')
result_file = file.zip(pwd="lol", use_7z=True)
when the 7z flag is raised, it gets called behind the scenes.
You don't need to learn 7z command line syntax.
You don't need to worry about installing 7z, does that automatically if it's not installed. (tested on windows so far)

You can use the Chilkat library. It's commercial, but has a free evaluation and seems pretty nice.
Here's an example I got from here:
import chilkat
# Demonstrates how to create a WinZip-compatible 128-bit AES strong encrypted zip
zip = chilkat.CkZip()
zip.UnlockComponent("anything for 30-day trial")
zip.NewZip("strongEncrypted.zip")
# Set the Encryption property = 4, which indicates WinZip compatible AES encryption.
zip.put_Encryption(4)
# The key length can be 128, 192, or 256.
zip.put_EncryptKeyLength(128)
zip.SetPassword("secret")
zip.AppendFiles("exampleData/*",True)
zip.WriteZip()

MD5 and SHA-2 collisions in Python

I'm writing a simple MP3 cataloguer to keep track of which MP3's are on my various devices. I was planning on using MD5 or SHA2 keys to identify matching files even if they have been renamed/moved, etc. I'm not trying to match MP3's that are logically equivalent (i.e.: same song but encoded differently). I have about 8000 MP3's. Only about 6700 of them generated unique keys.
My problem is that I'm running into collisions regardless of the hashing algorithm I choose. In one case, I have two files that happen to be tracks #1 and #2 on the same album, they are different file sizes yet produce identical hash keys whether I use MD5, SHA2-256, SHA2-512, etc...
This is the first time I'm really using hash keys on files and this is an unexpected result. I feel something fishy is going on here from the little I know about these hashing algorithms. Could this be an issue related to MP3's or Python's implementation?
Here's the snippet of code that I'm using:
data = open(path, 'r').read()
m = hashlib.md5(data)
m.update(data)
md5String = m.hexdigest()
Any answers or insights to why this is happening would be much appreciated. Thanks in advance.
--UPDATE--:
I tried executing this code in linux (with Python 2.6) and it did not produce a collision. As demonstrated by the stat call, the files are not the same. I also downloaded WinMD5 and this did not produce a collision(8d327ef3937437e0e5abbf6485c24bb3 and 9b2c66781cbe8c1be7d6a1447994430c). Is this a bug with Python hashlib on Windows? I tried the same under Python 2.7.1 and 2.6.6 and both provide the same result.
import hashlib
import os
def createMD5( path):
fh = open(path, 'r')
data = fh.read()
m = hashlib.md5(data)
md5String = m.hexdigest()
fh.close()
return md5String
print os.stat(path1)
print os.stat(path2)
print createMD5(path1)
print createMD5(path2)
>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=6617216L, st_atime=1303808346L, st_mtime=1167098073L, st_ctime=1290222341L)
>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=4921346L, st_atime=1303808348L, st_mtime=1167098076L, st_ctime=1290222341L)
>>> a7a10146b241cddff031eb03bd572d96
>>> a7a10146b241cddff031eb03bd572d96

I sort of have the feeling that you are reading a chunk of data which is smaller than the expected, and this chunk happens to be the same for both files. I don't know why, but try to open the file in binary with 'rb'. read() should read up to end of file, but windows behaves differently. From the docs
On Windows, 'b' appended to the mode
opens the file in binary mode, so
there are also modes like 'rb', 'wb',
and 'r+b'. Python on Windows makes a
distinction between text and binary
files; the end-of-line characters in
text files are automatically altered
slightly when data is read or written.
This behind-the-scenes modification to
file data is fine for ASCII text
files, but it’ll corrupt binary data
like that in JPEG or EXE files. Be
very careful to use binary mode when
reading and writing such files. On
Unix, it doesn’t hurt to append a 'b'
to the mode, so you can use it
platform-independently for all binary
files.

The files you're having a problem with are almost certainly identical if several different hashing algorithms all return the same hash results on them, or there's a bug in your implementation.
As a sanity test write your own "hash" that just returns the file's contents wholly, and see if this one generates the same "hashes".

As others have stated, a single hash collision is unlikely, and multiple nigh on impossible, unless the files are identical. I would recommend generating the sums with an external utility as something of a sanity check. For example, in Ubuntu (and most/all other Linux distributions):
blair#blair-eeepc:~$ md5sum Bandwagon.mp3
b87cbc2c17cd46789cb3a3c51a350557 Bandwagon.mp3
blair#blair-eeepc:~$ sha256sum Bandwagon.mp3
b909b027271b4c3a918ec19fc85602233a4c5f418e8456648c426403526e7bc0 Bandwagon.mp3
A quick Google search shows there are similar utilities available for Windows machines. If you see the collisions with the external utilities, then the files are identical. If there are no collisions, you are doing something wrong. I doubt the Python implementation is wrong, as I get the same results when doing the hash in Python:
>>> import hashlib
>>> hashlib.md5(open('Bandwagon.mp3', 'r').read()).hexdigest()
'b87cbc2c17cd46789cb3a3c51a350557'
>>> hashlib.sha256(open('Bandwagon.mp3', 'r').read()).hexdigest()
'b909b027271b4c3a918ec19fc85602233a4c5f418e8456648c426403526e7bc0'

Like #Delan Azabani said, there is something fishy here; collisions are bound to happen, but not that often. Check if the songs are the same, and update your post please.
Also, if you feel that you don't have enough keys, you can use two (or even more) hashing algorithms at the same time: by using MD5 for example, you have 2**128, or 340282366920938463463374607431768211456 keys. By using SHA-1, you have 2**160 or 1461501637330902918203684832716283019655932542976 keys. By combining them, you have 2**128 * 2**160, or 497323236409786642155382248146820840100456150797347717440463976893159497012533375533056.
(But if you ask me, MD5 is more than enough for your needs.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python zipfile, How to set the compression level? - python

Python supports zipping files when zlib is available, ZIP_DEFLATE see: https://docs.python.org/3.4/library/zipfile.html The zip command-line program on Linux supports -1 fastest, -9 best. Is there a way to set the compression level of a zip file created in Python's zipfile module?

Starting from python 3.7, the zipfile module added the compresslevel parameter. https://docs.python.org/3/library/zipfile.html I know this question is dated, but for people like me, that fall in this question, it may be a better option than the accepted one.

The zipfile module does not provide this. During compression it uses constant from zlib - Z_DEFAULT_COMPRESSION. By default it equals -1. So you can try to change this constant manually, as possible solution.

Related

How to read "-" (dash) as standard input with Python without writing extra code?

Pass file as input to a program and store its output using the sh library in python

Python 2.7 equivalent of importlib.machinery.EXTENSION_SUFFIXES

zipfile module in python [duplicate]

MD5 and SHA-2 collisions in Python

Categories

Resources