Python/Linux - unzip file while reading it

Python/Linux - unzip file while reading it - python

I have hundreds of CSV files zipped. This is great because they take very little space but when it is time to use them, I have to make some space on my HD and unzip them before I can process. I was wondering if it is possible with python(or linux command line) to unzip a file while reading it. In other words, I would like to open a zip file, start to decompress the file and as we go, process the file.
So there would be no need for extra space on my drive. Any ideas or suggestions?

Python, since the 1.6 version, provides the module zipfile to handle this kind of circumstances. An example usage:
import csv
import zipfile
with zipfile.ZipFile('myarchive.zip') as archive:
with archive.open('the_zipped_file.csv') as fin:
reader = csv.reader(fin, ...)
for record in reader:
# process record.
note that in python3 things get a bit more complicated because the file-like object returned by archive.open yields bytes, while csv.reader wants strings. You can write a simple class that does the conversion from bytes to strings using a given encoding:
class EncodingConverter:
def __init__(self, fobj, encoding):
self._iter_fobj = iter(fobj)
self._encoding = encoding
def __iter__(self):
return self
def __next__(self):
return next(self._iter_fobj).decode(self._encoding)
and use it like:
import csv
import zipfile
with zipfile.ZipFile('myarchive.zip') as archive:
with archive.open('the_zipped_file.csv') as fin:
reader = csv.reader(EncodingConverter(fin, 'utf-8'), ...)
for record in reader:
# process record.

While it's very possible to open ZIP files in Python, it is also possible to transparently handle this operation using a filesystem extension. If this is preferable or not depends on various factors including system access and solution portability.
See Fuse-Zip:
With fuse-zip you really can work with ZIP archives as real directories. Unlike KIO or Gnome VFS, it can be used in any application without modifications.
Or AVFS: A Virtual File System:
AVFS is a system, which enables all programs to look inside gzip, tar, zip, etc. files or view remote (ftp, http, dav, etc.) files, without recompiling the programs.
Note that these solutions are system-specific and rely on FUSE. There might be similar transparent solutions for Windows - but that would require another investigation for the specific system.

Related

Concatenating a list of files with subprocess and wildcards in python

I'm trying to concatenate multiple files in a directory to a single file. So far I've been trying to use cat with subprocess with poor results.
My original code was:
source = ['folder01/*', 'folder02/*']
target = ['/output/File1', '/output/File2']
for f1, f2, in zip(source, target):
subprocess.call(['cat', f1, '>>', f2])
I've tried handing it shell=True:
..., f2], shell=True)
And in conjunction with subprocess.Popen instead of call in a number of permutations, but with no joy.
As I've understood from other similar questions, with shell=True the command will need to be provided as a string. How can I go about calling cat on all items in my list whilst executing as a string?

You don't need subprocess here and you must always avoid subprocess when you can (that means: 99.99% of time).
As Joel pointed out in comments, maybe I should take a few minutes and bullet points to explain you why:
Using subprocess (or similar) assume your code will always run on the exact same environment, that means same OS, version, shell, tools installed, etc.. This is really not fitted for a production grade code.
These kind of libraries will prevent you to make "pythonic Python code", you will have to handle errors by parsing string instead of try / except, etc..
Tim Peters wrote the Zen of Python and I encourage you to follow it, at least 3 points are relevant here: "Beautiful is better than ugly.", "Readability counts." and "Simple is better than complex.".
In other words: subprocess will only make your code less robust, force you to handle non-Python issues, force you to perform tricky computing where you could just write clean and powerful Python code.
There are way more good reasons to not use subprocess, but I think you got the point.
Just open files with open, here is a basic example you will need to adapt:
import os
for filename in os.listdir('./'):
with open(filename, 'r') as fileh:
with open('output.txt', 'a') as outputh:
outputh.write(fileh.read())
Implementation example for your specific needs:
import os
sources = ['/tmp/folder01/', '/tmp/folder02/']
targets = ['/tmp/output/File1', '/tmp/output/File2']
# Loop in `sources`
for index, directory in enumerate(sources):
# Match `sources` file with expected `targets` directory
output_file = targets[index]
# Loop in files within `directory`
for filename in os.listdir(directory):
# Compute absolute path of file
filepath = os.path.join(directory, filename)
# Open source file in read mode
with open(filepath, 'r') as fileh:
# Open output file in append mode
with open(output_file, 'a') as outputh:
# Write content into output
outputh.write(fileh.read())
Be careful, I changed your source and target values (/tmp/)

How to work with CSV files inside a zipped folder?

I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!

You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.

How to elegantly compare zip folder contents to unzipped folder contents

This is the scenario. I want to be able to backup the contents of a folder using a python script. However, I want my backups to be stored in a zipped format, possibly bz2.
The problem comes from the fact that I don’t want to bother backing up the folder if the contents in the “current” folder are exactly the same as what is in my most recent backup.
My process will be like this:
Initiate backup
Check contents of “current” folder against what is stored in the most recent zipped backup
If same – then “complete”
If different, then run backup, then “complete”
Can anyone recomment the most reliable and simple way of completing step2? Do I have to unzip the contents of the backup and store in a temp directory to do a comparison or is there a more elegant way of doing this? Possibly to do with modified date?

Zip files contain CRC32 checksums and you can read them with the python zipfile module: http://docs.python.org/2/library/zipfile.html. You can get a list of ZipInfo objects with CRC members from ZipFile.infolist(). There are also modification dates in the ZipInfo object.
You can compare the zip checksum with calculated checksums for the unpacked files. You need to read the unpacked files but you avoid having to decompress everything.
CRC32 is not a cryptographic checksum but it should be enough if all you need is to check for changes.
This holds for zip files. Other archive formats (like tar.bz2) might not contain such easily-accessible metadata.

I use this script to create compress backup of a directory
only when the directory contents has changed after last backup.
I use external md5 file to store the digest of the backup file and I check
it to detect directory changes.
import hashlib
import tarfile
import bz2
import cStringIO
import os
def backup_dir(dirname, backup_path):
fobj = cStringIO.StringIO()
t = tarfile.open(mode='w',fileobj=fobj)
t.add(dirname)
t.close()
buf = fobj.getvalue()
new_md5 = hashlib.md5(buf).digest()
if os.path.isfile(backup_path + '.md5'):
old_md5 = open(backup_path + '.md5').read()
else:
old_md5 = ''
if new_md5 <> old_md5:
open(backup_path, 'wb').write(bz2.compress(buf))
open(backup_path + '.md5', 'wb').write(new_md5)
print 'backup done!'
else:
print 'nothing to do'

Rsync will automatically detect and only copy modified files, but seeing as you want to bzip the results, you still need to detect if anything has changed.
How about you output the directory listing (including time stamps) to a text file alongside your archive. The next time you diff the current directory structure against this stored text. You can grep differences out and pipe this file list to rsync to include those changed files.

You could also try the following process:
1) Initiate backup
2) Run backup
3) Compare both compressed files:
import filecmp
filecmp.cmp(Compressed_new_file, Compressed_old_file, shallow=True)
4) If same – delete new backup file then "complete"
5) Else “complete”
NOTE: In case you need to check just the time between the modifications, you can have a look at this documentation
Rather than decompressing the folder and comparing individual files, I think it might be easier to compare the compressed files.
Overall I feel (ok, its just an intuition :D) this will be better in case there is a high probability that the contents of the folder changes in between the times you run the script

Issue Replacing Already Existing Strings with ConfigParser

I am using ConfigParser to save simple settings to a .ini file, and one of these settings is a directory. Whenever I replace a directory string such as D:/Documents/Data, with a shorter directory string such as D:/, the remaining characters are placed two lines under the option. So the .ini file now looks like this:
[Settings]
directory = D:/
Documents/Data
What am I doing wrong? Here is my code:
import ConfigParser
class Settings():
self.config = ConfigParser.ConfigParser()
def SetDirectory(self, dir): #dir is the directory string
self.config.readfp(open('settings.ini'))
self.config.set('Settings', 'directory', dir)
with open('settings.ini', 'r+') as configfile: self.config.write(configfile)

The r+ option (in the open in the with) is telling Python to keep the file's previous contents, just overwriting the specific bytes that will be written to it but leaving all others alone. Use w to open a file for complete overwriting, which seems to be what you should be doing here. Overwriting just selected bytes inside an existing file is very rarely what you want to do, particularly for text files, which you're more likely to want to see as sequence of lines of text, rather than bunches of bytes! (It can be useful in very specialized cases, mostly involving large binary files, where the by-byte view may make some sense).
The "by-line organization" with which we like to view text files is not reflected in the underlying filesystem (on any OS that is currently popular, at least -- back in the dark past some file organizations were meant to mimic packs of punched cards, for example, so each line had to be exactly 80 bytes, no more, no less... but that's a far-off ancient memory, at most, for the vast majority of computer programmers and users today;-).
So, "overwriting part of a file in-place" (where the file contains text lines of different lengths) becomes quite a problem. Should you ever need to do that, btw, consider the fileinput module of the standard Python library, which mimics this often-desired but-running-against-the-filesystem's-grain operation quite competently. But, it wouldn't help you much in this case, where simple total overwriting seems to be exactly right;-).

Delete file from zipfile with the ZipFile Module

The only way I came up for deleting a file from a zipfile was to create a temporary zipfile without the file to be deleted and then rename it to the original filename.
In python 2.4 the ZipInfo class had an attribute file_offset, so it was possible to create a second zip file and copy the data to other file without decompress/recompressing.
This file_offset is missing in python 2.6, so is there another option than creating another zipfile by uncompressing every file and then recompressing it again?
Is there maybe a direct way of deleting a file in the zipfile, I searched and didn't find anything.

The following snippet worked for me (deletes all *.exe files from a Zip archive):
zin = zipfile.ZipFile ('archive.zip', 'r')
zout = zipfile.ZipFile ('archve_new.zip', 'w')
for item in zin.infolist():
buffer = zin.read(item.filename)
if (item.filename[-4:] != '.exe'):
zout.writestr(item, buffer)
zout.close()
zin.close()
If you read everything into memory, you can eliminate the need for a second file. However, this snippet recompresses everything.
After closer inspection the ZipInfo.header_offset is the offset from the file start. The name is misleading, but the main Zip header is actually stored at the end of the file. My hex editor confirms this.
So the problem you'll run into is the following: You need to delete the directory entry in the main header as well or it will point to a file that doesn't exist anymore. Leaving the main header intact might work if you keep the local header of the file you're deleting as well, but I'm not sure about that. How did you do it with the old module?
Without modifying the main header I get an error "missing X bytes in zipfile" when I open it. This might help you to find out how to modify the main header.

Not very elegant but this is how I did it:
import subprocess
import zipfile
z = zipfile.ZipFile(zip_filename)
files_to_del = filter( lambda f: f.endswith('exe'), z.namelist()]
cmd=['zip', '-d', zip_filename] + files_to_del
subprocess.check_call(cmd)
# reload the modified archive
z = zipfile.ZipFile(zip_filename)

The routine delete_from_zip_file from ruamel.std.zipfile¹ allows you to delete a file based on its full path within the ZIP, or based on (re) patterns. E.g. you can delete all of the .exe files from test.zip using
from ruamel.std.zipfile import delete_from_zip_file
delete_from_zip_file('test.zip', pattern='.*.exe')
(please note the dot before the *).
This works similar to mdm's solution (including the need for recompression), but recreates the ZIP file in memory (using the class InMemZipFile()), overwriting the old file after it is fully read.
¹ Disclaimer: I am the author of that package.

Based on Elias Zamaria comment to the question.
Having read through Python-Issue #51067, I want to give update regarding it.
For today, solution already exists, though it is not approved by Python due to missing Contributor Agreement from the author.
Nevertheless, you can take the code from https://github.com/python/cpython/blob/659eb048cc9cac73c46349eb29845bc5cd630f09/Lib/zipfile.py and create a separate file from it. After that just reference it from your project instead of built-in python library: import myproject.zipfile as zipfile.
Usage:
with zipfile.ZipFile(f"archive.zip", "a") as z:
z.remove(f"firstfile.txt")
I believe it will be included in future python versions. For me it works like a charm for given use case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.