How to elegantly compare zip folder contents to unzipped folder contents - python

This is the scenario. I want to be able to backup the contents of a folder using a python script. However, I want my backups to be stored in a zipped format, possibly bz2.
The problem comes from the fact that I don’t want to bother backing up the folder if the contents in the “current” folder are exactly the same as what is in my most recent backup.
My process will be like this:
Initiate backup
Check contents of “current” folder against what is stored in the most recent zipped backup
If same – then “complete”
If different, then run backup, then “complete”
Can anyone recomment the most reliable and simple way of completing step2? Do I have to unzip the contents of the backup and store in a temp directory to do a comparison or is there a more elegant way of doing this? Possibly to do with modified date?

Zip files contain CRC32 checksums and you can read them with the python zipfile module: http://docs.python.org/2/library/zipfile.html. You can get a list of ZipInfo objects with CRC members from ZipFile.infolist(). There are also modification dates in the ZipInfo object.
You can compare the zip checksum with calculated checksums for the unpacked files. You need to read the unpacked files but you avoid having to decompress everything.
CRC32 is not a cryptographic checksum but it should be enough if all you need is to check for changes.
This holds for zip files. Other archive formats (like tar.bz2) might not contain such easily-accessible metadata.

I use this script to create compress backup of a directory
only when the directory contents has changed after last backup.
I use external md5 file to store the digest of the backup file and I check
it to detect directory changes.
import hashlib
import tarfile
import bz2
import cStringIO
import os
def backup_dir(dirname, backup_path):
fobj = cStringIO.StringIO()
t = tarfile.open(mode='w',fileobj=fobj)
t.add(dirname)
t.close()
buf = fobj.getvalue()
new_md5 = hashlib.md5(buf).digest()
if os.path.isfile(backup_path + '.md5'):
old_md5 = open(backup_path + '.md5').read()
else:
old_md5 = ''
if new_md5 <> old_md5:
open(backup_path, 'wb').write(bz2.compress(buf))
open(backup_path + '.md5', 'wb').write(new_md5)
print 'backup done!'
else:
print 'nothing to do'

Rsync will automatically detect and only copy modified files, but seeing as you want to bzip the results, you still need to detect if anything has changed.
How about you output the directory listing (including time stamps) to a text file alongside your archive. The next time you diff the current directory structure against this stored text. You can grep differences out and pipe this file list to rsync to include those changed files.

You could also try the following process:
1) Initiate backup
2) Run backup
3) Compare both compressed files:
import filecmp
filecmp.cmp(Compressed_new_file, Compressed_old_file, shallow=True)
4) If same – delete new backup file then "complete"
5) Else “complete”
NOTE: In case you need to check just the time between the modifications, you can have a look at this documentation
Rather than decompressing the folder and comparing individual files, I think it might be easier to compare the compressed files.
Overall I feel (ok, its just an intuition :D) this will be better in case there is a high probability that the contents of the folder changes in between the times you run the script

Related

Using os.system() in a specific directory only

I have a directory containing mutliple files with similar names and subdirectories named after these so that files with like-names are located in that subdirectory. I'm trying to concatenate all the .sdf files in a given subdirectory to a single .sdf file.
import os
from os import system
for ele in os.listdir(Path):
if ele.endswith('.sdf'):
chdir(Path + '/' + ele[0:5])
system('cat' + ' ' + '*.sdf' + '>' + ele[0:5] + '.sdf')
However when I run this, the concatenated file includes every .sdf file from the original directory rather than just the .sdf files from the desired one. How do I alter my script to concatenate the files in the subdirectory only?
this is a very clumsy way of doing it. Using chdir is not recommended, and system either (deprecated, and overkill to call cat)
Let me propose a pure python implementation using glob.glob to filter the .sdf files, and read each file one by one and write to the big file opened before the loop:
import glob,os
big_sdf_file = "all_data.sdf" # I'll let you compute the name/directory you want
with open(big_sdf_file,"wb") as fw:
for sdf_file in glob.glob(os.path.join(Path,"*.sdf")):
with open(sdf_file,"rb") as fr:
fw.write(fr.read())
I left big_sdf_file not computed, I would not recommend to put it in the same directory as the other files, since running the script twice would result in taking the output as input as well.
Note that the drawback of this approach is that if the files are big, they're read fully into memory, which can cause problems. In that case, replace
fw.write(fr.read())
by:
shutil.copyfileobj(fr,fw)
(importing shutil is necessary in that case). That allows packet copy instead of full-file read/write.
I'll add that it's probably not the full solution you're expecting, since there seem to be something about scanning the sub-directories of Path to create 1 big .sdf file per sub-directory, but with the provided code which doesn't use any system command or chdir, it should be easier to adapt to your needs.

Why is my original folder not kept after compression? Why is my compression so slow? - python 3.4

The purpose of this program is to zip a directory or a folder as simply as possible, and write
the generated .tar.gz to one of my USB flash drives (or any other location), plans are to add a
function that will also use 'GnuPG' to encrypt the folder and another
that will allow user to input a time in order to perform this task
daily, weekly, monthly, etc. I also want the user to be able to choose
the destination of the zipped folder. Just wanted to post this up now
to see if it worked on first attempt and to get a bit of feedback.
My main question is why I lose the main folder upon extraction of the compressed files. For example, if I compress "Documents" which contains the two folders "Videos" and "Pictures" and the file "manual.txt". When I extract the file it does not dump "Documents" into the extraction point it dumps "Videos" and "Pictures" and "manual.txt". Which is fine and all, no data loss and everything is still intact, just creates a bit of clutter and I would like to keep the original directory.
Also wondering why in the world is this program taking so long to convert the file and when it does the conversion in some cases the .tar.gz file is just as large as the original folder, this happens with video files, it does seem to compress text files well, and much quicker.
Are video files just hard to compress? Or what, It takes like 5 minutes to process 2gb of video files and then they are the same as the original size? Kinda pointless.
Also would it make sense to use regex to validate user input in this case, I could just use a couple if statements instead no? like the preferred input in this program is 'root' not '/root'. Couldn't I just have it cut the '/' off if the input starts with a '/'.
I mainly want to see if this is the right, most efficient way of doing things, I'd rather not be given the answer in the usual stack overflow copy/paste way, lets get a discussion going.
So why is this program so slow when processing larger amounts of data? I expect a reduction in speed but not by that much
#!/usr/bin/env python3
'''
author: ryan st***
date: 12/5/2015
time: 18:55 Eastern time (GMT -5)
language: python 3.4
'''
# Import, import, import.
import os, subprocess, sys, zipfile, re
import shutil
import time
# Backup (zip) files
def zipDir():
try:
# Get file to be zipped and zip file destination from user
Dir = "~"
str1 = input ('Input directory to be zipped(eg. Douments, Dowloads, Desktop/programs): ')
# an input example that works "bin/mans"
str2 = input ('Zipped output directory (eg. root, myBackups): ')
# an output example that works "bin2/test"
zipName = input ("What would you like to name your zipped folder? ")
path1 = Dir, str1, "/"
path2 = Dir, str2, "/"
# Zip it up
# print (zipFile, ".tar.gz will be created from the folder ", path1[0]+path1[1]+path1[2])
#"and placed into the folder ", path2[0]+path2[1]+path2[2])
zipDirTo = os.path.expanduser(os.path.join(path2[0], path2[1]+path2[2], zipName))
zipDir = os.path.expanduser(os.path.join(path1[0], path1[1]))
print ('Directory "',zipDir,'" will be zipped and saved to the location: "' ,zipDirTo,'.tar.gz"')
shutil.make_archive(zipDirTo, 'gztar', zipDir)
print ("file zipped")
# In Case of mistake
except:
print ("Something went wrong in compression.\n",
"Ending Task, Please try again")
quit()
# Execute the program
def main():
print ("It will be a fucking miracle if this succeeds.")
zipDir()
print ("Success!!!!!!")
time.sleep(2)
quit()
# Wrap it all up
if __name__ == '__main__':
main()
Video files are normally compressed themselves and recompressing them doesn't help.for image and video file use tar only.
My main question is why I lose the main folder upon extraction of the compressed files
Because you're not storing that folder's name in the zip file. The paths you're using don't include Documents, they start with the name of the items inside Documents.
Are video files just hard to compress?
Any file that is already compressed, such as most video and audio formats, will be hard to compress further, and it will take quite a bit of time to find that out if the size is large. You might consider detecting compressed files and storing them in the zip file without further compression using the ZIP_STORED constant.
let[']s get a discussion going.
Stack Overflow's format is not really suited to discussions.

Reading gzipped data in Python

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?
The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')

Get big TAR(gz)-file contents by dir levels

I use python tarfile module.
I have a system backup in tar.gz file.
I need to get first level dirs and files list without getting ALL the list of files in the archive because it's TOO LONG.
For example: I need to get ['bin/', 'etc/', ... 'var/'] and that's all.
How can I do it? May be not even with a tar-file? Then how?
You can't scan the contents of a tar without scanning the entire file; it has no central index. You need something like a ZIP.

Delete file from zipfile with the ZipFile Module

The only way I came up for deleting a file from a zipfile was to create a temporary zipfile without the file to be deleted and then rename it to the original filename.
In python 2.4 the ZipInfo class had an attribute file_offset, so it was possible to create a second zip file and copy the data to other file without decompress/recompressing.
This file_offset is missing in python 2.6, so is there another option than creating another zipfile by uncompressing every file and then recompressing it again?
Is there maybe a direct way of deleting a file in the zipfile, I searched and didn't find anything.
The following snippet worked for me (deletes all *.exe files from a Zip archive):
zin = zipfile.ZipFile ('archive.zip', 'r')
zout = zipfile.ZipFile ('archve_new.zip', 'w')
for item in zin.infolist():
buffer = zin.read(item.filename)
if (item.filename[-4:] != '.exe'):
zout.writestr(item, buffer)
zout.close()
zin.close()
If you read everything into memory, you can eliminate the need for a second file. However, this snippet recompresses everything.
After closer inspection the ZipInfo.header_offset is the offset from the file start. The name is misleading, but the main Zip header is actually stored at the end of the file. My hex editor confirms this.
So the problem you'll run into is the following: You need to delete the directory entry in the main header as well or it will point to a file that doesn't exist anymore. Leaving the main header intact might work if you keep the local header of the file you're deleting as well, but I'm not sure about that. How did you do it with the old module?
Without modifying the main header I get an error "missing X bytes in zipfile" when I open it. This might help you to find out how to modify the main header.
Not very elegant but this is how I did it:
import subprocess
import zipfile
z = zipfile.ZipFile(zip_filename)
files_to_del = filter( lambda f: f.endswith('exe'), z.namelist()]
cmd=['zip', '-d', zip_filename] + files_to_del
subprocess.check_call(cmd)
# reload the modified archive
z = zipfile.ZipFile(zip_filename)
The routine delete_from_zip_file from ruamel.std.zipfile¹ allows you to delete a file based on its full path within the ZIP, or based on (re) patterns. E.g. you can delete all of the .exe files from test.zip using
from ruamel.std.zipfile import delete_from_zip_file
delete_from_zip_file('test.zip', pattern='.*.exe')
(please note the dot before the *).
This works similar to mdm's solution (including the need for recompression), but recreates the ZIP file in memory (using the class InMemZipFile()), overwriting the old file after it is fully read.
¹ Disclaimer: I am the author of that package.
Based on Elias Zamaria comment to the question.
Having read through Python-Issue #51067, I want to give update regarding it.
For today, solution already exists, though it is not approved by Python due to missing Contributor Agreement from the author.
Nevertheless, you can take the code from https://github.com/python/cpython/blob/659eb048cc9cac73c46349eb29845bc5cd630f09/Lib/zipfile.py and create a separate file from it. After that just reference it from your project instead of built-in python library: import myproject.zipfile as zipfile.
Usage:
with zipfile.ZipFile(f"archive.zip", "a") as z:
z.remove(f"firstfile.txt")
I believe it will be included in future python versions. For me it works like a charm for given use case.

Categories