Extracting BZ2 compressed folder using Python

Extracting BZ2 compressed folder using Python - python

I am trying to extract a bz2 compressed folder in a specific location.
I can see the data inside by :
handler = bz2.BZ2File(path, 'r')
print handler.read()
But I wish to extract all the files in this compressed folder into a location (specified by the user) maintaining the internal directory structure of the folder.
I am fairly new to this language .. Please help...

Like gzip, BZ2 is only a compressor for single files, it can not archive a directory structure. What I suspect you have is an archive that is first created by a software like tar, that is then compressed with BZ2. In order to recover the "full directory structure", first extract your Bz2 file, then un-tar (or equivalent) the file.
Fortunately, the Python tarfile module supports bz2 option, so you can do this process in one shot.

bzip2 is a data compression system which compresses one entire file. It does not bundle files and compress them like PKZip does. Therefore handler in your example has one and only one file in it and there is no "internal directory structure".
If, on the other hand, your file is actually a compressed tar-file, you should look at the tarfile module of Python which will handle decompression for you.

You need to use the tarfile module to uncompress a .tar.bz2 file ... from the docs here is how you can do it:
import tarfile
tar = tarfile.open(path, "r:bz2")
for tarinfo in tar:
print tarinfo.name, "is", tarinfo.size, "bytes in size and is",
if tarinfo.isreg():
print "a regular file."
# read the file
f = tar.extractfile(tarinfo)
print f.read()
elif tarinfo.isdir():
print "a directory."
else:
print "something else."
tar.close()

Related

compress multiple files into a bz2 file in python

I need to compress multiple files into one bz2 file in python.
I'm trying to find a way but I can't can find an answer.
Is it possible?

This is what tarballs are for. The tar format packs the files together, then you compress the result. Python makes it easy to do both at once with the tarfile module, where passing a "mode" of 'w:bz2' opens a new tar file for write with seamless bz2 compression. Super-simple example:
import tarfile
with tarfile.open('mytar.tar.bz2', 'w:bz2') as tar:
for file in mylistoffiles:
tar.add(file)
If you don't need much control over the operation, shutil.make_archive might be a possible alternative, which would simplify the code for compressing a whole directory tree to:
shutil.make_archive('mytar', 'bztar', directory_to_compress)

Take a look at python's bz2 library. Make sure to google and read the docs first!
https://docs.python.org/2/library/bz2.html#bz2.BZ2Compressor

you have import package for:
import tarfile,bz2
and multilfile compress in bz format
tar = tarfile.open("save the directory.tar.bz", "w:bz2")
for f in ["gti.png","gti.txt","file.taz"]:
tar.add(os.path.basename(f))
tar.close()
let use for in zip format was open in a directory open file
an use
os.path.basename(src_file)
open a only for file

Python's standard lib zipfile handles multiple files and has supported bz2 compression since 2001.
import zipfile
sourcefiles = ['a.txt', 'b.txt']
with zipfile.ZipFile('out.zip', 'w') as outputfile:
for sourcefile in sourcefiles:
outputfile.write(sourcefile, compress_type=zipfile.ZIP_BZIP2)

How to get information of .jar file in python-magic

I have a folder full of jar, html, css, exe type file. How can I check the file?
I already run "file" command on *NIX and using python-magic. but the result is all like this.
test : Zip archive data, at least v1.0 to extract
How can I get information specifically like test : jar only using using magic number.
How do I do like this?

While not required, most JAR files have a META-INF/MANIFEST.MF file contained within them. You could check for the existence of this file, after checking if it's a zip file:
import zipfile
def zipFileContains(zipFileName, pathName):
f = zipfile.ZipFile(zipFileName, "r")
result = any(x.startswith(pathName.rstrip("/")) for x in f.namelist())
f.close()
return result
print zipFileContains("test.jar", "META-INF/MANIFEST.MF")
However, it might be better to just check if it's a zip file that ends in .jar.
Magic alone won't do it for you, since a JAR is literally just a zip file. Read more about the format here.

Python tarfile extractall except files matching string

I have a legacy script which fetches boost libraries via a python script and extracts then builds them.
On windows, the extract step fails because the path is too long for some of the files in the boost archive. E.g.
IOError: [Errno 2] No such file or directory: 'C:\\<my_path>\\boost_1_57_0\\libs\\geometry\\doc\\html\\geometry\\reference\\spatial_indexes\\boost__geometry__index__rtree\\rtree_parameters_type_const____indexable_getter_const____value_equal_const____allocator_type_const___.html'
Is there anyway to simply make the tarfile lib extractall but ignore all files with .html extension?
Alternatively, is there a way to allow paths which exceed the windows limit of 266?

You can loop through all the files in the tar and extract only those that don't end with ".html"
import os
import tarfile
def custom_files(members):
for tarinfo in members:
if os.path.splitext(tarinfo.name)[1] != ".html":
yield tarinfo
tar = tarfile.open("sample.tar.gz")
tar.extractall(members=custom_files(tar))
tar.close()
The example code and information about the modules was found here
Coming to overcoming the limit on size of the file names, please refer the Microsoft doc](https://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx)

Reading gzipped data in Python

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?

The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')

How to elegantly compare zip folder contents to unzipped folder contents

This is the scenario. I want to be able to backup the contents of a folder using a python script. However, I want my backups to be stored in a zipped format, possibly bz2.
The problem comes from the fact that I don’t want to bother backing up the folder if the contents in the “current” folder are exactly the same as what is in my most recent backup.
My process will be like this:
Initiate backup
Check contents of “current” folder against what is stored in the most recent zipped backup
If same – then “complete”
If different, then run backup, then “complete”
Can anyone recomment the most reliable and simple way of completing step2? Do I have to unzip the contents of the backup and store in a temp directory to do a comparison or is there a more elegant way of doing this? Possibly to do with modified date?

Zip files contain CRC32 checksums and you can read them with the python zipfile module: http://docs.python.org/2/library/zipfile.html. You can get a list of ZipInfo objects with CRC members from ZipFile.infolist(). There are also modification dates in the ZipInfo object.
You can compare the zip checksum with calculated checksums for the unpacked files. You need to read the unpacked files but you avoid having to decompress everything.
CRC32 is not a cryptographic checksum but it should be enough if all you need is to check for changes.
This holds for zip files. Other archive formats (like tar.bz2) might not contain such easily-accessible metadata.

I use this script to create compress backup of a directory
only when the directory contents has changed after last backup.
I use external md5 file to store the digest of the backup file and I check
it to detect directory changes.
import hashlib
import tarfile
import bz2
import cStringIO
import os
def backup_dir(dirname, backup_path):
fobj = cStringIO.StringIO()
t = tarfile.open(mode='w',fileobj=fobj)
t.add(dirname)
t.close()
buf = fobj.getvalue()
new_md5 = hashlib.md5(buf).digest()
if os.path.isfile(backup_path + '.md5'):
old_md5 = open(backup_path + '.md5').read()
else:
old_md5 = ''
if new_md5 <> old_md5:
open(backup_path, 'wb').write(bz2.compress(buf))
open(backup_path + '.md5', 'wb').write(new_md5)
print 'backup done!'
else:
print 'nothing to do'

Rsync will automatically detect and only copy modified files, but seeing as you want to bzip the results, you still need to detect if anything has changed.
How about you output the directory listing (including time stamps) to a text file alongside your archive. The next time you diff the current directory structure against this stored text. You can grep differences out and pipe this file list to rsync to include those changed files.

You could also try the following process:
1) Initiate backup
2) Run backup
3) Compare both compressed files:
import filecmp
filecmp.cmp(Compressed_new_file, Compressed_old_file, shallow=True)
4) If same – delete new backup file then "complete"
5) Else “complete”
NOTE: In case you need to check just the time between the modifications, you can have a look at this documentation
Rather than decompressing the folder and comparing individual files, I think it might be easier to compare the compressed files.
Overall I feel (ok, its just an intuition :D) this will be better in case there is a high probability that the contents of the folder changes in between the times you run the script

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting BZ2 compressed folder using Python - python

Related

compress multiple files into a bz2 file in python

How to get information of .jar file in python-magic

Python tarfile extractall except files matching string

Reading gzipped data in Python

How to elegantly compare zip folder contents to unzipped folder contents

Categories

Resources