How to read gz compressed files from tar

How to read gz compressed files from tar - python

Let's say we have a tar file which in turn contains multiple gzip compressed files. I want to be able to read the contents of those gzip files without compressing either the tar file or the individual gzip files. I 'm trying to use tarfile module in python.

This might work, I haven't tested it, but this has the main ideas, and related tools. It iterates over the files in the tar, and if they are gzipped, then will read them into the file_contents variable:
import tarfile as t
import gzip as g
for member in t.open("your.gz.tar").getmembers():
fo=t.extractfile(member)
file_contents = g.GzipFile(fileobj=fo).read()
note: if the file is too large for memory, then consider looking into a streamed reader (chunk by chunk) as linked.
If you have additional logic based on what the member (TarInfo) object looks like you can use these:
https://docs.python.org/2/library/tarfile.html#tarinfo-objects
see:
How can I decompress a gzip stream with zlib?
Python decompressing gzip chunk-by-chunk
reading tar file contents without untarring it, in python script

Related

python: extracting a .bz2 compressed file from a torrent file

I have a .torrent file that contains a .bz2 file. I am sure that such a file is actually in the .torrent because I extracted the .bz2 with utorrent.
How can I do the same thing in python instead of using utorrent?
I have seen a lot of libraries for dealing with .torrent files in python but apparently none does what I need. Among my unsuccessful attempts I can mention:
import torrent_parser as tp
file_cont = tp.parse_torrent_file('RC_2015-01.bz2.torrent')
file_cont is now a dictionary and file_cont['info']['name']='RC_2015-01.bz2' but if I try to open the file, i.e.
from bz2 import BZ2File
with BZ2File(file_cont['info']['name']) as f:
what_I_want = f.read()
then the content of the dictionary is (obviously, I'd say) interpreted as a path, and I get
No such file or directory: 'RC_2015-01.bz2'
Other attempts have been even more ruinous.

A .torrent file is just a metadata file, indicating where to get the data and the filename of the file. You can't get the file contents from that file.
Only once you have successfully downloaded this torrent file to disk (using torrent software) you can then use BZ2File to open it (if it is .bz2 format).
If you want to perform the actual download with Python, the only option I found was torrent-dl which hasn't been updated for 2 years.

compress multiple files into a bz2 file in python

I need to compress multiple files into one bz2 file in python.
I'm trying to find a way but I can't can find an answer.
Is it possible?

This is what tarballs are for. The tar format packs the files together, then you compress the result. Python makes it easy to do both at once with the tarfile module, where passing a "mode" of 'w:bz2' opens a new tar file for write with seamless bz2 compression. Super-simple example:
import tarfile
with tarfile.open('mytar.tar.bz2', 'w:bz2') as tar:
for file in mylistoffiles:
tar.add(file)
If you don't need much control over the operation, shutil.make_archive might be a possible alternative, which would simplify the code for compressing a whole directory tree to:
shutil.make_archive('mytar', 'bztar', directory_to_compress)

Take a look at python's bz2 library. Make sure to google and read the docs first!
https://docs.python.org/2/library/bz2.html#bz2.BZ2Compressor

you have import package for:
import tarfile,bz2
and multilfile compress in bz format
tar = tarfile.open("save the directory.tar.bz", "w:bz2")
for f in ["gti.png","gti.txt","file.taz"]:
tar.add(os.path.basename(f))
tar.close()
let use for in zip format was open in a directory open file
an use
os.path.basename(src_file)
open a only for file

Python's standard lib zipfile handles multiple files and has supported bz2 compression since 2001.
import zipfile
sourcefiles = ['a.txt', 'b.txt']
with zipfile.ZipFile('out.zip', 'w') as outputfile:
for sourcefile in sourcefiles:
outputfile.write(sourcefile, compress_type=zipfile.ZIP_BZIP2)

Why zipfile module is_zipfile function cannot detech a gzip file?

I am aware of this question Why "is_zipfile" function of module "zipfile" always returns "false"?. I want to seek some more clarification and confirmation.
I have created a zip file in python using the gzip module.
If I check the zip file using the file command in OSX I get this
> file data.txt
data.txt: gzip compressed data, was "Slide1.html", last modified: Tue Oct 13 10:10:13 2015, max compression
I want to write a generic function to tell if the file is gzip'ed or not.
import gzip
import os
f = '/path/to/data.txt'
print os.path.exists(f) # True
with gzip.GzipFile(f) as zf:
print zf.read() # Print out content as expected
import zipfile
print zipfile.is_zipfile(f) # Give me false. Not expected
I want to use zipfile module but it always reports false.
I just want to have a confirmation that zipfile module is not compatible with gzip. If so, why it is the case? Are zip and gzip considered different format?

I have created a zip file in python using the gzip module.
No you haven't. gzip doesn't create zip files.
I just want to have a confirmation that zipfile module is not compatible with gzip.
Confirmed.
If so, why it is the case?
A gzip file is a single file compressed with zlib with a very small header. A zip file is multiple files, each optionally compressed with zlib, in a single archive with a header and directory.
Are zip and gzip considered different format?
Yes.

Reading gzipped data in Python

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?

The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')

Python: Unzipping and decompressing .Z files inside .zip

I am trying to unzip a Alpha.zip folder which contains a Beta directory which contains a Gamma Folder which contains a.Z, b.Z, c.Z, d.Z files. Using zip and 7-zip I was able to extract all a.D, b.D, c.D, d.D files stored within the .Z files.
I tried this in python using Import gzip and Import zlib.
import sys
import os
import getopt
import gzip
f = open('a.d.Z','r')
file_content = f.read()
f.close()
I keep getting all sorts of errors including: this is not a zip file, return codecs.charmap_encode(input self.errors encoding_map) 0. Any suggestions as to how to code this?

You need to actually make use of a zip library of some kind. Right now you're importing gzip, but you're not doing anything with it. Try taking a look at the gzip documentation and opening the file using that library.
gzip_file = gzip.open('a.d.Z') # use gzip.open instead of builtin open function
file_content = gzip_file.read()
Edit based on your comment: you can't just open all kinds of compressed files with any compression library. Since you have a .Z file, it's likely that you want to use zlib rather than gzip, but since extensions are just conventions, only you know for sure what compression format your file is in. To use zlib, do something like this instead:
# Note: untested code ahead!
import zlib
with open('a.d.Z', 'rb') as f: # Notice that I open this in binary mode
file_content = f.read() # Read the compressed binary data
decompressed_content = zlib.decompress(file_content) # Decompress

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read gz compressed files from tar - python

Let's say we have a tar file which in turn contains multiple gzip compressed files. I want to be able to read the contents of those gzip files without compressing either the tar file or the individual gzip files. I 'm trying to use tarfile module in python.

Related

python: extracting a .bz2 compressed file from a torrent file

compress multiple files into a bz2 file in python

Why zipfile module is_zipfile function cannot detech a gzip file?

Reading gzipped data in Python

Python: Unzipping and decompressing .Z files inside .zip

Categories

Resources