Reading gzipped data in Python - python

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?

The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')

Related

zipfile.ZipFile extracts the wrong file

I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.
I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.

python: extracting a .bz2 compressed file from a torrent file

I have a .torrent file that contains a .bz2 file. I am sure that such a file is actually in the .torrent because I extracted the .bz2 with utorrent.
How can I do the same thing in python instead of using utorrent?
I have seen a lot of libraries for dealing with .torrent files in python but apparently none does what I need. Among my unsuccessful attempts I can mention:
import torrent_parser as tp
file_cont = tp.parse_torrent_file('RC_2015-01.bz2.torrent')
file_cont is now a dictionary and file_cont['info']['name']='RC_2015-01.bz2' but if I try to open the file, i.e.
from bz2 import BZ2File
with BZ2File(file_cont['info']['name']) as f:
what_I_want = f.read()
then the content of the dictionary is (obviously, I'd say) interpreted as a path, and I get
No such file or directory: 'RC_2015-01.bz2'
Other attempts have been even more ruinous.
A .torrent file is just a metadata file, indicating where to get the data and the filename of the file. You can't get the file contents from that file.
Only once you have successfully downloaded this torrent file to disk (using torrent software) you can then use BZ2File to open it (if it is .bz2 format).
If you want to perform the actual download with Python, the only option I found was torrent-dl which hasn't been updated for 2 years.

How to work with CSV files inside a zipped folder?

I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!
You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.

How to get information of .jar file in python-magic

I have a folder full of jar, html, css, exe type file. How can I check the file?
I already run "file" command on *NIX and using python-magic. but the result is all like this.
test : Zip archive data, at least v1.0 to extract
How can I get information specifically like test : jar only using using magic number.
How do I do like this?
While not required, most JAR files have a META-INF/MANIFEST.MF file contained within them. You could check for the existence of this file, after checking if it's a zip file:
import zipfile
def zipFileContains(zipFileName, pathName):
f = zipfile.ZipFile(zipFileName, "r")
result = any(x.startswith(pathName.rstrip("/")) for x in f.namelist())
f.close()
return result
print zipFileContains("test.jar", "META-INF/MANIFEST.MF")
However, it might be better to just check if it's a zip file that ends in .jar.
Magic alone won't do it for you, since a JAR is literally just a zip file. Read more about the format here.

Delete file from zipfile with the ZipFile Module

The only way I came up for deleting a file from a zipfile was to create a temporary zipfile without the file to be deleted and then rename it to the original filename.
In python 2.4 the ZipInfo class had an attribute file_offset, so it was possible to create a second zip file and copy the data to other file without decompress/recompressing.
This file_offset is missing in python 2.6, so is there another option than creating another zipfile by uncompressing every file and then recompressing it again?
Is there maybe a direct way of deleting a file in the zipfile, I searched and didn't find anything.
The following snippet worked for me (deletes all *.exe files from a Zip archive):
zin = zipfile.ZipFile ('archive.zip', 'r')
zout = zipfile.ZipFile ('archve_new.zip', 'w')
for item in zin.infolist():
buffer = zin.read(item.filename)
if (item.filename[-4:] != '.exe'):
zout.writestr(item, buffer)
zout.close()
zin.close()
If you read everything into memory, you can eliminate the need for a second file. However, this snippet recompresses everything.
After closer inspection the ZipInfo.header_offset is the offset from the file start. The name is misleading, but the main Zip header is actually stored at the end of the file. My hex editor confirms this.
So the problem you'll run into is the following: You need to delete the directory entry in the main header as well or it will point to a file that doesn't exist anymore. Leaving the main header intact might work if you keep the local header of the file you're deleting as well, but I'm not sure about that. How did you do it with the old module?
Without modifying the main header I get an error "missing X bytes in zipfile" when I open it. This might help you to find out how to modify the main header.
Not very elegant but this is how I did it:
import subprocess
import zipfile
z = zipfile.ZipFile(zip_filename)
files_to_del = filter( lambda f: f.endswith('exe'), z.namelist()]
cmd=['zip', '-d', zip_filename] + files_to_del
subprocess.check_call(cmd)
# reload the modified archive
z = zipfile.ZipFile(zip_filename)
The routine delete_from_zip_file from ruamel.std.zipfile¹ allows you to delete a file based on its full path within the ZIP, or based on (re) patterns. E.g. you can delete all of the .exe files from test.zip using
from ruamel.std.zipfile import delete_from_zip_file
delete_from_zip_file('test.zip', pattern='.*.exe')
(please note the dot before the *).
This works similar to mdm's solution (including the need for recompression), but recreates the ZIP file in memory (using the class InMemZipFile()), overwriting the old file after it is fully read.
¹ Disclaimer: I am the author of that package.
Based on Elias Zamaria comment to the question.
Having read through Python-Issue #51067, I want to give update regarding it.
For today, solution already exists, though it is not approved by Python due to missing Contributor Agreement from the author.
Nevertheless, you can take the code from https://github.com/python/cpython/blob/659eb048cc9cac73c46349eb29845bc5cd630f09/Lib/zipfile.py and create a separate file from it. After that just reference it from your project instead of built-in python library: import myproject.zipfile as zipfile.
Usage:
with zipfile.ZipFile(f"archive.zip", "a") as z:
z.remove(f"firstfile.txt")
I believe it will be included in future python versions. For me it works like a charm for given use case.

Categories