I have several tar archives that I need to extract/read in memory. The problem is each tar contains many ZIP archives and each contain unique XML documents.
So the structure of each tar is as follows: tar -> directories-> ZIPs->XML.
Obviously I can manually extract a single TAR but I have about 1000 TAR archives that are about 3 GB each and contains about 6000 ZIP archives each. I'm looking for a way to handle the .tar archives in memory and extract the XML data of each ZIP. Is there a way to do this?
This should be doable, since all of the relevant methods have non-disk-related options.
Lots of loops here, so let's dig in.
For each tar archive:
tarfile.open would open the tar archive. (Docs)
Call .getmembers on the resulting TarFile instance to get a list of the zips (or other files) contained in the archive. (Docs)
For each zip within the tar archive:
Once you know what member file (i.e., one of your zips) you want to look through, call .extractfile on your TarFile instance to get a file object for that zip. (Docs)
Instantiate a new zipfile.ZipFile with your file object in order to open the zip so you can work with it. (Docs)
Call .infolist on your ZipFile instance to get a list of the files it contains (including your XML files). (Docs)
For each XML file within the zip:
Call .open on your ZipFile instance in order to get a file object of one of your XML files. (Docs)
You now have a file object corresponding to one of your XML files. Do whatever you want with it: .read it, copy it to disk somewhere, stick it in an ElementTree (docs), etc.
Related
I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.
I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.
One easy way is to create a directory and populate it with files. Then archive and compress that directory into a zip file called, say, file.zip. But this approach is needless since my files are in memory already, and needing to save them to disk is excessive.
Is it possible that I create the directory structure right in memory, without saving the unzipped files/directories? So that I end up saving only the final file.zip (without the intermediate stage of saving files/directories on file system)?
You can use zipfile:
from zipfile import ZipFile
with ZipFile("file.zip", "w") as zip_file:
zip_file.writestr("root/file.json", json.dumps(data))
zip_file.writestr("README.txt", "hello world")
I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?
The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')
This is the scenario. I want to be able to backup the contents of a folder using a python script. However, I want my backups to be stored in a zipped format, possibly bz2.
The problem comes from the fact that I don’t want to bother backing up the folder if the contents in the “current” folder are exactly the same as what is in my most recent backup.
My process will be like this:
Initiate backup
Check contents of “current” folder against what is stored in the most recent zipped backup
If same – then “complete”
If different, then run backup, then “complete”
Can anyone recomment the most reliable and simple way of completing step2? Do I have to unzip the contents of the backup and store in a temp directory to do a comparison or is there a more elegant way of doing this? Possibly to do with modified date?
Zip files contain CRC32 checksums and you can read them with the python zipfile module: http://docs.python.org/2/library/zipfile.html. You can get a list of ZipInfo objects with CRC members from ZipFile.infolist(). There are also modification dates in the ZipInfo object.
You can compare the zip checksum with calculated checksums for the unpacked files. You need to read the unpacked files but you avoid having to decompress everything.
CRC32 is not a cryptographic checksum but it should be enough if all you need is to check for changes.
This holds for zip files. Other archive formats (like tar.bz2) might not contain such easily-accessible metadata.
I use this script to create compress backup of a directory
only when the directory contents has changed after last backup.
I use external md5 file to store the digest of the backup file and I check
it to detect directory changes.
import hashlib
import tarfile
import bz2
import cStringIO
import os
def backup_dir(dirname, backup_path):
fobj = cStringIO.StringIO()
t = tarfile.open(mode='w',fileobj=fobj)
t.add(dirname)
t.close()
buf = fobj.getvalue()
new_md5 = hashlib.md5(buf).digest()
if os.path.isfile(backup_path + '.md5'):
old_md5 = open(backup_path + '.md5').read()
else:
old_md5 = ''
if new_md5 <> old_md5:
open(backup_path, 'wb').write(bz2.compress(buf))
open(backup_path + '.md5', 'wb').write(new_md5)
print 'backup done!'
else:
print 'nothing to do'
Rsync will automatically detect and only copy modified files, but seeing as you want to bzip the results, you still need to detect if anything has changed.
How about you output the directory listing (including time stamps) to a text file alongside your archive. The next time you diff the current directory structure against this stored text. You can grep differences out and pipe this file list to rsync to include those changed files.
You could also try the following process:
1) Initiate backup
2) Run backup
3) Compare both compressed files:
import filecmp
filecmp.cmp(Compressed_new_file, Compressed_old_file, shallow=True)
4) If same – delete new backup file then "complete"
5) Else “complete”
NOTE: In case you need to check just the time between the modifications, you can have a look at this documentation
Rather than decompressing the folder and comparing individual files, I think it might be easier to compare the compressed files.
Overall I feel (ok, its just an intuition :D) this will be better in case there is a high probability that the contents of the folder changes in between the times you run the script
I use python tarfile module.
I have a system backup in tar.gz file.
I need to get first level dirs and files list without getting ALL the list of files in the archive because it's TOO LONG.
For example: I need to get ['bin/', 'etc/', ... 'var/'] and that's all.
How can I do it? May be not even with a tar-file? Then how?
You can't scan the contents of a tar without scanning the entire file; it has no central index. You need something like a ZIP.