Get big TAR(gz)-file contents by dir levels - python

I use python tarfile module.
I have a system backup in tar.gz file.
I need to get first level dirs and files list without getting ALL the list of files in the archive because it's TOO LONG.
For example: I need to get ['bin/', 'etc/', ... 'var/'] and that's all.
How can I do it? May be not even with a tar-file? Then how?

You can't scan the contents of a tar without scanning the entire file; it has no central index. You need something like a ZIP.

Related

os.walk isn't showing all the files in the given path

I'm trying to make my own backup program but to do so I need to be able to give a directory and be able to get every file that is somewhere deep down in subdirectories to be able to copy them. I tried making a script but it doesn't give me all the files that are in that directory. I used documents as a test and my list with items is 3600 but the amount of files should be 17000. why isn't os.walk showing everything?
import os
data = []
for mdir, dirs, files in os.walk('C:/Users/Name/Documents'):
data.append(files)
print(data)
print(len(data))
Use data.extend(files) instead of data.append(files).
files is a list of files in a directory. It looks like ["a.txt", "b.html"] and so on. If you use append, you end up with data looking like
[..., ["a.txt", "b.html"]]
whereas I suspect you're after
[..., "a.txt", "b.html"]
Using extend will provide the second behaviour.

How to save os.listdir output as list

I am trying to save all entries of os.listdir("./oldcsv") separately in a list but I don't know how to manipulate the output before it is processed.
What I am trying to do is generate a list containing the absolute pathnames of all *.csv files in a folder, which can later be used to easily manipulate those files' contents. I don't want to put lots of hardcoded pathnames in the script, as it is annoying and hard to read.
import os
for file in os.listdir("./oldcsv"):
if file.endswith(".csv"):
print(os.path.join("/oldcsv", file))
Normally I would use a loop with .append but in this case I cannot do so, since os.listdir just seems to create a "blob" of content. Probably there is an easy solution out there, but my brain won't think of it.
There's a glob module in the standard library that can solve your problem with a single function call:
import glob
csv_files = glob.glob("./*.csv") # get all .csv files from the working dir
assert isinstance(csv_files, list)

How to work with CSV files inside a zipped folder?

I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!
You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.

Go through tar archive in memory to extract metadata?

I have several tar archives that I need to extract/read in memory. The problem is each tar contains many ZIP archives and each contain unique XML documents.
So the structure of each tar is as follows: tar -> directories-> ZIPs->XML.
Obviously I can manually extract a single TAR but I have about 1000 TAR archives that are about 3 GB each and contains about 6000 ZIP archives each. I'm looking for a way to handle the .tar archives in memory and extract the XML data of each ZIP. Is there a way to do this?
This should be doable, since all of the relevant methods have non-disk-related options.
Lots of loops here, so let's dig in.
For each tar archive:
tarfile.open would open the tar archive. (Docs)
Call .getmembers on the resulting TarFile instance to get a list of the zips (or other files) contained in the archive. (Docs)
For each zip within the tar archive:
Once you know what member file (i.e., one of your zips) you want to look through, call .extractfile on your TarFile instance to get a file object for that zip. (Docs)
Instantiate a new zipfile.ZipFile with your file object in order to open the zip so you can work with it. (Docs)
Call .infolist on your ZipFile instance to get a list of the files it contains (including your XML files). (Docs)
For each XML file within the zip:
Call .open on your ZipFile instance in order to get a file object of one of your XML files. (Docs)
You now have a file object corresponding to one of your XML files. Do whatever you want with it: .read it, copy it to disk somewhere, stick it in an ElementTree (docs), etc.

How to elegantly compare zip folder contents to unzipped folder contents

This is the scenario. I want to be able to backup the contents of a folder using a python script. However, I want my backups to be stored in a zipped format, possibly bz2.
The problem comes from the fact that I don’t want to bother backing up the folder if the contents in the “current” folder are exactly the same as what is in my most recent backup.
My process will be like this:
Initiate backup
Check contents of “current” folder against what is stored in the most recent zipped backup
If same – then “complete”
If different, then run backup, then “complete”
Can anyone recomment the most reliable and simple way of completing step2? Do I have to unzip the contents of the backup and store in a temp directory to do a comparison or is there a more elegant way of doing this? Possibly to do with modified date?
Zip files contain CRC32 checksums and you can read them with the python zipfile module: http://docs.python.org/2/library/zipfile.html. You can get a list of ZipInfo objects with CRC members from ZipFile.infolist(). There are also modification dates in the ZipInfo object.
You can compare the zip checksum with calculated checksums for the unpacked files. You need to read the unpacked files but you avoid having to decompress everything.
CRC32 is not a cryptographic checksum but it should be enough if all you need is to check for changes.
This holds for zip files. Other archive formats (like tar.bz2) might not contain such easily-accessible metadata.
I use this script to create compress backup of a directory
only when the directory contents has changed after last backup.
I use external md5 file to store the digest of the backup file and I check
it to detect directory changes.
import hashlib
import tarfile
import bz2
import cStringIO
import os
def backup_dir(dirname, backup_path):
fobj = cStringIO.StringIO()
t = tarfile.open(mode='w',fileobj=fobj)
t.add(dirname)
t.close()
buf = fobj.getvalue()
new_md5 = hashlib.md5(buf).digest()
if os.path.isfile(backup_path + '.md5'):
old_md5 = open(backup_path + '.md5').read()
else:
old_md5 = ''
if new_md5 <> old_md5:
open(backup_path, 'wb').write(bz2.compress(buf))
open(backup_path + '.md5', 'wb').write(new_md5)
print 'backup done!'
else:
print 'nothing to do'
Rsync will automatically detect and only copy modified files, but seeing as you want to bzip the results, you still need to detect if anything has changed.
How about you output the directory listing (including time stamps) to a text file alongside your archive. The next time you diff the current directory structure against this stored text. You can grep differences out and pipe this file list to rsync to include those changed files.
You could also try the following process:
1) Initiate backup
2) Run backup
3) Compare both compressed files:
import filecmp
filecmp.cmp(Compressed_new_file, Compressed_old_file, shallow=True)
4) If same – delete new backup file then "complete"
5) Else “complete”
NOTE: In case you need to check just the time between the modifications, you can have a look at this documentation
Rather than decompressing the folder and comparing individual files, I think it might be easier to compare the compressed files.
Overall I feel (ok, its just an intuition :D) this will be better in case there is a high probability that the contents of the folder changes in between the times you run the script

Categories