Is it possible to extract single file from tar bundle in python - python

I need to fetch a couple of files from a huge svn repo. Whole repo takes almost an hour to be fetched. Files I am looking for are part of tar bundle.
Is it possible to fetch only those two files from tar bundle without extracting the whole bundle through Python Code?
If so, can anybody let me know how should I go about it?

It sounds like you have two parts to your question:
Fetching a single tar bundle from the SVN repo, without the rest of the repo's files.
Using Python to extract two files from the retrieved bundle.
For the first part, I'll simply refer to this post on svn export and sparse checkouts.
For the second part, here is a solution for extracting the two files from the retrieved tarball:
import tarfile
files_i_want = ['path/to/file1','path/to/file2']
tar = tarfile.open("bundle.tar")
tar.extractall(members=[x for x in tar.getmembers() if x.name in files_i_want])

Here is one way to get a tar file from svn and extract one file from it all:
import tarfile
from subprocess import check_output
# Capture the tar file from subversion
tmp='/home/me/tempfile.tar'
open(tmp, 'wb').write(check_output(["svn", "cat", "svn://url/some.tar"]))
# Extract the file we want, saving to current directory
tarfile.open(tmp).extract('dir1/fname.ext', path='dir2')
where 'dir1/fname.ext' is the full path to the file that you want within the tar archive. It will be saved in 'dir2/dir1/fname.ext'. If you omit the path argument, it will be saved in 'dir1/fname.ext' under the current directory.
The above can be understood as follows. On a normal shell command line, svn cat url tells subversion to send the file defined by url to stdout (see svn help cat for more info). url can be any type of url that svn understands such as svn://..., svn+ssh://..., or file://.... We run this command under python control using the subprocess module. To do this the svn cat url command is broken up into a list: ["svn", "cat", "url"]. The output from this svn command is saved to a local file defined by the tmp variable. We then use the tarfile module to extract the file you want.
Alternatively, you could use the extractfile method to capture the file data to a python variable:
handle = t.extractfile('dir1/fname.ext')
print handle.readlines() # show file contents
According to the documentation, tarfile should accept a subprocess's stdout as a file handle. This would simplify the code and eliminate the need to save the tar file locally. However, due to a bug, Issue 10436, that will not work.

Perhaps you want something like this?
#!/usr/local/cpython-3.3/bin/python
import tarfile as tarfile_mod
def main():
tarfile = tarfile_mod.TarFile('tar-archive.tar', 'r')
if False:
file_ = tarfile.extractfile('etc/protocols')
print(file_.read())
else:
tarfile.extract('etc/protocols')
tarfile.close()
main()

Related

How to check that a script has not been modified - tried with git attribute ident $Id$

I am maintaining a collection of python scripts that are distributed on several computers. Users might have the fancy idea to modify the scripts so I am looking for an automatic solution to check the script integrity.
I wanted to use git attribute ident so that the file contains its own sha1 and then use git hash-object to compare.
It looks like this (.gitattributes contains *.py ident):
import subprocess
gitId= '$Id: 98a648abdf1cd8d563c72886a601857c20670013 $' #this sha will be updated automatically at each commit on the file.
gitId=gitId[5:-2]
shaCheck=subprocess.check_output(['git', 'hash-object', __file__]).strip().decode('UTF-8')
if shaCheck != gitId:
print('file has been corrupted \n {} <> {}'.format(shaCheck, gitId))
# below the actual purpose of the script
This is working fine when my script lays inside the git repository but git hash-object returns a different sha when outside of my git repository. I guess there is some git filters issue but I do not know how to get around that issue?
Any other painless way to check my file interity is also welcome.
You could check the file's hash with the Python module hashlib:
import hashlib
filename_1 = "./folder1/test_script.py"
with open(filename_1,"rb") as f:
bytes = f.read() # read entire file as bytes
readable_hash = hashlib.sha256(bytes).hexdigest();
print(readable_hash)
filename_2 = "./folder2/test_script.py"
with open(filename_2,"rb") as f:
bytes = f.read() # read entire file as bytes
readable_hash = hashlib.sha256(bytes).hexdigest();
print(readable_hash)
Output:
a0c22dc5d16db10ca0e3d99859ffccb2b4d536b21d6788cfbe2d2cfac60e8117
a0c22dc5d16db10ca0e3d99859ffccb2b4d536b21d6788cfbe2d2cfac60e8117

How to get information of .jar file in python-magic

I have a folder full of jar, html, css, exe type file. How can I check the file?
I already run "file" command on *NIX and using python-magic. but the result is all like this.
test : Zip archive data, at least v1.0 to extract
How can I get information specifically like test : jar only using using magic number.
How do I do like this?
While not required, most JAR files have a META-INF/MANIFEST.MF file contained within them. You could check for the existence of this file, after checking if it's a zip file:
import zipfile
def zipFileContains(zipFileName, pathName):
f = zipfile.ZipFile(zipFileName, "r")
result = any(x.startswith(pathName.rstrip("/")) for x in f.namelist())
f.close()
return result
print zipFileContains("test.jar", "META-INF/MANIFEST.MF")
However, it might be better to just check if it's a zip file that ends in .jar.
Magic alone won't do it for you, since a JAR is literally just a zip file. Read more about the format here.

How to loop through the list of .tar.gz files using linux command in python

Using python 2.7
I have a list of *.tat.gz files on a linux box. Using python, I want to loop through the files and extract those files in a different location, under their respective folders.
For example: if my file name is ~/TargetData/zip/1440198002317590001.tar.gz
then I want to untar and ungzip this file in a different location under its
respective folder name i.e. ~/TargetData/unzip/1440198002317590001.
I have written some code but I am not able to loop through the files. In a command line I am able to untar using $ tar -czf 1440198002317590001.tar.gz 1440198002317590001 command. But I want to be able to loop through the .tar.gz files. The code is mentioned below. Here, I’m not able to loop just the files Or print only the files. Can you please help?
import os
inF = []
inF = str(os.system('ls ~/TargetData/zip/*.tar.gz'))
#print(inF)
if inF is not None:
for files in inF[:-1]:
print files
"""
os.system('tar -czf files /unzip/files[:-7]')
# This is what i am expecting here files = "1440198002317590001.tar.gz" and files[:-7]= "1440198002317590001"
"""
Have you ever worked on this type of use case? Your help is greatly appreciated!! Thank you!
I think you misunderstood the meaning of os.system(), that will do the job, but its return value was not expected by you, it returns 0 for successful done, you can not directly assign its output to a variable. You may consider the module [subprocess], see doc here. However, I DO NOT recommend that way to list files (actually, it returns string instead of list, see doc find the detail by yourself).
The best way I think would be glob module, see doc here. Use glob.glob(pattern), you can put all files match the pattern in a list, then you can loop it easily.
Of course, if you are familiar with os module, you also can use os.listdir(), os.path.join(), or even os.paht.expanduser() to do this. (Unlike glob, it only put filenames without fully path into a list, you need to reconstruct file path).
By the way, for you purpose here, there is no need to declare an empty list first (i.e. inF = [])
For unzip file part, you can do it by os.system, but I also recommend to use subprocess module instead of os.system, you will find the reason in the doc of subprocess.
DO NOT see the following code, ONLY see them after you really can not solve this by yourself.
import os
import glob
inF = glob.glob('~/TargetData/zip/*.tar.gz')
if inF:
for files in inF:
# consider subprocess.call() instead of os.system
unzip_name = files.replace('zip', 'unzip')[:-7]
# get directory name and make sure it exists, otherwise create it
unzip_dir = os.path.dirname(unzip_name)
if not os.path.exists(unzip_dir):
os.mkdir(unzip_dir)
subprocess.call(['tar -xzf', files, '-C', unzip_name])
# os.system('tar -czf files /unzip/files[:-7]')

Reading gzipped data in Python

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?
The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')

Tar function in Python

I am trying to tar a folder using the following code.
make_tarfile('logs_' + str(datetime.datetime.now()),logFolder)
def make_tarfile(output_filename, source_dir):
with closing(tarfile.open(output_filename, "w:gz")) as tar:
tar.add(source_dir, arcname=os.path.basename(source_dir))
I dont see any tar file created though.Please could any one correct my code.
Thanks in advance
your code looks usable, you shall find the archive created.
be aware, the archive file name will be exactly the name, you pass into tarfile.open, you have to specify the extension .tar.gz if you want to see it as name of the archive.
the with closing is not necessary in Python 2.7+, you may use with tarfile.open(output_filename, "w:gz")) as tar: as open tarfile has proper context manager available.

Categories