How to get information of .jar file in python-magic - python

I have a folder full of jar, html, css, exe type file. How can I check the file?
I already run "file" command on *NIX and using python-magic. but the result is all like this.
test : Zip archive data, at least v1.0 to extract
How can I get information specifically like test : jar only using using magic number.
How do I do like this?

While not required, most JAR files have a META-INF/MANIFEST.MF file contained within them. You could check for the existence of this file, after checking if it's a zip file:
import zipfile
def zipFileContains(zipFileName, pathName):
f = zipfile.ZipFile(zipFileName, "r")
result = any(x.startswith(pathName.rstrip("/")) for x in f.namelist())
f.close()
return result
print zipFileContains("test.jar", "META-INF/MANIFEST.MF")
However, it might be better to just check if it's a zip file that ends in .jar.
Magic alone won't do it for you, since a JAR is literally just a zip file. Read more about the format here.

Related

Is it possible to download just part of a ZIP file using python zipfile library

I was wondering is there any way by which I can download only a part of a .rar or .zip file without downloading the whole file ? There is a zip file containing files A,B,C and D. I only need A. Can I somehow, use zipfile module so that i can only download 1 file ?
i am trying below code:
r = c.get(file)
z = ZipFile.ZipFile(BytesIO(r.content))
for file1 in z.namelist():
if 'time' not in file1:
print("hi")
z.extractall(file1,download_path + filename)
This code is downloading whole zip file and only extracting specific one. Can i somehow download only the file i Need.
There is similar question here but it shows only approch by command line in linux. That question dosent address how it can be done using python liabraries.
The question #Juggernaut mentioned in a comment is actually very helpful, as it points you in the direction of the solution.
You need to create a replacement for Bytes.IO that returns the necessary information to ZipFile. You will need to get the length of the file, and then get whatever sections ZipFile asks for.
How large are those file? Is it really worth the trouble?
Use remotezip: https://github.com/gtsystem/python-remotezip. You can install it using pip:
pip install remotezip
Usage example:
from remotezip import RemoteZip
with RemoteZip("https://path/to/zip/file.zip") as zip_file:
for file in zip_file.namelist():
if 'time' not in file:
print("hi")
zip_file.extract(file, path="/path/to/extract")
Note that to use this approach, the web server from which you receive the file needs to support the Range header.

How to loop through the list of .tar.gz files using linux command in python

Using python 2.7
I have a list of *.tat.gz files on a linux box. Using python, I want to loop through the files and extract those files in a different location, under their respective folders.
For example: if my file name is ~/TargetData/zip/1440198002317590001.tar.gz
then I want to untar and ungzip this file in a different location under its
respective folder name i.e. ~/TargetData/unzip/1440198002317590001.
I have written some code but I am not able to loop through the files. In a command line I am able to untar using $ tar -czf 1440198002317590001.tar.gz 1440198002317590001 command. But I want to be able to loop through the .tar.gz files. The code is mentioned below. Here, I’m not able to loop just the files Or print only the files. Can you please help?
import os
inF = []
inF = str(os.system('ls ~/TargetData/zip/*.tar.gz'))
#print(inF)
if inF is not None:
for files in inF[:-1]:
print files
"""
os.system('tar -czf files /unzip/files[:-7]')
# This is what i am expecting here files = "1440198002317590001.tar.gz" and files[:-7]= "1440198002317590001"
"""
Have you ever worked on this type of use case? Your help is greatly appreciated!! Thank you!
I think you misunderstood the meaning of os.system(), that will do the job, but its return value was not expected by you, it returns 0 for successful done, you can not directly assign its output to a variable. You may consider the module [subprocess], see doc here. However, I DO NOT recommend that way to list files (actually, it returns string instead of list, see doc find the detail by yourself).
The best way I think would be glob module, see doc here. Use glob.glob(pattern), you can put all files match the pattern in a list, then you can loop it easily.
Of course, if you are familiar with os module, you also can use os.listdir(), os.path.join(), or even os.paht.expanduser() to do this. (Unlike glob, it only put filenames without fully path into a list, you need to reconstruct file path).
By the way, for you purpose here, there is no need to declare an empty list first (i.e. inF = [])
For unzip file part, you can do it by os.system, but I also recommend to use subprocess module instead of os.system, you will find the reason in the doc of subprocess.
DO NOT see the following code, ONLY see them after you really can not solve this by yourself.
import os
import glob
inF = glob.glob('~/TargetData/zip/*.tar.gz')
if inF:
for files in inF:
# consider subprocess.call() instead of os.system
unzip_name = files.replace('zip', 'unzip')[:-7]
# get directory name and make sure it exists, otherwise create it
unzip_dir = os.path.dirname(unzip_name)
if not os.path.exists(unzip_dir):
os.mkdir(unzip_dir)
subprocess.call(['tar -xzf', files, '-C', unzip_name])
# os.system('tar -czf files /unzip/files[:-7]')

Reading gzipped data in Python

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?
The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')

Is it possible to extract single file from tar bundle in python

I need to fetch a couple of files from a huge svn repo. Whole repo takes almost an hour to be fetched. Files I am looking for are part of tar bundle.
Is it possible to fetch only those two files from tar bundle without extracting the whole bundle through Python Code?
If so, can anybody let me know how should I go about it?
It sounds like you have two parts to your question:
Fetching a single tar bundle from the SVN repo, without the rest of the repo's files.
Using Python to extract two files from the retrieved bundle.
For the first part, I'll simply refer to this post on svn export and sparse checkouts.
For the second part, here is a solution for extracting the two files from the retrieved tarball:
import tarfile
files_i_want = ['path/to/file1','path/to/file2']
tar = tarfile.open("bundle.tar")
tar.extractall(members=[x for x in tar.getmembers() if x.name in files_i_want])
Here is one way to get a tar file from svn and extract one file from it all:
import tarfile
from subprocess import check_output
# Capture the tar file from subversion
tmp='/home/me/tempfile.tar'
open(tmp, 'wb').write(check_output(["svn", "cat", "svn://url/some.tar"]))
# Extract the file we want, saving to current directory
tarfile.open(tmp).extract('dir1/fname.ext', path='dir2')
where 'dir1/fname.ext' is the full path to the file that you want within the tar archive. It will be saved in 'dir2/dir1/fname.ext'. If you omit the path argument, it will be saved in 'dir1/fname.ext' under the current directory.
The above can be understood as follows. On a normal shell command line, svn cat url tells subversion to send the file defined by url to stdout (see svn help cat for more info). url can be any type of url that svn understands such as svn://..., svn+ssh://..., or file://.... We run this command under python control using the subprocess module. To do this the svn cat url command is broken up into a list: ["svn", "cat", "url"]. The output from this svn command is saved to a local file defined by the tmp variable. We then use the tarfile module to extract the file you want.
Alternatively, you could use the extractfile method to capture the file data to a python variable:
handle = t.extractfile('dir1/fname.ext')
print handle.readlines() # show file contents
According to the documentation, tarfile should accept a subprocess's stdout as a file handle. This would simplify the code and eliminate the need to save the tar file locally. However, due to a bug, Issue 10436, that will not work.
Perhaps you want something like this?
#!/usr/local/cpython-3.3/bin/python
import tarfile as tarfile_mod
def main():
tarfile = tarfile_mod.TarFile('tar-archive.tar', 'r')
if False:
file_ = tarfile.extractfile('etc/protocols')
print(file_.read())
else:
tarfile.extract('etc/protocols')
tarfile.close()
main()

Opening .out files in Python

Am I right in thinking Python cannot open and read from .out files?
My application currently spits out a bunch of .out files that would be read manually for logging purposes, I'm building a Python script to automate this.
When the script gets to the following
for file in os.listdir(DIR_NAME):
if (file.endswith('.out')):
open(file)
The script blows up with the following error "IOError : No such file or directory: 'Filename.out' "
I've a similar function with the above code and works fine, only it reads .err files. Printing out DIR_NAME before the above code also shows the correct directory is being pointed to.
os.listdir() returns only filenames, not full paths. Use os.path.join() to create a full path:
for file in os.listdir(DIR_NAME):
if (file.endswith('.out')):
open(os.path.join(DIR_NAME, file))
As an alternative that I find a bit easier and flexible to use:
import glob,os
for outfile in glob.glob( os.path.join(DIR_NAME, '*.out') ):
open(outfile)
Glob will also accept things like '*/*.out' or '*something*.out'. I also read files of certain types and have found this to be very handy.

Categories