Read .tar.gz file in Python

Read .tar.gz file in Python - python

I have a text file of 25GB. so i compressed it to tar.gz and it became 450 MB. now i want to read that file from python and process the text data.for this i referred question . but in my case code doesn't work. the code is as follows :
import tarfile
import numpy as np
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f=tar.extractfile(member)
content = f.read()
Data = np.loadtxt(content)
the error is as follows :
Traceback (most recent call last):
File "dataExtPlot.py", line 21, in <module>
content = f.read()
AttributeError: 'NoneType' object has no attribute 'read'
also, Is there any other method to do this task ?

The docs tell us that None is returned by extractfile() if the member is a not a regular file or link.
One possible solution is to skip over the None results:
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f = tar.extractfile(member)
if f is not None:
content = f.read()

tarfile.extractfile() can return None if the member is neither a file nor a link. For example your tar archive might contain directories or device files. To fix:
import tarfile
import numpy as np
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f = tar.extractfile(member)
if f:
content = f.read()
Data = np.loadtxt(content)

You may try this one
t = tarfile.open("filename.gz", "r")
for filename in t.getnames():
try:
f = t.extractfile(filename)
Data = f.read()
print filename, ':', Data
except :
print 'ERROR: Did not find %s in tar archive' % filename

My needs:
Python3.
My tar.gz file consists of multiple utf-8 text files and dir.
Need to read text lines from all files.
Problems:
The tar object returned by tar.getmembers() maybe None.
The content extractfile(fname) returns is a bytes str (e.g. b'Hello\t\xe4\xbd\xa0\xe5\xa5\xbd'). Unicode char doesn't display correctly.
Solutions:
Check the type of tar object first. I reference the example in doc of tarfile lib. (Search "How to read a gzip compressed tar archive and display some member information")
Decode from byte str to normal str. (ref - most voted answer)
Code:
with tarfile.open("sample.tar.gz", "r:gz") as tar:
for tarinfo in tar:
logger.info(f"{tarinfo.name} is {tarinfo.size} bytes in size and is: ")
if tarinfo.isreg():
logger.info(f"Is regular file: {tarinfo.name}")
f = tar.extractfile(tarinfo.name)
# To get the str instead of bytes str
# Decode with proper coding, e.g. utf-8
content = f.read().decode('utf-8', errors='ignore')
# Split the long str into lines
# Specify your line-sep: e.g. \n
lines = content.split('\n')
for i, line in enumerate(lines):
print(f"[{i}]: {line}\n")
elif tarinfo.isdir():
logger.info(f"Is dir: {tarinfo.name}")
else:
logger.info(f"Is something else: {tarinfo.name}.")

You cannot "read" the content of some special files such as links yet tar supports them and tarfile will extract them alright. When tarfile extracts them, it does not return a file-like object but None. And you get an error because your tarball contains such a special file.
One approach is to determine the type of an entry in a tarball you are processing ahead of extracting it: with this information at hand you can decide whether or not you can "read" the file. You can achieve this by calling tarfile.getmembers() returns tarfile.TarInfos that contain detailed information about the type of file contained in the tarball.
The tarfile.TarInfo class has all the attributes and methods you need to determine the type of tar member such as isfile() or isdir() or tinfo.islnk() or tinfo.issym() and then accordingly decide what do to with each member (extract or not, etc).
For instance I use these to test the type of file in this patched tarfile to skip extracting special files and process links in a special way:
for tinfo in tar.getmembers():
is_special = not (tinfo.isfile() or tinfo.isdir()
or tinfo.islnk() or tinfo.issym())
...

In Jupyter notebook you can do like below
!wget -c http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -O - | tar -xz

Related

configparser does not show sections

I added sections and its values to ini file, but configparser doesn't want to print what sections I have in total. What I've done:
import configparser
import os
# creating path
current_path = os.getcwd()
path = 'ini'
try:
os.mkdir(path)
except OSError:
print("Creation of the directory %s failed" % path)
# add section and its values
config = configparser.ConfigParser()
config['section-1'] = {'somekey' : 'somevalue'}
file = open(f'ini/inifile.ini', 'a')
with file as f:
config.write(f)
file.close()
# get sections
config = configparser.ConfigParser()
file = open(f'ini/inifile.ini')
with file as f:
config.read(f)
print(config.sections())
file.close()
returns
[]
The similar code was in the documentation, but doesn't work. What I do wrong and how I could solve this problem?

From the docs, config.read() takes in a filename (or list of them), not a file descriptor object:
read(filenames, encoding=None)
Attempt to read and parse an iterable of filenames, returning a list of filenames which were successfully parsed.
If filenames is a string, a bytes object or a path-like object, it is treated as a single filename. ...
If none of the named files exist, the ConfigParser instance will contain an empty dataset. ...
A file object is an iterable of strings, so basically the config parser is trying to read each string in the file as a filename. Which is sort of interesting and silly, because if you passed it a file that contained the filename of your actual config...it would work.
Anyways, you should pass the filename directly to config.read(), i.e.
config.read("ini/inifile.ini")
Or, if you want to use a file descriptor object instead, simply use config.read_file(f). Read the docs for read_file() for more information.
As an aside, you are duplicating some of the work the context manager is doing for no gain. You can use the with block without creating the object explicitly first or closing it after (it will get closed automatically). Keep it simple:
with open("path/to/file.txt") as f:
do_stuff_with_file(f)

Python: tarfile stream

I would like to read some files from a tarball and save it to a new tarball.
This is the code I wrote.
archive = 'dum/2164/archive.tar'
# Read input data.
input_tar = tarfile.open(archive, 'r|')
tarinfo = input_tar.next()
input_tar.close()
# Write output file.
output_tar = tarfile.open('foo.tar', 'w|')
output_tar.addfile(tarinfo)
output_tar.close()
Unfortunately, the output tarball is no good:
$ tar tf foo.tar
./1QZP_A--2JED_A--not_reformatted.dat.bz2
tar: Truncated input file (needed 1548288 bytes, only 1545728 available)
tar: Error exit delayed from previous errors.
Any clue how to read and write tarballs on the fly with Python?

OK so this is how I managed to do it.
archive = 'dum/2164/archive.tar'
# Read input data.
input_tar = tarfile.open(archive, 'r|')
tarinfo = input_tar.next()
fileobj = input_tar.extractfile(tarinfo)
# Write output file.
output_tar = tarfile.open('foo.tar', 'w|')
output_tar.addfile(tarinfo, fileobj)
input_tar.close()
output_tar.close()

Python zipfile module adds files with null bytes instead of correct content

When I use zipfile.ZipFile.writestr, the file contains the correct amount of characters afterwards, but all of them are null bytes.
Minimal example:
import zipfile
z=zipfile.ZipFile("test.zip", "w")
z.writestr("foo", "test")
z.close()
The resulting test.zip has a file "foo" inside, which contains 4 null bytes.

Got the same problem, and it seems ZipInfo is the obvious workaround.
import zipfile, os
name = 'foo.txt'
data = b'This is a test text.'
open(name, 'wb').write(data)
zipfile.ZipFile('write.zip', 'w').write(name) # OK for Ark
zipfile.ZipFile('writestr.zip', 'w').writestr(name, data) # nulls by Ark
wrt_attr = zipfile.ZipFile('write.zip').getinfo(name)
wrts_attr = zipfile.ZipFile('writestr.zip').getinfo(name)
os.remove(name)
os.remove('write.zip')
os.remove('writestr.zip')
for attr in wrt_attr.__slots__:
if getattr(wrt_attr, attr) != getattr(wrts_attr, attr):
attr, getattr(wrt_attr, attr), getattr(wrts_attr, attr)
attr = 'external_attr'
oct(getattr(wrt_attr, attr)>>16), oct(getattr(wrts_attr, attr)>>16)
The ZIP spec says, external_attr should be set to zero if the content is came from stdin. However, writestr constructs an invalid external_attr when the first argument is str.
It could be
0o100xxx (regular file with umasked permission)
or
zero (as the spec)
but not
0oxxx (file type absent)

Seems not to be a python problem, as its only "ark", which cannot open this file.
On the other hand, it seems to be encoded in some way, that ark cannot read it (while other unzip programs can).

NameError: global name is not defined

Hello
My error is produced in generating a zip file. Can you inform what I should do?
main.py", line 2289, in get
buf=zipf.read(2048)
NameError: global name 'zipf' is not defined
The complete code is as follows:
def addFile(self,zipstream,url,fname):
# get the contents
result = urlfetch.fetch(url)
# store the contents in a stream
f=StringIO.StringIO(result.content)
length = result.headers['Content-Length']
f.seek(0)
# write the contents to the zip file
while True:
buff = f.read(int(length))
if buff=="":break
zipstream.writestr(fname,buff)
return zipstream
def get(self):
self.response.headers["Cache-Control"] = "public,max-age=%s" % 86400
start=datetime.datetime.now()-timedelta(days=20)
count = int(self.request.get('count')) if not self.request.get('count')=='' else 1000
from google.appengine.api import memcache
memcache_key = "ads"
data = memcache.get(memcache_key)
if data is None:
a= Ad.all().filter("modified >", start).filter("url IN", ['www.koolbusiness.com']).filter("published =", True).order("-modified").fetch(count)
memcache.set("ads", a)
else:
a = data
dispatch='templates/kml.html'
template_values = {'a': a , 'request':self.request,}
path = os.path.join(os.path.dirname(__file__), dispatch)
output = template.render(path, template_values)
self.response.headers['Content-Length'] = len(output)
zipstream=StringIO.StringIO()
file = zipfile.ZipFile(zipstream,"w")
url = 'http://www.koolbusiness.com/list.kml'
# repeat this for every URL that should be added to the zipfile
file =self.addFile(file,url,"list.kml")
# we have finished with the zip so package it up and write the directory
file.close()
zipstream.seek(0)
# create and return the output stream
self.response.headers['Content-Type'] ='application/zip'
self.response.headers['Content-Disposition'] = 'attachment; filename="list.kmz"'
while True:
buf=zipf.read(2048)
if buf=="": break
self.response.out.write(buf)

That is probably zipstream and not zipf. So replace that with zipstream and it might work.

i don't see where you declare zipf?
zipfile? Senthil Kumaran is probably right with zipstream since you seek(0) on zipstream before the while loop to read chunks of the mystery variable.
edit:
Almost certainly the variable is zipstream.
zipfile docs:
class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])
Open a ZIP file, where file can be either a path to a file (a string) or
a file-like object. The mode parameter
should be 'r' to read an existing
file, 'w' to truncate and write a new
file, or 'a' to append to an existing
file. If mode is 'a' and file refers
to an existing ZIP file, then
additional files are added to it. If
file does not refer to a ZIP file,
then a new ZIP archive is appended to
the file. This is meant for adding a
ZIP archive to another file (such as
python.exe).
your code:
zipsteam=StringIO.StringIO()
create a file-like object using StringIO which is essentially a "memory file" read more in docs
file = zipfile.ZipFile(zipstream,w)
opens the zipfile with the zipstream file-like object in 'w' mode
url = 'http://www.koolbusiness.com/list.kml'
# repeat this for every URL that should be added to the zipfile
file =self.addFile(file,url,"list.kml")
# we have finished with the zip so package it up and write the directory
file.close()
uses the addFile method to retrieve and write the retrieved data to the file-like object and returns it. The variables are slightly confusing because you pass a zipfile to the addFile method which aliases as zipstream (confusing because we are using zipstream as a StringIO file-like object). Anyways, the zipfile is returned, and closed to make sure everything is "written".
It was written to our "memory file", which we now seek to index 0
zipstream.seek(0)
and after doing some header stuff, we finally reach the while loop that will read our "memory-file" in chunks
while True:
buf=zipstream.read(2048)
if buf=="": break
self.response.out.write(buf)

You need to declare:
global zipf
right after your
def get(self):
line. you are modifying a global variable, and this is the only way python knows what you are doing.

reading tar file contents without untarring it, in python script

I have a tar file which has number of files within it.
I need to write a python script which will read the contents of the files and gives the count o total characters, including total number of letters, spaces, newline characters, everything, without untarring the tar file.

you can use getmembers()
>>> import tarfile
>>> tar = tarfile.open("test.tar")
>>> tar.getmembers()
After that, you can use extractfile() to extract the members as file object. Just an example
import tarfile,os
import sys
os.chdir("/tmp/foo")
tar = tarfile.open("test.tar")
for member in tar.getmembers():
f=tar.extractfile(member)
content=f.read()
print "%s has %d newlines" %(member, content.count("\n"))
print "%s has %d spaces" % (member,content.count(" "))
print "%s has %d characters" % (member, len(content))
sys.exit()
tar.close()
With the file object f in the above example, you can use read(), readlines() etc.

you need to use the tarfile module. Specifically, you use an instance of the class TarFile to access the file, and then access the names with TarFile.getnames()
| getnames(self)
| Return the members of the archive as a list of their names. It has
| the same order as the list returned by getmembers().
If instead you want to read the content, then you use this method
| extractfile(self, member)
| Extract a member from the archive as a file object. `member' may be
| a filename or a TarInfo object. If `member' is a regular file, a
| file-like object is returned. If `member' is a link, a file-like
| object is constructed from the link's target. If `member' is none of
| the above, None is returned.
| The file-like object is read-only and provides the following
| methods: read(), readline(), readlines(), seek() and tell()

Previously, this post showed an example of "dict(zip(()"'ing the member names and members lists together, this is silly and causes excessive reads of the archive, to accomplish the same, we can use dictionary comprehension:
index = {i.name: i for i in my_tarfile.getmembers()}
More info on how to use tarfile
Extract a tarfile member
#!/usr/bin/env python3
import tarfile
my_tarfile = tarfile.open('/path/to/mytarfile.tar')
print(my_tarfile.extractfile('./path/to/file.png').read())
Index a tar file
#!/usr/bin/env python3
import tarfile
import pprint
my_tarfile = tarfile.open('/path/to/mytarfile.tar')
index = my_tarfile.getnames() # a list of strings, each members name
# or
# index = {i.name: i for i in my_tarfile.getmembers()}
pprint.pprint(index)
Index, read, dynamic extra a tar file
#!/usr/bin/env python3
import tarfile
import base64
import textwrap
import random
# note, indexing a tar file requires reading it completely once
# if we want to do anything after indexing it, it must be a file
# that can be seeked (not a stream), so here we open a file we
# can seek
my_tarfile = tarfile.open('/path/to/mytar.tar')
# tarfile.getmembers is similar to os.stat kind of, it will
# give you the member names (i.name) as well as TarInfo attributes:
#
# chksum,devmajor,devminor,gid,gname,linkname,linkpath,
# mode,mtime,name,offset,offset_data,path,pax_headers,
# size,sparse,tarfile,type,uid,uname
#
# here we use a dictionary comprehension to index all TarInfo
# members by the member name
index = {i.name: i for i in my_tarfile.getmembers()}
print(index.keys())
# pick your member
# note: if you can pick your member before indexing the tar file,
# you don't need to index it to read that file, you can directly
# my_tarfile.extractfile(name)
# or my_tarfile.getmember(name)
# pick your filename from the index dynamically
my_file_name = random.choice(index.keys())
my_file_tarinfo = index[my_file_name]
my_file_size = my_file_tarinfo.size
my_file_buf = my_tarfile.extractfile(
my_file_name
# or my_file_tarinfo
)
print('file_name: {}'.format(my_file_name))
print('file_size: {}'.format(my_file_size))
print('----- BEGIN FILE BASE64 -----'
print(
textwrap.fill(
base64.b64encode(
my_file_buf.read()
).decode(),
72
)
)
print('----- END FILE BASE64 -----'
tarfile with duplicate members
in the case that we have a tar that was created strangely, in this example by appending many versions of the same file to the same tar archive, we can work with that carefully, I've annotated which members contain what text, lets say we want the fourth (index 3) member, "capturetheflag\n"
tar -tf mybadtar.tar
mymember.txt # "version 1\n"
mymember.txt # "version 1\n"
mymember.txt # "version 2\n"
mymember.txt # "capturetheflag\n"
mymember.txt # "version 3\n"
#!/usr/bin/env python3
import tarfile
my_tarfile = tarfile.open('mybadtar.tar')
# >>> my_tarfile.getnames()
# ['mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt']
# if we use extracfile on a name, we get the last entry, I'm not sure how python is smart enough to do this, it must read the entire tar file and buffer every valid member and return the last one
# >>> my_tarfile.extractfile('mymember.txt').read()
# b'version 3\n'
# >>> my_tarfile.extractfile(my_tarfile.getmembers()[3]).read()
# b'capturetheflag\n'
Alternatively we can iterate over the tar file
#!/usr/bin/env python3
import tarfile
my_tarfile = tarfile.open('mybadtar.tar')
# note, if we do anything to the tarfile object that will
# cause a full read, the tarfile.next() method will return none,
# so call next in a loop as the first thing you do if you want to
# iterate
while True:
my_member = my_tarfile.next()
if not my_member:
break
print((my_member.offset, mytarfile.extractfile(my_member).read,))
# (0, b'version 1\n')
# (1024, b'version 1\n')
# (2048, b'version 2\n')
# (3072, b'capturetheflag\n')
# (4096, b'version 3\n')

you can use tarfile.list()
ex :
filename = "abc.tar.bz2"
with open( filename , mode='r:bz2') as f1:
print(f1.list())
after getting these data. you can manipulate or write this output to file and do whatever your requirement.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read .tar.gz file in Python - python

You may try this one t = tarfile.open("filename.gz", "r") for filename in t.getnames(): try: f = t.extractfile(filename) Data = f.read() print filename, ':', Data except : print 'ERROR: Did not find %s in tar archive' % filename

In Jupyter notebook you can do like below !wget -c http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -O - | tar -xz

Related

configparser does not show sections

Python: tarfile stream

Python zipfile module adds files with null bytes instead of correct content

NameError: global name is not defined

reading tar file contents without untarring it, in python script

Categories

Resources