How to un-tar in-memory data using Python3?

How to un-tar in-memory data using Python3? - python

I've got some tar data in bytes, and want to read it without writing it to the file system.
Writing it to the file system works:
with open('out.tar', 'wb') as f:
f.write(data)
then, in the shell: tar -xzvf out.tar
But the following errors:
import tarfile
tarfile.open(data, 'r')
'''
File ".../lib/python3.7/tarfile.py", line 1591, in open
return func(name, filemode, fileobj, **kwargs)
File ".../lib/python3.7/tarfile.py", line 1638, in gzopen
fileobj = gzip.GzipFile(name, mode + "b", compresslevel, fileobj)
File ".../lib/python3.7/gzip.py", line 163, in __init__
fileobj = self.myfileobj = builtins.open(fil
'''
what is the right way to read the tar in memory?
Update
The following works:
from io import BytesIO
tarfile.open(fileobj=BytesIO(data), 'r')
Why?
tarfile.open is supposed to be able to work with bytes. Converting the bytes to a file-like object myself and then telling tarfile.open to use the file-like object works, but why is the transformation necessary? When does the raw bytes-based API work vs. not work?

You can use the tarfile and from there you can read the data using Byte stream.
import tarfile
with tarfile.open(fileobj = BytesIO(your_file_name)) as tar:
for tar_file in tar:
if (tar_file.isfile()):
inner_data = tar.extractfile(tar_file).read().decode('utf-8')

Related

How do I create a PDF file from a binary code using Python?

I am trying to send myself PDF files per E-mail with Python. I am able to send myself the binary code of a PDF file, but I am not able to reconstruct the PDF file from this binary code.
Here is how I obtain the binary code of a PDF file:
file = open('code.txt', 'w')
for line in open('somefile.pdf', 'rb').readlines():
file.write(str(line))
file.close()
Here is how I try to create a PDF file from the binary code:
file = open('new.pdf', 'wb')
for line in open('code.txt', 'r').readlines():
file.write(bytes(line))
file.close()
I then recieve this error:
Traceback (most recent call last):
File "something.py", line 3, in
file.write(bytes(line))
TypeError: string argument without an encoding
What did I do wrong?

In your first block, open file in binary write mode (wb), since you are writing binary to it. Also, you don't need to convert it explicitly to str. It should look like this:
file = open('code.txt', 'wb')
for line in open('somefile.pdf', 'rb').readlines():
file.write(line)
file.close()
For second block, open file in read binary mode (rb). Here also, no need to explicitly convert to bytes. It should look like this:
file = open('new.pdf', 'wb')
for line in open('code.txt', 'rb').readlines():
file.write(line)
file.close()
This should do the trick. But why do you need to convert it in the first place? Keeping file intact will save your hardwork and computational power.

Just to add. In my case, I was downloading the pdf file from an API and the 'response.content' came in base64 format. I also didn't need to write line by line
I needed to convert the byte array first using:
import requests
import base64
response = requests.get(self.download_url,
allow_redirects=True,
headers=headers,
params=query_params)
bytes = base64.b64decode(response.content)
with open('file.pdf', 'wb') as f:
f.write(bytes)

How do I to turn my .tar.gz file into a file-like object for shutil.copyfileobj?

My goal is to extract a file out of a .tar.gz file without also extracting out the sub directories that precede the desired file. I am trying to module my method off this question. I already asked a question of my own but it seemed like the answer I thought would work didn't work fully.
In short, shutil.copyfileobj isn't copying the contents of my file.
My code is now:
import os
import shutil
import tarfile
import gzip
with tarfile.open('RTLog_20150425T152948.gz', 'r:*') as tar:
for member in tar.getmembers():
filename = os.path.basename(member.name)
if not filename:
continue
source = tar.fileobj
target = open('out', "wb")
shutil.copyfileobj(source, target)
Upon running this code the file out was successfully created however, the file was empty. I know that this file I wanted to extract does, in fact, have lots of information (approximately 450 kb). A print(member.size) returns 1564197.
My attempts to solve this were unsuccessful. A print(type(tar.fileobj)) told me that tar.fileobj is a <gzip _io.BufferedReader name='RTLog_20150425T152948.gz' 0x3669710>.
Therefore I tried changing source to: source = gzip.open(tar.fileobj) but this raised the following error:
Traceback (most recent call last):
File "C:\Users\dzhao\Desktop\123456\444444\blah.py", line 15, in <module>
shutil.copyfileobj(source, target)
File "C:\Python34\lib\shutil.py", line 67, in copyfileobj
buf = fsrc.read(length)
File "C:\Python34\lib\gzip.py", line 365, in read
if not self._read(readsize):
File "C:\Python34\lib\gzip.py", line 433, in _read
if not self._read_gzip_header():
File "C:\Python34\lib\gzip.py", line 297, in _read_gzip_header
raise OSError('Not a gzipped file')
OSError: Not a gzipped file
Why isn't shutil.copyfileobj actually copying the contents of the file in the .tar.gz?

fileobj isn't a documented property of TarFile. It's probably an internal object used to represent the whole tar file, not something specific to the current file.
Use TarFile.extractfile() to get a file-like object for a specific member:
…
source = tar.extractfile(member)
target = open("out", "wb")
shutil.copyfile(source, target)

unable to read large bz2 file

I am trying to read a large bz2 file with this code:
import bz2
file= bz2.BZ2File("20150219.csv.bz2","rb")
print file.read()
file.close()
But after 4525 lines, it stops without an error message. The bz2 file is much bigger.
How can I read the whole file line by line?

Your file.read() call tries to read the entire file into memory and then and decompress all of it there, too. Try reading it a line at a time:
import bz2
with bz2.BZ2File("20150219.csv.bz2", "r") as file:
for line in file:
print(line)

Why do you want to print a binary file line by line? Read them to a bytes object instead:
bs = file.read()

urlopen trouble while trying to download a gzip file

I am going to use the wiktionary dump for the purpose of POS tagging. Somehow it gets stuck when downloading. Here is my code:
import nltk
from urllib import urlopen
from collections import Counter
import gzip
url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz'
fStream = gzip.open(urlopen(url).read(), 'rb')
dictFile = fStream.read()
fStream.close()
text = nltk.Text(word.lower() for word in dictFile())
tokens = nltk.word_tokenize(text)
Here is the error I get:
Traceback (most recent call last):
File "~/dir1/dir1/wikt.py", line 15, in <module>
fStream = gzip.open(urlopen(url).read(), 'rb')
File "/usr/lib/python2.7/gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "/usr/lib/python2.7/gzip.py", line 89, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
Process finished with exit code 1

You are passing the downloaded data to gzip.open(), which expects to be passed a filename instead.
The code then tries to open a filename named by the gzipped data, and fails.
Either save the URL data to a file, then use gzip.open() on that, or decompress the gzipped data using the zlib module instead. 'Saving' the data can be as easy as using a StringIO.StringIO() in-memory file object:
from StringIO import StringIO
from urllib import urlopen
import gzip
url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz'
inmemory = StringIO(urlopen(url).read())
fStream = gzip.GzipFile(fileobj=inmemory, mode='rb')

Python gzip: is there a way to decompress from a string?

I've read this SO post around the problem to no avail.
I am trying to decompress a .gz file coming from an URL.
url_file_handle=StringIO( gz_data )
gzip_file_handle=gzip.open(url_file_handle,"r")
decompressed_data = gzip_file_handle.read()
gzip_file_handle.close()
... but I get TypeError: coercing to Unicode: need string or buffer, cStringIO.StringI found
What's going on?
Traceback (most recent call last):
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 2974, in _HandleRequest
base_env_dict=env_dict)
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 411, in Dispatch
base_env_dict=base_env_dict)
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 2243, in Dispatch
self._module_dict)
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 2161, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 2057, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/home/jldupont/workspace/jldupont/trunk/site/app/server/tasks/debian/repo_fetcher.py", line 36, in <module>
main()
File "/home/jldupont/workspace/jldupont/trunk/site/app/server/tasks/debian/repo_fetcher.py", line 30, in main
gziph=gzip.open(fh,'r')
File "/usr/lib/python2.5/gzip.py", line 49, in open
return GzipFile(filename, mode, compresslevel)
File "/usr/lib/python2.5/gzip.py", line 95, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: coercing to Unicode: need string or buffer, cStringIO.StringI found

If your data is already in a string, try zlib, which claims to be fully gzip compatible:
import zlib
decompressed_data = zlib.decompress(gz_data, 16+zlib.MAX_WBITS)
Read more: http://docs.python.org/library/zlib.html‎

gzip.open is a shorthand for opening a file, what you want is gzip.GzipFile which you can pass a fileobj
open(filename, mode='rb', compresslevel=9)
#Shorthand for GzipFile(filename, mode, compresslevel).
vs
class GzipFile
__init__(self, filename=None, mode=None, compresslevel=9, fileobj=None)
# At least one of fileobj and filename must be given a non-trivial value.
so this should work for you
gzip_file_handle = gzip.GzipFile(fileobj=url_file_handle)

You can use gzip.decompress from the gzip builtin Python library(available for Python 3.2+).
Example on how to decompress bytes:
import gzip
gzip.decompress(gzip_data)
Documentation
https://docs.python.org/3.5/library/gzip.html#gzip.decompress

Consider using gzip.GzipFile if you don't like passing obscure arguments to zlib.decompress.
When you deal with urllib2.urlopen response that can be either gzip-compressed or uncompressed:
import gzip
from StringIO import StringIO
# response = urllib2.urlopen(...
content_raw = response.read()
if 'gzip' in response.info().getheader('Content-Encoding'):
content = gzip.GzipFile(fileobj=StringIO(content_raw)).read()
When you deal with a file that can store either gzip-compressed or uncompressed data:
import gzip
# some_file = open(...
try:
content = gzip.GzipFile(fileobj=some_file).read()
except IOError:
some_file.seek(0)
content = some_file.read()
The examples above are in Python 2.7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to un-tar in-memory data using Python3? - python

You can use the tarfile and from there you can read the data using Byte stream. import tarfile with tarfile.open(fileobj = BytesIO(your_file_name)) as tar: for tar_file in tar: if (tar_file.isfile()): inner_data = tar.extractfile(tar_file).read().decode('utf-8')

Related

How do I create a PDF file from a binary code using Python?

How do I to turn my .tar.gz file into a file-like object for shutil.copyfileobj?

unable to read large bz2 file

urlopen trouble while trying to download a gzip file

Python gzip: is there a way to decompress from a string?

Categories

Resources