Python TarFile with bz2 data

Python TarFile with bz2 data - python

Im trying to download a bz2 compressed tarfile and create a tarfile.TarFile object from it.
import MyModule
import StringIO
import tarfile
tardata = StringIO.StringIO()
tardata.write(MyModule.getBz2TarFileData())
tardata.seek(0)
tar = tarfile.open(fileobj = tardata, mode="r:bz2")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/tarfile.py", line 896, in open
return func(name, filemode, fileobj)
File "/usr/lib/python2.4/tarfile.py", line 987, in bz2open
pre, ext = os.path.splitext(name)
File "/usr/lib/python2.4/posixpath.py", line 92, in splitext
i = p.rfind('.')
AttributeError: 'NoneType' object has no attribute 'rfind'
According to the docs (http://docs.python.org/library/tarfile.html#tarfile.open) when you use fileobj= it is used in favor of file name=. Though, it looks like its still trying to access a null file?
If fileobj is specified, it is used as an alternative to a file object
opened for name. It is supposed to be at position 0.
If I don't use tarfile.open() and I decompress the bz2 data and create the tarfile.Tarfile object manually it works with StringIO and fileobj:
>>> import MyModule
>>> import tarfile
>>> import StringIO
>>> import bz2
>>> tardata = StringIO.StringIO()
>>> tardata.write(bz2.decompress(MyModule.getBz2TarFileData()))
>>> tardata.seek(0)
>>> tar = tarfile.TarFile(fileobj=tardata, mode='r')
>>> tar.getmembers()
[<TarInfo 'FileNumber1' at -0x48e150f4>, <TarInfo 'FileNumber2' at -0x48e150d4>, <TarInfo 'FileNumber3' at -0x48e11fb4>]
>>>
I was trying to streamline since tarfile is supposed to support bz2 compression.

I just have thrown a look into tarfile.py on my systems. The line numbers were quite different (I have 2.6), so that I would suppose there was heavy work since 2.4.
Maybe the module had a bug in 2.4 times which has been corrected, or the said interface has changed thus the docs don't match your module version any longer.
It is just a guess, however.

Related

python uploading a remote file to GCS , without saving it in the machine [duplicate]

I have managed to get my first python script to work which downloads a list of .ZIP files from a URL and then proceeds to extract the ZIP files and writes them to disk.
I am now at a loss to achieve the next step.
My primary goal is to download and extract the zip file and pass the contents (CSV data) via a TCP stream. I would prefer not to actually write any of the zip or extracted files to disk if I could get away with it.
Here is my current script which works but unfortunately has to write the files to disk.
import urllib, urllister
import zipfile
import urllib2
import os
import time
import pickle
# check for extraction directories existence
if not os.path.isdir('downloaded'):
os.makedirs('downloaded')
if not os.path.isdir('extracted'):
os.makedirs('extracted')
# open logfile for downloaded data and save to local variable
if os.path.isfile('downloaded.pickle'):
downloadedLog = pickle.load(open('downloaded.pickle'))
else:
downloadedLog = {'key':'value'}
# remove entries older than 5 days (to maintain speed)
# path of zip files
zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/zip/files"
# retrieve list of URLs from the webservers
usock = urllib.urlopen(zipFileURL)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
# only parse urls
for url in parser.urls:
if "PUBLIC_P5MIN" in url:
# download the file
downloadURL = zipFileURL + url
outputFilename = "downloaded/" + url
# check if file already exists on disk
if url in downloadedLog or os.path.isfile(outputFilename):
print "Skipping " + downloadURL
continue
print "Downloading ",downloadURL
response = urllib2.urlopen(downloadURL)
zippedData = response.read()
# save data to disk
print "Saving to ",outputFilename
output = open(outputFilename,'wb')
output.write(zippedData)
output.close()
# extract the data
zfobj = zipfile.ZipFile(outputFilename)
for name in zfobj.namelist():
uncompressed = zfobj.read(name)
# save uncompressed data to disk
outputFilename = "extracted/" + name
print "Saving extracted file to ",outputFilename
output = open(outputFilename,'wb')
output.write(uncompressed)
output.close()
# send data via tcp stream
# file successfully downloaded and extracted store into local log and filesystem log
downloadedLog[url] = time.time();
pickle.dump(downloadedLog, open('downloaded.pickle', "wb" ))

Below is a code snippet I used to fetch zipped csv file, please have a look:
Python 2:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(StringIO(resp.read()))
for line in myzip.open(file).readlines():
print line
Python 3:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
# or: requests.get(url).content
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(BytesIO(resp.read()))
for line in myzip.open(file).readlines():
print(line.decode('utf-8'))
Here file is a string. To get the actual string that you want to pass, you can use zipfile.namelist(). For instance,
resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip')
myzip = ZipFile(BytesIO(resp.read()))
myzip.namelist()
# ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']

My suggestion would be to use a StringIO object. They emulate files, but reside in memory. So you could do something like this:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'
import zipfile
from StringIO import StringIO
zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()
# output: "hey, foo"
Or more simply (apologies to Vishal):
myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
[ ... ]
In Python 3 use BytesIO instead of StringIO:
import zipfile
from io import BytesIO
filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
[ ... ]

I'd like to offer an updated Python 3 version of Vishal's excellent answer, which was using Python 2, along with some explanation of the adaptations / changes, which may have been already mentioned.
from io import BytesIO
from zipfile import ZipFile
import urllib.request
url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")
with ZipFile(BytesIO(url.read())) as my_zip_file:
for contained_file in my_zip_file.namelist():
# with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
for line in my_zip_file.open(contained_file).readlines():
print(line)
# output.write(line)
Necessary changes:
There's no StringIO module in Python 3 (it's been moved to io.StringIO). Instead, I use io.BytesIO]2, because we will be handling a bytestream -- Docs, also this thread.
urlopen:
"The legacy urllib.urlopen function from Python 2.6 and earlier has been discontinued; urllib.request.urlopen() corresponds to the old urllib2.urlopen.", Docs and this thread.
Note:
In Python 3, the printed output lines will look like so: b'some text'. This is expected, as they aren't strings - remember, we're reading a bytestream. Have a look at Dan04's excellent answer.
A few minor changes I made:
I use with ... as instead of zipfile = ... according to the Docs.
The script now uses .namelist() to cycle through all the files in the zip and print their contents.
I moved the creation of the ZipFile object into the with statement, although I'm not sure if that's better.
I added (and commented out) an option to write the bytestream to file (per file in the zip), in response to NumenorForLife's comment; it adds "unzipped_and_read_" to the beginning of the filename and a ".file" extension (I prefer not to use ".txt" for files with bytestrings). The indenting of the code will, of course, need to be adjusted if you want to use it.
Need to be careful here -- because we have a byte string, we use binary mode, so "wb"; I have a feeling that writing binary opens a can of worms anyway...
I am using an example file, the UN/LOCODE text archive:
What I didn't do:
NumenorForLife asked about saving the zip to disk. I'm not sure what he meant by it -- downloading the zip file? That's a different task; see Oleh Prypin's excellent answer.
Here's a way:
import urllib.request
import shutil
with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:
shutil.copyfileobj(response, out_file)

I'd like to add my Python3 answer for completeness:
from io import BytesIO
from zipfile import ZipFile
import requests
def get_zip(file_url):
url = requests.get(file_url)
zipfile = ZipFile(BytesIO(url.content))
files = [zipfile.open(file_name) for file_name in zipfile.namelist()]
return files.pop() if len(files) == 1 else files

write to a temporary file which resides in RAM
it turns out the tempfile module ( http://docs.python.org/library/tempfile.html ) has just the thing:
tempfile.SpooledTemporaryFile([max_size=0[,
mode='w+b'[, bufsize=-1[, suffix=''[,
prefix='tmp'[, dir=None]]]]]])
This
function operates exactly as
TemporaryFile() does, except that data
is spooled in memory until the file
size exceeds max_size, or until the
file’s fileno() method is called, at
which point the contents are written
to disk and operation proceeds as with
TemporaryFile().
The resulting file has one additional
method, rollover(), which causes the
file to roll over to an on-disk file
regardless of its size.
The returned object is a file-like
object whose _file attribute is either
a StringIO object or a true file
object, depending on whether
rollover() has been called. This
file-like object can be used in a with
statement, just like a normal file.
New in version 2.6.
or if you're lazy and you have a tmpfs-mounted /tmp on Linux, you can just make a file there, but you have to delete it yourself and deal with naming

Adding on to the other answers using requests:
# download from web
import requests
url = 'http://mlg.ucd.ie/files/datasets/bbc.zip'
content = requests.get(url)
# unzip the content
from io import BytesIO
from zipfile import ZipFile
f = ZipFile(BytesIO(content.content))
print(f.namelist())
# outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']
Use help(f) to get more functions details for e.g. extractall() which extracts the contents in zip file which later can be used with with open.

All of these answers appear too bulky and long. Use requests to shorten the code, e.g.:
import requests, zipfile, io
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("/path/to/directory")

Vishal's example, however great, confuses when it comes to the file name, and I do not see the merit of redefing 'zipfile'.
Here is my example that downloads a zip that contains some files, one of which is a csv file that I subsequently read into a pandas DataFrame:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
import pandas
url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.zip")
zf = ZipFile(StringIO(url.read()))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
(Note, I use Python 2.7.13)
This is the exact solution that worked for me. I just tweaked it a little bit for Python 3 version by removing StringIO and adding IO library
Python 3 Version
from io import BytesIO
from zipfile import ZipFile
import pandas
import requests
url = "https://www.nseindia.com/content/indices/mcwb_jun19.zip"
content = requests.get(url)
zf = ZipFile(BytesIO(content.content))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])

It wasn't obvious in Vishal's answer what the file name was supposed to be in cases where there is no file on disk. I've modified his answer to work without modification for most needs.
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
def unzip_string(zipped_string):
unzipped_string = ''
zipfile = ZipFile(StringIO(zipped_string))
for name in zipfile.namelist():
unzipped_string += zipfile.open(name).read()
return unzipped_string

Use the zipfile module. To extract a file from a URL, you'll need to wrap the result of a urlopen call in a BytesIO object. This is because the result of a web request returned by urlopen doesn't support seeking:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
zip_url = 'http://example.com/my_file.zip'
with urlopen(zip_url) as f:
with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read())
If you already have the file downloaded locally, you don't need BytesIO, just open it in binary mode and pass to ZipFile directly:
from zipfile import ZipFile
zip_filename = 'my_file.zip'
with open(zip_filename, 'rb') as f:
with ZipFile(f) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read().decode('utf-8'))
Again, note that you have to open the file in binary ('rb') mode, not as text or you'll get a zipfile.BadZipFile: File is not a zip file error.
It's good practice to use all these things as context managers with the with statement, so that they'll be closed properly.

How to get a Python os.PathLike object from a NamedTemporaryFile using the tempfile module?

I'm trying to call a Python (3.8) function using the pdfkit library that requires a list of file objects. I only have strings, so I need to write each string to a temporary file that I will delete after the function returns. I can roll my own solution for this, but it would be much more convenient (and portable) if I could use the tempfile module for this. Unfortunately, when I wrap each string with a NamedTemporaryFile, my function complains with a TypeError that I must supply something that is os.PathLike, not _TemporaryFileWrapper.
Is it possible to get something from the tempfile module that is suitable for this purpose?
from tempfile import NamedTemporaryFile
import pdfkit
the_files = []
for html in my_html_strings:
tmp_file = NamedTemporaryFile(mode='w', suffix='.html')
tmp_file.write(html)
the_files.append(tmp_file)
pdf_data = pdfkit.from_file(the_files, False) # raises
And the stack trace:
File "...", line 68, in my_code
pdf_data = pdfkit.from_file(section_files, False, options=opts)
File ".../lib/python3.8/site-packages/pdfkit/api.py", line 46, in from_file
r = PDFKit(input, 'file', options=options, toc=toc, cover=cover, css=css,
File ".../lib/python3.8/site-packages/pdfkit/pdfkit.py", line 41, in __init__
self.source = Source(url_or_file, type_)
File ".../lib/python3.8/site-packages/pdfkit/source.py", line 12, in __init__
self.checkFiles()
File ".../lib/python3.8/site-packages/pdfkit/source.py", line 28, in checkFiles
if not os.path.exists(path):
File "/usr/lib/python3.8/genericpath.py", line 19, in exists
os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not _TemporaryFileWrapper

You don't need a PathLike. You just need a string representing the file path. For a NamedTemporaryFile, that string is its name attribute:
namedtempfile.name
Almost everything that takes an os.PathLike is supposed to take a plain string too.
If for some weird reason you find yourself in a situation where you really need a PathLike and a string actually doesn't work, you can wrap the string in a pathlib.Path:
import pathlib
pathlike = pathlib.Path(nametempfile.name)
This is not such a situation.

Import from <class 'bytes'> instead of file

In Python I have .pyd shared library that is encrypted to .epyd and that I read and decrypt with
with open('src_nuitka/src.epyd', 'rb') as f:
my_pyd_module = decrypt(f.read())
Now I would like to import the module using the <class 'bytes'> object my_pyd_module directly without writing to disk first. How can I do this? Since it is not a Python code string, I cannot use exec. Is there an import hook available for this task? All examples of writing import hooks did this using exec or by instantiating the classes directly as in https://dev.to/dangerontheranger/dependency-injection-with-import-hooks-in-python-3-5hap.
So here is my first try using the ideas of #a_guest and https://dev.to/dangerontheranger/dependency-injection-with-import-hooks-in-python-3-5hap (and no en-/decrypting yet):
import importlib.abc
import importlib.machinery
import sys
class DependencyInjectorFinder(importlib.abc.MetaPathFinder):
def __init__(self, loader):
self._loader: DependencyInjectorLoader = loader
def find_spec(self, fullname, path, target=None):
if fullname == 'src2':
return importlib.machinery.ModuleSpec(fullname, self._loader)
class DependencyInjectorLoader(importlib.machinery.ExtensionFileLoader):
def get_data(self, path):
with open('src_packaged/src_dist/src.pyd', 'rb') as f:
module = f.read()
return module
sys.meta_path.append(DependencyInjectorFinder(DependencyInjectorLoader('src2', 'src2')))
import src2
which results in
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: bad argument type for built-in operation
for the last line.

Unable to Zip a File in Buffer

i Want to Zip the CSV File in (Buffer) Using zipFile in Python
Below is My Code Which I Have Tried And Error Log Attached
I Dont want to use the compression in df.to_csv due to Version issue
import pandas as pd
import numpy as np
import io
import zipfile
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
s_buf = io.StringIO()
df.to_csv(s_buf,index=False)
s_buf.seek(0)
s_buf.name = 'my_filename.csv'
localfile= io.BytesIO()
localzip = io.BytesIO()
zf = zipfile.ZipFile(localzip, mode="w",compression=zipfile.ZIP_DEFLATED)
zf.writestr(localfile, s_buf.read())
zf.close()
with open("D:/my_zip.zip", "wb") as f:
f.write(zf.getvalue())
Error I am Getting
Traceback (most recent call last):
File "C:/Users/Window/PycharmProjects/dfZip/dfZiptest.py", line 25, in <module>
zf.writestr(localfile, s_buf.read())
File "C:\Python\Python37\lib\zipfile.py", line 1758, in writestr
date_time=time.localtime(time.time())[:6])
File "C:\Python\Python37\lib\zipfile.py", line 345, in __init__
null_byte = filename.find(chr(0))
AttributeError: '_io.BytesIO' object has no attribute 'find'

zf = zipFile.ZipFile("localzip.zip", mode="w", compression=zipfile.ZIP_DEFLATED)
zf.write(filename + '.cvs', s_buf.read())
zf.close
What you are doing here is
1 - You initializa the path of the ZipFile
2 - You simply pass the name and then the file you want to be written to the archive. In your case you were passing io.BytesIO() as a name, which made no sense to Python, thus the error.
I would strongly advice you, to resolve any Version issues first, because while 'clever' solution may seem like a quick way out, they tend to rack up a terrible technical debt latter, which can and will be a nightmare to deal with.

You are passing a io.BytesIO() object as the first argument to ZipFile.writestr() where it expects either an archive name or a ZipInfo object.
zf.writestr(localfile, s_buf.read())
zinfo_or_arcname is either the file name it will be given in the
archive, or a ZipInfo instance.
source: Docs

Python - How do I convert "an OS-level handle to an open file" to a file object?

tempfile.mkstemp() returns:
a tuple containing an OS-level handle to an open file (as would be returned by os.open()) and the absolute pathname of that file, in that order.
How do I convert that OS-level handle to a file object?
The documentation for os.open() states:
To wrap a file descriptor in a "file
object", use fdopen().
So I tried:
>>> import tempfile
>>> tup = tempfile.mkstemp()
>>> import os
>>> f = os.fdopen(tup[0])
>>> f.write('foo\n')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IOError: [Errno 9] Bad file descriptor

You can use
os.write(tup[0], "foo\n")
to write to the handle.
If you want to open the handle for writing you need to add the "w" mode
f = os.fdopen(tup[0], "w")
f.write("foo")

Here's how to do it using a with statement:
from __future__ import with_statement
from contextlib import closing
fd, filepath = tempfile.mkstemp()
with closing(os.fdopen(fd, 'w')) as tf:
tf.write('foo\n')

You forgot to specify the open mode ('w') in fdopen(). The default is 'r', causing the write() call to fail.
I think mkstemp() creates the file for reading only. Calling fdopen with 'w' probably reopens it for writing (you can reopen the file created by mkstemp).

temp = tempfile.NamedTemporaryFile(delete=False)
temp.file.write('foo\n')
temp.close()

What's your goal, here? Is tempfile.TemporaryFile inappropriate for your purposes?

I can't comment on the answers, so I will post my comment here:
To create a temporary file for write access you can use tempfile.mkstemp and specify "w" as the last parameter, like:
f = tempfile.mkstemp("", "", "", "w") # first three params are 'suffix, 'prefix', 'dir'...
os.write(f[0], "write something")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python TarFile with bz2 data - python

Related

python uploading a remote file to GCS , without saving it in the machine [duplicate]

How to get a Python os.PathLike object from a NamedTemporaryFile using the tempfile module?

Import from <class 'bytes'> instead of file

Unable to Zip a File in Buffer

Python - How do I convert "an OS-level handle to an open file" to a file object?

Categories

Resources