Using Pylzma with streaming and 7Zip compatibility

Using Pylzma with streaming and 7Zip compatibility - python

I have been using pylzma for a little bit, but I have to be able to create files compatible with the 7zip windows application. The caveat is that some of my files are really large (3 to 4gb created by a third party software in a proprietary binary format).
I went over and over here and on the instructions here: https://github.com/fancycode/pylzma/blob/master/doc/USAGE.md
I am able to create compatible files with the following code:
def Compacts(folder,f):
os.chdir(folder)
fsize=os.stat(f).st_size
t=time.clock()
i = open(f, 'rb')
o = open(f+'.7z', 'wb')
i.seek(0)
s = pylzma.compressfile(i)
result = s.read(5)
result += struct.pack('<Q', fsize)
s=result+s.read()
o.write(s)
o.flush()
o.close()
i.close()
os.remove(f)
The smaller files (up to 2Gb) compress well with this code and are compatible with 7Zip, but the larger files just crash python after some time.
According to the user guide, to compact large files one should use streaming, but then the resulting file is not compatible with 7zip, as in the snippet bellow.
def Compacts(folder,f):
os.chdir(folder)
fsize=os.stat(f).st_size
t=time.clock()
i = open(f, 'rb')
o = open(f+'.7z', 'wb')
i.seek(0)
s = pylzma.compressfile(i)
while True:
tmp = s.read(1)
if not tmp: break
o.write(tmp)
o.flush()
o.close()
i.close()
os.remove(f)
Any ideas on how can I incorporate the streaming technique present in pylzma while keeping the 7zip compatibility?

You still need to correctly write the header (.read(5)) and size, e.g. like so:
import os
import struct
import pylzma
def sevenzip(infile, outfile):
size = os.stat(infile).st_size
with open(infile, "rb") as ip, open(outfile, "wb") as op:
s = pylzma.compressfile(ip)
op.write(s.read(5))
op.write(struct.pack('<Q', size))
while True:
# Read 128K chunks.
# Not sure if this has to be 1 instead to trigger streaming in pylzma...
tmp = s.read(1<<17)
if not tmp:
break
op.write(tmp)
if __name__ == "__main__":
import sys
try:
_, infile, outfile = sys.argv
except:
infile, outfile = __file__, __file__ + u".7z"
sevenzip(infile, outfile)
print("compressed {} to {}".format(infile, outfile))

Related

file.tell() doesn't update file size info for samba files

File size parameter stop updated with python's tell() method for the files shared with samba.
I've created a sample to reproduce this problem.
tell() - always show the same size
while os.stat keep updating the value
import time
import os
fname = "SAMBA_FILE_PATH"
with open(fname, 'r') as file_handler:
while 1:
file_handler.seek(0, 2)
file_size = file_handler.tell()
print file_size
print os.stat(fname).st_size
time.sleep(2)```

I have used readline() here instead of seek().
import time
import os
fname = "SAMBA_FILE_PATH"
with open(fname, 'r') as file_handler:
while 1:
file_handler.readline()
file_size = file_handler.tell()
print file_size
print os.stat(fname).st_size
time.sleep(2)

fixed with modifying samba config by adding
oplocks = False
taken from here

race-condition: reading/writing file (windows)

I have the following situation:
-different users (all on windows OS) that run a python script that can either read or write to pickle file located on a shared folder.
-the "system" is designed in way that only one user at a time will be writing to the file (therefore no race condition of more processes trying to WRITE at the same time on the file)
-the basic code to write would be this:
with open(path + r'\final_db.p', 'wb') as f:
pickle.dump((x, y), f)
-while code to read would be:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
-x is list of 5K or plus elements, where each element is a class instance containing many attributes and functions; y is a date
QUESTION:
am i correct assuming that there is a race condition when a reading and a writing process overlap? and that the reading one can end up with a corrupt file?
PROPOSED SOLUTIONS:
1.a possible solution i thought of is using filelock:
code to write:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'wb') as f:
pickle.dump((x, y), f)
code to read:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'rb') as f:
x, y = pickle.load(f)
this solution should work (??), but if a process crash, the file remains blocked till the "file_path + '.lock'" is cancelled
2.another solution could be to use portalocker
code to write:
with open(path + r'\final_db.p', 'wb') as f:
portalocker.lock(f, portalocker.LOCK_EX)
pickle.dump((x, y), f)
code to read:
segnale = True
while segnale:
try:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
segnale = False
except:
pass
the reading process, if another process started writing before it, will keep looping till the file is unlocked (except PermissionError).
if the writing process started after the reading process, the reading should loop if the file is corrupt.
what i am not sure about is if the reading process could end up reading a partially written file.
Any advice? better solutions?

Unified opening of .csv and .csv.gz files

In Python 2.7, I would like to open a file and do some manipulations with it. The problem is that I do not know beforehand if it has a .csv or a .csv.gz extension. If I knew it was .csv, I would do
with open(filename, "r") as f_in:
do something
If I knew it was .csv.gz, I could say
import gzip
with gzip.open(filename, "r") as f_in:
do something
I am curious if there is a way to avoid repetition after figuring out the file extension:
def find_ext(filename):
return filename.split(".")[-1]
ext = find_ext(filename)
if ext == "csv":
with open(filename, "r") as f_in:
do something
else if ext == "gz":
import gzip
with gzip.open(filename, "r") as f_in:
do something

I wouldn't bother looking at the file extension: Files get renamed or use non-standard variations.
Instead, open the raw file and examine the header. If it begins with two 32-bit words 0x00088b1f and 0, it is a gzip file.
import struct
f = open(filename, 'rb')
v = f.read(8)
v1 = struct.unpack('I', v)[0]
v2 = struct.unpack('I', v)[1]
if v1 == 0x00088b1f and v2 == 0:
# it is gzip

import gzip
import mimetypes
smart_open = lambda fn: {"gzip": gzip.open(fn)}.get(mimetypes.guess_type(fn)[1], open(fn))
# usage:
f = smart_open("test.csv.gz")
f = smart_open("test.csv")

Since different libraries are used for different cases the check of the extension is required. But one can go with construction like:
try:
...
except AnError:
...
else:
...
finally:
...
Anyway, one (if/else) or another (try/finally) type of construction is necessary since a file opening process is not the same for different cases.
To avoid repetition use such approach (it is pseudocode):
def readcsv(file):
...
def readgzip(file):
...
if csv:
readcsv(file)
elif gzip:
readgzip(file)

How to get python to successfully download large images from the internet

So I've been using
urllib.request.urlretrieve(URL, FILENAME)
to download images of the internet. It works great, but fails on some images. The ones it fails on seem to be the larger images- eg. http://i.imgur.com/DEKdmba.jpg. It downloads them fine, but when I try to open these files photo viewer gives me the error "windows photo viewer cant open this picture because the file appears to be damaged corrupted or too large".
What might be the reason it can't download these, and how can I fix this?
EDIT: after looking further, I dont think the problem is large images- it manages to download larger ones. It just seems to be some random ones that it can never download whenever I run the script again. Now I'm even more confused

In the past, I have used this code for copying from the internet. I have had no trouble with large files.
def download(url):
file_name = raw_input("Name: ")
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_size = 8192
while True:
buffer = u.read(block_size)
if not buffer:
break

Here's the sample code for Python 3 (tested in Windows 7):
import urllib.request
def download_very_big_image():
url = 'http://i.imgur.com/DEKdmba.jpg'
filename = 'C://big_image.jpg'
conn = urllib.request.urlopen(url)
output = open(filename, 'wb') #binary flag needed for Windows
output.write(conn.read())
output.close()
For completeness sake, here's the equivalent code in Python 2:
import urllib2
def download_very_big_image():
url = 'http://i.imgur.com/DEKdmba.jpg'
filename = 'C://big_image.jpg'
conn = urllib2.urlopen(url)
output = open(filename, 'wb') #binary flag needed for Windows
output.write(conn.read())
output.close()

This should work: use requests module:
import requests
img_url = 'http://i.imgur.com/DEKdmba.jpg'
img_name = img_url.split('/')[-1]
img_data = requests.get(img_url).content
with open(img_name, 'wb') as handler:
handler.write(img_data)

How to read from a text file compressed with 7z?

I would like to read (in Python 2.7), line by line, from a csv (text) file, which is 7z compressed. I don't want to decompress the entire (large) file, but to stream the lines.
I tried pylzma.decompressobj() unsuccessfully. I get a data error. Note that this code doesn't yet read line by line:
input_filename = r"testing.csv.7z"
with open(input_filename, 'rb') as infile:
obj = pylzma.decompressobj()
o = open('decompressed.raw', 'wb')
obj = pylzma.decompressobj()
while True:
tmp = infile.read(1)
if not tmp: break
o.write(obj.decompress(tmp))
o.close()
Output:
o.write(obj.decompress(tmp))
ValueError: data error during decompression

This will allow you to iterate the lines. It's partially derived from some code I found in an answer to another question.
At this point in time (pylzma-0.5.0) the py7zlib module doesn't implement an API that would allow archive members to be read as a stream of bytes or characters — its ArchiveFile class only provides a read() function that decompresses and returns the uncompressed data in a member all at once. Given that, about the best that can be done is return bytes or lines iteratively via a Python generator using that as a buffer.
The following does the latter, but may not help if the problem is the archive member file itself is huge.
The code below should work in Python 3.x as well as 2.7.
import io
import os
import py7zlib
class SevenZFileError(py7zlib.ArchiveError):
pass
class SevenZFile(object):
#classmethod
def is_7zfile(cls, filepath):
""" Determine if filepath points to a valid 7z archive. """
is7z = False
fp = None
try:
fp = open(filepath, 'rb')
archive = py7zlib.Archive7z(fp)
_ = len(archive.getnames())
is7z = True
finally:
if fp: fp.close()
return is7z
def __init__(self, filepath):
fp = open(filepath, 'rb')
self.filepath = filepath
self.archive = py7zlib.Archive7z(fp)
def __contains__(self, name):
return name in self.archive.getnames()
def readlines(self, name, newline=''):
r""" Iterator of lines from named archive member.
`newline` controls how line endings are handled.
It can be None, '', '\n', '\r', and '\r\n' and works the same way as it does
in StringIO. Note however that the default value is different and is to enable
universal newlines mode, but line endings are returned untranslated.
"""
archivefile = self.archive.getmember(name)
if not archivefile:
raise SevenZFileError('archive member %r not found in %r' %
(name, self.filepath))
# Decompress entire member and return its contents iteratively.
data = archivefile.read().decode()
for line in io.StringIO(data, newline=newline):
yield line
if __name__ == '__main__':
import csv
if SevenZFile.is_7zfile('testing.csv.7z'):
sevenZfile = SevenZFile('testing.csv.7z')
if 'testing.csv' not in sevenZfile:
print('testing.csv is not a member of testing.csv.7z')
else:
reader = csv.reader(sevenZfile.readlines('testing.csv'))
for row in reader:
print(', '.join(row))

If you were using Python 3.3+, you might be able to do this using the lzma module which was added to the standard library in that version.
See: lzma Examples

If you can use python 3, there is a useful library, py7zr, which supports partially 7zip decompression as below:
import py7zr
import re
filter_pattern = re.compile(r'<your/target/file_and_directories/regex/expression>')
with SevenZipFile('archive.7z', 'r') as archive:
allfiles = archive.getnames()
selective_files = [f if filter_pattern.match(f) for f in allfiles]
archive.extract(targets=selective_files)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Pylzma with streaming and 7Zip compatibility - python

Related

file.tell() doesn't update file size info for samba files

race-condition: reading/writing file (windows)

Unified opening of .csv and .csv.gz files

How to get python to successfully download large images from the internet

How to read from a text file compressed with 7z?

Categories

Resources