Write file from cStringIO - python

I am trying to write a cStringIO buffer to disk. The buffer may represent a pdf, image or html file.
The approach I took seems a bit wonky, so I am open to alternative approaches as a solution as well.
def copyfile(self, destfilepath):
if self.datastream.tell() == 0:
raise Exception("Exception: Attempt to copy empty buffer.")
with open(destfilepath, 'wb') as fp:
shutil.copyfileobj(self.datastream, fp)
self.__datastream__.close()
#property
def datastream(self):
return self.__datastream__
#... inside func that sets __datastream__
while True:
buffer = response.read(block_sz)
self.__datastream__.write(buffer)
if not buffer:
break
# ... etc ..
test = Downloader()
ok = test.getfile(test_url)
if ok:
test.copyfile(save_path)
I took this approach because I don't want to start writting data to disk until I know I have successfully read the entire file and that it's a type I am interested in.
After calling copyfile() the file on disk is always zero bytes.

whoops!
I forgot to reset the stream position before trying to read from it; so it was reading from the end, hence the zero bytes. Moving the cursor to the beginning resolves the issue.
def copyfile(self, destfilepath):
if self.datastream.tell() == 0:
raise Exception("Exception: Attempt to copy empty buffer.")
self.__datastream__.seek(0) # <-- RESET POSITION TO BEGINNING
with open(destfilepath, 'wb') as fp:
shutil.copyfileobj(self.datastream, fp)
self.__datastream__.close()

Related

pickle.dump dumps nothing when appending to file

User may give a bunch of urls as command line args. All URLs given in the past are serialized with pickle. The script checks all given URLs, if they are unique then they are serialized and appended to a file. At least that's what should be happening. Nothing is being appended. However when I open the file in write mode,the new, unique URL is written. So what gives? Code is:
def get_new_urls():
if(len(urls.URLs) != 0): # check if empty
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
toDump = []
for arg in urls.URLs:
if (arg in cereal):
print("Duplicate URL {0} given, ignoring it.".format(arg))
else:
toDump.append(arg)
except Exception as e:
print("Holy bleep something went wrong: {0}".format(e))
return(toDump)
urlsToDump = get_new_urls()
print(urlsToDump)
# TODO: append new URLs
if(urlsToDump):
with open(urlFile, 'ab') as f:
pickle.dump(urlsToDump, f)
# TODO check HTML of each page against the serialized copy
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
except EOFError: # your URL file is empty, bruh
pass
Pickle writes out the data you give it in a special format, e.g. it will write some header/metadata/etc, to the file you give it.
It is not intended to work this way; concatenating two pickle files doesn't really make sense. To achieve a concatenation of your data, you'd need to first read whatever is in the file into your urlsToDump, then update your urlsToDump with any new data, and then finally dump it out again (overwriting the whole file, not appending).
After
with open(urlFile, 'rb') as f:
you need a while loop, to repeatedly unpickle (repeatedly read) from the file until hitting EOF.

downloading a large file in chunks with gzip encoding (Python 3.4)

If I make a request for a file and specify encoding of gzip, how do I handle that?
Normally when I have a large file I do the following:
while True:
chunk = resp.read(CHUNK)
if not chunk: break
writer.write(chunk)
writer.flush()
where the CHUNK is some size in bytes, writer is an open() object and resp is the request response generated from a urllib request.
So it's pretty simple most of the time when the response header contains 'gzip' as the returned encoding, I would do the following:
decomp = zlib.decompressobj(16+zlib.MAX_WBITS)
data = decomp.decompress(resp.read())
writer.write(data)
writer.flush()
or this:
f = gzip.GzipFile(fileobj=buf)
writer.write(f.read())
where the buf is a BytesIO().
If I try to decompress the gzip response though, I am getting issues:
while True:
chunk = resp.read(CHUNK)
if not chunk: break
decomp = zlib.decompressobj(16+zlib.MAX_WBITS)
data = decomp.decompress(chunk)
writer.write(data)
writer.flush()
Is there a way I can decompress the gzip data as it comes down in little chunks? or do I need to write the whole file to disk, decompress it then move it to the final file name? Part of the issue I have, using 32-bit Python, is that I can get out of memory errors.
Thank you
I think I found a solution that I wish to share.
def _chunk(response, size=4096):
""" downloads a web response in pieces """
method = response.headers.get("content-encoding")
if method == "gzip":
d = zlib.decompressobj(16+zlib.MAX_WBITS)
b = response.read(size)
while b:
data = d.decompress(b)
yield data
b = response.read(size)
del data
else:
while True:
chunk = response.read(size)
if not chunk: break
yield chunk
If anyone has a better solution, please add to it. Basically my error was the creation of the zlib.decompressobj(). I was creating it in the wrong place.
This seems to work in both python 2 and 3 as well, so there is a plus.

Maya Python - Embed zip file into maya file?

This was a suggestion from another stack thread that I'm finally getting back to. It was part of a discussion about how to embed tools into a maya file.
You can write the whole thing as a python package, zip it, then stuff the binary contents of the zip file into a fileInfo. When you need to code, look for it in the user's $MAYA_APP_DIR; if there's no zip, write the contents of the fileInfo to disk as a zip and then insert the zip into sys.path
Source discussions were:
python copy scripts into script
and Maya Python Create and Use Zipped Package?
So far the programming is going okay, but I think I hit a snag. When I attempt this:
with open("..directory/myZip.zip","rb") as file:
cmds.fileInfo("myZip", file.read())
..and then I..
print cmds.fileInfo("myZip",q=1)
I get
[u'PK\003\004\024']
which is a bad translation of the first line of gibberish when reading the zip file as a text document.
How can I embed my zip file into my maya file as binary?
====================
Update:
Maya doesn't like writing to the file as a direct read of the utf-8 encoded zip file. I found various methods of making it into an acceptable string that could be written, but the decoding back to the file didn't appear to work. I see now that Theodox's suggestion was to write it to binary and put that in the fileInfo node.
How can one encode, store and then decode to write to file later?
If I were to convert to binary using for instance:
' '.join(format(ord(x), 'b') for x in line)
What code would I need to turn that back into the original utf-8 zip information?
you can find related code here:
http://tech-artists.org/forum/showthread.php?4161-Maya-API-Singleton-Nodes&highlight=mayapersist
the relevant bit is
import base64
encoded = base64.b64encode(value)
decoded = base64.b64decode(encoded)
basically it's the same idea, except using the base64 module instead of binascii. Any method of converting an arbitary character stream to an ascii-safe representation will work fine, as long as you use a reversable method: the potential problem you need to watch out for is a character in the data block that looks to maya like a close quote - an open quote int he fileInfo will be messy in an MA file.
This example uses YAML to do arbitrary key-value pairs but that part is irrelevant to storing the binary stuff. I've used this technique for fairly large data (up to 640k if i recall) but I don't know if it has an upper limit in terms of what you can stash in Maya
Found the answer. Great script on stack overflow. I had to encode to 'string_escape', which is something I found while trying to figure out the whole characters situation. But anyways, you open the zip, encode to 'string_escape', write it to fileInfo and then before fetching to write it back to zip, you decode it back.
Convert binary to ASCII and vice versa
import maya.cmds as cmds
import binascii
def text_to_bits(text, encoding='utf-8', errors='surrogatepass'):
bits = bin(int(binascii.hexlify(text.encode(encoding, errors)), 16))[2:]
return bits.zfill(8 * ((len(bits) + 7) // 8))
def text_from_bits(bits, encoding='utf-8', errors='surrogatepass'):
n = int(bits, 2)
return int2bytes(n).decode(encoding, errors)
def int2bytes(i):
hex_string = '%x' % i
n = len(hex_string)
return binascii.unhexlify(hex_string.zfill(n + (n & 1)))
And then you can
with open("..\maya\scripts/test.zip","rb") as thing:
texty = text_to_bits(thing.read().encode('string_escape'))
cmds.fileInfo("binaryZip",texty)
...later
with open("..\maya\scripts/test_2.zip","wb") as thing:
texty = cmds.fileInfo("binaryZip",q=1)
thing.write( text_from_bits( texty ).decode('string_escape') )
and this appears to work.. so far..
Figured I'd post the final product for anyone wanting to undertake this approach. I tried to do some corruption checking, so that a bad zip doesn't get passed between machines. That's what all the check hashing is for.
def writeTimeFull(tl):
import TimeFull
#reload(TimeFull)
with open(TimeFull.__file__.replace(".pyc",".py"),"r") as file:
cmds.scriptNode( tl.scriptConnection[1][0], e=1, bs=file.read() )
cmds.expression("spark_timeliner_activator",
e=1,s='if (Spark_Timeliner.ShowTimeliner == 1)\n'
'{\n'
'\tsetAttr Spark_Timeliner.ShowTimeliner 0;\n'
'\tpython \"Timeliner.InitTimeliner()\";\n'
'}',
o="Spark_Timeliner",ae=1,uc=all)
def checkHash(zipPath,hash1,hash2,hash3):
check = False
hashes = [hash1,hash2,hash3]
for ii, hash in enumerate(hashes):
if hash == hashes[(ii+1)%3]:
hashes[(ii+2)%3] = hashes[ii]
check = True
if check:
if md5(zipPath) == hashes[0]:
return [zipPath,hashes[0],hashes[1],hashes[2]]
else:
cmds.warning("Hash checks and/or zip are corrupted. Attain toolbox_fix.zip, put it in scripts folder and restart.")
return []
#this writes the zip file to the local users space
def saveOutZip(filename):
if os.path.isfile(filename):
if not os.path.isfile(filename.replace('_pkg','_'+__version__)):
os.rename(filename,filename.replace('_pkg','_'+__version__))
with open(filename,"w") as zipFile:
zipInfo = cmds.fileInfo("zipInfo1",q=1)[0]
zipHash_1 = cmds.fileInfo("zipHash1",q=1)[0]
zipHash_2 = cmds.fileInfo("zipHash2",q=1)[0]
zipHash_3 = cmds.fileInfo("zipHash3",q=1)[0]
zipFile.write( base64.b64decode(zipInfo) )
if checkHash(filename,zipHash_1,zipHash_2,zipHash_3):
cmds.fileInfo("zipInfo2",zipInfo)
return filename
with open(filename,"w") as zipFile:
zipInfo = cmds.fileInfo("zipInfo2",q=1)[0]
zipHash_1 = cmds.fileInfo("zipHash1",q=1)[0]
zipHash_2 = cmds.fileInfo("zipHash2",q=1)[0]
zipHash_3 = cmds.fileInfo("zipHash3",q=1)[0]
zipFile.write( base64.b64decode(zipInfo) )
if checkHash(filename,zipHash_1,zipHash_2,zipHash_3):
cmds.fileInfo("zipInfo1",zipInfo)
return filename
return False
#this writes the local zip to this file
def loadInZip(filename):
zipResults = []
for ii in range(0,10):
with open(filename,"r") as theRead:
zipResults.append([base64.b64encode(theRead.read())]+checkHash(filename,md5(filename),md5(filename),md5(filename)))
if ii>0 and zipResults[ii]==zipResults[ii-1]:
cmds.fileInfo("zipInfo1",zipResults[ii][0])
cmds.fileInfo("zipInfo2",zipResults[ii-1][0])
cmds.fileInfo("zipHash1",zipResults[ii][2])
cmds.fileInfo("zipHash2",zipResults[ii][3])
cmds.fileInfo("zipHash3",zipResults[ii][4])
return True
#file check
#http://stackoverflow.com/questions/3431825/generating-a-md5-checksum-of-a-file
def md5(fname):
import hashlib
hash = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash.update(chunk)
return hash.hexdigest()
filename = path+'/toolbox_pkg.zip'
zipPaths = [path+'/toolbox_update.zip',
path+'/toolbox_fix.zip',
path+'/toolbox_'+__version__+'.zip',
filename]
zipPaths_exist = [os.path.isfile(zipPath) for zipPath in zipPaths ]
if any(zipPaths_exist[:2]):
if zipPaths_exist[0]:
cmds.warning('Timeliner update present. Forcing file to update version')
if zipPaths_exist[2]:
os.remove(zipPaths[3])
elif os.path.isfile(zipPaths[3]):
os.rename(zipPaths[3], zipPaths[2])
os.rename(zipPaths[0],zipPaths[3])
if zipPaths_exist[1]:
os.remove(zipPaths[1])
else:
cmds.warning('Timeliner fix present. Replacing file to the fix version.')
if os.path.isfile(zipPaths[3]):
os.remove(zipPaths[3])
os.rename(zipPaths[1],zipPaths[3])
loadInZip(filename)
if not cmds.fileInfo("zipInfo1",q=1) and not cmds.fileInfo("zipInfo2",q=1):
loadInZip(filename)
if not os.path.isfile(filename):
saveOutZip(filename)
sys.path.append(filename)
import Timeliner
Timeliner.InitTimeliner(theVers=__version__)
if not any(zipPaths[:2]):
if __version__ > Timeliner.__version__:
cmds.warning('Saving out newer version of timeliner to local machine. Restart Maya to access latest version.')
saveOutZip(filename)
elif __version__ < Timeliner.__version__:
cmds.warning('Timeliner on machine is newer than file version. Saving machine version over timeliner in file.')
loadInZip(filename)
__version__ = Timeliner.__version__
if __name__ != "__main__":
tl = getTimeliner()
writeTimeFull(tl)

python: read file continuously, even after it has been logrotated

I have a simple python script, where I read logfile continuosly (same as tail -f)
while True:
line = f.readline()
if line:
print line,
else:
time.sleep(0.1)
How can I make sure that I can still read the logfile, after it has been rotated by logrotate?
i.e. I need to do the same what tail -F would do.
I am using python 2.7
As long as you only plan to do this on Unix, the most robust way is probably to check so that the open file still refers to the same i-node as the name, and reopen it when that is no longer the case. You can get the i-number of the file from os.stat and os.fstat, in the st_ino field.
It could look like this:
import os, sys, time
name = "logfile"
current = open(name, "r")
curino = os.fstat(current.fileno()).st_ino
while True:
while True:
buf = current.read(1024)
if buf == "":
break
sys.stdout.write(buf)
try:
if os.stat(name).st_ino != curino:
new = open(name, "r")
current.close()
current = new
curino = os.fstat(current.fileno()).st_ino
continue
except IOError:
pass
time.sleep(1)
I doubt this works on Windows, but since you're speaking in terms of tail, I'm guessing that's not a problem. :)
You can do it by keeping track of where you are in the file and reopening it when you want to read. When the log file rotates, you notice that the file is smaller and since you reopen, you handle any unlinking too.
import time
cur = 0
while True:
try:
with open('myfile') as f:
f.seek(0,2)
if f.tell() < cur:
f.seek(0,0)
else:
f.seek(cur,0)
for line in f:
print line.strip()
cur = f.tell()
except IOError, e:
pass
time.sleep(1)
This example hides errors like file not found because I'm not sure of logrotate details such as small periods of time where the file is not available.
NOTE: In python 3, things are different. A regular open translates bytes to str and the interim buffer used for that conversion means that seek and tell don't operate properly (except when seeking to 0 or the end of file). Instead, open in binary mode ("rb") and do the decode manually line by line. You'll have to know the file encoding and what that encoding's newline looks like. For utf-8, its b"\n" (one of the reasons utf-8 is superior to utf-16, btw).
Thanks to #tdelaney and #Dolda2000's answers, I ended up with what follows. It should work on both Linux and Windows, and also handle logrotate's copytruncate or create options (respectively copy then truncate size to 0 and move then recreate file).
file_name = 'my_log_file'
seek_end = True
while True: # handle moved/truncated files by allowing to reopen
with open(file_name) as f:
if seek_end: # reopened files must not seek end
f.seek(0, 2)
while True: # line reading loop
line = f.readline()
if not line:
try:
if f.tell() > os.path.getsize(file_name):
# rotation occurred (copytruncate/create)
f.close()
seek_end = False
break
except FileNotFoundError:
# rotation occurred but new file still not created
pass # wait 1 second and retry
time.sleep(1)
do_stuff_with(line)
A limitation when using copytruncate option is that if lines are appended to the file while time-sleeping, and rotation occurs before wake-up, the last lines will be "lost" (they will still be in the now "old" log file, but I cannot see a decent way to "follow" that file to finish reading it). This limitation is not relevant with "move and create" create option because f descriptor will still point to the renamed file and therefore last lines will be read before the descriptor is closed and opened again.
Using 'tail -F
man tail
-F same as --follow=name --retr
-f, --follow[={name|descriptor}] output appended data as the file grows;
--retry keep trying to open a file if it is inaccessible
-F option will follow the name of the file not descriptor.
So when logrotate happens, it will follow the new file.
import subprocess
def tail(filename: str) -> Generator[str, None, None]:
proc = subprocess.Popen(["tail", "-F", filename], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
while True:
line = proc.stdout.readline()
if line:
yield line.decode("utf-8")
else:
break
for line in tail("/config/logs/openssh/current"):
print(line.strip())
I made a variation of the awesome above one by #pawamoy into a generator function one for my log monitoring and following needs.
def tail_file(file):
"""generator function that yields new lines in a file
:param file:File Path as a string
:type file: str
:rtype: collections.Iterable
"""
seek_end = True
while True: # handle moved/truncated files by allowing to reopen
with open(file) as f:
if seek_end: # reopened files must not seek end
f.seek(0, 2)
while True: # line reading loop
line = f.readline()
if not line:
try:
if f.tell() > os.path.getsize(file):
# rotation occurred (copytruncate/create)
f.close()
seek_end = False
break
except FileNotFoundError:
# rotation occurred but new file still not created
pass # wait 1 second and retry
time.sleep(1)
yield line
Which can be easily used like the below
import os, time
access_logfile = '/var/log/syslog'
loglines = tail_file(access_logfile)
for line in loglines:
print(line)

Open a file in memory

(I'm working on a Python 3.4 project.)
There's a way to open a (sqlite3) database in memory :
with sqlite3.connect(":memory:") as database:
Does such a trick exist for the open() function ? Something like :
with open(":file_in_memory:") as myfile:
The idea is to speed up some test functions opening/reading/writing some short files on disk; is there a way to be sure that these operations occur in memory ?
How about StringIO:
import StringIO
output = StringIO.StringIO()
output.write('First line.\n')
print >>output, 'Second line.'
# Retrieve file contents -- this will be
# 'First line.\nSecond line.\n'
contents = output.getvalue()
# Close object and discard memory buffer --
# .getvalue() will now raise an exception.
output.close()
python3: io.StringIO
There is something similar for file-like input/output to or from a string in io.StringIO.
There is no clean way to add url-based processing to normal file open, but being Python dynamic you could monkey-patch standard file open procedure to handle this case.
For example:
from io import StringIO
old_open = open
in_memory_files = {}
def open(name, mode="r", *args, **kwargs):
if name[:1] == ":" and name[-1:] == ":":
# in-memory file
if "w" in mode:
in_memory_files[name] = ""
f = StringIO(in_memory_files[name])
oldclose = f.close
def newclose():
in_memory_files[name] = f.getvalue()
oldclose()
f.close = newclose
return f
else:
return old_open(name, mode, *args, **kwargs)
after that you can write
f = open(":test:", "w")
f.write("This is a test\n")
f.close()
f = open(":test:")
print(f.read())
Note that this example is very minimal and doesn't handle all real file modes (e.g. append mode, or raising the proper exception on opening in read mode an in-memory file that doesn't exist) but it may work for simple cases.
Note also that all in-memory files will remain in memory forever (unless you also patch unlink).
PS: I'm not saying that monkey-patching standard open or StringIO instances is a good idea, just that you can :-D
PS2: This kind of problem is solved better at OS level by creating an in-ram disk. With that you can even call external programs redirecting their output or input from those files and you also get all the full support including concurrent access, directory listings and so on.
io.StringIO provides a memory file implementation you can use to simulate a real file. Example from documentation:
import io
output = io.StringIO()
output.write('First line.\n')
print('Second line.', file=output)
# Retrieve file contents -- this will be
# 'First line.\nSecond line.\n'
contents = output.getvalue()
# Close object and discard memory buffer --
# .getvalue() will now raise an exception.
output.close()
In Python 2, this class is available instead as StringIO.StringIO.

Categories