Reading a binary file Python (pickle) [duplicate] - python

I created some data and stored it several times like this:
with open('filename', 'a') as f:
pickle.dump(data, f)
Every time the size of file increased, but when I open file
with open('filename', 'rb') as f:
x = pickle.load(f)
I can see only data from the last time.
How can I correctly read file?

Pickle serializes a single object at a time, and reads back a single object -
the pickled data is recorded in sequence on the file.
If you simply do pickle.load you should be reading the first object serialized into the file (not the last one as you've written).
After unserializing the first object, the file-pointer is at the beggining
of the next object - if you simply call pickle.load again, it will read that next object - do that until the end of the file.
objects = []
with (open("myfile", "rb")) as openfile:
while True:
try:
objects.append(pickle.load(openfile))
except EOFError:
break

There is a read_pickle function as part of pandas 0.22+
import pandas as pd
obj = pd.read_pickle(r'filepath')

The following is an example of how you might write and read a pickle file. Note that if you keep appending pickle data to the file, you will need to continue reading from the file until you find what you want or an exception is generated by reaching the end of the file. That is what the last function does.
import os
import pickle
PICKLE_FILE = 'pickle.dat'
def main():
# append data to the pickle file
add_to_pickle(PICKLE_FILE, 123)
add_to_pickle(PICKLE_FILE, 'Hello')
add_to_pickle(PICKLE_FILE, None)
add_to_pickle(PICKLE_FILE, b'World')
add_to_pickle(PICKLE_FILE, 456.789)
# load & show all stored objects
for item in read_from_pickle(PICKLE_FILE):
print(repr(item))
os.remove(PICKLE_FILE)
def add_to_pickle(path, item):
with open(path, 'ab') as file:
pickle.dump(item, file, pickle.HIGHEST_PROTOCOL)
def read_from_pickle(path):
with open(path, 'rb') as file:
try:
while True:
yield pickle.load(file)
except EOFError:
pass
if __name__ == '__main__':
main()

I developed a software tool that opens (most) Pickle files directly in your browser (nothing is transferred so it's 100% private):
https://pickleviewer.com/ (formerly)
Now it's hosted here: https://fire-6dcaa-273213.web.app/
Edit: Available here if you want to host it somewhere: https://github.com/ch-hristov/Pickle-viewer
Feel free to host this somewhere.

Related

random/empty characters while re-editing a json file

I apologize for the vague definition of my problem in the title, but I really can't figure out what sort of problem I'm dealing with. So, here it goes.
I have python file:
edit-json.py
import os, json
def add_rooms(data):
if(not os.path.exists('rooms.json')):
with open('rooms.json', 'w'): pass
with open('rooms.json', 'r+') as f:
d = f.read() # take existing data from file
f.truncate(0) # empty the json file
if(d == ''): rooms = [] # check if data is empty i.e the file was just created
else: rooms = json.loads(d)['rooms']
rooms.append({'name': data['roomname'], 'active': 1})
f.write(json.dumps({"rooms": rooms})) # write new data(rooms list) to the json file
add_rooms({'roomname': 'friends'})'
This python script basically creates a file rooms.json(if it doesn't exist), grabs the data(array) from the json file, empties the json file, then finally writes the new data into the file. All this is done in the function add_rooms(), which is then called at the end of the script, pretty simple stuff.
So, here's the problem, I run the file once, nothing weird happens, i.e the file is created and the data inside it is:
{"rooms": [{"name": "friends"}]}
But the weird stuff happens when the run the script again.
What I should see:
{"rooms": [{"name": "friends"}, {"name": "friends"}]}
What I see instead:
I apologize I had to post the image because for some reason I couldn't copy the text I got.
and I can't obviously run the script again(for the third time) because the json parser gives error due to those characters
I obtained this result in an online compiler. In my local windows system, I get extra whitespace instead of those extra symbols.
I can't figure out what causes it. Maybe I'm not doing file handling incorrectly? or is it due to the json module? or am I the only one getting this result?
When you truncate the file, the file pointer is still at the end of the file. Use f.seek(0) to move back to the start of the file:
import os, json
def add_rooms(data):
if(not os.path.exists('rooms.json')):
with open('rooms.json', 'w'): pass
with open('rooms.json', 'r+') as f:
d = f.read() # take existing data from file
f.truncate(0) # empty the json file
f.seek(0) # <<<<<<<<< add this line
if(d == ''): rooms = [] # check if data is empty i.e the file was just created
else: rooms = json.loads(d)['rooms']
rooms.append({'name': data['roomname'], 'active': 1})
f.write(json.dumps({"rooms": rooms})) # write new data(rooms list) to the json file
add_rooms({'roomname': 'friends'})

pickle.dump dumps nothing when appending to file

User may give a bunch of urls as command line args. All URLs given in the past are serialized with pickle. The script checks all given URLs, if they are unique then they are serialized and appended to a file. At least that's what should be happening. Nothing is being appended. However when I open the file in write mode,the new, unique URL is written. So what gives? Code is:
def get_new_urls():
if(len(urls.URLs) != 0): # check if empty
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
toDump = []
for arg in urls.URLs:
if (arg in cereal):
print("Duplicate URL {0} given, ignoring it.".format(arg))
else:
toDump.append(arg)
except Exception as e:
print("Holy bleep something went wrong: {0}".format(e))
return(toDump)
urlsToDump = get_new_urls()
print(urlsToDump)
# TODO: append new URLs
if(urlsToDump):
with open(urlFile, 'ab') as f:
pickle.dump(urlsToDump, f)
# TODO check HTML of each page against the serialized copy
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
except EOFError: # your URL file is empty, bruh
pass
Pickle writes out the data you give it in a special format, e.g. it will write some header/metadata/etc, to the file you give it.
It is not intended to work this way; concatenating two pickle files doesn't really make sense. To achieve a concatenation of your data, you'd need to first read whatever is in the file into your urlsToDump, then update your urlsToDump with any new data, and then finally dump it out again (overwriting the whole file, not appending).
After
with open(urlFile, 'rb') as f:
you need a while loop, to repeatedly unpickle (repeatedly read) from the file until hitting EOF.

Open a file in memory

(I'm working on a Python 3.4 project.)
There's a way to open a (sqlite3) database in memory :
with sqlite3.connect(":memory:") as database:
Does such a trick exist for the open() function ? Something like :
with open(":file_in_memory:") as myfile:
The idea is to speed up some test functions opening/reading/writing some short files on disk; is there a way to be sure that these operations occur in memory ?
How about StringIO:
import StringIO
output = StringIO.StringIO()
output.write('First line.\n')
print >>output, 'Second line.'
# Retrieve file contents -- this will be
# 'First line.\nSecond line.\n'
contents = output.getvalue()
# Close object and discard memory buffer --
# .getvalue() will now raise an exception.
output.close()
python3: io.StringIO
There is something similar for file-like input/output to or from a string in io.StringIO.
There is no clean way to add url-based processing to normal file open, but being Python dynamic you could monkey-patch standard file open procedure to handle this case.
For example:
from io import StringIO
old_open = open
in_memory_files = {}
def open(name, mode="r", *args, **kwargs):
if name[:1] == ":" and name[-1:] == ":":
# in-memory file
if "w" in mode:
in_memory_files[name] = ""
f = StringIO(in_memory_files[name])
oldclose = f.close
def newclose():
in_memory_files[name] = f.getvalue()
oldclose()
f.close = newclose
return f
else:
return old_open(name, mode, *args, **kwargs)
after that you can write
f = open(":test:", "w")
f.write("This is a test\n")
f.close()
f = open(":test:")
print(f.read())
Note that this example is very minimal and doesn't handle all real file modes (e.g. append mode, or raising the proper exception on opening in read mode an in-memory file that doesn't exist) but it may work for simple cases.
Note also that all in-memory files will remain in memory forever (unless you also patch unlink).
PS: I'm not saying that monkey-patching standard open or StringIO instances is a good idea, just that you can :-D
PS2: This kind of problem is solved better at OS level by creating an in-ram disk. With that you can even call external programs redirecting their output or input from those files and you also get all the full support including concurrent access, directory listings and so on.
io.StringIO provides a memory file implementation you can use to simulate a real file. Example from documentation:
import io
output = io.StringIO()
output.write('First line.\n')
print('Second line.', file=output)
# Retrieve file contents -- this will be
# 'First line.\nSecond line.\n'
contents = output.getvalue()
# Close object and discard memory buffer --
# .getvalue() will now raise an exception.
output.close()
In Python 2, this class is available instead as StringIO.StringIO.

Why does tempfile.NamedTemporaryFile() truncate my data?

Here is a test I created to recreate a problem I was having when I used
tempfile.NamedTemporaryFile(). The problem is that when I use tempfile the
data in my CSV is truncated off the end of the file.
When you run this test script, temp2.csv will get truncated and temp1.csv
will be the same size as the original CSV.
I'm using Python 2.7.1.
You can download the sample CSV from http://explore.data.gov/Energy-and-Utilities/Residential-Energy-Consumption-Survey-RECS-Files-A/eypy-jxs2
#!/usr/bin/env python
import tempfile
import shutil
def main():
f = open('RECS05alldata.csv')
data = f.read()
f.close()
f = open('temp1.csv', 'w+b')
f.write(data)
f.close()
temp = tempfile.NamedTemporaryFile()
temp.write(data)
shutil.copy(temp.name, 'temp2.csv')
temp.close()
if __name__ == '__main__':
main()
Add temp.flush() after temp.write(data).
You copy the file before you close it. Files are buffered, which means that some of it will remain in the buffer while it is waiting to be written to the file. The close will write out all remaining data from the buffer to the file as part of the closing of the file.
This has nothing to do with NamedTemporaryFile.
I think your problem is that Python has not flushed the entire file to disk when you call shutil.copy.
Change
temp = tempfile.NamedTemporaryFile()
temp.write(data)
shutil.copy(temp.name, 'temp2.csv')
temp.close()
to
temp = tempfile.NamedTemporaryFile()
temp.write(data)
temp.close()
shutil.copy(temp.name, 'temp2.csv')

function for loading both strings and files on disk?

I have a design question. I have a function loadImage() for loading an image file. Now it accepts a string which is a file path. But I also want to be able to load files which are not on physical disk, eg. generated procedurally. I could have it accept a string, but then how could it know the string is not a file path but file data? I could add an extra boolean argument to specify that, but that doesn't sound very clean. Any ideas?
It's something like this now:
def loadImage(filepath):
file = open(filepath, 'rb')
data = file.read()
# do stuff with data
The other version would be
def loadImage(data):
# do stuff with data
How to have this function accept both 'filepath' or 'data' and guess what it is?
You can change your loadImage function to expect an opened file-like object, such as:
def load_image(f):
data = file.read()
... and then have that called from two functions, one of which expects a path and the other a string that contains the data:
from StringIO import StringIO
def load_image_from_path(path):
with open(path, 'rb') as f:
load_image(f)
def load_image_from_string(s):
sio = StringIO(s)
try:
load_image(sio)
finally:
sio.close()
How about just creating two functions, loadImageFromString and loadImageFromFile?
This being Python, you can easily distinguish between a filename and a data string. I would do something like this:
import os.path as P
from StringIO import StringIO
def load_image(im):
fin = None
if P.isfile(im):
fin = open(im, 'rb')
else:
fin = StringIO(im)
# Read from fin like you would from any open file object
Other ways to do it would be a try block instead of using os.path, but the essence of the approach remains the same.

Categories