I have a pickle database which I am reading using the following code
import pickle, pprint
import sys
def main(datafile):
with open(datafile,'rb')as fin:
data = pickle.load(fin)
pprint.pprint(data)
if __name__=='__main__':
if len(sys.argv) != 2:
print "Pickle database file must be given as an argument."
sys.exit()
main(sys.argv[1])
I recognised that it contained a dictionary. I want to delete/edit some values from this dictionary and make a new pickle database.
I am storing the output of this program in a file ( so that I can read the elements in the dictionary and choose which ones to delete) How do I read this file (pprinted data structures) and create a pickle database from it ?
As stated in Python docs pprint is guaranteed to turn objects into valid (in the sense of Python syntax) objects as long as they are representable as Python constants. So first thing is that what you are doing is fine as long as you do it for dicts, lists, numbers, strings, etc. In particular if some value deep down in the dict is not representable as a constant (e.g. a custom object) this will fail.
Now reading the output file should be quite straight forward:
import ast
with open('output.txt') as fo:
data = fo.read()
obj = ast.literal_eval(data)
This is assuming that you keep one object per file and nothing more.
Note that you may use built-in eval instead of ast.literal_eval but that is quite unsafe since eval can run arbitrary Python code.
Related
I usually use json for lists, but it doesn't work for sets. Is there a similar function to write a set into an output file,f? Something like this, but for sets:
f=open('kos.txt','w')
json.dump(list, f)
f.close()
json is not a python-specific format. It knows about lists and dictionaries, but not sets or tuples.
But if you want to persist a pure python dataset you could use string conversion.
with open('kos.txt','w') as f:
f.write(str({1,3,(3,5)})) # set of numbers & a tuple
then read it back again using ast.literal_eval
import ast
with open('kos.txt','r') as f:
my_set = ast.literal_eval(f.read())
this also works for lists of sets, nested lists with sets inside... as long as the data can be evaluated literally and no sets are empty (a known limitation of literal_eval). So basically serializing (almost) any python basic object structure with str can be parsed back with it.
For the empty set case there's a kludge to apply since set() cannot be parsed back.
import ast
with open('kos.txt','r') as f:
ser = f.read()
my_set = set() if ser == str(set()) else ast.literal_eval(ser)
You could also have used the pickle module, but it creates binary data, so more "opaque", and there's also a way to use json: How to JSON serialize sets?. But for your needs, I would stick to str/ast.literal_eval
Using ast.literal_eval(f.read()) will give error ValueError: malformed node or string, if we write empty set in file. I think, pickle would be better to use.
If set is empty, this will give no error.
import pickle
s = set()
##To save in file
with open('kos.txt','wb') as f:
pickle.dump(s, f)
##To read it again from file
with open('kos.txt','rb') as f:
my_set = pickle.load(f)
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T
I saved a default dictionary (defaultdict) to a file using Python, (so it's a string now) because I thought it'd be more convenient but now it's stuck as a string and ast.literal_eval(my_string) is not working unless I slice the "default dict" wording out. How do I retrieve my dictionary from this file in a more elegant way than slicing the default dict notation out and using ast.literal_eval? Thank you!
As Kevin Guan notes in the comments, the repr of a defaultdict is not the same as the code you would use to initialize a fresh one from scratch (because it prints the repr of the default constructor, so a defaultdict(list) with no entries would stringify as defaultdict(<class 'list'>, {}), and <class 'list'>, isn't Python legal syntax).
There is no nicer way to handle this other than string manipulation; there is no generalized way to unrepr something when the repr can't be eval-ed. If there is only one defaultdict per file, you could read back in the bad data and write out a good pickle with something like:
import ast, collections, pickle
for filename in allfiles:
# Slurp the file
with open(filename) as f:
data = f.read()
# Slice from first { to last } and convert to dict (assumes all
# keys/values are Python literals too)
data = ast.literal_eval(data[data.index('{'):data.rindex('}')+1])
# Convert to defaultdict
# Assumes they're all defaultdict of list, and that you need them to
# remain defaultdict; change list to whatever you actually want
newdefdict = collections.defaultdict(list, ast.literal_eval(data))
# Rewrite the input file as a pickle containing the recovered data
# Use open(filename + ".pickle", ...
# if you want to avoid rewriting without verifying that you got the right data
with open(filename, 'wb') as f:
pickle.dump(newdefdict, f, protocol=pickle.HIGHEST_PROTOCOL)
The pickle.HIGHEST_PROTOCOL bit uses the most up to date protocol (if you need to interoperate with older versions of Python, choose the highest protocol that works on all versions; for Py2 and Py3 compatibility, that's protocol 2).
I have a question. It may be an easy one, but anyway I could not find a good idea. The question is that I have 2 python programs. First of them is giving 2 outputs, one of output is a huge list (like having thousands of another lists) and the other one is a simple csv file for the Weka. I need to store this list (first output) somehow to be able to use it as input of the other program later. I can not just send it to second program because when the first of the program is done, Weka should also produce new output for the second program. Hence, second program has to wait the outputs of first program and Weka.
The problem is that output list consists of lost of lists having numerical values. Simple example could be:
list1 = [[1,5,7],[14,3,27], [19,12,0], [23,8,17], [12,7]]
If I write this on a txt file, then when I try to read it, it takes all the values as string. Is there any easy and fast way (since data is big) to manage somehow taking all the values as integer? Or maybe in the first case, writing it as integer?
I think this is good case to use pickle module
To save data:
import pickle
lst = [[1,5,7],[14,3,27], [19,12,0], [23,8,17], [12,7]]
pickle.dump(lst, open('data.pkl', 'wb'))
To read data from saved file:
import pickle
lst = pickle.load(open('data.pkl', 'r')
From documentation:
The pickle module implements a fundamental, but powerful algorithm for
serializing and de-serializing a Python object structure. “Pickling”
is the process whereby a Python object hierarchy is converted into a
byte stream, and “unpickling” is the inverse operation, whereby a byte
stream is converted back into an object hierarchy. Pickling (and
unpickling) is alternatively known as “serialization”, “marshalling,”
[1] or “flattening”, however, to avoid confusion, the terms used here
are “pickling” and “unpickling”.
there's also faster cPickle module:
To save data:
from cPickle import Pickler
p = Pickler(open('data2.pkl', 'wb'))
p.dump(lst)
To read data from saved file:
from cPickle import Unpickler
up = Unpickler(open('data.pkl', 'r'))
lst = up.load()
How about pickling the list output rather than outputting it as a plaintext representation? Have a look at the documentation for your version: it's basically a way to write Python objects to file, which you can then read from Python at any point to get identical objects.
Once you have the file open that you want to output to, the outputting difference will be quite minor, e.g.
import pickle
my_list = [[1, 2], [134, 76], [798, 5, 2]]
with open('outputfile.pkl', 'wb') as output:
pickle.dump(my_list, output, -1)
And then just use the following way to read it in from your second program:
import pickle
my_list = pickle.load(open('outputfile.pkl', 'rb'))
I'm relatively new to and encoding and decoding, in fact I don't have any experience with it at all.
I was wondering, how would I decode a dictionary in Python 3 into an unreadable format that would prevent someone from modifying it outside the program?
Likewise, how would I then read from that file and encode the dictionary back?
My test code right now only writes to and reads from a plain text file.
import ast
myDict = {}
#Writer
fileModifier = open('file.txt', 'w')
fileModifier.write(str(myDict)))
fileModifier.close()
#Reader
fileModifier = open('file.txt', 'r')
myDict = ast.literal_eval(fileModifier.read())
fileModifier.close()
Depending on what your dictionary is holding you can use an encoding library like json or pickle (useful for storing my complex python data structures).
Here is an example using json, to use pickle just replace all instances of json with pickle and you should be good to go.
import json
myDict = {}
#Writer
fileModifier = open('file.txt', 'w'):
json.dump(myDict, fileModifier)
fileModifier.close()
#Reader
fileModifier = open('file.txt', 'r'):
myDict = json.load(fileModifier)
fileModifier.close()
The typical thing to use here is either the json or pickle modules (both in the standard library). The process is called "serialization". pickle can serialize almost arbitrary python objects whereas json can only serialize basic types/objects (integers, floats, strings, lists, dictionaries). json is human readible at the end of the day whereas pickle files aren't.
An alternative to encoding/decoding is to simply use a file as a dict, and python has the shelve module which does exactly this. This module uses a file as database and provide a dict-like interface to it.
It has some limitations, for example keys must be strings, and it's obviously slower than a normal dict since it performs I/O operations.