Is there a way to deserialize a json array directly to a set?
data.json (yes this is just a json array.)
["a","b","c"]
Notice that the json array contains unique elements.
Currently my workflow is the following.
open_file = open(path, 'r')
json_load = json.load(open_file) # this returns a list
return set(json_load) # which I am then converting to a set.
Is there a way to do something like this?
open_file = open(path, 'r')
return json.load(open_file, **arguments) # this returns a set.
Also is there any other way to go about doing it without the json module perhaps? Surely I am not the first one to need a set decoder.
No. You would have to subclass one of the json module classes JSONDecoder and override the method that creates the object, to do it yourself.
And it is also not worth the trouble. json arrays really map to lists in python - they have order, and can allow duplicates - a set can't correctly represent a json array. Therefore it is not the job of a json decoder to provide a set.
Converting is the best you can do. You could create a function and call it when you need:
def json_load_set(f):
return set(json.load(f))
Related
Which is the best way to store dictionary of strings in file(as they are big) and load it partially in python. Dictionary of strings here means, keyword would be a string and the value would be a list of strings.
Dictionary storing in appended form to check keys, if available not update or else update. Then use keys for post processing.
Usually a dictionary is stored in JSON.
I'll leave here a link:
Convert Python dictionary to JSON array
You could simply write the dictionary to a text file, and then create a new dictionary that only pulls certain keys and values from that text file.
But you're probably best off exploring the json module.
Here's a straighforward way to write a dict called "sample" to a file with the json module:
import json
with open('result.json', 'w') as fp:
json.dump(sample, fp)
On the loading side, we'd need to know more about how you want to choose which keys to load from the JSON file.
The above answers are great, but i hate using JSON, i have had issues with pickle before that corrupted my data, so what i do is, i use numpy's save and load
To save np.save(filename,dict)
to load dict = np.load(filename).item()
really simple and works well, as far as loading partially goes, you could always split the dictionary into multiple smaller dictionaries and save them as individual files, maybe not a very concrete solution but it could work
to split the dictionary you could do something like this
temp_dict = {}
for i,k in enumerate(dict.keys()):
if i%1000 == 0:
np.save("records-"+str(i-1000)+"-"+str(i)+".npy",temp_dict)
temp_dict = {}
temp_dict[k]=dict[k].value()
then for loading just do something like
my_dict={}
all_files = glob.glob("*.npy")
for f in all_files:
dict = np.load(filename).item()
my_dict.update(dict)
If this is for some sort of database type use then save yourself the headache and use TinyDB. It uses JSON format when saving to disc and will provide you the "partial" loading that you're looking for.
I only recommend TinyDB as this seems to be the closest to what you're looking to achieve, maybe try googling for other databases if this isn't your fancy there's TONS of them out there!
I usually use json for lists, but it doesn't work for sets. Is there a similar function to write a set into an output file,f? Something like this, but for sets:
f=open('kos.txt','w')
json.dump(list, f)
f.close()
json is not a python-specific format. It knows about lists and dictionaries, but not sets or tuples.
But if you want to persist a pure python dataset you could use string conversion.
with open('kos.txt','w') as f:
f.write(str({1,3,(3,5)})) # set of numbers & a tuple
then read it back again using ast.literal_eval
import ast
with open('kos.txt','r') as f:
my_set = ast.literal_eval(f.read())
this also works for lists of sets, nested lists with sets inside... as long as the data can be evaluated literally and no sets are empty (a known limitation of literal_eval). So basically serializing (almost) any python basic object structure with str can be parsed back with it.
For the empty set case there's a kludge to apply since set() cannot be parsed back.
import ast
with open('kos.txt','r') as f:
ser = f.read()
my_set = set() if ser == str(set()) else ast.literal_eval(ser)
You could also have used the pickle module, but it creates binary data, so more "opaque", and there's also a way to use json: How to JSON serialize sets?. But for your needs, I would stick to str/ast.literal_eval
Using ast.literal_eval(f.read()) will give error ValueError: malformed node or string, if we write empty set in file. I think, pickle would be better to use.
If set is empty, this will give no error.
import pickle
s = set()
##To save in file
with open('kos.txt','wb') as f:
pickle.dump(s, f)
##To read it again from file
with open('kos.txt','rb') as f:
my_set = pickle.load(f)
I have a pickle database which I am reading using the following code
import pickle, pprint
import sys
def main(datafile):
with open(datafile,'rb')as fin:
data = pickle.load(fin)
pprint.pprint(data)
if __name__=='__main__':
if len(sys.argv) != 2:
print "Pickle database file must be given as an argument."
sys.exit()
main(sys.argv[1])
I recognised that it contained a dictionary. I want to delete/edit some values from this dictionary and make a new pickle database.
I am storing the output of this program in a file ( so that I can read the elements in the dictionary and choose which ones to delete) How do I read this file (pprinted data structures) and create a pickle database from it ?
As stated in Python docs pprint is guaranteed to turn objects into valid (in the sense of Python syntax) objects as long as they are representable as Python constants. So first thing is that what you are doing is fine as long as you do it for dicts, lists, numbers, strings, etc. In particular if some value deep down in the dict is not representable as a constant (e.g. a custom object) this will fail.
Now reading the output file should be quite straight forward:
import ast
with open('output.txt') as fo:
data = fo.read()
obj = ast.literal_eval(data)
This is assuming that you keep one object per file and nothing more.
Note that you may use built-in eval instead of ast.literal_eval but that is quite unsafe since eval can run arbitrary Python code.
I have a very big list of lists. One of my programs does this:
power_time_array = [[1,2,3],[1,2,3]] # In a short form
with open (file_name,'w') as out:
out.write (str(power_time_array))
Now another independent script need to read this list of lists back.
How do I do this?
What I have tried:
with open (file_name,'r') as app_trc_file :
power_trace_of_application.append (app_trc_file.read())
Note: power_trace_application is a list of list of lists.
This stores it as a list with one element as a huge string.
How does one efficiently store and retrieve big lists or list of lists from files in python?
You can serialize your list to json and deserialize it back. This really doesn't change anything in representation, your list is already valid json:
import json
power_time_array = [[1,2,3],[1,2,3]] # In a short form
with open (file_name,'w') as out:
json.dump(power_time_array, out)
and then just read it back:
with open (file_name,'r') as app_trc_file :
power_trace_of_application = json.load(app_trc_file)
For speed, you can use a json library with C backend (like ujson). And this works with custom objects too.
Use Json library to efficiently read and write structured information (in the form of JSON) to a text file.
To write data on the file, use json.dump() , and
To retrieve json data from file, use json.load()
It will be faster:
from ast import literal_eval
power_time_array = [[1,2,3],[1,2,3]]
with open(file_name, 'w') as out:
out.write(repr(power_time_array))
with open(file_name,'r') as app_trc_file:
power_trace_of_application.append(literal_eval(app_trc_file.read()))
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T