Does json.dump() in python rewrite or append a JSON file - python

When working with json.dump() I noticed that it appears to be rewriting the entire document. Is this correct, and is there another way to append to the dictionary like .append() deos with lists?
When I write the function like this and change the key value (name), it would appear that the item is being appended.
filename = "infohere.json"
name = "Bob"
numbers = 20
#Write to JSON
def writejson(name = name, numbers = numbers):
with open(filename, "r") as info:
xdict = json.load(info)
xdict[name] = numbers
with open(filename, "w") as info:
json.dump(xdict, info)
When you write it out like this however, you can see that the code clearly writes over the entire dictionary/json file.
filename = infohere.json
dict = {"Bob":23, "Mark":50}
dict2 = {Ricky":40}
#Write to JSON
def writejson2(dict):
with open(filehere, "w") as info:
json.dump(dict, info)
writejson(dict)
writejson(dict2)
In the second example it only ever shows up the last date input leading me to believe that this is rewriting the entire document. If the case is that it writes the whole document during each json.dump, does this cause issues with larger json file, if so is there another method like .append() but for dealing with json.
Thanks in advance.

Neither.
json.dump doesn't decide whether to delete prior content when it writes to a file. That decision happens when you run open(filehere, "w"); that is what deletes old content.
But: Normal JSON isn't amenable to appends.
A single JSON document is one object. There are variants on the format that allow multiple documents in one file, the most common of which is JSONL (which has one JSON document per line). Unless you're using such a format, trying to append JSON to a non-empty file usually won't result in something that can be successfully parsed.

Related

Seeking and deleting elements in lists of a parsed file and saving result to another file

I have a large .txt file that is a result of a C-file being parsed containing various blocks of data, but about 90% of them are useless to me. I'm trying to get rid of them and then save the result to another file, but have hard time doing so. At first I tried to delete all useless information in unparsed file, but then it won't parse. My .txt file is built like this:
//Update: Files I'm trying to work on comes from pycparser module, that I found on a GitHub.
File before being parsed looks like this:
And after using pycparser
file_to_parse = pycparser.parse_file(current_directory + r"\D_Out_Clean\file.d_prec")
I want to delete all blocks that starts with word Typedef. This module stores this in an one big list that I can access via it's attribute.
Currently my code looks like this:
len_of_ext_list = len(file_to_parse.ext)
i = 0
while i < len_of_ext_list:
if 'TypeDecl' not in file_to_parse.ext[i]:
print("NOT A TYPEDECL")
print(file_to_parse.ext[i], type(file_to_parse.ext[i]))
parsed_file_2 = open(current_directory + r"\Zadanie\D_Out_Clean_Parsed\clean_file.d_prec", "w+")
parsed_file_2.write("%s%s\n" %("", file_to_parse.ext[i]))
parsed_file_2.close
#file_to_parse_2 = file_to_parse.ext[i]
i+=1
But above code only saves one last FuncDef from a unparsed file, and I don't know how to change it.
So, now I'm trying to get rid of all typedefs in parsed file as they don't have any valuable information for me. I want to now what functions definitions and declarations are in file, and what type of global variables are stored in parsed file. Hope this is more clear now.
I suggest reading the entire input file into a string, and then doing a regex replacement:
with open(current_directory + r"\D_Out\file.txt", "r+") as file:
with open(current_directory + r"\D_Out_Clean\clean_file.txt", "w+") as output:
data = file.read()
data = re.sub(r'type(?:\n\{.*?\}|[^;]*?;)\n?', '', data, flags=re.S)
output.write(line)
Here is a regex demo showing that the replacement logic is working.

Constantly append element to list in json file [duplicate]

In python, I know that I can dump a list of dictionaries in .json file for storage with json.dump() from the json module. However, after dumping a list, is it possible to append more dictionaries to that list in the .json file without explictly read load the full list, append, then dump the list again?
e.g.
In .json I have
[{'a': 1}]
Is it possible to add {'b', 2} to the list in .json such that the file become
[{'a': 1}, {'b', 2}]
The actual list is much longer (on the order of ten million), so I'm wondering if there're more direct ways of doing that without reading the entire list from the file to save memory.
Edit:
PS: I'm also open to other file format as long as it can effectively store a large list of dictionaries and can achieve the function above
It sounds like this can be a simple file manipulation problem under the right circumstances. If you are sure that the root data structure of the dump is indeed a json array, you can delete the last "]" in the file and then append a new dump to the file.
You can append with the dumps function.
from json import dumps, dump
import os
#This represents your current dump call
with open('out.json', 'w') as f:
dump([{'version':1}], f)
# This removes the final ']'
with open('out.json', 'rb+') as f:
f.seek(-1, os.SEEK_END)
f.truncate()
#This appends the new dictionary
with open('out.json', 'a') as f:
f.write(',')
f.write(dumps({'n':1}))
f.write(']')
It seems to also work if you dump with indent because the dump function doesn't end with a newline character in either case.
Handling an empty array
If, the first time you dumped the list, it was empty, resulting in an empty json array in the file "[]", then appending a comma like in my example will result in something like this "[,...] which you probably don't want.
The way I've seen this handled in the wild by protocols like i3bar (wich use an unending json array to send information), is to always start with a header element. In their case they use { "version": 1 }.
So ensure that you have that at the start of yours list when you do the first dump -- that is, unless you're sure you'll always have something in the list.
Other notes
Even though this sort of manual json hack is used by projects like i3bar, I wouldn't personally reccomend doing this in a production environment.
JSON requires that the list be closed with a "]" so its not natively appendable. You could try something tricky with opening the end of the file, removing the "]" and also fiddle with the new JSON you are trying to write. But that's messy.
An interesting thing about JSON is that the encoding doesn't have newlines. You can pretty print a JSON but if you don't, you can write an entire JSON record on a single line. So, instead of a JSON list, just have a bunch of lines, each of which is a JSON encoding of your dict.
def append_dict(filename, d):
with open(filename, 'a', encoding='utf-8') as fp:
fp.write(json.dumps(d))
fp.write("\n")
def read_list(filename):
with open(filename, encoding='utf-8') as fp:
return [json.loads(line) for line in fp]
Since this file is now a bunch of JSON objects, not a single JSON list, any program expecting a single list in this file will fail.

Extracting N JSON objects contained in a single line from a text file in Python 2.7?

I have a huge text file that contains several JSON objects inside of it that I want to parse into a csv file. Just because i'm dealing with someone else's data I cannot really change the format its being delivered in.
Since I dont know how many objects JSON objects I just can create a couple set of dictionaries, wrap them in a list and then json.loads() the list.
Also, since all the objects are in a single text line I can't a regex expression to separete each individual json object and then put them on a list.(It's a super complicated and sometimes triple nested json at some points.
Here's, my current code
def json_to_csv(text_file_name,desired_csv_name):
#Cleans up a bit of the text file
file = fileinput.FileInput(text_file_name, inplace=True)
ile = fileinput.FileInput(text_file_name, inplace=True)
for line in file:
sys.stdout.write(line.replace(u'\'', u'"'))
for line in ile:
sys.stdout.write(re.sub(r'("[\s\w]*)"([\s\w]*")', r"\1\2", line))
#try to load the text file to content var
with open(text_file_name, "rb") as fin:
content = json.load(fin)
#Rest of the logic using the json data in content
#that uses it for the desired csv format
This code gives a ValueError: Extra data: line 1 column 159816 because there is more than one object there.
I seen similar questions in Google and StackOverflow. But none of those solutions none because of the fact that it's just one really long line in a text file and I dont know how many objects there are in the file.
If you are trying to split apart the highest level braces you could do something like
string = '{"NextToken": {"value": "...'
objects = eval("[" + string + "]")
and then parse each item in the list.

How to read a JSON file in python? [duplicate]

I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T

Storing data of a dict in Json

I have a simple program that take input from user and put them into a dict. After then I want to store that data into json file (I searched and found only json useful)
for example
mydict = {}
while True:
user = input("Enter key: ")
if user in mydict.keys(): #if the key already exist only print
print ("{} ---> {}".format(user,mydict[user]))
continue
mn = input("Enter value: ")
app = input ("Apply?: ")
if app == "y":
mydict[user] = mn
with open ("mydict.json","a+") as f:
json.dump(mydict,f)
with open ("mydict.json") as t:
mydict = json.load(t)
Every time user enter a key and value, I want to add them into dict, after then store that dict in json file. And every time I want to read that json file so I can refresh the dict in program.
Those codes above raised ValueError: Extra data: . I understood error occured because I'm adding the dict to json file every time so there are more than one dict. But how can I add whole dict at once? I didn't want to use w mode because I don't want to overwrite the file and I'm new in Json.
Program must be infinite and I have to refresh dict every time, that's why I couldn't find any solution or try anything, since I'm new on Json.
If you want to use JSon, then you will have to use the 'w' option when opening the file for writing. The 'a+' option will append your full dict to the file right after its previously saved version.
Why wouldn't you use csv instead ? With 'a+' option, any newly entered user info will be appended to the end of the file and transforming its content at reading time to a dict is quite easy and should look something like:
import csv
with open('your_dict.json', 'r') as fp:
yourDict = {key: value for key,value in csv.reader(fp, delimiter='\t')
while the saving counterpart would look like:
yourDictWriter = csv.writer( open('your_dict.json','a+'), delimiter='\t') )
#...
yourDictWriter.writerow([key, value])
Another approach would be to use MongoDB, a database designed for storing json documents. Then you won't have to worry about overwriting files, encoding json, and so on, since the database and driver will manage this for you. (Also note that it makes your code more concise.) Assuming you have MongoDB installed and running, you could use it like this:
import pymongo
client = MongoClient()
db = client.test_database.test_collection
while True:
user = input("Enter key: ")
if db.find_one({'user': user}) #if the key already exist only print
print ("{} ---> {}".format(user, db.find_one({'user':user})['value'])
continue
mn = input("Enter value: ")
app = input ("Apply?: ")
if app == "y":
db.insert({'user':user, 'value':value})
With your code as it is right now, you have no reason to append to the file. You're converting the entire dict to JSON and writing it all to file anyway, so it doesn't matter if you lose the previous data. a is no more efficient than w here. In fact it's worse because the file will take much more space on disk.
Using the CSV module as Schmouk said is a good approach, as long as your data has a simple structure. In this case you just have a table with two columns and many rows, so a CSV is appropriate, and also more efficient.
If each row has a more complex structure, such as nested dicts and or lists, or different fields for each row, then JSON will be more useful. You can still append one row at a time. Just write each row into a single line, and have each row be an entire JSON object on its own. Reading files one line at a time is normal and easy in Python.
By the way, there is no need for the last two lines. You only need to read in the JSON when the program starts (as Moses has shown you) so you have access to data from previous runs. After that you can just use the variable mydict and it will remember all the keys you've added to it.

Categories