When working with json.dump() I noticed that it appears to be rewriting the entire document. Is this correct, and is there another way to append to the dictionary like .append() deos with lists?
When I write the function like this and change the key value (name), it would appear that the item is being appended.
filename = "infohere.json"
name = "Bob"
numbers = 20
#Write to JSON
def writejson(name = name, numbers = numbers):
with open(filename, "r") as info:
xdict = json.load(info)
xdict[name] = numbers
with open(filename, "w") as info:
json.dump(xdict, info)
When you write it out like this however, you can see that the code clearly writes over the entire dictionary/json file.
filename = infohere.json
dict = {"Bob":23, "Mark":50}
dict2 = {Ricky":40}
#Write to JSON
def writejson2(dict):
with open(filehere, "w") as info:
json.dump(dict, info)
writejson(dict)
writejson(dict2)
In the second example it only ever shows up the last date input leading me to believe that this is rewriting the entire document. If the case is that it writes the whole document during each json.dump, does this cause issues with larger json file, if so is there another method like .append() but for dealing with json.
Thanks in advance.
Neither.
json.dump doesn't decide whether to delete prior content when it writes to a file. That decision happens when you run open(filehere, "w"); that is what deletes old content.
But: Normal JSON isn't amenable to appends.
A single JSON document is one object. There are variants on the format that allow multiple documents in one file, the most common of which is JSONL (which has one JSON document per line). Unless you're using such a format, trying to append JSON to a non-empty file usually won't result in something that can be successfully parsed.
Is there a way, in the code below, to access the variable utterances_dict outside of the with-block? The code below obviously returns the error: ValueError: I/O operation on closed file.
from csv import DictReader
utterances_dict = {}
utterance_file = 'toy_utterances.csv'
with open(utterance_file, 'r') as utt_f:
utterances_dict = DictReader(utt_f)
for line in utterances_dict:
print(line)
I am not an expert on DictReader implementation, however their documentation leaves the implementation open to the reader itself parsing the file after construction. Meaning it may be possible that the underlying file has to remain open until you are done using it. In this case, it would be problematic to attempt to use the utterances_dict outside of the with block because the underlying file will be closed by then.
Even if the current implementation of DictReader does in fact parse the whole csv on construction, it doesn't mean their implementation won't change in the future.
DictReader returns a view of the csv file.
Convert the result to a list of dictionaries.
from csv import DictReader
utterances = []
utterance_file = 'toy_utterances.csv'
with open(utterance_file, 'r') as utt_f:
utterances = [dict(row) for row in DictReader(utt_f) ]
for line in utterances:
print(line)
I'm been learning python and playing around with dictionaries and .csv files and the csv module. It seems like the csv.DictReader() function can help turn .csv files into dictionary objects, but there's a bit of a quirk with the Reader objects that I'm confused about.
I've read a little bit into the documentation (and then tried to find answers looking up at the csv.Reader() function), but I'm still a little unsure.
Why does this code run as expected:
with open("cool_csv.csv") as cool_csv_file:
cool_csv_text = cool_csv_file.read()
print(cool_csv_text)
and yet the following code returns a ValueError: I/O operation on closed file.
with open("cool_csv.csv") as cool_csv_file:
cool_csv_dict = csv.DictReader(cool_csv_file)
for row in cool_csv_dict:
print(row["Cool Fact"])
Since we saved the DictReader object to a python variable, shouldn’t we be able to call the variable after we close the file, like if I were assigned cool cool_csv.read()?
I know the proper way to code this would be:
with open("cool_csv.csv") as cool_csv_file:
cool_csv_dict = csv.DictReader(cool_csv_file)
for row in cool_csv_dict:
print(row["Cool Fact"])
But why does the for row in cool_csv_dict: section have to be nested in the open() section?
My only guess would be that because the csv.DictReader() object is not quite an actual dictionary (or something like that), there’s some shenanigans because it still needs to point somewhere (because maybe thats the "reader" part?).
Can anyone shed any light?
csv.DictReader doesn't read the entire file into memory when you create the cool_csv_dict object. Each time you call it to get the next record from the CSV, it reads the next line from cool_csv_file. Therefore, it needs this to be kept open so it can read from it as needed.
The argument to csv.DictReader can be any iterator that returns lines. So if you don't want to keep the file open, you can call readlines() to get all the lines into a list, then pass that to csv.DictReader
with open("cool_csv.csv") as cool_csv_file:
cool_csv_lines = cool_csv_file.readlines()
cool_csv_dict = csv.DictReader(cool_csv_lines)
for row in cool_csv_dict:
print(row("Cool Fact")
I'm not sure how to word my question exactly, and I have seen some similar questions asked but not exactly what I'm trying to do. If there already is a solution please direct me to it.
Here is what I'm trying to do:
At my work, we have a few pkgs we've built to handle various data types. One I am working with is reading in a csv file into a std_io object (std_io is our all-purpose object class that reads in any type of data file).
I am trying to connect this to another pkg I am writing, so I can make an object in the new pkg, and covert it to a std_io object.
The problem is, the std_io object is meant to read an actual file, not take in an object. To get around this, I can basically write my data to temp.csv file then read it into a std_io object.
I am wondering if there is a way to eliminate this step of writing the temp.csv file.
Here is my code:
x #my object
df = x.to_df() #object class method to convert to a pandas dataframe
df.to_csv('temp.csv') #write data to a csv file
std_io_obj = std_read('temp.csv') #read csv file into a std_io object
Is there a way to basically pass what the output of writing the csv file would be directly into std_read? Does this make sense?
The only reason I want to do this is to avoid having to code additional functionality into either of the pkgs to directly accept an object as input.
Hope this was clear, and thanks to anyone who contributes.
For those interested, or who may have this same kind of issue/objective, here's what I did to solve this problem.
I basically just created a temporary named file, linked a .csv filename to this temp file, then passed it into my std_read function which requires a csv filename as an input.
This basically tricks the function into thinking it's taking the name of a real file as an input, and it just opens it as usual and uses csvreader to parse it up.
This is the code:
import tempfile
import os
x #my object I want to convert to a std_io object
text = x.to_df().to_csv() #object class method to convert to a pandas dataframe then generate the 'text' of a csv file
filename = 'temp.csv'
with tempfile.NamedTemporaryFile(dir = os.path.dirname('.')) as f:
f.write(text.encode())
os.link(f.name, filename)
stdio_obj = std_read(filename)
os.unlink(filename)
del f
FYI - the std_read function essentially just opens the file the usual way, and passes it into csvreader:
with open(filename, 'r') as f:
rdr = csv.reader(f)
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T