Python error trying to append JSON file - python

I'm relatively new to Python so my issue may be a simple one to fix but after days of trying and searching the internet I cant find anything.
So I built a script to stream data from Twitter and store the collecting data into a json file so that I can later access it and do whatever. This script pulls the user credentials like consumer key, token, and access info from a separate file to authenticate (I'm sure there is a better and more secure way to do that, this is just a proof of concept at the moment) using this code:
with open('Twitter_Credentials.json', mode = 'a+') as tc:
data = json.load(tc)
if user not in data['names']:
user_dict = dict()
user_dict[user] = {'key':'','secret':'','token':'','token_secret':''}
user_dict[user]['key'] = input('Twitter Consumer Key: ')
user_dict[user]['secret'] = input('Twitter Consumer Secret: ')
user_dict[user]['token'] = input('Twitter Access Token: ')
user_dict[user]['token_secret'] = input('Twitter Access Secret: ')
data['names'].append(user_dict)
json.dump(data,tc, indent = 2, ensure_ascii = False)
tc.close()
The issue I am having is that if I want to append another user and their credentials to this file I keep receiving this error:
File "(filepath)", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
THINGS I HAVE ALREADY TRIED:
Modifying the mode using 'r', 'r+', 'w', 'w+'
Changing load() and dump() to loads() and dumps()
Changing the encoding
Using 'r+' and 'w+' did not give me an error, but it did duplicate the original user so that they appeared multiple times. I want to eliminate that so that when it appends it does not duplicate. Any insight would be greatly appreciated. Thanks in advance.

A JSON file is a file containing a single JSON document. If you append another JSON string to it, it's no longer a JSON file.
As the docs say:
Note Unlike pickle and marshal, JSON is not a framed protocol, so trying to serialize multiple objects with repeated calls to dump() using the same fp will result in an invalid JSON file.
If you aren't actually trying to store multiple documents in one file, the fix is easy: what you want to do is open the file, load it, modify the data, then open the file again and overwrite it. Like this:
with open('Twitter_Credentials.json', mode = 'r') as tc:
data = json.load(tc)
if user not in data['names']:
# blah blah
with open('Twitter_Credentials.json', mode = 'w') as tc:
json.dump(data, tc, indent = 2, ensure_ascii = False)
Notice that I'm using w mode, not a, because we want to overwrite the old file with the new one, not tack stuff onto the end of it.
If you are trying to store multiple documents, then you can't do that with JSON. Fortunately, there are very simple framed protocols based on JSON—JSONlines, NDJ, etc.—that are commonly used. There are three or four different such formats with minor differences, but the key to all of them is that each JSON document goes on a line by itself, with newlines between the documents.
But using ensure_ascii=False means you're not escaping newlines in strings, and indent=2 means you're adding more newlines between fields within the document, and then you aren't doing anything to write a newline after each document. So, your output isn't valid JSONlines any more than it's valid JSON.
Also, even if you fix all that, you're doing a single json.load, which would read only the first document out of a JSONlines file, and then doing json.dump to the same file, which would write a second document after that one, overwriting whatever's there. You could easily end up overwriting, say, half of the previous second document, leaving the other half behind as garbage to read later. So, you need to rethink your logic. At minimum, you want to do the same thing as above, opening the file twice:
with open('Twitter_Credentials.json', mode = 'r') as tc:
data = json.load(tc)
if user not in data['names']:
# blah blah
with open('Twitter_Credentials.json', mode = 'a') as tc:
json.dump(data, tc)
tc.write('\n')
This time I'm using a mode, because this time we do want to tack a new line onto the end of an existing file.

Related

How to remove empty space from front of JSON object?

I am trying to process a large JSON file using the follow code:
dctx = zst.ZstdDecompressor(max_window_size=2147483648)
with open(filename+".zst", 'rb') as infile, open(outpath, 'wb') as outfile:
dctx.copy_stream(infile, outfile)
with pd.read_json(filename+".json", lines=True, chunksize=5000) as reader:
reader
# Making list of column headers
df_titles = []
for chunk in reader:
chunk_titles = list(chunk.keys())
df_titles.extend(chunk_titles)
df_titles = list(set(df_titles))
However, when I attempt to run the code, I get an error message: ValueError: Expected object or value. The file is formatted with one JSON object per line, and looking at the JSON file itself, it seems the issue is that one of the JSON objects has a bunch of empty space in front of it.
If I manually delete the 'nul' line, the file processes with no issues. However, for the sake of reproducibility, I would like to be able to address the issue from within my code itself. I'm pretty new to working in Python, and I have tried googling the issue, but solutions seem to focus on removing white space from the beginning of JSON values, rather than the start of a line in this kind of file. Is there any easy way to deal with this issue either when decompressing the initial file, or reading the decompressed file in?

Does json.dump() in python rewrite or append a JSON file

When working with json.dump() I noticed that it appears to be rewriting the entire document. Is this correct, and is there another way to append to the dictionary like .append() deos with lists?
When I write the function like this and change the key value (name), it would appear that the item is being appended.
filename = "infohere.json"
name = "Bob"
numbers = 20
#Write to JSON
def writejson(name = name, numbers = numbers):
with open(filename, "r") as info:
xdict = json.load(info)
xdict[name] = numbers
with open(filename, "w") as info:
json.dump(xdict, info)
When you write it out like this however, you can see that the code clearly writes over the entire dictionary/json file.
filename = infohere.json
dict = {"Bob":23, "Mark":50}
dict2 = {Ricky":40}
#Write to JSON
def writejson2(dict):
with open(filehere, "w") as info:
json.dump(dict, info)
writejson(dict)
writejson(dict2)
In the second example it only ever shows up the last date input leading me to believe that this is rewriting the entire document. If the case is that it writes the whole document during each json.dump, does this cause issues with larger json file, if so is there another method like .append() but for dealing with json.
Thanks in advance.
Neither.
json.dump doesn't decide whether to delete prior content when it writes to a file. That decision happens when you run open(filehere, "w"); that is what deletes old content.
But: Normal JSON isn't amenable to appends.
A single JSON document is one object. There are variants on the format that allow multiple documents in one file, the most common of which is JSONL (which has one JSON document per line). Unless you're using such a format, trying to append JSON to a non-empty file usually won't result in something that can be successfully parsed.

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.
What do you think is the problem, and how can I solve it?
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
This is my code:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
This is the error I get:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
I used this code:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
and realized that the file starts with b' (as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.
I uploaded a sample of these files in the following link
just a few json.gz files
The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.
Dunes' answer to the question #Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
Here's what I mean:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
Note that the output it prints shows what valid JSON might have looked like.

My program sometimes writes an extra ] or } at the end of data in json file?

I have written a note taking tool for myself as my first program. Its actually working really well for the most part however sometimes the program will write an extra ]or } at the end of the list or dict stored inside of said json file.
It doesn't happen often and I think it is only happening when I am writing new lines of code or changing existing lines that read/write to said files. I am not 100% sure but that is what it looks like.
For example I have a single list stored in a file and I use the indent="" flag to make sure as it writes the files its a little more readable for me if I ever have to edit said files. Sometimes when running my program after changing up some code or adding code I get an error stating a file has "extra data" in it.
The error looks something like this:
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 6 column 2 (char 5791)
and the cause of the error would be something like this:
[
"Help",
"DataTypes",
"test",
"Variables",
]] # the error would be cause by this extra ] at the end of the list
What I don't understand is why does the program sometimes add and extra ] or } at the end of the data in my json files?
Is there something I am doing wrong when I open the file or dump to the file?
Here are some sections of code I have that are used to open files and dump to files:
path = "./NotesKeys/"
notebook = dict()
currentWorkingLib = ""
currentWorkingKeys = ""
#~~~~~~~~~~~~~~~~~~~< USE TO open all files in Directory >~~~~~~~~~~~~~~~~~~~
with open("%s%s"%(path,"list_of_all_filenames"), "r") as listall:
list_of_all_filenames = json.load(listall)
def openAllFiles(event=None):
global path
for filename in os.listdir(path):
with open(path+filename, "r+") as f:
notebook[filename] = json.load(f)
openAllFiles()
And here is how I am updating the data in the file. Just ignore the e1Current, e1allcase, e2Current they are used to keep the format of the users input for filenames (dict key) lower case in the dictionaries where the notes are stored and maintain the case the user imputed for a display list. This should not be related to the file read write issue.:
Edit: removed unrelated code per commenters request.
#~~~~~~~~~~~~~~~~~~~< UPDATE selected_notes! >~~~~~~~~~~~~~~~~~~~
dict_to_be_updated = notebook[currentWorkingLib]
dict_to_be_updated[e1Current] = e2Current
with open("%s%s"%(path,currentWorkingLib),"r+") as working_temp_var:
json.dump(dict_to_be_updated, working_temp_var, indent = "")
I am aware of how to open a file and use the data and how to dump data to said file and update the content loaded in the variables of the program based off the newly dumped data.
Am I missing something important during this process? Should I be doing something to ensure data integrity in the json files?
You are opening files in read-write mode, r+:
with open("%s%s"%(path,currentWorkingLib),"r+") as working_temp_var:
This means you'll be writing to a file that already has data in it, and sometimes the existing data is longer than what you are now writing to the file. That means you'll end up with some trailing data at the end.
You can see this by writing a shorter demo string to a file, then using r+ to write less data to the same file, then reading again:
>>> with open('/tmp/demo', 'w') as init:
... init.write('The quick brown fox jumps over the lazy dog\n')
...
44
>>> with open('/tmp/demo', 'r+') as readwrite:
... readwrite.write("Monty Python's flying circus\n")
...
29
>>> with open('/tmp/demo', 'r') as result:
... print(result.read())
...
Monty Python's flying circus
r the lazy dog
Don't do this. Use w write mode so the file is truncated first:
with open("%s%s"%(path,currentWorkingLib), "w") as working_temp_var:
This ensures your file is cut back to size 0 before you write a new JSON document.

How to read a JSON file in python? [duplicate]

I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T

Categories