How to remove empty space from front of JSON object? - python

I am trying to process a large JSON file using the follow code:
dctx = zst.ZstdDecompressor(max_window_size=2147483648)
with open(filename+".zst", 'rb') as infile, open(outpath, 'wb') as outfile:
dctx.copy_stream(infile, outfile)
with pd.read_json(filename+".json", lines=True, chunksize=5000) as reader:
reader
# Making list of column headers
df_titles = []
for chunk in reader:
chunk_titles = list(chunk.keys())
df_titles.extend(chunk_titles)
df_titles = list(set(df_titles))
However, when I attempt to run the code, I get an error message: ValueError: Expected object or value. The file is formatted with one JSON object per line, and looking at the JSON file itself, it seems the issue is that one of the JSON objects has a bunch of empty space in front of it.
If I manually delete the 'nul' line, the file processes with no issues. However, for the sake of reproducibility, I would like to be able to address the issue from within my code itself. I'm pretty new to working in Python, and I have tried googling the issue, but solutions seem to focus on removing white space from the beginning of JSON values, rather than the start of a line in this kind of file. Is there any easy way to deal with this issue either when decompressing the initial file, or reading the decompressed file in?

Related

Python: Cannot print JSON records from large (7gb+) JSON file

I am using Python 3.10.5 and Spyder. So I have been trying to convert a large (7gb) JSON file to a CSV format. I eventually want this to work on 1tb+ JSON files (I know...can't be helped), but I have had issues with manipulating the file. In order to preserve memory, I had the idea to stream the JSON records and append them to a CSV file piece by piece. I tried writing from the stream first and when that would not work I tried to simply print out individual JSON records to the console, but nothing happens at all, not even an error. This is my code:
For the CSV conversion:
with open(lrg_file, "rb") as f:
for record in ijson.items(f, "item"):
row_x = pd.Series(record, index = ['id','type','actor','repo','payload','public','created_at'])
row_x.to_csv(dump_out, mode='a', index=False, header=False)
For printing to console:
with open(lrg_file, "rb") as f:
for record in ijson.items(f, "item"):
obj = json.dumps(record, indent=4)
print(obj)
My code works on smaller files (2-25 MB) but the same code fails on the larger file; it simply runs silently without printing or throwing an error or doing anything at all. I did find that I can use the ijson.parse function and it will print from the larger file, using the following code:
with open(lrg_file, "rb") as f:
parser = ijson.parse(f)
for record in parser:
obj = json.dumps(record, indent=4)
print(obj)
That allows me to see the prefix/event/value trio for each event in the file, but does not make for an efficient means of converting to .CSV. Any explanation for why my larger file just wont work on my CSV conversion code?

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.
What do you think is the problem, and how can I solve it?
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
This is my code:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
This is the error I get:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
I used this code:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
and realized that the file starts with b' (as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.
I uploaded a sample of these files in the following link
just a few json.gz files
The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.
Dunes' answer to the question #Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
Here's what I mean:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
Note that the output it prints shows what valid JSON might have looked like.

Python error trying to append JSON file

I'm relatively new to Python so my issue may be a simple one to fix but after days of trying and searching the internet I cant find anything.
So I built a script to stream data from Twitter and store the collecting data into a json file so that I can later access it and do whatever. This script pulls the user credentials like consumer key, token, and access info from a separate file to authenticate (I'm sure there is a better and more secure way to do that, this is just a proof of concept at the moment) using this code:
with open('Twitter_Credentials.json', mode = 'a+') as tc:
data = json.load(tc)
if user not in data['names']:
user_dict = dict()
user_dict[user] = {'key':'','secret':'','token':'','token_secret':''}
user_dict[user]['key'] = input('Twitter Consumer Key: ')
user_dict[user]['secret'] = input('Twitter Consumer Secret: ')
user_dict[user]['token'] = input('Twitter Access Token: ')
user_dict[user]['token_secret'] = input('Twitter Access Secret: ')
data['names'].append(user_dict)
json.dump(data,tc, indent = 2, ensure_ascii = False)
tc.close()
The issue I am having is that if I want to append another user and their credentials to this file I keep receiving this error:
File "(filepath)", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
THINGS I HAVE ALREADY TRIED:
Modifying the mode using 'r', 'r+', 'w', 'w+'
Changing load() and dump() to loads() and dumps()
Changing the encoding
Using 'r+' and 'w+' did not give me an error, but it did duplicate the original user so that they appeared multiple times. I want to eliminate that so that when it appends it does not duplicate. Any insight would be greatly appreciated. Thanks in advance.
A JSON file is a file containing a single JSON document. If you append another JSON string to it, it's no longer a JSON file.
As the docs say:
Note Unlike pickle and marshal, JSON is not a framed protocol, so trying to serialize multiple objects with repeated calls to dump() using the same fp will result in an invalid JSON file.
If you aren't actually trying to store multiple documents in one file, the fix is easy: what you want to do is open the file, load it, modify the data, then open the file again and overwrite it. Like this:
with open('Twitter_Credentials.json', mode = 'r') as tc:
data = json.load(tc)
if user not in data['names']:
# blah blah
with open('Twitter_Credentials.json', mode = 'w') as tc:
json.dump(data, tc, indent = 2, ensure_ascii = False)
Notice that I'm using w mode, not a, because we want to overwrite the old file with the new one, not tack stuff onto the end of it.
If you are trying to store multiple documents, then you can't do that with JSON. Fortunately, there are very simple framed protocols based on JSON—JSONlines, NDJ, etc.—that are commonly used. There are three or four different such formats with minor differences, but the key to all of them is that each JSON document goes on a line by itself, with newlines between the documents.
But using ensure_ascii=False means you're not escaping newlines in strings, and indent=2 means you're adding more newlines between fields within the document, and then you aren't doing anything to write a newline after each document. So, your output isn't valid JSONlines any more than it's valid JSON.
Also, even if you fix all that, you're doing a single json.load, which would read only the first document out of a JSONlines file, and then doing json.dump to the same file, which would write a second document after that one, overwriting whatever's there. You could easily end up overwriting, say, half of the previous second document, leaving the other half behind as garbage to read later. So, you need to rethink your logic. At minimum, you want to do the same thing as above, opening the file twice:
with open('Twitter_Credentials.json', mode = 'r') as tc:
data = json.load(tc)
if user not in data['names']:
# blah blah
with open('Twitter_Credentials.json', mode = 'a') as tc:
json.dump(data, tc)
tc.write('\n')
This time I'm using a mode, because this time we do want to tack a new line onto the end of an existing file.

Replace string in specific line of nonstandard text file

Similar to posting: Replace string in a specific line using python, however results were not forethcomming in my slightly different instance.
I working with python 3 on windows 7. I am attempting to batch edit some files in a directory. They are basically text files with .LIC tag. I'm not sure if that is relevant to my issue here. I am able to read the file into python without issue.
My aim is to replace a specific string on a specific line in this file.
import os
import re
groupname = 'Oldtext'
aliasname = 'Newtext'
with open('filename') as f:
data = f.readlines()
data[1] = re.sub(groupname,aliasname, data[1])
f.writelines(data[1])
print(data[1])
print('done')
When running the above code I get an UnsupportedOperation: not writable. I am having some issue writing the changes back to the file. Based on suggestion of other posts, I edited added the w option to the open('filename', "w") function. This causes all text in the file to be deleted.
Based on suggestion, the r+ option was tried. This leads to successful editing of the file, however, instead of editing the correct line, the edited line is appended to the end of the file, leaving the original intact.
Writing a changed line into the middle of a text file is not going to work unless it's exactly the same length as the original - which is the case in your example, but you've got some obvious placeholder text there so I have no idea if the same is true of your actual application code. Here's an approach that doesn't make any such assumption:
with open('filename', 'r') as f:
data = f.readlines()
data[1] = re.sub(groupname,aliasname, data[1])
with open('filename', 'w') as f:
f.writelines(data)
EDIT: If you really wanted to write only the single line back into the file, you'd need to use f.tell() BEFORE reading the line, to remember its position within the file, and then f.seek() to go back to that position before writing.

How to convert a large Json file into a csv using python

(Python 3.5)
I am trying to parse a large user review.json file (1.3gb) into python and convert to a .csv file. I have tried looking for a simple converter tool online, most of which accept a file size maximum of 1Mb or are super expensive.
as i am fairly new to python i guess i ask 2 questions.
is it even possible/ efficient to do so or should i be looking for another method?
I tried the following code, it only is reading the and writing the top 342 lines in my .json doc then returning an error.
Blockquote
File "C:\Anaconda3\lib\json__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Anaconda3\lib\json\decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
JSONDecodeError: Extra data
This is the code im using
import csv
import json
infile = open("myfile.json","r")
outfile = open ("myfile.csv","w")
writer = csv.writer(outfile)
for row in json.loads(infile.read()):
writer.writerow(row)
my .json example:
Link To small part of Json
My thoughts is its some type of error related to my for loop, with json.loads... but i do not know enough about it. Is it possible to create a dictionary{} and take convert just the values "user_id", "stars", "text"? or am i dreaming.
Any suggestions or criticism are appreciated.
This is not a JSON file; this is a file containing individual lines of JSON. You should parse each line individually.
for row in infile:
data = json.loads(row)
writer.writerow(data)
Sometimes it's not as easy as having one JSON definition per line of input. A JSON definition can spread out over multiple lines, and it's not necessarily easy to determine which are the start and end braces reading line by line (for example, if there are strings containing braces, or nested structures).
The answer is to use the raw_decode method of json.JSONDecoder to fetch the JSON definitions from the file one at a time. This will work for any set of concatenated valid JSON definitions. It's further described in my answer here: Importing wrongly concatenated JSONs in python

Categories