Python's json.load weird behavior

Python's json.load weird behavior - python

I'm trying to extract a specific value from log files in a directory.
Now the log files contains JSON data and i want to extract the value for the id field.
JSON Data look something like this
{
id: "123",
name: "foo"
description: "bar baz"
}
Code Looks like this
def test_load_json_directly(self):
with open('source_data/testing123.json') as log_file:
data = json.load(log_file)
print data
def test_load_json_from_iteration(self, dir_path, file_ext):
path_name = os.path.join(dir_path, '*.' + file_ext)
files = glob.glob(path_name)
for filename in files:
with open(filename) as log_file:
data = json.load(log_file)
print data
Now I try to call the function test_load_json_directly the JSON string gets loaded correctly. No problem there. This is just to check the correct behavior of the json.load function.
The issue is when I try to call the function test_load_json_from_iteration, the JSON string is not being recognized and returns an error.
ValueError: No JSON object could be decoded
What am I doing wrong here?

Your json is invalid. The property names and the values must be wrapped with quotes (except if they are numbers). You're also missing the commas.
The most probable reason for this error is an error in a json file. Since json module doesn't show detailed errors, you can use the simplejson module to see what's actually happening.
Change your code to:
import simplejson
.
.
.
data = simplejson.load(log_file)
And look at the error message. It will show you the line and the column where it fails.
Ex:
simplejson.errors.JSONDecodeError: Expecting value: line 5 column 17 (char 84)
Hope it helps :) Feel free to ask if you have any doubts.

Related

Power BI(PBIX) - Parsing Layout file

I am trying to document the Reports, Visuals and measures used in a PBIX file. I have a PBIX file(containing some visuals and pointing to Tabular Model in Live Mode), I then exported it as a PBIT, renamed to zip. Now in this zip file we have a folder called Report, within that we have a file called Layout. The layout file looks like a JSON file but when I try to read it via python,
import json
# Opening JSON file
f = open("C://Layout",)
# returns JSON object as
# a dictionary
#f1 = str.replace("\'", "\"")
data = json.load(f)
I get below issue,
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Renaming it to Layout.json doesn't help either and gives the same issue. Is there a easy way or a parser to specifically parse this Layout file and get below information out of it
Report Name | Visual Name | Column or Measure Name

Not sure if you have come across an answer for your question yet, but I have been looking into something similar.
Here is what I have had to do in order to get the file to parse correctly.
Big items here to not is the encoding and all the whitespace replacements.
data will then contain the parsed object.
with open('path/to/Layout', 'r', encoding="cp1252") as json_file:
data_str = json_file.read().replace(chr(0), "").replace(chr(28), "").replace(chr(29), "").replace(chr(25), "")
data = json.loads(data_str)

This script may help: https://github.com/grenzi/powerbi-model-utilization
a portion of the script is:
def get_layout_from_pbix(pbixpath):
"""
get_layout_from_pbix loads a pbix file, grabs the layout from it, and returns json
:parameter pbixpath: file to read
:return: json goodness
"""
archive = zipfile.ZipFile(pbixpath, 'r')
bytes_read = archive.read('Report/Layout')
s = bytes_read.decode('utf-16-le')
json_obj = json.loads(s, object_hook=parse_pbix_embedded_json)
return json_obj

had similar issue.
my work around was to save it as Layout.txt with utf-8 encoding, then continued as you have

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.
What do you think is the problem, and how can I solve it?
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
This is my code:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
This is the error I get:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
I used this code:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
and realized that the file starts with b' (as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.
I uploaded a sample of these files in the following link
just a few json.gz files

The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.
Dunes' answer to the question #Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
Here's what I mean:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
Note that the output it prints shows what valid JSON might have looked like.

python3 JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I'm facing this error in python3.6.
My json file looks like this:
{
"id":"776",
"text":"Scientists have just discovered a bizarre pattern in global weather. Extreme heat waves like the one that hit the Eastern US in 2012, leaving at least 82 dead, don't just come out of nowhere."
}
It's encoding 'utf-8' and I checked it online, it is a valid json file. I tried to load it in this way:
p = 'doc1.json'
json.loads(p)
I tried this as well:
p = "doc1.json"
with open(p, "r") as f:
doc = json.load(f)
The error is the same:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Anyone can help? Thank you!

p = 'doc1.json'
json.loads(p)
you're asking the json module to load the string 'doc1.json' which obviously isn't valid json, it's a filename.
You want to open the file, read the contents, then load the contents using json.loads():
p = 'doc1.json'
with open(p, 'r') as f:
doc = json.loads(f.read())
As suggested in the comments, this could be further simplified to:
p = 'doc1.json'
with open(p, 'r') as f:
doc = json.load(f)
where jon.load() takes a file handle and reads it for you.

Aside
First, your path is not really a path. My response won't be about that, but your path should be something like '.path/to/the/doc1.json' (this example is a relative path).
TL;DR
json.loads is for loading str objects directly; json.load wants a fp or file pointer object which represents a file.
Solution
It appears you are misusing json.loads vs json.load (notice the s in one and not the other). I believe the s stands for string or Python object type str though I may be wrong. There is a very important distinction here; your path is represented by a string, but you actually care about the file as an object.
So of course this breaks because json.loads thinks it is trying to parse a type str object that is actually an invalid json:
path = 'a/path/like/this/is/a/string.json'
json.loads(path)
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
...
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Using this properly would look something like this:
json_str = '{"hello": "world!"}'
json.loads(json_str)
# The expected output.
{'hello': 'world!'}
Since json.loads does not meet our needs—it can, however it is extra and unnecessary code—we can use its friend json.load. json.load wants its first parameter to be an fp, but what is that? Well, it stands for file pointer which is a fancy way of saying "an object that represents a file." This has to do with opening a file to do something to or with it. In our case, we want to read the file into json.load.
We will use the context manager open() since that is a good thing to do. Note, I do not know what the contents of your doc1.json is so I replaced the output with my own.
path = 'path/to/the/doc1.json'
with open(path, 'r') as fp:
print(json.load(fp))
# The expected output.
{'hello': 'world!'}
Generally, I think I would use json.load a lot more than json.loads (with the s) since I read directly from json files. If you load some json into your code using a third party package, you may find your self reading that in your code and then passing as a str to json.loads.
Resources
Python's json — https://docs.python.org/3/library/json.html#module-json

File writing and reading

try:
studfile = open("students.csv","r")
except IOError:
studfile = open("students.csv","w")
#later in the code
studfile.write(students)
The purpose of this try/except block was to try and route out the IOError, but I ended up getting another error message, which was "expected a character buffer object". help on how to fix it?

Assuming students is some form of data you wish to save as a csv file, it's probably best to use python's built in csv file IO. For example:
import csv
with open("students.csv","wb") as studfile: # using with is good practice
...
csv_writer = csv.writer(studfile)
csv_writer.writerow(students) # assuming students is a list of data

This is the TypeError you are getting. Your 'students' type should be string while writing to file. Using str(students) should solve your problem.
EDIT:
str can convert any object to string type. Considering the comments below:
Since you didn't mention the type of student. If it is list of string (assuming). Then you can't write like this: studfile.write(students).
You should do similar to this:
for entry in students:
studfile.write(entry) # decide whether to add newline character or not

Validate and format JSON files

I have around 2000 JSON files which I'm trying to run through a Python program. A problem occurs when a JSON file is not in the correct format. (Error: ValueError: No JSON object could be decoded) In turn, I can't read it into my program.
I am currently doing something like the below:
for files in folder:
with open(files) as f:
data = json.load(f); # It causes an error at this part
I know there's offline methods to validating and formatting JSON files but is there a programmatic way to check and format these files? If not, is there a free/cheap alternative to fixing all of these files offline i.e. I just run the program on the folder containing all the JSON files and it formats them as required?
SOLVED using #reece's comment:
invalid_json_files = []
read_json_files = []
def parse():
for files in os.listdir(os.getcwd()):
with open(files) as json_file:
try:
simplejson.load(json_file)
read_json_files.append(files)
except ValueError, e:
print ("JSON object issue: %s") % e
invalid_json_files.append(files)
print invalid_json_files, len(read_json_files)
Turns out that I was saving a file which is not in JSON format in my working directory which was the same place I was reading data from. Thanks for the helpful suggestions.

The built-in JSON module can be used as a validator:
import json
def parse(text):
try:
return json.loads(text)
except ValueError as e:
print('invalid json: %s' % e)
return None # or: raise
You can make it work with files by using:
with open(filename) as f:
return json.load(f)
instead of json.loads and you can include the filename as well in the error message.
On Python 3.3.5, for {test: "foo"}, I get:
invalid json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
and on 2.7.6:
invalid json: Expecting property name: line 1 column 2 (char 1)
This is because the correct json is {"test": "foo"}.
When handling the invalid files, it is best to not process them any further. You can build a skipped.txt file listing the files with the error, so they can be checked and fixed by hand.
If possible, you should check the site/program that generated the invalid json files, fix that and then re-generate the json file. Otherwise, you are going to keep having new files that are invalid JSON.
Failing that, you will need to write a custom json parser that fixes common errors. With that, you should be putting the original under source control (or archived), so you can see and check the differences that the automated tool fixes (as a sanity check). Ambiguous cases should be fixed by hand.

Yes, there are ways to validate that a JSON file is valid. One way is to use a JSON parsing library that will throw exceptions if the input you provide is not well-formatted.
try:
load_json_file(filename)
except InvalidDataException: # or something
# oops guess it's not valid
Of course, if you want to fix it, you naturally cannot use a JSON loader since, well, it's not valid JSON in the first place. Unless the library you're using will automatically fix things for you, in which case you probably wouldn't even have this question.
One way is to load the file manually and tokenize it and attempt to detect errors and try to fix them as you go, but I'm sure there are cases where the error is just not possible to fix automatically and would be better off throwing an error and asking the user to fix their files.
I have not written a JSON fixer myself so I can't provide any details on how you might go about actually fixing errors.
However I am not sure whether it would be a good idea to fix all errors, since then you'd have assume your fixes are what the user actually wants. If it's a missing comma or they have an extra trailing comma, then that might be OK, but there may be cases where it is ambiguous what the user wants.

Here is a full python3 example for the next novice python programmer that stumbles upon this answer. I was exporting 16000 records as json files. I had to restart the process several times so I needed to verify that all of the json files were indeed valid before I started importing into a new system.
I am no python programmer so when I tried the answers above as written, nothing happened. Seems like a few lines of code were missing. The example below handles files in the current folder or a specific folder.
verify.py
import json
import os
import sys
from os.path import isfile,join
# check if a folder name was specified
if len(sys.argv) > 1:
folder = sys.argv[1]
else:
folder = os.getcwd()
# array to hold invalid and valid files
invalid_json_files = []
read_json_files = []
def parse():
# loop through the folder
for files in os.listdir(folder):
# check if the combined path and filename is a file
if isfile(join(folder,files)):
# open the file
with open(join(folder,files)) as json_file:
# try reading the json file using the json interpreter
try:
json.load(json_file)
read_json_files.append(files)
except ValueError as e:
# if the file is not valid, print the error
# and add the file to the list of invalid files
print("JSON object issue: %s" % e)
invalid_json_files.append(files)
print(invalid_json_files)
print(len(read_json_files))
parse()
Example:
python3 verify.py
or
python3 verify.py somefolder
tested with python 3.7.3

It was not clear to me how to provide path to the file folder, so I'd like to provide answer with this option.
path = r'C:\Users\altz7\Desktop\your_folder_name' # use your path
all_files = glob.glob(path + "/*.json")
data_list = []
invalid_json_files = []
for filename in all_files:
try:
df = pd.read_json(filename)
data_list.append(df)
except ValueError:
invalid_json_files.append(filename)
print("Files in correct format: {}".format(len(data_list)))
print("Not readable files: {}".format(len(invalid_json_files)))
#df = pd.concat(data_list, axis=0, ignore_index=True) #will create pandas dataframe
from readable files, if you like

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python's json.load weird behavior - python

Related

Power BI(PBIX) - Parsing Layout file

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

python3 JSONDecodeError: Expecting value: line 1 column 1 (char 0)

File writing and reading

Validate and format JSON files

Categories

Resources