Validate and format JSON files

Validate and format JSON files - python

I have around 2000 JSON files which I'm trying to run through a Python program. A problem occurs when a JSON file is not in the correct format. (Error: ValueError: No JSON object could be decoded) In turn, I can't read it into my program.
I am currently doing something like the below:
for files in folder:
with open(files) as f:
data = json.load(f); # It causes an error at this part
I know there's offline methods to validating and formatting JSON files but is there a programmatic way to check and format these files? If not, is there a free/cheap alternative to fixing all of these files offline i.e. I just run the program on the folder containing all the JSON files and it formats them as required?
SOLVED using #reece's comment:
invalid_json_files = []
read_json_files = []
def parse():
for files in os.listdir(os.getcwd()):
with open(files) as json_file:
try:
simplejson.load(json_file)
read_json_files.append(files)
except ValueError, e:
print ("JSON object issue: %s") % e
invalid_json_files.append(files)
print invalid_json_files, len(read_json_files)
Turns out that I was saving a file which is not in JSON format in my working directory which was the same place I was reading data from. Thanks for the helpful suggestions.

The built-in JSON module can be used as a validator:
import json
def parse(text):
try:
return json.loads(text)
except ValueError as e:
print('invalid json: %s' % e)
return None # or: raise
You can make it work with files by using:
with open(filename) as f:
return json.load(f)
instead of json.loads and you can include the filename as well in the error message.
On Python 3.3.5, for {test: "foo"}, I get:
invalid json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
and on 2.7.6:
invalid json: Expecting property name: line 1 column 2 (char 1)
This is because the correct json is {"test": "foo"}.
When handling the invalid files, it is best to not process them any further. You can build a skipped.txt file listing the files with the error, so they can be checked and fixed by hand.
If possible, you should check the site/program that generated the invalid json files, fix that and then re-generate the json file. Otherwise, you are going to keep having new files that are invalid JSON.
Failing that, you will need to write a custom json parser that fixes common errors. With that, you should be putting the original under source control (or archived), so you can see and check the differences that the automated tool fixes (as a sanity check). Ambiguous cases should be fixed by hand.

Yes, there are ways to validate that a JSON file is valid. One way is to use a JSON parsing library that will throw exceptions if the input you provide is not well-formatted.
try:
load_json_file(filename)
except InvalidDataException: # or something
# oops guess it's not valid
Of course, if you want to fix it, you naturally cannot use a JSON loader since, well, it's not valid JSON in the first place. Unless the library you're using will automatically fix things for you, in which case you probably wouldn't even have this question.
One way is to load the file manually and tokenize it and attempt to detect errors and try to fix them as you go, but I'm sure there are cases where the error is just not possible to fix automatically and would be better off throwing an error and asking the user to fix their files.
I have not written a JSON fixer myself so I can't provide any details on how you might go about actually fixing errors.
However I am not sure whether it would be a good idea to fix all errors, since then you'd have assume your fixes are what the user actually wants. If it's a missing comma or they have an extra trailing comma, then that might be OK, but there may be cases where it is ambiguous what the user wants.

Here is a full python3 example for the next novice python programmer that stumbles upon this answer. I was exporting 16000 records as json files. I had to restart the process several times so I needed to verify that all of the json files were indeed valid before I started importing into a new system.
I am no python programmer so when I tried the answers above as written, nothing happened. Seems like a few lines of code were missing. The example below handles files in the current folder or a specific folder.
verify.py
import json
import os
import sys
from os.path import isfile,join
# check if a folder name was specified
if len(sys.argv) > 1:
folder = sys.argv[1]
else:
folder = os.getcwd()
# array to hold invalid and valid files
invalid_json_files = []
read_json_files = []
def parse():
# loop through the folder
for files in os.listdir(folder):
# check if the combined path and filename is a file
if isfile(join(folder,files)):
# open the file
with open(join(folder,files)) as json_file:
# try reading the json file using the json interpreter
try:
json.load(json_file)
read_json_files.append(files)
except ValueError as e:
# if the file is not valid, print the error
# and add the file to the list of invalid files
print("JSON object issue: %s" % e)
invalid_json_files.append(files)
print(invalid_json_files)
print(len(read_json_files))
parse()
Example:
python3 verify.py
or
python3 verify.py somefolder
tested with python 3.7.3

It was not clear to me how to provide path to the file folder, so I'd like to provide answer with this option.
path = r'C:\Users\altz7\Desktop\your_folder_name' # use your path
all_files = glob.glob(path + "/*.json")
data_list = []
invalid_json_files = []
for filename in all_files:
try:
df = pd.read_json(filename)
data_list.append(df)
except ValueError:
invalid_json_files.append(filename)
print("Files in correct format: {}".format(len(data_list)))
print("Not readable files: {}".format(len(invalid_json_files)))
#df = pd.concat(data_list, axis=0, ignore_index=True) #will create pandas dataframe
from readable files, if you like

Related

Can't open and read content of an uploaded zip file with FastAPI

I am currently developing a little backend project for myself with the Python Framework FastAPI. I made an endpoint, where the user should be able to upload 2 files, while the first one is a zip-file (which contains X .xmls) and the latter a normal .xml file.
The code is as follows:
#router.post("/sendxmlinzip/")
def create_upload_files_with_zip(files: List[UploadFile] = File(...)):
if not len(files) == 2:
raise Httpex.EXPECTEDTWOFILES
my_file = files[0].file
zfile = zipfile.ZipFile(my_file, 'r')
filelist = []
for finfo in zfile.infolist():
print(finfo)
ifile = zfile.open(finfo)
line_list = ifile.readlines()
print(line_list)
This should print the content of the files, that are in the .zip file, but it raises the Exception
AttributeError: 'SpooledTemporaryFile' object has no attribute 'seekable'
In the row ifile = zfile.open(finfo)
Upon approximately 3 days research with a lot of trial and error involved, trying to use different functions such as .read() or .extract(), I gave up. Because the python docs literally state, that this should be possible in this way...
For you, who do not know about FastAPI, it's a backend fw for Restful Webservices and is using the starlette datastructure for UploadFile. Please forgive me, if I have overseen something VERY obvious, but I literally tried to check every corner, that may have been the possible cause of the error such as:
Check, whether another implementation is possible
Check, that the .zip file is correct
Check, that I attach the correct file (lol)
Debug to see, whether the actual data, that comes to the backend is indeed the .zip file

This is a known Python bug:
SpooledTemporaryFile does not fully satisfy the abstract for IOBase.
Namely, seekable, readable, and writable are missing.
This was discovered when seeking a SpooledTemporaryFile-backed lzma
file.
As #larsks suggested in his comment, I would try writing the contents of the spooled file to a new TemporaryFile, and then operate on that. As long as your files aren't too large, that should work just as well.

This is my workaround
with zipfile.ZipFile(io.BytesIO(file.read()), 'r') as zip:

How would you write an `is_pdf(path_to_file)` function in Python?

I have a Django project that creates PDFs using Java as a background task. Sometimes the process can take awhile, so the client uses polling like this:
The first request starts the build process and returns None.
Each subsequent request checks to see if the PDF has been built.
If it has been, it returns the PDF.
If it hasn't, it returns None again and the client schedules another request to check again in n seconds.
The problem I have is that I don't know how to check if the PDF is finished building. The Java process creates the file in stages. If I just check if the PDF exists, then the PDF that gets returned is often invalid, because it is still being built. So, what I need is an is_pdf(path_to_file) function that returns True if the file is a valid PDF and False otherwise.
I'd like to do this without a library if possible, but will use a library if necessary.
I'm on Linux.
Here is a solution that works using pdfminer, but it seems like overkill to me.
from pdfminer.high_level import extract_text
def is_pdf(path_to_file):
"""Return True if path_to_file is a readable PDF"""
try:
extract_text(path_to_file, maxpages=1)
return True
except:
return False
I'm hoping for a solution that doesn't involve installing a large library just to check if a file is a valid PDF.

I've found this pypi.org/project/pdfminer.six . I produced a simple example. See if it is useful to you. a.pdf is an empty file. I don't know what it will do when trying to read a pdf file which is still being processed by another program.
from pdfminer.high_level import extract_text
try:
text = extract_text("D:\\a.pdf")
print(text)
except :
print("invalid PDF file")
else:
pass
--- update --
Alternatively, I have seen an example of PDFDocument on pdfminer github,
https://github.com/pdfminer/pdfminer.six/blob/develop/tools/pdfstats.py
on line 53.
I produced a similar example code:
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
try:
pdf_file = open("D:\\a.pdf", 'rb')
parser = PDFParser(pdf_file)
password = ''
document = PDFDocument(parser, password)
print(document.info)
print(document.xrefs)
except :
print("invalid PDF file")
else:
pass
In my example, since a.pdf is empty; open() function throws the exception. In your case, I'm guessing it will be able to open the file but PDFParser or PDFDocument may throw an exception. If no exception is thrown, PDFDocument.info attribute might be useful.
-- update 2 --
I've realized that document object has xrefs attribute. there is an explanation in PdfParser class : "It also reads XRefs at the end of every PDF file." Checking the value of document.xrefs might be useful.

I suspect you could just write a script to email yourself or team distribution and simply list all the files in the directory. However, if you're only asking how to natively search a directory without installing modules. I would import os and re.
# ***** Search File *****
files = os.listdir(r"C:\Users\PATH")
print(files)

Python's json.load weird behavior

I'm trying to extract a specific value from log files in a directory.
Now the log files contains JSON data and i want to extract the value for the id field.
JSON Data look something like this
{
id: "123",
name: "foo"
description: "bar baz"
}
Code Looks like this
def test_load_json_directly(self):
with open('source_data/testing123.json') as log_file:
data = json.load(log_file)
print data
def test_load_json_from_iteration(self, dir_path, file_ext):
path_name = os.path.join(dir_path, '*.' + file_ext)
files = glob.glob(path_name)
for filename in files:
with open(filename) as log_file:
data = json.load(log_file)
print data
Now I try to call the function test_load_json_directly the JSON string gets loaded correctly. No problem there. This is just to check the correct behavior of the json.load function.
The issue is when I try to call the function test_load_json_from_iteration, the JSON string is not being recognized and returns an error.
ValueError: No JSON object could be decoded
What am I doing wrong here?

Your json is invalid. The property names and the values must be wrapped with quotes (except if they are numbers). You're also missing the commas.
The most probable reason for this error is an error in a json file. Since json module doesn't show detailed errors, you can use the simplejson module to see what's actually happening.
Change your code to:
import simplejson
.
.
.
data = simplejson.load(log_file)
And look at the error message. It will show you the line and the column where it fails.
Ex:
simplejson.errors.JSONDecodeError: Expecting value: line 5 column 17 (char 84)
Hope it helps :) Feel free to ask if you have any doubts.

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.
What do you think is the problem, and how can I solve it?
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
This is my code:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
This is the error I get:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
I used this code:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
and realized that the file starts with b' (as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.
I uploaded a sample of these files in the following link
just a few json.gz files

The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.
Dunes' answer to the question #Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
Here's what I mean:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
Note that the output it prints shows what valid JSON might have looked like.

How to differentiate between two different type of files in python 2.7?

I have a tool in which i save file in 2 formats (one is JSON and other is text(without extension)) and have 2 buttons for opening them.
In the upgraded version of tool, I have removed saving of text format. And now i don't want 2 buttons for loading 2 different files, i want both the files to be loaded with same button.
How this can be done because one file has ".json" extension and other file don't have any extension.
1 method i know is to check the file extension(is this standard way ?)
Any other ways ?
What is the pythonic way of doing this ?

Yes you can just check the extension. Use endswith
if filename.endswith('.json'):
# it's json
else:
# it's not
Or you could check the file contents itself.
s = open(filename).read()
try:
json.loads(s)
# it's json
except ValueError:
# it's not

Two approaches you can take:
use os.splitext to determine if the extension is 'json':
if os.splitext(path)[1] == 'json':
...
Or try to parse as json, parse the other way if that fails:
try:
data = json.loads(contents)
except ValueError:
data = parse_text() # your custom function i guess?

import os
if os.path.splitext('file.json')[1] == '.json':
#it's a json file
else:
#it's not a json

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Validate and format JSON files - python

Related

Can't open and read content of an uploaded zip file with FastAPI

How would you write an `is_pdf(path_to_file)` function in Python?

Python's json.load weird behavior

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

How to differentiate between two different type of files in python 2.7?

Categories

Resources