parsing a deeply nested JSON data present in a .dms file - python

I am trying to parse a deeply nested json data which is saved as .dms file. I saved some transactions of the file as a .json file. When I try json.load() function to read the .json file. I am getting the error as
JSONDecodeError: Extra data: line 2 column 1 (char 4392)
Opening the .dms file in text editor, I copied 3 transactions from it and saved it as .json file. The transactions in the file are not separated by commas. It is separated by new lines. When I used 1 transaction of it as a .json file and used json.load() function, it successfully read. But when I try the json file with 3 transactions, its showing error.
import json
d = json.load(open('t3.json')) or
with open('t3.json') as f:
data = json.load(f)
print(data)
the example transaction is :
{
"header":{
"msgType":"SOURCE_EVENT",
},
"content":{
"txntype":"ums",
"ISSUE":{
"REQUEST":{
"messageTime":"2019-06-06 21:54:11.492",
"Code":"655400",
},
"RESPONSE":{
"Time":"2019-06-06 21:54:11.579",
}
},
"DATA":{
"UserId":"021",
},
{header:{.....}}}
{header:{......}}}
This is how my json data from an API looks like. I wrote it in a readable way. But its all continuously written and whenever a header starts it starts from a new line. and the .dms file has 3500 transactions. the two transactions are not even seperated by commas. Its separated by new lines. But within a transaction there are extra spaces in a value. for eg; "company": "Target Chips 123 CA"
The output I need:
I need to make a csv by extracting values of keys messageType, messageTime, userid from the data for each transaction.
Please help out to clear the error and suggest ways to extract the data I need from these transactions for every transaction and put in .csv file for me to do further analysis and machine learning modeling.

If each object is contained within a single line, then read one line at a time and decode each line separately:
with open(fileName, 'r') as file_to_read:
for line in filetoread:
json_line = json.loads(line)
If objects are spread over multiple lines, then ideally try and fix the source of the data, otherwise use my library jsonfinder. Here is an example answer that may help.

Related

Processing a big file in python (>60gb)

i have a text file (>= 60Gig) and record's in it are like this :
{"index": {"_type": "_doc", "_id": "bLcy4m8BAObvGO9GALME"}}
{"message":"{\"_\":\"user\",\"pFlags\":{\"contact\":true},\"flags\":2135,\"id\":816704468,\"access_hash\":\"788468819702098896\",\"first_name\":\"a\",\"last_name\":\"b\",\"phone\":\"123\",\"status\":{\"_\":\"userStatusOffline\",\"was_online\":132}}","phone":"12","#version":"1","typ":"telegram_contacts","access_hash":"123","id":816704468,"#timestamp":"2020-01-26T13:53:29.467Z","path":"/home/user/mirror_01/users_5d6ca02e7e736a7fc700df8c.log","type":"redis","flags":2135,"host":"ubuntu","imported_from":"telegram_contacts"}
{"index": {"_type": "_doc", "_id": "Z7cy4m8BAObvGO9GALME"}}
{"message":"{\"_\":\"user\",\"pFlags\":{\"contact\":true},\"flags\":2143,\"id\":323586643,\"access_hash\":\"8315858910992970114\",\"first_name\":\"bv\",\"last_name\":\"nj\",\"username\":\"kj\",\"phone\":\"123\",\"status\":{\"_\":\"userStatusRecently\"}}","phone":"123","#version":"1","typ":"telegram_contacts","access_hash":"8315858910992970114","id":323586643,"#timestamp":"2020-01-26T13:53:29.469Z","path":"/home/user/mirror_01/users_5d6ca02e7e736a7fc700df8c.log","username":"mbnab","type":"redis","flags":2143,"host":"ubuntu","imported_from":"telegram_contacts"}
I have a few questions regarding this:
Is this a valid JSON file?
Can python process a file of this size? Or should I convert it somehow to Access or Excel file?
These are some SO posts I found useful:
Is there a memory efficient and fast way to load big json files in python?
Reading rather large json files in Python
But still need help.
You can work through the file line by line and extract the information you need.
with open('largefile.txt','r') as f:
for line in f:
# Extract what you need from that line of text here
print(line)
For example, to read things You can work through the file line by line and extract the information you need.
with open('largefile.txt','r') as f:
for line in f:
# For example, to interpret the string as json, and read
# it in as a dictionary, do
if line.strip(): # check there is something on the line
data = json.loads(line)
# in your case, to fix the value for "message" do
if 'message' in data:
data['message'] = json.loads(data['message'])
# extract information you need here
I expect there's a lot more work to extract the information you need, but I hope this gets you started. Good luck!

Producing seperate text files from a large Json file

I'm trying to use the code below to produce a set of .txt files from a large .json file where there is one Json object per line, with date and a string of text. I want the date to be the filename.
When I open up the .json file (in sublime text editor), it shows me 2272 lines, so I assume the code should produce this number of text files. However, it is only producing half as many. Can anybody tell me why, and what I should do to correct this?
import json
#with open('results.json') as json_file:
data = [json.loads(line) for line in open('results.json', 'r')]
for p in data:
date = p["date"]
filename = date.replace(" ", "_").replace(":","_")
print(filename)
text = p["text"]
with open('Articles2/'+filename+'.txt', 'w') as f:
f.write(text+'\n')
Thanks for any help!
You have duplicate dates in your sample data, so each iteration of your for loop is creating a new entry and then overwriting the entries where the date is the exact same.
For example: 2018-11-17 17:11:48 with 3 entries about 13 lines down in your data. This will only create 1 new file because your new file creation criteria in your script is only based on the date data.
You need to add some other unique value to the date to make them unique so open() doesn't overwrite the file that already exists. ie. add milliseconds to your date, concat values from the "text" property in your JSON object, add a counter, etc.

Extracting N JSON objects contained in a single line from a text file in Python 2.7?

I have a huge text file that contains several JSON objects inside of it that I want to parse into a csv file. Just because i'm dealing with someone else's data I cannot really change the format its being delivered in.
Since I dont know how many objects JSON objects I just can create a couple set of dictionaries, wrap them in a list and then json.loads() the list.
Also, since all the objects are in a single text line I can't a regex expression to separete each individual json object and then put them on a list.(It's a super complicated and sometimes triple nested json at some points.
Here's, my current code
def json_to_csv(text_file_name,desired_csv_name):
#Cleans up a bit of the text file
file = fileinput.FileInput(text_file_name, inplace=True)
ile = fileinput.FileInput(text_file_name, inplace=True)
for line in file:
sys.stdout.write(line.replace(u'\'', u'"'))
for line in ile:
sys.stdout.write(re.sub(r'("[\s\w]*)"([\s\w]*")', r"\1\2", line))
#try to load the text file to content var
with open(text_file_name, "rb") as fin:
content = json.load(fin)
#Rest of the logic using the json data in content
#that uses it for the desired csv format
This code gives a ValueError: Extra data: line 1 column 159816 because there is more than one object there.
I seen similar questions in Google and StackOverflow. But none of those solutions none because of the fact that it's just one really long line in a text file and I dont know how many objects there are in the file.
If you are trying to split apart the highest level braces you could do something like
string = '{"NextToken": {"value": "...'
objects = eval("[" + string + "]")
and then parse each item in the list.

Reading json twitter streams with python

I am pretty new at working with python and coding in general so I feel as though this answer is something that I do not understand about how python works.
I have been working with using Tweepy to collect streams of data from python to measure sentiment with different things. That part worked fine. When I ran the program, I had the data write to a txt file and then was trying to use the data within that file to see things such as common words or locations. But I am running into problems when I am reading the data. I have been searching online and found a number of different ways that people have read the data but as I am unfamiliar with json files in general, I don't understand why these methods would work or not.
The main error I seem to be running into is something similar to this:
JSONDecodeError: Expecting value: line 1 column 1 (char 0).
From my understanding, this means that the data is not reading in correctly as can't be read as a json file. But I have also experienced the error where it reads like this:
JSONDecodeError: Expecting value: line 4 column 1 (char 0).
I don't understand why it would change or not. I have tried reading the file in as the original txt file and then saving it again as a json file. The first error I received when trying it as a json file with the second coming from the txt file.
I have read a number of different threads discussing similar problems but they keep giving me these types of errors. Just as an example, here is what my code looked like for the most recent error:
import json
source = open("../twitterdata24.json")
json_data = json.load(source)
One of my other attempts:
import json
tweets = []
for line in open("fileinfo"):
tweets.append(json.load(line))
One other point of interest, the data I am working with contains many individual tweets and from what I have read, I think there is a problem with each individual tweet being a new dictionary, so I tried to make the whole data file a list using [] but that just moved the error down a line.
So if there is anything anyone could tell me or point me to that would help me understand what I am supposed to do to read this data, I would really appreciate it.
Thanks
Edit:
Here is a small sample of the data. The whole data file is little large so here are the first two tweets in the file.
https://drive.google.com/file/d/1l6uiCzBTYf-SqUpCThQ3WDXmslMcUnPA/view?usp=sharing
Looking at your sample data, I suspect that the problem is that it isn't a valid json document. You effectively have data like:
{"a": "b"}
{"c": "d"}
{"a": "b"} is valid json, and {"c": "d"} is valid json, but {"a": "b"}\n{"c": "d"} is not valid json. This explains why `json.load(source) fails.
You're on the right track with your second attempt: by reading through the file line-by-line, you can extract the valid json data objects individually. But your implementation has two problems:
line is a string and you can't call json.load on a string. That's what json.loads is for.
you can't convert an empty line to a json object.
So if you check for empty lines and use loads, you should be able to fill your tweets list without any problems.
import json
tweets = []
with open("sampledata.txt") as source:
for line in source:
if line.strip():
tweets.append(json.loads(line))
print("Succesfully loaded {} tweets.".format(len(tweets)))
Result:
Succesfully loaded 2 tweets.

Python: Json.load large json file MemoryError

I'm trying to load a large JSON File (300MB) to use to parse to excel. I just started running into a MemoryError when I do a json.load(file). Questions similar to this have been posted but have not been able to answer my specific question. I want to be able to return all the data from the json file in one block like I did in the code. What is the best way to do that? The Code and json structure are below:
The code looks like this.
def parse_from_file(filename):
""" proceed to load the json file that given and verified,
it and returns the data that was in the json file so it can actually be read
Args:
filename (string): full branch location, used to grab the json file plus '_metrics.json'
Returns:
data: whatever data is being loaded from the json file
"""
print("STARTING PARSE FROM FILE")
with open(filename) as json_file:
d = json.load(json_file)
json_file.close()
return d
The structure looks like this.
[
{
"analysis_type": "test_one",
"date": 1505900472.25,
"_id": "my_id_1.1.1",
"content": {
.
.
.
}
},
{
"analysis_type": "test_two",
"date": 1605939478.91,
"_id": "my_id_1.1.2",
"content": {
.
.
.
}
},
.
.
.
]
Inside "content" the information is not consistent but has 3 distinct but different possible template that can be predicted based of analysis_type.
i did like this way, hope it will helps you. and maybe you need skip the 1th line "[". and remove "," at a line end if exists "},".
with open(file) as f:
for line in f:
while True:
try:
jfile = ujson.loads(line)
break
except ValueError:
# Not yet a complete JSON value
line += next(f)
# do something with jfile
If all the tested libraries are giving you memory problems my approach would be splitting the file into one per each object inside the array.
If the file has the newlines and padding as you said in the OP I owuld read by line, discarding if it is [ or ] writting the lines to new files every time you find a }, where you also need to remove the commas. Then try to load everyfile and print a message when you end reading each one to see where it fails, if it does.
If the file has no newlines or is not properly padded you would need to start reading char by char keeping too counters, increasing each of them when you find [ or { and decreasing them when you find ] or } respectively. Also take into account that you may need to discard any curly or square bracket that is inside a string, though that may not be needed.

Categories