Hi I am writing a DATABRICKS Python code which picks huge JSON file and divide into two part. Which means from index 0 or "reporting_entity_name" till index 3 or "version" on one file and from index 4 in other file till the end. Though it successfully divides the file from index 1 of the json file but when i provide index 0 it fails and says
Datasource does not support writing empty or nested empty schemas. Please make sure the data schema has at least one or more column(s).
Here is the SAMPLE Data of large JSON file.
{
"reporting_entity_name": "launcher",
"reporting_entity_type": "launcher",
"last_updated_on": "2020-08-27",
"version": "1.0.0",
"in_network": [
{
"negotiation_arrangement": "ffs",
"name": "Boosters",
"billing_code_type": "CPT",
"billing_code_type_version": "2020",
"billing_code": "27447",
"description": "Boosters On Demand",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
0
],
"tin": {
"type": "ein",
"value": "11-1111111"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 123.45,
"expiration_date": "2022-01-01",
"billing_class": "organizational"
}
]
}
]
}
]
}
Here is the python code.
from pyspark.sql.functions import explode, col
import itertools
# Read the JSON file from Databricks storage
df_json = spark.read.option("multiline","true").json("/mnt/BigData_JSONFiles/SampleDatafilefrombigfile.json")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
# Convert the dataframe to a dictionary
data = df_json.toPandas().to_dict()
# Split the data into two parts
d1 = dict(itertools.islice(data.items(), 1))
d2 = dict(itertools.islice(data.items(), 1, len(data.items())))
# Convert the first part of the data back to a dataframe
df1 = spark.createDataFrame([d1])
# Write the first part of the data to a JSON file in Databricks storage
df1.write.format("json").save("/mnt/BigData_JSONFiles/SampleDatafilefrombigfile_detail.json")
# Convert the second part of the data back to a dataframe
df2 = spark.createDataFrame([d2])
# Write the second part of the data to a JSON file in Databricks storage
df2.write.format("json").save("/mnt/BigData_JSONFiles/SampleDatafilefrombigfile_header.json")
Here is the output of the two files. In the output file you can see in the detail file it should only contains the data of "in_network" but it also have the 0 index data which is "reporting_entity_name" which shouldnt be in detail file it should be in header file.
{
"in_network": [
{
"negotiation_arrangement": "ffs",
"name": "Boosters",
"billing_code_type": "CPT",
"billing_code_type_version": "2020",
"billing_code": "27447",
"description": "Boosters On Demand",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
0
],
"tin": {
"type": "ein",
"value": "11-1111111"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 123.45,
"expiration_date": "2022-01-01",
"billing_class": "organizational"
}
]
}
]
}
]
},"negotiation_arrangement":"ffs"}]}}
The output of the Headerfile which starts from 1 index and gives the output.
{"reporting_entity_type": "launcher",
"last_updated_on": "2020-08-27",
"version": "1.0.0"}
Kindly please help me in this error.
A guidance on code will be helpful.
Here is the screenshot of large json file which is exact copy of the file attached above I increased the cluster from 2 gb to 8gb. But the error is same also dict inside in_network occurs 714 times in the file does. But why it is failing in the big file. If it is exactly same.
I change the code line of the answer to this also
df_network=df_json.select(df_json.columns[714:])
Here is the TraceBack
AnalysisException Traceback (most recent call last)
<command-863551447189973> in <cell line: 13>()
11 df_version=df_json.select(df_json.columns[:1])
12
---> 13 df_network.write.format("json").save("/mnt/BigData_JSONFiles/2022-10_040_05C0_in-network-rates_2_of_2_detail.json")
14 df_version.write.format("json").save("/mnt/BigData_JSONFiles/2022-10_040_05C0_in-network-rates_2_of_2_header.json")
15 display(df_network)
/databricks/spark/python/pyspark/instrumentation_utils.py in wrapper(*args, **kwargs)
46 start = time.perf_counter()
47 try:
---> 48 res = func(*args, **kwargs)
49 logger.log_success(
50 module_name, class_name, function_name, time.perf_counter() - start, signature
/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
966 self._jwrite.save()
967 else:
--> 968 self._jwrite.save(path)
969
970 #since(1.4)
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in __call__(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
200 # Hide where the exception came from that shows a non-Pythonic
201 # JVM exception message.
--> 202 raise converted from None
203 else:
204 raise
AnalysisException:
Datasource does not support writing empty or nested empty schemas.
Please make sure the data schema has at least one or more column(s).
I have reproduced your code and got below results for one file.
{"last_updated_on":{"0":"2020-08-27"},"reporting_entity_name":{"0":"launcher"},"reporting_entity_type":{"0":"launcher"},"version":{"0":"1.0.0"}}
The inner 0 key might be due to the usage of dictionary and pandas.
As your JSON has the same structure, you can try the below workaround to divide the JSON using select rather than converting into dictionary.
This is the Original Dataframe from JSON file.
So, use select to generate the required JSON files.
df_network=df_json.select(df_json.columns[:1])
df_version=df_json.select(df_json.columns[1:])
display(df_network)
display(df_version)
Dataframes:
Result after writing to JSON files:
Related
THE ERROR
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
THE CODE
from pyspark.sql.functions import explode, col
# Read the JSON file from Databricks storage
df_json = spark.read.json("/mnt/BigData_JSONFiles/new_test.json")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
# Convert the dataframe to a dictionary
data = df_json.toPandas().to_dict()
# Split the data into two parts
d1 = dict(itertools.islice(data.items(), 8))
d2 = dict(itertools.islice(data.items(), 8, len(data.items())))
# Convert the first part of the data back to a dataframe
df1 = spark.createDataFrame([d1])
# Write the first part of the data to a JSON file in Databricks storage
df1.write.format("json").save("/mnt/BigData_JSONFiles/new_test_header.json")
# Convert the second part of the data back to a dataframe
df2 = spark.createDataFrame([d2])
# Write the second part of the data to a JSON file in Databricks storage
df2.write.format("json").save("/mnt/BigData_JSONFiles/new_test_detail.json")
THE SAMPLE JSON FILE OF LARGE JSON FILE
{
"reporting_entity_name": "launcher",
"reporting_entity_type": "launcher",
"plan_name": "launched",
"plan_id_type": "hios",
"plan_id": "1111111111",
"plan_market_type": "individual",
"last_updated_on": "2020-08-27",
"version": "1.0.0",
"in_network": [
{
"negotiation_arrangement": "ffs",
"name": "Boosters",
"billing_code_type": "CPT",
"billing_code_type_version": "2020",
"billing_code": "27447",
"description": "Boosters On Demand",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
0
],
"tin": {
"type": "ein",
"value": "11-1111111"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 123.45,
"expiration_date": "2022-01-01",
"billing_class": "organizational"
}
]
}
]
}
]
}
Hi, I am trying to divide a big json file into two format which is done by the above code. But it is failing saying to cache i used
.cache() at the last of the loading file but still getting this error. Kindly please let me know how can i solve this error.
I am able to resolve this error buy changing the this
df_json = spark.read.json("/mnt/BigData_JSONFiles/new_test.json")
to this
df_json = spark.read.option("multiline","true").json("/mnt/BigData_JSONFiles/new_test.json")
Here is the type of json file that I am working with
{
"header": {
"gtfsRealtimeVersion": "1.0",
"incrementality": "FULL_DATASET",
"timestamp": "1656447045"
},
"entity": [
{
"id": "RTVP:T:16763243",
"isDeleted": false,
"vehicle": {
"trip": {
"tripId": "16763243",
"scheduleRelationship": "SCHEDULED"
},
"position": {
"latitude": 33.497833,
"longitude": -112.07365,
"bearing": 0.0,
"odometer": 16512.0,
"speed": 1.78816
},
"currentStopSequence": 16,
"currentStatus": "INCOMING_AT",
"timestamp": "1656447033",
"stopId": "2792",
"vehicle": {
"id": "5074"
}
}
},
{
"id": "RTVP:T:16763242",
"isDeleted": false,
"vehicle": {
"trip": {
"tripId": "16763242",
"scheduleRelationship": "SCHEDULED"
},
"position": {
"latitude": 33.562374,
"longitude": -112.07392,
"bearing": 359.0,
"odometer": 40367.0,
"speed": 15.6464
},
"currentStopSequence": 36,
"currentStatus": "INCOMING_AT",
"timestamp": "1656447024",
"stopId": "2794",
"vehicle": {
"id": "5251"
}
}
}
]
}
In my code, I am taking in the json as a string. But when I try normalize json string to put into data frame
import pandas as pd
import json
import requests
base_URL = requests.get('https://app.mecatran.com/utw/ws/gtfsfeed/vehicles/valleymetro?apiKey=4f22263f69671d7f49726c3011333e527368211f&asJson=true')
packages_json = base_URL.json()
packages_str = json.dumps(packages_json, indent=1)
df = pd.json_normalize(packages_str)
I get this error, I am definitely making some rookie error, but how exactly am I using this wrong? Are there additional arguments that may need in that?
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-33-aa23f9157eac> in <module>()
8 packages_str = json.dumps(packages_json, indent=1)
9
---> 10 df = pd.json_normalize(packages_str)
/usr/local/lib/python3.7/dist-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
421 data = list(data)
422 else:
--> 423 raise NotImplementedError
424
425 # check to see if a simple recursive function is possible to
NotImplementedError:
When I had the json format within my code without the header portion referenced as an object, the pd.json_normalize(package_str) does work, why would that be, and what additional things would I need to do?
The issue is, that pandas.json_normalize expects either a dictionary or a list of dictionaries but json.dumps returns a string.
It should work if you skip the json.dumps and directly input the json to the normalizer, like this:
import pandas as pd
import json
import requests
base_URL = requests.get('https://app.mecatran.com/utw/ws/gtfsfeed/vehicles/valleymetro?apiKey=4f22263f69671d7f49726c3011333e527368211f&asJson=true')
packages_json = base_URL.json()
df = pd.json_normalize(packages_json)
If you take a look at the corresponding source-code of pandas you can see for yourself:
if isinstance(data, list) and not data:
return DataFrame()
elif isinstance(data, dict):
# A bit of a hackjob
data = [data]
elif isinstance(data, abc.Iterable) and not isinstance(data, str):
# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
data = list(data)
else:
raise NotImplementedError
You should find this code at the path that is shown in the stacktrace, with the error raised on line 423:
/usr/local/lib/python3.7/dist-packages/pandas/io/json/_normalize.py
I would advise you to use a code-linter or an IDE that has one included (like PyCharm for example) as this is the type of error that doesn't happen if you have one.
I m not sure where is the problem, but if you are desperate, you can always make text function that will data-mine that Json.
Yes, it will be quite tiring, but with +-10 variables you need to mine, for each row, you will be done in +-60 minutes no problem.
Something like this:
def MineJson(text, target): #target is for example "id"
#print(text)
findword = text.find(target)
g=findword+len(target)+5 #should not include the first "
new_text= text[g:] #output should be starting with RTVP:T...
return new_text
def WhatsAfter(text): #should return new text and RTVP:T:16763243
#print(text)
toFind='"'
findEnd = text.find(toFind)
g=findEnd
value=text[:g]
new_text= text[g:]
return new_text,value
I wrote it without testing, so maybe there will be some mistakes.
I Have a config file
Position,ColumnName
1,TXS_ID
4,TXX_NAME
8,AGE
As per the above position i have 1 , 4, 8 --- we have only 3 columns are available. In between 1 & 4 we don't have 2,3 position where i want to fill them with Null Values .
As per the above config file i am trying to parse the data from a Json file by using Python but i have a scenario where i need to define the columns on the base of position as mentioned above. When python script is running if the "TXS_ID" is available it should pick the data from the JSON file & as i dont have 2& 3 fields i want to keep them as Null.
Sample output file
TSX_ID,,,TXX_NAME,,,,AGE
10000,,,AAAAAAAAA,,,,40
As per the config file i specify , data should be extracted from Json file and if the position is missing as per above example then it should be filling with nulls. Please help me if there is any possibility i can achieve.
Below is the sample Json File.
{
"entities": [
{
"id": "XXXXXXXXXXXXXXX",
"data": {
"attributes": {
"TSX_ID": {
"values": [
{
"value": 10000
}
]
},
"TXX_NAME": {
"values": [
{
"value": "AAAAAAAAA"
}
]
},
"AGE": {
"values": [
{
"value": "40"
}
]
}
}
}
}
]
}
Assuming that the config file line 1,TXS_ID has a typo and is actually 1,TSX_ID, this program works with your sample data (see explanations in comments):
import pandas
# read the "config file" into a Series of the "ColumnName"s:
config = pandas.read_csv('config', index_col='Position', squeeze=True)
maxdex = config.index[-1] # get the maximum Position
# fill the Positions missing in the "config file" with empty "ColumnName"s:
config = config.reindex(range(1, maxdex+1), fill_value='')
import json
sample = json.load(open('sample.json'))
# create an empty DataFrame with the desired columns:
output = pandas.DataFrame(columns=config.values)
# now insert the nested JSON data values into the given columns:
for a in config.values:
if a: # only if not an empty column name, of course
output[a] = [av['value'] for e in sample['entities']
for av in e['data']['attributes'][a]['values']]
output.to_csv('output.csv', index=False)
I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.
I have tried quite a few solutions but nothing seems to be yielding a positive result.
This is what I have tried so far
path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data
And this is the full traceback of error message I am receiving:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']
/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder
/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I have added a sample of the json files I am trying to read from
Link to Jsons
Example of jsons:
{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}
There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.
Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.
I've broken down the problem into the following parts:
convert malformed JSON's in the files to valid JSON's
flatten the dict's in the valid JSON files to prepare for dataframe conversion
create dataframes from the files and merge into one dataframe
Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.
Part 1: convert malformed JSON's in the files to valid JSON's:
The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.
with open('./Test/test.json') as f:
data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
The only way I found to get around this problem was:
to convert all the JSON files to txt files as this would convert the contents to string
perform regex on the JSON string in text files and add quotes(" ") around the keys
save the file as JSON again
The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.
import json
import os
import re
import pandas as pd
#rename to txt files and return list of filenames
def rename_to_text_files():
all_new_filenames = []
for filename in os.listdir('./Test'):
if filename.endswith("json"):
new_filename = filename.split('.')[0] + '.txt'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
all_new_filenames.append(new_filename)
else:
all_new_filenames.append(filename)
return all_new_filenames
#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
json_validated_files = []
for index, filename in enumerate(files):
filepath = os.path.join('./Test',filename)
with open(filepath,'r+') as f:
data = f.read()
dict_converted = re.sub("(\w+):(.+)", r'"\1":\2', data)
f.seek(0)
f.write(dict_converted)
f.truncate()
#rename
new_filename = filename[:-4] + '.json'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
json_validated_files.append(new_filename)
print("All files converted to valid JSON!")
return json_validated_files
So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":
#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}
If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:
df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length
In the next part, we fix this problem by flattening the dict in the JSON.
Part 2: Flatten dict for dataframe conversion:
As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:
#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
values = []
d['participants_name'] = d.pop('participants')
for i in d['participants_name']:
values.append(i['name'])
for i in d['messages']:
d['messages_sender_name'] = i['sender_name']
d['messages_timestamp_ms'] = str(i['timestamp_ms'])
d['messages_content'] = i['content']
d['messages_type'] = i['type']
if "share" in i:
d['messages_share_link'] = i["share"]["link"]
d["is_still_participant"] = str(d["is_still_participant"])
d.pop('messages')
d.update(participants_name=values)
return d
This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:
#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}
When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame function(discussed later):
with open('./Test/test1.json') as f:
data = json.load(f)
print(flatten(data))
#Output:
{'title': 'Clearance',
'is_still_participant': 'True',
'thread_type': 'Regular',
'thread_path': 'inbox/Clearance',
'participants_name': ['Clearance', 'Person'],
'messages_sender_name': 'Emmanuel Sibanda',
'messages_timestamp_ms': '1212242073308',
'messages_content': 'Dear',
'messages_type': 'Share',
'messages_share_link': 'http://www.example.com/'}
Part 3: Create Dataframes and merge them into one:
So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:
open the JSON files one by one, load each JSON as a dict in memory using json.load().
call the flatten function on each dict
convert the flattened dicts to dataframes
append all dataframes to an empty list.
merge all dataframes with pd.concat() passing the list of dataframes as an argument.
The code to accomplish these tasks:
#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.
def create_merge_dataframes(list_of_valid_json_files):
df_list = []
for index, js in enumerate(list_of_valid_json_files):
with open(os.path.join('./Test', js)) as json_file:
data = json.load(json_file)
flattened_json_data = flatten(data)
df = pd.DataFrame(flattened_json_data)
df_list.append(df)
merged_df = pd.concat(df_list,sort=False, ignore_index=True)
return merged_df
Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.
#rename invalid JSON files to text
files = rename_to_text_files()
#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)
#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)
The final Dataframe:
title is_still_participant thread_type thread_path \
0 Test 1 True Regular inbox/xyz
1 Test 1 True Regular inbox/xyz
2 Clearance True Regular inbox/Clearance
3 Clearance True Regular inbox/Clearance
participants_name messages_sender_name messages_timestamp_ms \
0 Test 1 Person 1485467319139
1 Person Person 1485467319139
2 Clearance Emmanuel Sibanda 1212242073308
3 Person Emmanuel Sibanda 1212242073308
messages_content messages_type messages_share_link
0 Hie Generic NaN
1 Hie Generic NaN
2 Dear Share http://www.example.com/
3 Dear Share http://www.example.com/
You can change the order of columns as you like.
Note:
The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples
The shape and columns of the Dataframes has also been assumed
You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.
The whole script can be further modularized by putting the functions into a class.
It can also be further optimized with the use of hashable data types like tuples and sped up with threading and asyncio libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long.
It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.
The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.
I've checked your json files, and found that there are same problems in document1.json, document2.json and document3.json: the property name are not enclosed with double quotes.
For example, document1.json should be corrected as:
{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}
EDIT: you can use following line to add double quote to the keys of a json file:
re.sub("([^\s^\"]+):(.+)", '"\\1":\\2', s)
I'm currently trying to process a json as pandas dataframe. What happened here is that I get a continuous stream of json structures. They are simply appended. It's a whole line. I extracted a .txt from it and want to analyse it now via pandas.
Example snippet:
{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}...
as you see in this light snipped is, that every json starts with {"positionFlightMessage": and ends with messageSubtype":"ADSB"
After a json ends, the next json just appends after it.
What i need is a table out of it, like this:
95b3b6ca-5dd2-44b4-918a-baa51022d143 1.0-RC1 1533134514 DLH1601 4.414.525 -131.849 340 24.0 ADSB AFR1601-1532928365-airline-0002 AFR AFR89GA 442.0 34000.0 ADSB
884708c1-2fff-4ebf-b72c-bbc6ed2c3623 1.0-RC1 1533134515 DLH012 3.734.542 14.379.951 320 54.0 ADSB EVA12-1532928367-airline-0096 DLH EVA012 462.0 32000.0 ADSB
i tried to use pandas read json but i get a error.
import pandas as pd
df = pd.read_json("tD.txt",orient='columns')
df.head()
ValueError: Trailing data
tD.txt has the above given snippet without the last (...) dots
I think the problem is, that every json is just appended. I could add a new line after every
messageSubtype":"ADSB"}}
and then read it, but maybe you have a solution where i can just convert the big txt file directly and convert it easily to a df
Try to get the stream of json to output like the following:
Notice the starting '[' and the ending ']'.
Also notice the ',' between each json input.
data = [{
"positionFlightMessage": {
"messageUuid": "95b3b6ca-5dd2-44b4-918a-baa51022d143",
"schemaVersion": "1.0-RC1",
"timestamp": 1533134514,
"flightNumber": "DLH1601",
"position": {
"waypoint": {
"latitude": 44.14525,
"longitude": -1.31849
},
"flightLevel": 340,
"heading": 24.0
},
"messageSource": "ADSB",
"flightUniqueId": "AFR1601-1532928365-airline-0002",
"airlineIcaoCode": "AFR",
"atcCallsign": "AFR89GA",
"fuel": {},
"speed": {
"groundSpeed": 442.0
},
"altitude": {
"altitude": 34000.0
},
"nextPosition": {
"waypoint": {}
},
"messageSubtype": "ADSB"
}
}, {
"positionFlightMessage": {
"messageUuid": "884708c1-2fff-4ebf-b72c-bbc6ed2c3623",
"schemaVersion": "1.0-RC1",
"timestamp": 1533134515,
"flightNumber": "DLH012",
"position": {
"waypoint": {
"latitude": 37.34542,
"longitude": 143.79951
},
"flightLevel": 320,
"heading": 54.0
},
"messageSource": "ADSB",
"flightUniqueId": "EVA12-1532928367-airline-0096",
"airlineIcaoCode": "DLH",
"atcCallsign": "EVA012",
"fuel": {},
"speed": {
"groundSpeed": 462.0
},
"altitude": {
"altitude": 32000.0
},
"nextPosition": {
"waypoint": {}
},
"messageSubtype": "ADSB"
}
}]
Now you should be able to loop over each 'list' element in the json and append it to the pandas df.
print(len(data))
for i in range(0,len(data)):
#here is just show messageSource only. Up to you to find out the rest..
print(data[i]['positionFlightMessage']['messageSource'])
#instead of printing here you should append it to pandas df.
Hope this helps you out a bit.
Now here's a solution for your JSON as is using regex.
s = '{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}'
import re
import json
replaced = json.loads('['+re.sub(r'{\"positionFlightMessage*', ',{\"positionFlightMessage', s)[1:] + ']')
dfTemp = pd.DataFrame(data=replaced)
df = pd.DataFrame()
counter = 0
def newDf(row):
global df,counter
counter += 1
temp = pd.DataFrame([row])
df = df.append(temp)
dfTemp['positionFlightMessage'] = dfTemp['positionFlightMessage'].apply(newDf)
print(df)
First we replace all occurrences of {"positionFlightMessage with ,{"positionFlightMessage and discard the first separator.
We create a dataframe out of this but we have only one column here. Use the apply function on the column and create a new dataframe out of it.
From this dataframe, you can perform some more cleaning.