JSON extract to pandas dataframe - python

I'm currently trying to process a json as pandas dataframe. What happened here is that I get a continuous stream of json structures. They are simply appended. It's a whole line. I extracted a .txt from it and want to analyse it now via pandas.
Example snippet:
{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}...
as you see in this light snipped is, that every json starts with {"positionFlightMessage": and ends with messageSubtype":"ADSB"
After a json ends, the next json just appends after it.
What i need is a table out of it, like this:
95b3b6ca-5dd2-44b4-918a-baa51022d143 1.0-RC1 1533134514 DLH1601 4.414.525 -131.849 340 24.0 ADSB AFR1601-1532928365-airline-0002 AFR AFR89GA 442.0 34000.0 ADSB
884708c1-2fff-4ebf-b72c-bbc6ed2c3623 1.0-RC1 1533134515 DLH012 3.734.542 14.379.951 320 54.0 ADSB EVA12-1532928367-airline-0096 DLH EVA012 462.0 32000.0 ADSB
i tried to use pandas read json but i get a error.
import pandas as pd
df = pd.read_json("tD.txt",orient='columns')
df.head()
ValueError: Trailing data
tD.txt has the above given snippet without the last (...) dots
I think the problem is, that every json is just appended. I could add a new line after every
messageSubtype":"ADSB"}}
and then read it, but maybe you have a solution where i can just convert the big txt file directly and convert it easily to a df

Try to get the stream of json to output like the following:
Notice the starting '[' and the ending ']'.
Also notice the ',' between each json input.
data = [{
"positionFlightMessage": {
"messageUuid": "95b3b6ca-5dd2-44b4-918a-baa51022d143",
"schemaVersion": "1.0-RC1",
"timestamp": 1533134514,
"flightNumber": "DLH1601",
"position": {
"waypoint": {
"latitude": 44.14525,
"longitude": -1.31849
},
"flightLevel": 340,
"heading": 24.0
},
"messageSource": "ADSB",
"flightUniqueId": "AFR1601-1532928365-airline-0002",
"airlineIcaoCode": "AFR",
"atcCallsign": "AFR89GA",
"fuel": {},
"speed": {
"groundSpeed": 442.0
},
"altitude": {
"altitude": 34000.0
},
"nextPosition": {
"waypoint": {}
},
"messageSubtype": "ADSB"
}
}, {
"positionFlightMessage": {
"messageUuid": "884708c1-2fff-4ebf-b72c-bbc6ed2c3623",
"schemaVersion": "1.0-RC1",
"timestamp": 1533134515,
"flightNumber": "DLH012",
"position": {
"waypoint": {
"latitude": 37.34542,
"longitude": 143.79951
},
"flightLevel": 320,
"heading": 54.0
},
"messageSource": "ADSB",
"flightUniqueId": "EVA12-1532928367-airline-0096",
"airlineIcaoCode": "DLH",
"atcCallsign": "EVA012",
"fuel": {},
"speed": {
"groundSpeed": 462.0
},
"altitude": {
"altitude": 32000.0
},
"nextPosition": {
"waypoint": {}
},
"messageSubtype": "ADSB"
}
}]
Now you should be able to loop over each 'list' element in the json and append it to the pandas df.
print(len(data))
for i in range(0,len(data)):
#here is just show messageSource only. Up to you to find out the rest..
print(data[i]['positionFlightMessage']['messageSource'])
#instead of printing here you should append it to pandas df.
Hope this helps you out a bit.

Now here's a solution for your JSON as is using regex.
s = '{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}'
import re
import json
replaced = json.loads('['+re.sub(r'{\"positionFlightMessage*', ',{\"positionFlightMessage', s)[1:] + ']')
dfTemp = pd.DataFrame(data=replaced)
df = pd.DataFrame()
counter = 0
def newDf(row):
global df,counter
counter += 1
temp = pd.DataFrame([row])
df = df.append(temp)
dfTemp['positionFlightMessage'] = dfTemp['positionFlightMessage'].apply(newDf)
print(df)
First we replace all occurrences of {"positionFlightMessage with ,{"positionFlightMessage and discard the first separator.
We create a dataframe out of this but we have only one column here. Use the apply function on the column and create a new dataframe out of it.
From this dataframe, you can perform some more cleaning.

Related

Is there a way to normalize a json pulled straight from an api

Here is the type of json file that I am working with
{
"header": {
"gtfsRealtimeVersion": "1.0",
"incrementality": "FULL_DATASET",
"timestamp": "1656447045"
},
"entity": [
{
"id": "RTVP:T:16763243",
"isDeleted": false,
"vehicle": {
"trip": {
"tripId": "16763243",
"scheduleRelationship": "SCHEDULED"
},
"position": {
"latitude": 33.497833,
"longitude": -112.07365,
"bearing": 0.0,
"odometer": 16512.0,
"speed": 1.78816
},
"currentStopSequence": 16,
"currentStatus": "INCOMING_AT",
"timestamp": "1656447033",
"stopId": "2792",
"vehicle": {
"id": "5074"
}
}
},
{
"id": "RTVP:T:16763242",
"isDeleted": false,
"vehicle": {
"trip": {
"tripId": "16763242",
"scheduleRelationship": "SCHEDULED"
},
"position": {
"latitude": 33.562374,
"longitude": -112.07392,
"bearing": 359.0,
"odometer": 40367.0,
"speed": 15.6464
},
"currentStopSequence": 36,
"currentStatus": "INCOMING_AT",
"timestamp": "1656447024",
"stopId": "2794",
"vehicle": {
"id": "5251"
}
}
}
]
}
In my code, I am taking in the json as a string. But when I try normalize json string to put into data frame
import pandas as pd
import json
import requests
base_URL = requests.get('https://app.mecatran.com/utw/ws/gtfsfeed/vehicles/valleymetro?apiKey=4f22263f69671d7f49726c3011333e527368211f&asJson=true')
packages_json = base_URL.json()
packages_str = json.dumps(packages_json, indent=1)
df = pd.json_normalize(packages_str)
I get this error, I am definitely making some rookie error, but how exactly am I using this wrong? Are there additional arguments that may need in that?
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-33-aa23f9157eac> in <module>()
8 packages_str = json.dumps(packages_json, indent=1)
9
---> 10 df = pd.json_normalize(packages_str)
/usr/local/lib/python3.7/dist-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
421 data = list(data)
422 else:
--> 423 raise NotImplementedError
424
425 # check to see if a simple recursive function is possible to
NotImplementedError:
When I had the json format within my code without the header portion referenced as an object, the pd.json_normalize(package_str) does work, why would that be, and what additional things would I need to do?
The issue is, that pandas.json_normalize expects either a dictionary or a list of dictionaries but json.dumps returns a string.
It should work if you skip the json.dumps and directly input the json to the normalizer, like this:
import pandas as pd
import json
import requests
base_URL = requests.get('https://app.mecatran.com/utw/ws/gtfsfeed/vehicles/valleymetro?apiKey=4f22263f69671d7f49726c3011333e527368211f&asJson=true')
packages_json = base_URL.json()
df = pd.json_normalize(packages_json)
If you take a look at the corresponding source-code of pandas you can see for yourself:
if isinstance(data, list) and not data:
return DataFrame()
elif isinstance(data, dict):
# A bit of a hackjob
data = [data]
elif isinstance(data, abc.Iterable) and not isinstance(data, str):
# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
data = list(data)
else:
raise NotImplementedError
You should find this code at the path that is shown in the stacktrace, with the error raised on line 423:
/usr/local/lib/python3.7/dist-packages/pandas/io/json/_normalize.py
I would advise you to use a code-linter or an IDE that has one included (like PyCharm for example) as this is the type of error that doesn't happen if you have one.
I m not sure where is the problem, but if you are desperate, you can always make text function that will data-mine that Json.
Yes, it will be quite tiring, but with +-10 variables you need to mine, for each row, you will be done in +-60 minutes no problem.
Something like this:
def MineJson(text, target): #target is for example "id"
#print(text)
findword = text.find(target)
g=findword+len(target)+5 #should not include the first "
new_text= text[g:] #output should be starting with RTVP:T...
return new_text
def WhatsAfter(text): #should return new text and RTVP:T:16763243
#print(text)
toFind='"'
findEnd = text.find(toFind)
g=findEnd
value=text[:g]
new_text= text[g:]
return new_text,value
I wrote it without testing, so maybe there will be some mistakes.

convert json with nested dicts into data frame with python

can someone explain how I convert the following json into a simple data frame with the following headings?
----- sample----
{
"last_scanned_block": 14968718,
"blocks": {
"13965799": {
"0x9603846aff5c425277e483de16179a68dbc739debcc5449ea99e45c9d0924430": {
"165": {
"from": "0x0000000000000000000000000000000000000000",
"to": "0x01f87c337be5636Cd9B3D48F1159768A7e7837A5",
"value": 100000000000000000000000000,
"timestamp": "2022-01-08T16:19:02"
}
}
},
"13965820": {
"0xd4a4122734a522c40504c8b0ab43b9aa40ac821cd9913179b3ae64e5b166fc57": {
"226": {
"from": "0x01f87c337be5636Cd9B3D48F1159768A7e7837A5",
"to": "0xEa3Fa123Eb40CEEaeED390D8d6dE6AF95f044AF7",
"value": 610000000000000000000000,
"timestamp": "2022-01-08T16:25:12"
}
}
},
--- end----
I'd like the df to have the following 8 column headings and values for each row
(value examples for first row)
Last_scanned_block: 14968718
block: 13965799
hex: 0x9603846aff5c425277e483de16179a68dbc739debcc5449ea99e45c9d0924430
number: 165
from: 0x0000000000000000000000000000000000000000
to: 0x01f87c337be5636Cd9B3D48F1159768A7e7837A5
value: 100000000000000000000000000
timestamp: 2022-01-08T16:19:02
Thanks
I would make a new dictionary from the json that is passed in. Essentially instead of having nested dictionaries like you have above you want to get them into one simple dictionary according to your headings and values. It should be:
*heading name* : *list of values*
Essentially, the resulting format should be:
{"Last_scanned_block" : [14968718], "block" : [13965799], "hex" : ["0x9603846aff5c425277e483de16179a68dbc739debcc5449ea99e45c9d0924430"], "number" : [165], "from" : ["0x0000000000000000000000000000000000000000"], "to" : ["0x01f87c337be5636Cd9B3D48F1159768A7e7837A5"], "value": [100000000000000000000000000], "timestamp" : ["2022-01-08T16:19:02"]}
Then every time you read more data you just append it to each respective list in your dictionary.
Once you have your complete dictionary, you would use pandas. So something along the lines of:
import pandas
d = *the dictionary above*
frame = pandas.DataFrame(data = d)
print(frame)

flattening JSON file using json_normalise and choosing specific elements to convert to an excel sheet (Sample Attached)

{
"currency": {
"Wpn": {
"units": "KB_per_sec",
"type": "scalar",
"value": 528922.0,
"direction": "up"
}
},
"catalyst": {
"Wpn": {
"units": "ns",
"type": "scalar",
"value": 70144.0,
"direction": "down"
}
},
"common": {
"Wpn": {
"units": "ns",
"type": "scalar",
"value": 90624.0,
"direction": "down"
}
}
}
So I have to basically convert nested json into excel, for which my approach was to flatten json file using json_normalise , but as I am new to all these...I always seem to end up in KeyError...
Here's my code so far , assuming that the file is named as json.json
import requests
from pandas import json_normalize
with open('json.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(sum([i[['Wpn'], ['value']] for i in data], []))
df.to_excel('Ai.xlsx')
I'm trying to get output on an excel sheet consisting of currency and common along with their resp. values as an output
I know , there are alot of similar questions , but trust me I have tried most of them and yet I didn't get any desirable output... Plz just help me in this
Try:
import json
import pandas as pd
with open('json.json', 'r') as f: data = json.load(f)
data = [{'key': k, 'wpn_value': v['Wpn']['value']} for k, v in data.items()]
print(data)
# here, the variable data looks like
# [{'key': 'currency', 'wpn_value': 528922.0}, {'key': 'catalyst', 'wpn_value': 70144.0}, {'key': 'common', 'wpn_value': 90624.0}]
df = pd.DataFrame(data).set_index('key') # set_index() optional
df.to_excel('Ai.xlsx')
The result looks like
key
wpn_value
currency
528922
catalyst
70144
common
90624

loading json files using json normalize + pd concat

someone would help me optimize my solution of loading data from json files using json normalize and pd concat?
My 5k json files like:
[
{
"id": {
"number": 2121",
"exp" : "1",
"state": "California"
},
"state": [
{
"city": "San Francisco",
"pm": "17",
"spot": "2"
},
{
"city": "San Diego",
"pm": "14",
"spot": "1"
}
]
},
{
"id": {
"number": "2122",
"exp" : "1"
"state": "California",
},
"state": [
{
"city: "San Jose",
"pm": "15",
"spot": "1"
}
]
}
]
I have to load data from 'state' and I must have the date (taken from json file name) on each city. My solution is
json_paths = 'my files_directory'
jsfiles = glob.glob(os.path.join(json_paths, "*.json"))
main_df = pd.DataFrame()
for file in jsfiles:
df = pd.read_json(file)
for i in df['state']:
df2 = pd.concat([pd.DataFrame(json_normalize(i))], ignore_index=False, sort = False)
df2['date'] = file
main_df = pd.concat([main_df, df2])
Loading 1000 jsons takes a long time, let alone 5000. Is there any way to optimize my solution?
Many of the functions you are using seem convoluted for this purpose because they somewhat are. json_normalize() is for flattening a dictionary (removing nesting) which you don't need to do since your JSON state object is already flat. Using pd.read_json is fine if your JSON file is already in a convenient format for reading, but yours isn't.
With those things in mind, the easiest thing to do is to parse each JSON file first in Python so that you get the data that you want to correspond to a single row into a dictionary, and keep a list of all of those.
Also I used pathlib.Path objects to clean up globbing and filename extraction.
Something like this is what you want to do:
import pandas as pd
from pathlib import Path
import json
# each dict in states corresponds to a row
states = []
# you can glob directly on pathlib.Path objects
for file in Path("my files_directory").glob("*.json"):
# load json data
with open(file) as jsonf:
data = json.load(jsonf)
# add the date from the filename stem to each dict, and append to list
for result in data:
for state in result["state"]:
state["date"] = file.stem
states.append(state)
# create a df where each row corresponds to each dict in states
df = pd.DataFrame(states)

Add missing fields with null values as per position mentioned in the config file in Python while parsing the JSON file data

I Have a config file
Position,ColumnName
1,TXS_ID
4,TXX_NAME
8,AGE
As per the above position i have 1 , 4, 8 --- we have only 3 columns are available. In between 1 & 4 we don't have 2,3 position where i want to fill them with Null Values .
As per the above config file i am trying to parse the data from a Json file by using Python but i have a scenario where i need to define the columns on the base of position as mentioned above. When python script is running if the "TXS_ID" is available it should pick the data from the JSON file & as i dont have 2& 3 fields i want to keep them as Null.
Sample output file
TSX_ID,,,TXX_NAME,,,,AGE
10000,,,AAAAAAAAA,,,,40
As per the config file i specify , data should be extracted from Json file and if the position is missing as per above example then it should be filling with nulls. Please help me if there is any possibility i can achieve.
Below is the sample Json File.
{
"entities": [
{
"id": "XXXXXXXXXXXXXXX",
"data": {
"attributes": {
"TSX_ID": {
"values": [
{
"value": 10000
}
]
},
"TXX_NAME": {
"values": [
{
"value": "AAAAAAAAA"
}
]
},
"AGE": {
"values": [
{
"value": "40"
}
]
}
}
}
}
]
}
Assuming that the config file line 1,TXS_ID has a typo and is actually 1,TSX_ID, this program works with your sample data (see explanations in comments):
import pandas
# read the "config file" into a Series of the "ColumnName"s:
config = pandas.read_csv('config', index_col='Position', squeeze=True)
maxdex = config.index[-1] # get the maximum Position
# fill the Positions missing in the "config file" with empty "ColumnName"s:
config = config.reindex(range(1, maxdex+1), fill_value='')
import json
sample = json.load(open('sample.json'))
# create an empty DataFrame with the desired columns:
output = pandas.DataFrame(columns=config.values)
# now insert the nested JSON data values into the given columns:
for a in config.values:
if a: # only if not an empty column name, of course
output[a] = [av['value'] for e in sample['entities']
for av in e['data']['attributes'][a]['values']]
output.to_csv('output.csv', index=False)

Categories