someone would help me optimize my solution of loading data from json files using json normalize and pd concat?
My 5k json files like:
[
{
"id": {
"number": 2121",
"exp" : "1",
"state": "California"
},
"state": [
{
"city": "San Francisco",
"pm": "17",
"spot": "2"
},
{
"city": "San Diego",
"pm": "14",
"spot": "1"
}
]
},
{
"id": {
"number": "2122",
"exp" : "1"
"state": "California",
},
"state": [
{
"city: "San Jose",
"pm": "15",
"spot": "1"
}
]
}
]
I have to load data from 'state' and I must have the date (taken from json file name) on each city. My solution is
json_paths = 'my files_directory'
jsfiles = glob.glob(os.path.join(json_paths, "*.json"))
main_df = pd.DataFrame()
for file in jsfiles:
df = pd.read_json(file)
for i in df['state']:
df2 = pd.concat([pd.DataFrame(json_normalize(i))], ignore_index=False, sort = False)
df2['date'] = file
main_df = pd.concat([main_df, df2])
Loading 1000 jsons takes a long time, let alone 5000. Is there any way to optimize my solution?
Many of the functions you are using seem convoluted for this purpose because they somewhat are. json_normalize() is for flattening a dictionary (removing nesting) which you don't need to do since your JSON state object is already flat. Using pd.read_json is fine if your JSON file is already in a convenient format for reading, but yours isn't.
With those things in mind, the easiest thing to do is to parse each JSON file first in Python so that you get the data that you want to correspond to a single row into a dictionary, and keep a list of all of those.
Also I used pathlib.Path objects to clean up globbing and filename extraction.
Something like this is what you want to do:
import pandas as pd
from pathlib import Path
import json
# each dict in states corresponds to a row
states = []
# you can glob directly on pathlib.Path objects
for file in Path("my files_directory").glob("*.json"):
# load json data
with open(file) as jsonf:
data = json.load(jsonf)
# add the date from the filename stem to each dict, and append to list
for result in data:
for state in result["state"]:
state["date"] = file.stem
states.append(state)
# create a df where each row corresponds to each dict in states
df = pd.DataFrame(states)
Related
If I have a heavy json file that have 30m entries like that
{"id":3,"price":"231","type":"Y","location":"NY"}
{"id":4,"price":"321","type":"N","city":"BR"}
{"id":5,"price":"354","type":"Y","city":"XE","location":"CP"}
--snip--
{"id":30373779,"price":"121","type":"N","city":"SR","location":"IU"}
{"id":30373780,"price":"432","type":"Y","location":"TB"}
{"id":30373780,"price":"562","type":"N","city":"CQ"}
how I can only abstract the location and the city and parse it into one json like that in python:
{
"orders":{
3:{
"location":"NY"
},
4:{
"city":"BR"
},
5:{
"city":"XE",
"location":"CP"
},
30373779:{
"city":"SR",
"location":"IU"
},
30373780:{
"location":"TB"
},
30373780:{
"city":"CQ"
}
}
}
P.S: beatufy the syntax is not necessary.
Assuming your input file is actually in jsonlines format, then you can read each line, extract the city and location keys from the dict and then append those to a new dict:
import json
from collections import defaultdict
orders = { 'orders' : defaultdict(dict) }
with open('orders.txt', 'r') as f:
for line in f:
o = json.loads(line)
id = o['id']
if 'location' in o:
orders['orders'][id]['location'] = o['location']
if 'city' in o:
orders['orders'][id]['city'] = o['city']
print(orders)
Output for your sample data (note it has two 30373780 id values, so the values get merged into one dict):
{
"orders": {
"3": {
"location": "NY"
},
"4": {
"city": "BR"
},
"5": {
"location": "CP",
"city": "XE"
},
"30373779": {
"location": "IU",
"city": "SR"
},
"30373780": {
"location": "TB",
"city": "CQ"
}
}
}
As you've said that your file is pretty big and you probably don't want to keep all entries in memory here is the way to consume source file line by line and write output immediately:
import json
with open(r"in.jsonp") as i_f, open(r"out.json", "w") as o_f:
o_f.write('{"orders":{')
for i in i_f:
i_obj = json.loads(i)
o_f.write(f'{i_obj["id"]}:')
o_obj = {}
if location := i_obj.get("location"):
o_obj["location"] = location
if city := i_obj.get("city"):
o_obj["city"] = city
json.dump(o_obj, o_f)
o_f.write(",")
o_f.write('}}')
It will generate semi-valid JSON object in same format you've provided in your question.
{
"currency": {
"Wpn": {
"units": "KB_per_sec",
"type": "scalar",
"value": 528922.0,
"direction": "up"
}
},
"catalyst": {
"Wpn": {
"units": "ns",
"type": "scalar",
"value": 70144.0,
"direction": "down"
}
},
"common": {
"Wpn": {
"units": "ns",
"type": "scalar",
"value": 90624.0,
"direction": "down"
}
}
}
So I have to basically convert nested json into excel, for which my approach was to flatten json file using json_normalise , but as I am new to all these...I always seem to end up in KeyError...
Here's my code so far , assuming that the file is named as json.json
import requests
from pandas import json_normalize
with open('json.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(sum([i[['Wpn'], ['value']] for i in data], []))
df.to_excel('Ai.xlsx')
I'm trying to get output on an excel sheet consisting of currency and common along with their resp. values as an output
I know , there are alot of similar questions , but trust me I have tried most of them and yet I didn't get any desirable output... Plz just help me in this
Try:
import json
import pandas as pd
with open('json.json', 'r') as f: data = json.load(f)
data = [{'key': k, 'wpn_value': v['Wpn']['value']} for k, v in data.items()]
print(data)
# here, the variable data looks like
# [{'key': 'currency', 'wpn_value': 528922.0}, {'key': 'catalyst', 'wpn_value': 70144.0}, {'key': 'common', 'wpn_value': 90624.0}]
df = pd.DataFrame(data).set_index('key') # set_index() optional
df.to_excel('Ai.xlsx')
The result looks like
key
wpn_value
currency
528922
catalyst
70144
common
90624
Currently I'm trying to load a json file from a webscrape into python in order to search reorder some of the columns, remove some text such as the (\n), etc. I'm having some issues with the json file, the pd.read_json() works (kinda). It returns a dataframe with 1 column titled 'Default'. My current code is below and runs without errors.
I tried the native JSON interpreter but due to some stylized characters and I receive an error.
def main():
file_path = filedialog.askopenfilename()
df = pd.read_json(file_path)
print(df)
Json file is valid and formatted as so:
{
"Default": [{
"ItemID": "11111",
"Title": "A super captivating title",
"Date": "July 22, 2019",
"URL": "www.someurl.com",
"BodyText": "some text."
}, {
"ItemID": "22222",
"Title": "Even more captivating title",
"Date": "July 12, 2019",
"URL": "www.differenturl.com",
"BodyText": "different text"
}]
}
Now I understand that the "Default" is being interpreted as the JSON object and why it's using it as the column. I experimented with several different orients of the read_json() but received more or less the same result.
I'm hoping to have ItemID, Title, Date, URL, and BodyText be the columns and their values being appropriately designated into rows. Any help is appreciated, I couldn't find a similar question but if it has been answered before please point me in the right direction.
There is no read_json orient that will do it. What you need is to pass the "Default" content to the DataFrame constructor:
import json
import pandas as pd
with open('temp.txt') as fh:
df = pd.DataFrame(json.load(fh)['Default'])
I'm currently trying to process a json as pandas dataframe. What happened here is that I get a continuous stream of json structures. They are simply appended. It's a whole line. I extracted a .txt from it and want to analyse it now via pandas.
Example snippet:
{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}...
as you see in this light snipped is, that every json starts with {"positionFlightMessage": and ends with messageSubtype":"ADSB"
After a json ends, the next json just appends after it.
What i need is a table out of it, like this:
95b3b6ca-5dd2-44b4-918a-baa51022d143 1.0-RC1 1533134514 DLH1601 4.414.525 -131.849 340 24.0 ADSB AFR1601-1532928365-airline-0002 AFR AFR89GA 442.0 34000.0 ADSB
884708c1-2fff-4ebf-b72c-bbc6ed2c3623 1.0-RC1 1533134515 DLH012 3.734.542 14.379.951 320 54.0 ADSB EVA12-1532928367-airline-0096 DLH EVA012 462.0 32000.0 ADSB
i tried to use pandas read json but i get a error.
import pandas as pd
df = pd.read_json("tD.txt",orient='columns')
df.head()
ValueError: Trailing data
tD.txt has the above given snippet without the last (...) dots
I think the problem is, that every json is just appended. I could add a new line after every
messageSubtype":"ADSB"}}
and then read it, but maybe you have a solution where i can just convert the big txt file directly and convert it easily to a df
Try to get the stream of json to output like the following:
Notice the starting '[' and the ending ']'.
Also notice the ',' between each json input.
data = [{
"positionFlightMessage": {
"messageUuid": "95b3b6ca-5dd2-44b4-918a-baa51022d143",
"schemaVersion": "1.0-RC1",
"timestamp": 1533134514,
"flightNumber": "DLH1601",
"position": {
"waypoint": {
"latitude": 44.14525,
"longitude": -1.31849
},
"flightLevel": 340,
"heading": 24.0
},
"messageSource": "ADSB",
"flightUniqueId": "AFR1601-1532928365-airline-0002",
"airlineIcaoCode": "AFR",
"atcCallsign": "AFR89GA",
"fuel": {},
"speed": {
"groundSpeed": 442.0
},
"altitude": {
"altitude": 34000.0
},
"nextPosition": {
"waypoint": {}
},
"messageSubtype": "ADSB"
}
}, {
"positionFlightMessage": {
"messageUuid": "884708c1-2fff-4ebf-b72c-bbc6ed2c3623",
"schemaVersion": "1.0-RC1",
"timestamp": 1533134515,
"flightNumber": "DLH012",
"position": {
"waypoint": {
"latitude": 37.34542,
"longitude": 143.79951
},
"flightLevel": 320,
"heading": 54.0
},
"messageSource": "ADSB",
"flightUniqueId": "EVA12-1532928367-airline-0096",
"airlineIcaoCode": "DLH",
"atcCallsign": "EVA012",
"fuel": {},
"speed": {
"groundSpeed": 462.0
},
"altitude": {
"altitude": 32000.0
},
"nextPosition": {
"waypoint": {}
},
"messageSubtype": "ADSB"
}
}]
Now you should be able to loop over each 'list' element in the json and append it to the pandas df.
print(len(data))
for i in range(0,len(data)):
#here is just show messageSource only. Up to you to find out the rest..
print(data[i]['positionFlightMessage']['messageSource'])
#instead of printing here you should append it to pandas df.
Hope this helps you out a bit.
Now here's a solution for your JSON as is using regex.
s = '{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}'
import re
import json
replaced = json.loads('['+re.sub(r'{\"positionFlightMessage*', ',{\"positionFlightMessage', s)[1:] + ']')
dfTemp = pd.DataFrame(data=replaced)
df = pd.DataFrame()
counter = 0
def newDf(row):
global df,counter
counter += 1
temp = pd.DataFrame([row])
df = df.append(temp)
dfTemp['positionFlightMessage'] = dfTemp['positionFlightMessage'].apply(newDf)
print(df)
First we replace all occurrences of {"positionFlightMessage with ,{"positionFlightMessage and discard the first separator.
We create a dataframe out of this but we have only one column here. Use the apply function on the column and create a new dataframe out of it.
From this dataframe, you can perform some more cleaning.
I use JSON for one of my project. For example, I have the JSON structure.
{
"address":{
"streetAddress": {
"aptnumber" : "21",
"building_number" : "2nd",
"street" : "Wall Street",
},
"city":"New York"
},
"phoneNumber":
[
{
"type":"home",
"number":"212 555-1234"
}
]
}
Now I have a bunch of modules using this structure, and it expects to see certain fields in the received json. For the example above, I have two files: address_manager and phone_number_manager. Each will be passed the relevant information. So address_manager will expect a dict that has keys 'streetAddress' and 'city'.
My question is: Is it possible to set up a constant structure so that every time I change the name of a field in my JSON structure (e.g. I want to change 'streetAddress' to 'address'), I don't have to make change in several places?
My naive approach is to have a bunch of constants (e.g.
ADDRESS = "address"
ADDRESS_STREET_ADDRESS = "streetAddress"
..etc..
) and so if I want to change the name of one of my fields in JSON structure, I just have to make change in one place. However, this seems to be very inefficient because my constant naming would be terribly long once I reach the third or fourth layer of the JSON structure (e.g. ADDRESS_STREETADDRESS_APTNUMBER, ADDRESS_STREETADDRESS_BUILDINGNUMBER)
I am doing this in python, but any generic answer would be OK.
Thanks.
Like Cameron Sparr suggested in a comment, don't have your constant names include all levels of your JSON structure. If you have the same data in multiple places, it will actually be better if you reuse the same constant. For example, suppose your JSON has a phone number included in the address:
{
"address": {
"streetAddress": {
"aptnumber" : "21",
"building_number" : "2nd",
"street" : "Wall Street"
},
"city":"New York",
"phoneNumber":
[
{
"type":"home",
"number":"212 555-1234"
}
]
},
"phoneNumber":
[
{
"type":"home",
"number":"212 555-1234"
}
]
}
Why not have a single constant PHONES = 'phoneNumber' that you use in both places? Your constants will have shorter names, and it is more logically coherent. You would end up using it like this (assuming JSON is stored in person):
person[ADDRESS][PHONES][x] # Phone numbers associated with that address
person[PHONES][x] # Phone numbers associated with the person
Instead of
person[ADDRESS][ADDRESS_PHONES][x]
person[PHONE_NUMBERS][x]
You can write a script than when you change the constant, change the structure in all json files.
Example:
import json
CHANGE = ('steet', 'streetAddress')
json_data = None
with open('file.json') as jfile:
json_data = jfile.load(jfile)
json_data[CHANGE[1]], json_data[CHANGE[0]] = json_data[CHANGE[0]], None