Normalizing nested json data with pandas

Normalizing nested json data with pandas - python

I am trying to work with a nested json and I am not reaching the result that I want.
I have a JSON data like this:
{'from_cache': True,
'results': [{'data': [{'date': '2019/06/01', 'value': 0},
{'date': '2019/06/02', 'value': 0},
{'date': '2019/08/09', 'value': 7087},
{'date': '2019/08/10', 'value': 0},
{'date': '2019/08/11', 'value': 15},
{'date': '2019/08/12', 'value': 14177},
{'date': '2019/08/13', 'value': 0}],
'name': 'Clicks'},
{'data': [{'date': '2019/06/01', 'value': 0.0},
{'date': '2019/06/02', 'value': 0.0},
{'date': '2019/06/03', 'value':1.0590561064390611},
{'date': '2019/08/11', 'value':1.8610421836228286},
{'date': '2019/08/12', 'value': 6.191613785151832},
{'date': '2019/08/13', 'value': 0.0}],
'name': 'Rate'}]}
The expected result is a dataframe like this:
date Clicks Rate
2019/06/01 0 0.0
2019/06/02 0 0.0
2019/08/09 7087 1.0590561064390611
As you can see I want each 'name' as a dataframe column with the respective 'values'.
I am working with pd.io.json_normalize, but no success to get this result. The best result I've reached is a dataframe with the columns: date, value, name.
Can someone help me with this?

IIUC, use pd.concat through axis=1
df = pd.concat([pd.DataFrame(k['data']).rename(columns={'value': k['name']})\
.set_index('date')
for k in d['results']],
sort=False,
axis=1)
Clicks Rate
2019/06/01 0.0 0.000000
2019/06/02 0.0 0.000000
2019/08/09 7087.0 NaN
2019/08/10 0.0 NaN
2019/08/11 15.0 1.861042
2019/08/12 14177.0 6.191614
2019/08/13 0.0 0.000000
2019/06/03 NaN 1.059056
Another way with pivot_table
df = pd.concat([pd.DataFrame(x['data']).assign(column=x['name']) for x in d['results']])\
.pivot_table(columns='column', index='date', values='value')

Without loops:
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
df = json_normalize(data['results'], record_path=['data'], meta=['name'])
df.date = pd.to_datetime(df.date)
df_clicks = df[df.name == 'Clicks'].drop('name', axis=1).rename(columns={'value': 'Clicks'})
df_rate = df[df.name == 'Rate'].drop('name', axis=1).rename(columns={'value': 'Rate'})
df_final = df_clicks.merge(df_rate, how='outer', sort=True)
df_final.set_index('date', drop=True, inplace=True)
unexpected data:
2019-06-03: a rate with no clicks
2019-08-09: clicks, but no rate
Plot it:
df_final.plot(kind='bar', logy=True)
plt.show()
Suggested new json format:
data = {'from_cache': True,
'results': [{'date': '2019/06/01', 'Clicks': 0, 'Rate': 0},
{'date': '2019/06/02', 'Clicks': 0, 'Rate': 0},
{'date': '2019/06/03', 'Clicks': 0, 'Rate': 1.0590561064390611},
{'date': '2019/08/09', 'Clicks': 7087, 'Rate': 0},
{'date': '2019/08/10', 'Clicks': 0, 'Rate': 0},
{'date': '2019/08/11', 'Clicks': 15, 'Rate': 1.8610421836228286},
{'date': '2019/08/12', 'Clicks': 14177, 'Rate': 6.191613785151832},
{'date': '2019/08/13', 'Clicks': 0, 'Rate': 0}]}

Related

Trying to make a pandas dataframe from a dictionary in a list of dictionaries

I have JSON data from a website, where I am trying to create a pandas dataframe from the data. It seems like I have a list of dictionaries that is nested in a dictionary and I am not sure what to do. My goal was to create key,value pairs and then make them into a dataframe.
import requests
import pandas as pd
search_url = 'https://data.europa.eu/api/hub/statistics/data/num-datasets'
response = requests.get(search_url)
root=response.json()
print(root)
I was able to get the data into my notebook, but I am not sure the best way to get data out of the dictionaries into lists to create a dataframe.
I tried to use pd.json_normalize(), but it didn't work.
The output looks like this:
{'name': 'count',
'stats': [{'date': '2019-08-01', 'count': 877625.0},
{'date': '2019-09-01', 'count': 895697.0},
{'date': '2020-10-01', 'count': 1161894.0},
{'date': '2020-11-01', 'count': 1205046.0},
{'date': '2020-12-01', 'count': 1184899.0},
{'date': '2023-01-01', 'count': 1503404.0}]}
My goal is to have two columns in a pd.Dataframe:
Date
Count

d={'name': 'count',
'stats': [{'date': '2019-08-01', 'count': 877625.0},
{'date': '2019-09-01', 'count': 895697.0},
{'date': '2020-10-01', 'count': 1161894.0},
{'date': '2020-11-01', 'count': 1205046.0},
{'date': '2020-12-01', 'count': 1184899.0},
{'date': '2023-01-01', 'count': 1503404.0}]}
pd.DataFrame(d['stats'])
Out[274]:
date count
0 2019-08-01 877625.0
1 2019-09-01 895697.0
2 2020-10-01 1161894.0
3 2020-11-01 1205046.0
4 2020-12-01 1184899.0
5 2023-01-01 1503404.0

save data from api to dataframe

I have the following output when getting data from an API:
{'Textbook': [{'Type': 'Chapters', 'Case': 'Ch09', 'Rates':
[{'Date': '2021- 04-23T00:00:00', 'Rate': 10.0}, {'Date': '2021-04-26T00:00:00', 'Rate': 10.0},
{'Date': '2021-04-27T00:00:00', 'Rate': 10.5}, {'Date': '2021-04-28T00:00:00', 'Rate': 10.5},
{'Date': '2021-04-29T00:00:00', 'Rate': 10.5}, {'Date': '2021-04-30T00:00:00', 'Rate': 10.0}]}]}
I am trying to get the following output in a dataframe:
Date Rate
2021- 04-23T00:00:00 10.0
2021-04-26T00:00:00 10.0
2021-04-27T00:00:00 10.5
etc
I tried the following code:
l=parsed ###this is the output from API
df=pd.DataFrame()
for i in l:
d1 = {}
reportDate = []
price = []
for j in i['Chapters']:
reportDate.append(j['Date'])
price.append(j['Rate'])
d1['Date'] = reportDate
d1['Rate'] = price
df = df.append(pd.DataFrame(d1))
df['Date'] = pd.to_datetime(df['Date'])
However, I get the following error: string indices must be integers for the line for j in i['Chapters']:

Below fix on your code will solve your issue. Although the answer by Andreas is a pythonic way!
import ast
# Data setup
raw_data="""
{'Textbook': [{'Type': 'Chapters', 'Case': 'Ch09', 'Rates':
[{'Date': '2021- 04-23T00:00:00', 'Rate': 10.0}, {'Date': '2021-04-26T00:00:00', 'Rate': 10.0},
{'Date': '2021-04-27T00:00:00', 'Rate': 10.5}, {'Date': '2021-04-28T00:00:00', 'Rate': 10.5},
{'Date': '2021-04-29T00:00:00', 'Rate': 10.5}, {'Date': '2021-04-30T00:00:00', 'Rate': 10.0}]}]}
"""
val=ast.literal_eval(raw_data) # eval to dictionary
the fix would be(pls review the comment section)
l=val ###this is the output from API, added val in this example
reportDate = [] # moved out of loop to collect the data
price = [] # moved out of loop to collect the data
#df=pd.DataFrame() build the dataframe once all the data is ready
for i in l: # this is dictionary
#d1 = {} not needed
for j in l[i][0]['Rates']:
reportDate.append(j['Date'])
price.append(j['Rate'])
#d1['Date'] = reportDate
#d1['Rate'] = price
#df = df.append(pd.DataFrame(d1))
#df['Date'] = pd.to_datetime(df['Date'])
df=pd.DataFrame({'Date':reportDate,"Rate":price})

You can try this:
d = {'Textbook': [{'Type': 'Chapters', 'Case': 'Ch09', 'Rates':
[{'Date': '2021- 04-23T00:00:00', 'Rate': 10.0}, {'Date': '2021-04-26T00:00:00', 'Rate': 10.0},
{'Date': '2021-04-27T00:00:00', 'Rate': 10.5}, {'Date': '2021-04-28T00:00:00', 'Rate': 10.5},
{'Date': '2021-04-29T00:00:00', 'Rate': 10.5}, {'Date': '2021-04-30T00:00:00', 'Rate': 10.0}]}]}
pd.DataFrame(d.get('Textbook')[0].get('Rates'))
# Date Rate
# 0 2021- 04-23T00:00:00 10.0
# 1 2021-04-26T00:00:00 10.0
# 2 2021-04-27T00:00:00 10.5
# 3 2021-04-28T00:00:00 10.5
# 4 2021-04-29T00:00:00 10.5
# 5 2021-04-30T00:00:00 10.0

Possible solutions to your question
You could read a documentation here: https://www.activestate.com/resources/quick-reads/how-to-save-a-dataframe/
or you could try this code
d = {'Textbook': [{'Type': 'Chapters', 'Case': 'Ch09', 'Rates':
[{'Date': '2021- 04-23T00:00:00', 'Rate': 10.0}, {'Date': '2021-04-26T00:00:00', 'Rate': 10.0},
{'Date': '2021-04-27T00:00:00', 'Rate': 10.5}, {'Date': '2021-04-28T00:00:00', 'Rate': 10.5},
{'Date': '2021-04-29T00:00:00', 'Rate': 10.5}, {'Date': '2021-04-30T00:00:00', 'Rate': 10.0}]}]}
pd.DataFrame(d.get('Textbook')[0].get('Rates'))
Code doesn't work? Please comment below.. any other questions, I'll be glad to talk.

Converting JSON to pandas DataFrame- Python (JSON fom yahoo_financials)

can anyone help me with that JSON format: (updated dataframe)
JSON:
{'PSG.MC': [{'date': 1547452800,'formatted_date': '2019-01-14', 'amount': 0.032025}, {'date': 1554361200, 'formatted_date': '2019-04-04', 'amount': 0.032025}, {'date': 1562310000, 'formatted_date': '2019-07-05', 'amount': 0.032025}, {'date': 1570690800, 'formatted_date': '2019-10-10', 'amount': 0.032025}, {'date': 1578902400, 'formatted_date': '2020-01-13', 'amount': 0.033}, {'date': 1588057200, 'formatted_date': '2020-04-28', 'amount': 0.033}, {'date': 1595228400, 'formatted_date': '2020-07-20', 'amount': 0.033}, {'date': 1601362800, 'formatted_date': '2020-09-29', 'amount': 0.033}, {'date': 1603436400, 'formatted_date': '2020-10-23', 'amount': 0.033}], 'ACX.MC': [{'date': 1559545200, 'formatted_date': '2019-06-03', 'amount': 0.3}, {'date': 1562137200, 'formatted_date': '2019-07-03', 'amount': 0.2}, {'date': 1591254000, 'formatted_date': '2020-06-04', 'amount': 0.4}, {'date': 1594018800, 'formatted_date': '2020-07-06', 'amount': 0.1}, {'date': 1606809600, 'formatted_date': '2020-12-01', 'amount': 0.1}]}
So I got it from
yahoo_financials.get_daily_dividend_data('2019-1-1', '2020-12-1')
As an example.
tried it to convert to DataFrame by:
data2 = {"data": {'VIG.VI': [{'date'......................................
s=pd.DataFrame(data2)
pd.concat([s.drop('data',1),pd.DataFrame(s.data.tolist(),index=s.index)],1)
In this case I get result like: 0 [{'date': 1433314500, 'formatted_date': '2015-... [{'date': 1430290500, 'formatted_date': '2015-...
Everything is perfect if weuse only 1 date + delete []:
Also I tried the code which under this topic: It works fine if format is the same for every variable in [], however if it is as in example above, then I get a mistake "arrays must all be same length"
Does anyone have any idea how could I convert this type of JSON to DataFrame?

You can convert that list of dict to dict of list. Then convert the final dict to multi index columns dataframe with:
import pandas as pd
from collections import defaultdict
data2 = {"data": {'PSG.MC': [{'date': 1547452800,'formatted_date': '2019-01-14', 'amount': 0.032025}, {'date': 1554361200, 'formatted_date': '2019-04-04', 'amount': 0.032025}, {'date': 1562310000, 'formatted_date': '2019-07-05', 'amount': 0.032025}, {'date': 1570690800, 'formatted_date': '2019-10-10', 'amount': 0.032025}, {'date': 1578902400, 'formatted_date': '2020-01-13', 'amount': 0.033}, {'date': 1588057200, 'formatted_date': '2020-04-28', 'amount': 0.033}, {'date': 1595228400, 'formatted_date': '2020-07-20', 'amount': 0.033}, {'date': 1601362800, 'formatted_date': '2020-09-29', 'amount': 0.033}, {'date': 1603436400, 'formatted_date': '2020-10-23', 'amount': 0.033}], 'ACX.MC': [{'date': 1559545200, 'formatted_date': '2019-06-03', 'amount': 0.3}, {'date': 1562137200, 'formatted_date': '2019-07-03', 'amount': 0.2}, {'date': 1591254000, 'formatted_date': '2020-06-04', 'amount': 0.4}, {'date': 1594018800, 'formatted_date': '2020-07-06', 'amount': 0.1}, {'date': 1606809600, 'formatted_date': '2020-12-01', 'amount': 0.1}]}}
data = {}
for key, values in data2['data'].items():
res = defaultdict(list)
{res[k].append(sub[k]) for sub in values for k in sub}
data[key] = dict(res)
def reform_dict(data):
reformed_dict = {}
for outerKey, innerDict in data.items():
for innerKey, values in innerDict.items():
reformed_dict[(outerKey, innerKey)] = values
return reformed_dict
df = pd.concat([pd.DataFrame(reform_dict({key: value})) for key, value in data.items()], axis=1)
print(df)
PSG.MC ACX.MC
date formatted_date amount date formatted_date amount
0 1547452800 2019-01-14 0.032025 1.559545e+09 2019-06-03 0.3
1 1554361200 2019-04-04 0.032025 1.562137e+09 2019-07-03 0.2
2 1562310000 2019-07-05 0.032025 1.591254e+09 2020-06-04 0.4
3 1570690800 2019-10-10 0.032025 1.594019e+09 2020-07-06 0.1
4 1578902400 2020-01-13 0.033000 1.606810e+09 2020-12-01 0.1
5 1588057200 2020-04-28 0.033000 NaN NaN NaN
6 1595228400 2020-07-20 0.033000 NaN NaN NaN
7 1601362800 2020-09-29 0.033000 NaN NaN NaN
8 1603436400 2020-10-23 0.033000 NaN NaN NaN

Thank you for your code and help.
Here sharing my code, it works nice and output is nice table with needed data, may be it will be helpful for someone:
def getDividends:
def getDividends(tickers, start_date, end_date):
yahoo_financials = YahooFinancials(tickers)
dividends = yahoo_financials.get_daily_dividend_data(start_date, end_date)
return dividends
def Frame:
def getDividendDataFrame(tickerList):
dividendList = getDividends(tickerList, '2015-1-1', '2020-12-1')
dataFrame = pd.DataFrame()
for ticker in dividendList:
for dividend in dividendList[ticker]:
series = pd.Series([ticker, dividend['formatted_date'], dividend['amount']])
dfItem = pd.DataFrame([series])
dataFrame = pd.concat([dataFrame, dfItem], ignore_index=True)
print('\n')
dataFrame.columns=['Ticker', 'formatted_date', 'amount']
return dataFrame

Trying to prevent For Loop values with none from being added to Dictionary Python 3

I am new to python and am trying to write a For loop that iterate over a large text file line by line to extract specific Regex values and add them to a new CSV file. I am following code I located to solve a similar problem. My issue is that none values are being added to the dictionary despite using an "if value not None" line. The output files are printing multiple blank rows in the output csv because all of the none values are included in the list. Any help would be appreciated. code below:
import re
import pandas as pd
list = []
fh = open(r"test_data.txt", "r").read()
contents = fh.split()
for item in contents:
list_dict = {}
date_field = re.search(r"(\d{1})[/.-](\d{1})[/.-](\d{4})$", item)
if date_field is not None:
date = date_field.group()
else:
date = None
list_dict["date"] = date
list.append(list_dict)
print(list)
df = pd.DataFrame(list)
df.to_csv("test_export_with_testdata.csv", index=False)
Output
[{'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': '2/5/2021'}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': '2/6/2021'}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': '2/7/2021'}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': '2/8/2021'}, {'date': None}, {'date': None}, {'date': None}, {'date': None}, {'date': None}]
Process finished with exit code 0

You are still running your append() if the value is None.
If you do not want to include the lines where the regex found no result, simply move all of the code into the if statement.
for item in contents:
list_dict = {}
date_field = re.search(r"(\d{1})[/.-](\d{1})[/.-](\d{4})$", item)
if date_field:
date = date_field.group()
list_dict["date"] = date
list.append(list_dict)

import re
import pandas as pd
list1 = []
fh = open(r"test_data.txt", "r").read()
contents = fh.split()
for item in contents:
list_dict = {}
date_field = re.search(r"(\d{1})[/.-](\d{1})[/.-](\d{4})$", item)
if date_field is not None:
date = date_field.group()
list_dict["date"] = date
list1.append(list_dict)
else:
date = None
time_field = re.search(r"(\d{1,2})[:](\d{2})[:](\d{2})$", item)
if time_field is not None:
time = time_field.group()
list_dict["time"] = time
list1.append(list_dict)
print(list1)
df = pd.DataFrame(list1)
df.to_csv("test_export_with_testdata.csv", index=False)
Output:
date time
0 2/5/2021 NaN
1 NaN 10:41:45
2 2/6/2021 NaN
3 NaN 10:42:45
4 2/7/2021 NaN
5 NaN 10:43:45
6 2/8/2021 NaN
7 NaN 10:44:45

Find combined max value

I have the following DataFrame:
{'date': '2019-10-21', 'hour': 3, 'id': '1'},
{'date': '2019-10-21', 'hour': 4, 'id': '1'},
{'date': '2019-10-20', 'hour': 0, 'id': '1'},
{'date': '2019-10-20', 'hour': 1, 'id': '1'},
{'date': '2019-10-21', 'hour': 0, 'id': '1'},
{'date': '2019-10-20', 'hour': 0, 'id': '1'},
{'date': '2019-10-19', 'hour': 5, 'id': '1'},
{'date': '2019-10-20', 'hour': 0, 'id': '2'},
{'date': '2019-10-20', 'hour': 0, 'id': '3'}
I need to find for each id the latest date and hour so for instance for id=1 I want 2019-10-21 and 4 while I am getting the correct date but hour=5

Use DataFrame.sort_values by all 3 columns and remove duplicates by DataFrame.drop_duplicates by column id:
L = [{'date': '2019-10-21', 'hour': 3, 'id': '1'},
{'date': '2019-10-21', 'hour': 4, 'id': '1'},
{'date': '2019-10-20', 'hour': 0, 'id': '1'},
{'date': '2019-10-20', 'hour': 1, 'id': '1'},
{'date': '2019-10-21', 'hour': 0, 'id': '1'},
{'date': '2019-10-20', 'hour': 0, 'id': '1'},
{'date': '2019-10-19', 'hour': 5, 'id': '1'},
{'date': '2019-10-20', 'hour': 0, 'id': '2'},
{'date': '2019-10-20', 'hour': 0, 'id': '3'}]
df = pd.DataFrame(L)
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','date','hour'], ascending=[True, False, False]).drop_duplicates('id')
print (df)
date hour id
1 2019-10-21 4 1
7 2019-10-20 0 2
8 2019-10-20 0 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalizing nested json data with pandas - python

Related

Trying to make a pandas dataframe from a dictionary in a list of dictionaries

save data from api to dataframe

Converting JSON to pandas DataFrame- Python (JSON fom yahoo_financials)

Trying to prevent For Loop values with none from being added to Dictionary Python 3

Find combined max value

Categories

Resources