Pandas, create dictionary from df, with one column as replacing another - python

I have an unknown number of DataFrames.
two for example:
date week_score daily_score site_name
0 2014-07-04 100 90 demo 2
1 2014-07-05 80 55 demo 2
2 2015-07-06 70 60 demo 2
date week_score daily_score site_name
0 2014-07-04 85 100 demo 1
1 2014-07-05 50 80 demo 1
2 2015-07-06 45 30 demo 1
I know the data frames all have the same shape and columns names.
I want to combine them into a list of dictionaries (df.to_dict(orient='records') but have the site_name as key and to do this for every score.
the desired output is a bit tricky:
{'week_score: [{'date': '2014-07-04', 'demo 2': 100, 'demo 1': 85},
{'date': '2014-07-05', 'demo 2': 80, 'demo 1': 50},
{'date': '2014-07-06', 'demo 2': 70, 'demo 1': 45}],
'daily_score: [{'date': '2014-07-04', 'demo 2': 90, 'demo 1': 100},
{'date': '2014-07-05', 'demo 2': 55, 'demo 1': 80},
{'date': '2014-07-06', 'demo 2': 60, 'demo 1': 30}],
}

you can try this code :
d = dict()
for col in df.columns[1:-1].tolist():
new_df = pd.DataFrame({'date':dfs[0]['date']})
for df in dfs:
site_name = df['site_name'].unique()[0]
dropped = df.drop('site_name',axis='columns')
new_df[site_name] = df[col]
d[col] = new_df.to_dict('records')
>>>d
output:
{'week_score': [{'date': '2014-07-04', 'demo1': 85, 'demo2': 100},
{'date': '2014-07-05', 'demo1': 50, 'demo2': 80},
{'date': '2015-07-06', 'demo1': 45, 'demo2': 70}],
'daily_score': [{'date': '2014-07-04', 'demo1': 100, 'demo2': 90},
{'date': '2014-07-05', 'demo1': 80, 'demo2': 55},
{'date': '2015-07-06', 'demo1': 30, 'demo2': 60}]}

Related

Converting JSON to pandas DataFrame- Python (JSON fom yahoo_financials)

can anyone help me with that JSON format: (updated dataframe)
JSON:
{'PSG.MC': [{'date': 1547452800,'formatted_date': '2019-01-14', 'amount': 0.032025}, {'date': 1554361200, 'formatted_date': '2019-04-04', 'amount': 0.032025}, {'date': 1562310000, 'formatted_date': '2019-07-05', 'amount': 0.032025}, {'date': 1570690800, 'formatted_date': '2019-10-10', 'amount': 0.032025}, {'date': 1578902400, 'formatted_date': '2020-01-13', 'amount': 0.033}, {'date': 1588057200, 'formatted_date': '2020-04-28', 'amount': 0.033}, {'date': 1595228400, 'formatted_date': '2020-07-20', 'amount': 0.033}, {'date': 1601362800, 'formatted_date': '2020-09-29', 'amount': 0.033}, {'date': 1603436400, 'formatted_date': '2020-10-23', 'amount': 0.033}], 'ACX.MC': [{'date': 1559545200, 'formatted_date': '2019-06-03', 'amount': 0.3}, {'date': 1562137200, 'formatted_date': '2019-07-03', 'amount': 0.2}, {'date': 1591254000, 'formatted_date': '2020-06-04', 'amount': 0.4}, {'date': 1594018800, 'formatted_date': '2020-07-06', 'amount': 0.1}, {'date': 1606809600, 'formatted_date': '2020-12-01', 'amount': 0.1}]}
So I got it from
yahoo_financials.get_daily_dividend_data('2019-1-1', '2020-12-1')
As an example.
tried it to convert to DataFrame by:
data2 = {"data": {'VIG.VI': [{'date'......................................
s=pd.DataFrame(data2)
pd.concat([s.drop('data',1),pd.DataFrame(s.data.tolist(),index=s.index)],1)
In this case I get result like: 0 [{'date': 1433314500, 'formatted_date': '2015-... [{'date': 1430290500, 'formatted_date': '2015-...
Everything is perfect if weuse only 1 date + delete []:
Also I tried the code which under this topic: It works fine if format is the same for every variable in [], however if it is as in example above, then I get a mistake "arrays must all be same length"
Does anyone have any idea how could I convert this type of JSON to DataFrame?
You can convert that list of dict to dict of list. Then convert the final dict to multi index columns dataframe with:
import pandas as pd
from collections import defaultdict
data2 = {"data": {'PSG.MC': [{'date': 1547452800,'formatted_date': '2019-01-14', 'amount': 0.032025}, {'date': 1554361200, 'formatted_date': '2019-04-04', 'amount': 0.032025}, {'date': 1562310000, 'formatted_date': '2019-07-05', 'amount': 0.032025}, {'date': 1570690800, 'formatted_date': '2019-10-10', 'amount': 0.032025}, {'date': 1578902400, 'formatted_date': '2020-01-13', 'amount': 0.033}, {'date': 1588057200, 'formatted_date': '2020-04-28', 'amount': 0.033}, {'date': 1595228400, 'formatted_date': '2020-07-20', 'amount': 0.033}, {'date': 1601362800, 'formatted_date': '2020-09-29', 'amount': 0.033}, {'date': 1603436400, 'formatted_date': '2020-10-23', 'amount': 0.033}], 'ACX.MC': [{'date': 1559545200, 'formatted_date': '2019-06-03', 'amount': 0.3}, {'date': 1562137200, 'formatted_date': '2019-07-03', 'amount': 0.2}, {'date': 1591254000, 'formatted_date': '2020-06-04', 'amount': 0.4}, {'date': 1594018800, 'formatted_date': '2020-07-06', 'amount': 0.1}, {'date': 1606809600, 'formatted_date': '2020-12-01', 'amount': 0.1}]}}
data = {}
for key, values in data2['data'].items():
res = defaultdict(list)
{res[k].append(sub[k]) for sub in values for k in sub}
data[key] = dict(res)
def reform_dict(data):
reformed_dict = {}
for outerKey, innerDict in data.items():
for innerKey, values in innerDict.items():
reformed_dict[(outerKey, innerKey)] = values
return reformed_dict
df = pd.concat([pd.DataFrame(reform_dict({key: value})) for key, value in data.items()], axis=1)
print(df)
PSG.MC ACX.MC
date formatted_date amount date formatted_date amount
0 1547452800 2019-01-14 0.032025 1.559545e+09 2019-06-03 0.3
1 1554361200 2019-04-04 0.032025 1.562137e+09 2019-07-03 0.2
2 1562310000 2019-07-05 0.032025 1.591254e+09 2020-06-04 0.4
3 1570690800 2019-10-10 0.032025 1.594019e+09 2020-07-06 0.1
4 1578902400 2020-01-13 0.033000 1.606810e+09 2020-12-01 0.1
5 1588057200 2020-04-28 0.033000 NaN NaN NaN
6 1595228400 2020-07-20 0.033000 NaN NaN NaN
7 1601362800 2020-09-29 0.033000 NaN NaN NaN
8 1603436400 2020-10-23 0.033000 NaN NaN NaN
Thank you for your code and help.
Here sharing my code, it works nice and output is nice table with needed data, may be it will be helpful for someone:
def getDividends:
def getDividends(tickers, start_date, end_date):
yahoo_financials = YahooFinancials(tickers)
dividends = yahoo_financials.get_daily_dividend_data(start_date, end_date)
return dividends
def Frame:
def getDividendDataFrame(tickerList):
dividendList = getDividends(tickerList, '2015-1-1', '2020-12-1')
dataFrame = pd.DataFrame()
for ticker in dividendList:
for dividend in dividendList[ticker]:
series = pd.Series([ticker, dividend['formatted_date'], dividend['amount']])
dfItem = pd.DataFrame([series])
dataFrame = pd.concat([dataFrame, dfItem], ignore_index=True)
print('\n')
dataFrame.columns=['Ticker', 'formatted_date', 'amount']
return dataFrame

DataFrame pop function removing wanted values in Nest Dictionary

I have a DataFrame that has a nested dict within a column. I am removing the nested values and creating a column for each associated key. When using the pop function on pricings it removes values that are wanted. I wish to keep the '1 color', '2 color', '3 color', '4 color', '5 color', '6 color'.
The nested dict looks like this, with column name variations
{'name': 'printing on a DARK shirt',
'pricings': {'1 color': [{'max': 47, 'min': 1, 'price': 100.0},
{'max': 71, 'min': 48, 'price': 40.25},
{'max': 143, 'min': 72, 'price': 2.8},
{'max': 287, 'min': 144, 'price': 2.5}],
'2 color': [{'max': 47, 'min': 1, 'price': 200.0},
{'max': 71, 'min': 48, 'price': 4.25},
{'max': 143, 'min': 72, 'price': 3.8},
{'max': 287, 'min': 144, 'price': 3.5}],
'3 color': [{'max': 47, 'min': 1, 'price': 300.0},
{'max': 71, 'min': 48, 'price': 5.25},
{'max': 143, 'min': 72, 'price': 4.8},
{'max': 287, 'min': 144, 'price': 4.5}],
'4 color': [{'max': 47, 'min': 1, 'price': 400.0},
{'max': 71, 'min': 48, 'price': 6.25},
{'max': 143, 'min': 72, 'price': 5.8},
{'max': 287, 'min': 144, 'price': 5.5}],
'5 color': [{'max': 47, 'min': 1, 'price': 500.0},
{'max': 71, 'min': 48, 'price': 7.5},
{'max': 143, 'min': 72, 'price': 7.0},
{'max': 287, 'min': 144, 'price': 6.6}],
'6 color': [{'max': 47, 'min': 1, 'price': 600.0},
{'max': 71, 'min': 48, 'price': 8.5},
{'max': 143, 'min': 72, 'price': 8.0},
{'max': 287, 'min': 144, 'price': 7.6}]}}
The code I'm using looks like this
df2 = (pd.concat({i: pd.DataFrame(x) for i, x in df1.pop('variations').items()})
.reset_index(level=1, drop=True)
.join(df1 , how='left', lsuffix='_left', rsuffix='_right')
.reset_index(drop=True))
The output is as follows, with the new column name pricing added.
[{'max': 47, 'min': 1, 'price': 20.0},
{'max': 71, 'min': 48, 'price': 4.25},
{'max': 143, 'min': 72, 'price': 3.8},
{'max': 287, 'min': 144, 'price': 3.5}]
If its not clear in the DataFrame the actual list of colors '1 color', '2 color', '3 color', '4 color', '5 color', '6 color'. ranges has fallen off. This is important and the portion I want most. the colors have not created there own column so we are clear.

Normalizing nested json data with pandas

I am trying to work with a nested json and I am not reaching the result that I want.
I have a JSON data like this:
{'from_cache': True,
'results': [{'data': [{'date': '2019/06/01', 'value': 0},
{'date': '2019/06/02', 'value': 0},
{'date': '2019/08/09', 'value': 7087},
{'date': '2019/08/10', 'value': 0},
{'date': '2019/08/11', 'value': 15},
{'date': '2019/08/12', 'value': 14177},
{'date': '2019/08/13', 'value': 0}],
'name': 'Clicks'},
{'data': [{'date': '2019/06/01', 'value': 0.0},
{'date': '2019/06/02', 'value': 0.0},
{'date': '2019/06/03', 'value':1.0590561064390611},
{'date': '2019/08/11', 'value':1.8610421836228286},
{'date': '2019/08/12', 'value': 6.191613785151832},
{'date': '2019/08/13', 'value': 0.0}],
'name': 'Rate'}]}
The expected result is a dataframe like this:
date Clicks Rate
2019/06/01 0 0.0
2019/06/02 0 0.0
2019/08/09 7087 1.0590561064390611
As you can see I want each 'name' as a dataframe column with the respective 'values'.
I am working with pd.io.json_normalize, but no success to get this result. The best result I've reached is a dataframe with the columns: date, value, name.
Can someone help me with this?
IIUC, use pd.concat through axis=1
df = pd.concat([pd.DataFrame(k['data']).rename(columns={'value': k['name']})\
.set_index('date')
for k in d['results']],
sort=False,
axis=1)
Clicks Rate
2019/06/01 0.0 0.000000
2019/06/02 0.0 0.000000
2019/08/09 7087.0 NaN
2019/08/10 0.0 NaN
2019/08/11 15.0 1.861042
2019/08/12 14177.0 6.191614
2019/08/13 0.0 0.000000
2019/06/03 NaN 1.059056
Another way with pivot_table
df = pd.concat([pd.DataFrame(x['data']).assign(column=x['name']) for x in d['results']])\
.pivot_table(columns='column', index='date', values='value')
Without loops:
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
df = json_normalize(data['results'], record_path=['data'], meta=['name'])
df.date = pd.to_datetime(df.date)
df_clicks = df[df.name == 'Clicks'].drop('name', axis=1).rename(columns={'value': 'Clicks'})
df_rate = df[df.name == 'Rate'].drop('name', axis=1).rename(columns={'value': 'Rate'})
df_final = df_clicks.merge(df_rate, how='outer', sort=True)
df_final.set_index('date', drop=True, inplace=True)
unexpected data:
2019-06-03: a rate with no clicks
2019-08-09: clicks, but no rate
Plot it:
df_final.plot(kind='bar', logy=True)
plt.show()
Suggested new json format:
data = {'from_cache': True,
'results': [{'date': '2019/06/01', 'Clicks': 0, 'Rate': 0},
{'date': '2019/06/02', 'Clicks': 0, 'Rate': 0},
{'date': '2019/06/03', 'Clicks': 0, 'Rate': 1.0590561064390611},
{'date': '2019/08/09', 'Clicks': 7087, 'Rate': 0},
{'date': '2019/08/10', 'Clicks': 0, 'Rate': 0},
{'date': '2019/08/11', 'Clicks': 15, 'Rate': 1.8610421836228286},
{'date': '2019/08/12', 'Clicks': 14177, 'Rate': 6.191613785151832},
{'date': '2019/08/13', 'Clicks': 0, 'Rate': 0}]}

From dic pandas with nested dictionaries

I have a dictionary like that:
{12: {'Soccer': {'value': 31, 'year': 2013}},
23: {'Volley': {'value': 24, 'year': 2012},'Yoga': {'value': 3, 'year': 2014}},
39: {'Baseball': {'value': 2, 'year': 2014},'basket': {'value': 4, 'year': 2012}}}
and i would like to have a dataframe like this:
index column
12 {'Soccer': {'value': 31, 'year': 2013}}
23 {'Volley': {'value': 24, 'year': 2012},'Yoga': {'value': 3, 'year': 2014}}
39 {'Baseball': {'value': 2, 'year': 2014},'basket': {'value': 4, 'year': 2012}}
with each nested dictionary set in a unique column, with the row given by the key of the external dictionary. When I use 'from_dict' with orient parameter equal to index, it considers that keys from the nested dictionaries are the labels of the columns and it makes a square dataframe instead of a single column...
Thanks a lot
Use:
df = pd.DataFrame({'column':d})
Or:
df = pd.Series(d).to_frame('column')
print (df)
column
12 {'Soccer': {'year': 2013, 'value': 31}}
23 {'Volley': {'year': 2012, 'value': 24}, 'Yoga'...
39 {'Baseball': {'year': 2014, 'value': 2}, 'bask...
In [65]: pd.DataFrame(d.values(), index=d.keys(), columns=['column'])
Out[65]:
column
12 ({'Soccer': {'value': 31, 'year': 2013}}, {'Vo...
23 ({'Soccer': {'value': 31, 'year': 2013}}, {'Vo...
39 ({'Soccer': {'value': 31, 'year': 2013}}, {'Vo...

How to aggregate particular property value that is group by a particular property in a list

I have a list which represent the item purchase of a customer:
purchases = [
{
'id': 1, 'product': 'Item 1', 'price': 12.4, 'qty' : 4
},
{
'id': 1, 'product': 'Item 1', 'price': 12.4, 'qty' : 8
},
{
'id': 2, 'product': 'Item 2', 'price': 7.5, 'qty': 10
},
{
'id': 3, 'product': 'Item 3', 'price': 18, 'qty': 7
}
]
Now i want the output that returns the distinct product with an aggregated qty.
result = [
{
'id': 1, 'product': 'Item 1', 'price': 12.4, 'qty' : 12 # 8 + 4
},
{
'id': 2, 'product': 'Item 2', 'price': 7.5, 'qty': 10
},
{
'id': 3, 'product': 'Item 3', 'price': 18, 'qty': 7
}
]
And the answers here never makes sense to me
How to sum dict elements
In pandas it is simple - groupby with aggregate, last to_dict:
import pandas as pd
df = pd.DataFrame(purchases)
print (df)
id price product qty
0 1 12.4 Item 1 4
1 1 12.4 Item 1 8
2 2 7.5 Item 2 10
3 3 18.0 Item 3 7
print (df.groupby('product', as_index=False)
.agg({'id':'first','price':'first','qty':'sum'})
.to_dict(orient='records'))
[{'qty': 12, 'product': 'Item 1', 'price': 12.4, 'id': 1},
{'qty': 10, 'product': 'Item 2', 'price': 7.5, 'id': 2},
{'qty': 7, 'product': 'Item 3', 'price': 18.0, 'id': 3}]
If is possible groupby by 3 elements:
print (df.groupby(['id','product', 'price'], as_index=False)['qty'].sum()
.to_dict(orient='records'))
[{'qty': 12, 'product': 'Item 1', 'id': 1, 'price': 12.4},
{'qty': 10, 'product': 'Item 2', 'id': 2, 'price': 7.5},
{'qty': 7, 'product': 'Item 3', 'id': 3, 'price': 18.0}]
from itertools import groupby
from operator import itemgetter
grouper = itemgetter("id", "product", "price")
result = []
for key, grp in groupby(sorted(purchases, key = grouper), grouper):
temp_dict = dict(zip(["id", "product", "price"], key))
temp_dict["qty"] = sum(item["qty"] for item in grp)
result.append(temp_dict)
print(result)
[{'qty': 12, 'product': 'Item 1', 'id': 1, 'price': 12.4},
{'qty': 10, 'product': 'Item 2', 'id': 2, 'price': 7.5},
{'qty': 7, 'product': 'Item 3', 'id': 3, 'price': 18}]
EDIT by comment:
purchases = [
{
'id': 1, 'product': { 'id': 1, 'name': 'item 1' }, 'price': 12.4, 'qty' : 4
},
{
'id': 1, 'product': { 'id': 1, 'name': 'item 2' }, 'price': 12.4, 'qty' : 8
},
{
'id': 2, 'product':{ 'id': 2, 'name': 'item 3' }, 'price': 7.5, 'qty': 10
},
{
'id': 3, 'product': { 'id': 3, 'name': 'item 4' }, 'price': 18, 'qty': 7
}
]
from pandas.io.json import json_normalize
df = json_normalize(purchases)
print (df)
id price product.id product.name qty
0 1 12.4 1 item 1 4
1 1 12.4 1 item 2 8
2 2 7.5 2 item 3 10
3 3 18.0 3 item 4 7
print (df.groupby(['id','product.id', 'price'], as_index=False)['qty'].sum()
.to_dict(orient='records'))
[{'qty': 12.0, 'price': 12.4, 'id': 1.0, 'product.id': 1.0},
{'qty': 10.0, 'price': 7.5, 'id': 2.0, 'product.id': 2.0},
{'qty': 7.0, 'price': 18.0, 'id': 3.0, 'product.id': 3.0}]
Another solution, not the most elegant, but easier to understand
from collections import Counter
c = Counter()
some = [((x['id'], x['product'], x['price']), x['qty']) for x in purchases]
for x in some:
c[x[0]] += x[1]
[{'id': k[0], 'product': k[1], 'price': k[2], 'qty': v} for k, v in c.items()]
And that i measured that solution with groupby solution of #jezrael
And 100000 loops, best of 3: 9.03 µs per loop vs #jezrael's 100000 loops, best of 3: 12.2 µs per loop

Categories