I'm trying to transform df like this into the dictionary with multiple nested keys.
import pandas as pd
import datetime
columns = ['country', 'city', 'from_date', 'to_date', 'sales']
data = [['UK', 'London', datetime.date(2021, 8, 26), datetime.date(2099, 5,5), 2500], ['Mexico', 'Mexico City', datetime.date(2011, 3,3), datetime.date(2012, 4, 5), 5670], ['Mexico', 'Mexico City', datetime.date(2014, 3,3), datetime.date(2017, 4, 5), 5680]]
df = pd.DataFrame(data, columns=columns)
df
country city from_date to_date sales
0 UK London 2021-08-26 2099-05-05 2500
1 Mexico Mexico City 2011-03-03 2012-04-05 5670
2 Mexico Mexico City 2014-03-03 2017-04-05 5680
Result # 1 I'm looking for:
{'Mexico':
{'Mexico City':
[
{'from_date: 2011-03-03, 'to_date: 2012-04-05, 'sales': 5670},
{'from_date: 2014-03-03, 'to_date: 2017-04-05, 'sales': 5680}
]},
'UK':
{'London':
[
{'from_date: 2021-08-26, 'to_date: 2099-05-05, 'sales': 2500}
]},
}
Or Result #2:
{'Mexico':
{'Mexico City':
{2011-03-03: 5670, # from_date: sales
2014-03-03: 5680} # from_date: sales
},
'UK':
{'London':
{2021-08-26: 2500} # from_date: sales
},
}
I don't know how to get result #1, as for result #2 I've tried this:
df.groupby(['country', 'city', 'from_date'])['sales'].apply(float).to_dict()
{('Mexico', 'Mexico City', Timestamp('2011-03-03 00:00:00')): 5670.0, ('Mexico', 'Mexico City', Timestamp('2014-03-03 00:00:00')): 5670.0, ('UK', 'London', Timestamp('2021-08-26 00:00:00')): 2500.0}
BUT I need to be able to get from_date as a separate key because I will be using it to compare to another date.
Ideally, I'd like to learn how to get both results but any help is appreciated!
You can create MultiIndex Series by lambda function in GroupBy.apply with DataFrame.to_dict:
df['from_date'] = pd.to_datetime(df['from_date']).dt.strftime('%Y-%m-%d')
df['to_date'] = pd.to_datetime(df['to_date']).dt.strftime('%Y-%m-%d')
f = lambda x: x.to_dict('records')
s = df.groupby(['country', 'city'])[['from_date','to_date','sales']].apply(f)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{
'Mexico': {
'Mexico City': [{
'from_date': '2011-03-03',
'to_date': '2012-04-05',
'sales': 5670
},
{
'from_date': '2014-03-03',
'to_date': '2017-04-05',
'sales': 5680
}
]
},
'UK': {
'London': [{
'from_date': '2021-08-26',
'to_date': '2099-05-05',
'sales': 2500
}]
}
}
For second is changed lambda function only:
f = lambda x: x.set_index('from_date')['sales'].to_dict()
s2 = df.groupby(['country', 'city']).apply(f)
print (s2)
country city
Mexico Mexico City {'2011-03-03': 5670, '2014-03-03': 5680}
UK London {'2021-08-26': 2500}
dtype: object
d2 = {level: s2.xs(level).to_dict() for level in s2.index.levels[0]}
print (d2)
{'Mexico': {'Mexico City': {'2011-03-03': 5670, '2014-03-03': 5680}},
'UK': {'London': {'2021-08-26': 2500}}}
Related
I have the following list
test={'data': [{'name': 'john',
'insights': {'data': [{'id': '123',
'person_id': '456',
'date_start': '2022-12-31',
'date_stop': '2023-01-29',
'impressions': '4070',
'spend': '36.14'}],
'paging': {'cursors': {'before': 'MAZDZD', 'after': 'MAZDZD'}}},
'id': '978'}]}
I want to create a pandas dataframe where the columns are the name, date_start, date_stop, impressions, and spend.
I tried doing this,
data = pd.DataFrame()
data = data.append(test['data'])
But the insight now becomes a column like such
name insights id
john {'data': [{'id': '123', 'person_id': '456', 'd... 978
How do I get the impressions and the spend from the insight column? When I tried
test['data']['insights']
I got an error
list indices must be integers or slices, not str
Use pd.json_normalize:
>>> pd.json_normalize(test["data"], ['insights', 'data'])
id person_id date_start date_stop impressions spend
0 123 456 2022-12-31 2023-01-29 4070 36.14
One option is to use pandas.json_normalize with pandas.Series.explode :
df = (
pd.json_normalize(test["data"])
['insights.data']
.explode()
.pipe(lambda s: pd.DataFrame(s.tolist()))
)
Output :
print(df)
id person_id date_start date_stop impressions spend
0 123 456 2022-12-31 2023-01-29 4070 36.14
Try:
import pandas as pd
test = {
"data": [
{
"name": "john",
"insights": {
"data": [
{
"id": "123",
"person_id": "456",
"date_start": "2022-12-31",
"date_stop": "2023-01-29",
"impressions": "4070",
"spend": "36.14",
}
],
"paging": {"cursors": {"before": "MAZDZD", "after": "MAZDZD"}},
},
"id": "978",
}
]
}
df = pd.DataFrame(
[
{
"name": d["name"],
"date_start": dd["date_start"],
"date_stop": dd["date_stop"],
"impressions": dd["impressions"],
}
for d in test["data"]
for dd in d["insights"]["data"]
]
)
print(df)
Prints:
name date_start date_stop impressions
0 john 2022-12-31 2023-01-29 4070
Here is an alternative approach:
df = pd.DataFrame(test["data"][0]["insights"]["data"],
columns=["name", "date_start", "date_stop", "impressions", "spend"])
df["name"] = test["data"][0]["name"]
print(df)
name date_start date_stop impressions spend
0 john 2022-12-31 2023-01-29 4070 36.14
seriously messy dataframe
data['insights'][0]['data'][0]['impressions']
data['insights'][0]['data'][0]['spend']
For example I have multiple locations under the location column, then I want to add group numbers within each location. But number of groups is different in locations.
e.g. df1
Location
Chicago
Minneapolis
Dallas
.
.
.
and df2
Location times
Chicago 2
Minneapolis 5
Dallas 1
. .
. .
. .
What I want to get is:
Location Group
Chicago 1
Chicago 2
Minneapolis 1
Minneapolis 2
Minneapolis 3
Minneapolis 4
Minneapolis 5
Dallas 1
.
.
.
What I have now is... repeating same number of groups among the locations: 17 groups within each location. But I just realized there will be different groups among locations... then I don't know what to do next.
filled_results['location'] = results['location'].unique()
filled_results['times'] = 17
filled_results = filled_results.loc[filled_results.index.repeat(filled_results.times)]
v = pd.Series(range(1, 18))
filled_results['group'] = np.tile(v, len(filled_results) // len(v) + 1)[:len(filled_results)]
filled_results = filled_results.drop(columns=['times'])
I was thinking about a for loop, but don't know how to achieve that. for each unique location within df1, giving them 0 to x of groups based on #ofgroups in df2.
I think I found a solution myself. It's very easy if considering this as add index within each group. Here's the solution:
df = pd.DataFrame()
df['location'] = df1['location'].unique()
df = pd.merge(df,
df2,
on = 'location',
how = 'left' )
df = df.loc[df.index.repeat(df.times)]
df["Group"] = df.groupby("location")["times"].rank(method="first", ascending=True)
df["Group"] = df["Group"].astype(int)
df = df.drop(columns=['times'])
You can check out this code.
data = [
{ 'name': 'Chicago', 'c': 2 },
{ 'name': 'Minneapolis', 'c': 5 },
{ 'name': 'Dallas', 'c': 1 }
]
result = []
for location in data:
for i in range(0, location['c']):
result.append({ 'name': location['name'], 'group': i+1 })
result will be:
[{'group': 1, 'name': 'Chicago'}, {'group': 2, 'name': 'Chicago'}, {'group': 1, 'name': 'Minneapolis'}, {'group': 2, 'name': 'Minneapolis'}, {'group': 3, 'name': 'Minneapolis'}, {'group': 4, 'name': 'Minneapolis'}, {'group': 5, 'name': 'Minneapolis'}, {'group': 1, 'name': 'Dallas'}]
I have a data similar to below:
Id Car Code ShowTime
1 Honda A 10/18/2017 14:45
2 Honda A 10/18/2017 17:10
3 Honda C 10/18/2017 19:35
4 Toyota B 10/18/2017 12:20
5 Toyota B 10/18/2017 14:45
My code below return multiple instance output if I include Id which is unique:
all_car_schedules = db.session.query(Schedules.id, Schedules.code,
Car.carname, Schedules.showtime) \
.filter(Schedules.id == Car.id)
df = pd.read_sql(all_car_schedules.statement, db.session.bind)
df[['show_date', 'start_times', 'median']] = df.showtime.str.split(' ', expand=True)
df['start_times'] = df['start_times'] + df['median']
df.drop('screening', axis=1, inplace=True)
df.drop('median', axis=1, inplace=True)
df_grp = df.groupby(['id', 'code', 'carname'])
df_grp_time_stacked = df_grp['start_times'].apply(list).reset_index()
df_grp_time_stacked['start_times'] = df_grp_time_stacked['start_times'].apply(lambda x: x[0] if (len(x) == 1) else x)
return_to_dict = df_grp_time_stacked.to_dict(orient='records')
Code above returns multiple rows when the expected output should be:
"data":{
'id': '1',
'schedule': {
'car': 'Honda',
'show_date': '10/18/2017',
'time_available': [
'14:45',
'17:10',
],
'code': 'A'
}
},{
'id': '3',
'schedule': {
'car': 'Honda',
'show_date': '10/18/2017',
'time_available': [
'19:35'
],
'code': 'C'
}
},{
'id': '4',
'schedule': {
'car': 'Toyota',
'show_date': '10/18/2017',
'time_available': [
'12:20',
'14:45'
],
'code': 'B'
}
}
I am also using sqlite3 as db. I am not sure if there should be a change in the query. Please let me know your thoughts and help me on this. Thank you so much. I am also using sqlite3 as db.
You could use the groupby() function combined with the list option:
df = pd.DataFrame({'Id' : [1,2,3,4,5], 'Car': ['Honda', 'Honda', 'Honda', 'Toyota', 'Toyota'],
'Code': ['A', 'A', 'B', 'C', 'C'], 'show date': ['10/18/2017', '10/18/2017',
'10/18/2017', '10/18/2017', '10/18/2017'],
'start_times' : ['14:45', '17:10', '19:35', '12:20', '14:45']})
df.groupby(['Car', 'Code', 'show date'])['start_times'].apply(list)
Output :
start_times
Car Code show date
Honda A 10/18/2017 [14:45, 17:10]
B 10/18/2017 [19:35]
Toyota C 10/18/2017 [12:20, 14:45]
If you want to keep the first id you have to add the option 'first' to the Id row like so :
df.groupby(['Car', 'Code', 'show date']).agg({'start_times' : list, 'Id' : 'first'})
# Output
start_times Id
Car Code show date
Honda A 10/18/2017 [14:45, 17:10] 1
B 10/18/2017 [19:35] 3
Toyota C 10/18/2017 [12:20, 14:45] 4
I have a table where one column are the county names and the other columns are various attributes.
I want to convert this column of county names to fips codes.
I have an intermediary table that shows the fips code for each county.
Here is an example of what data i have (initial, and intermediate) and the data i want (final).
initial_df = {
'county': ['REAGAN', 'UPTON', 'HARDEMAN', 'UPTON'],
'values': [508, 364, 26, 870]
}
intermediate_df = {
'county': ['REAGAN', 'HARDEMAN', 'UPTON'],
'fips': [48383, 47069, 48461]
}
final_df = {
'county': ['REAGAN', 'UPTON', 'HARDEMAN', 'UPTON'],
'fips': [48383, 48461, 47069, 48461],
'values': [508, 364, 26, 870]
}
You can use 'merge'.
import pandas as pd
initial_df = {'county': ['REAGAN', 'UPTON', 'HARDEMAN', 'UPTON'], 'values': [508,
364, 26, 870]}
intermediate_df = {'county': ['REAGAN', 'HARDEMAN', 'UPTON'], 'fips': [48383, 47069,
48461]}
final_df = {'county': ['REAGAN', 'UPTON', 'HARDEMAN', 'UPTON'], 'fips': [48383,
48461, 47069, 48461], 'values': [508, 364, 26, 870]}
df1=pd.DataFrame(initial_df)
df2=pd.DataFrame(intermediate_df)
df3=df1.merge(df2)
print(df3)
and the output is your final_df.
Here is one way:
initial_df = pd.DataFrame(initial_df)
final_df = initial_df.assign(fips = initial_df['county'].map(dict(zip(*intermediate_df.values()))))
Or:
initial_df = pd.DataFrame(initial_df)
final_df = initial_df.assign(fips = initial_df['county'].map(pd.DataFrame(intermediate_df).set_index('county')['fips']))
Both result in:
>>> final_df
county values fips
0 REAGAN 508 48383
1 UPTON 364 48461
2 HARDEMAN 26 47069
3 UPTON 870 48461
You can take the dictionary from intermediate_df and convert it into a dictionary keyed on the county name with fips as the values. Then use this to map the county field in the initial_df.
mapping = {k: v for k, v in zip(*intermediate_df.values())}
df_final = pd.DataFrame(initial_df)
df_final['fips'] = df_final['county'].map(mapping)
>>> df_final
county values fips
0 REAGAN 508 48383
1 UPTON 364 48461
2 HARDEMAN 26 47069
3 UPTON 870 48461
df:
no fruit price city
1 apple 10 Pune
2 apple 20 Mumbai
3 orange 5 Nagpur
4 orange 7 Delhi
5 Mango 20 Bangalore
6 Mango 15 Chennai
Now I want to get city name where "fruit= orange and price =5"
df.loc[(df['fruit'] == 'orange') & (df['price'] == 5) , 'city'].iloc[0]
is not working and giving error as:
IndexError: single positional indexer is out-of-bounds
Versions used: Python 3.5
You could create masks step-wise and see how they look like:
import pandas as pd
df = pd.DataFrame([{'city': 'Pune', 'fruit': 'apple', 'no': 1L, 'price': 10L},
{'city': 'Mumbai', 'fruit': 'apple', 'no': 2L, 'price': 20L},
{'city': 'Nagpur', 'fruit': 'orange', 'no': 3L, 'price': 5L},
{'city': 'Delhi', 'fruit': 'orange', 'no': 4L, 'price': 7L},
{'city': 'Bangalore', 'fruit': 'Mango', 'no': 5L, 'price': 20L},
{'city': 'Chennai', 'fruit': 'Mango', 'no': 6L, 'price': 15L}])
m1 = df['fruit'] == 'orange'
m2 = df['price'] == 5
df[m1&m2]['city'].values[0] # 'Nagpur'
Scalable and programmable solution - utilizes multiIndexing
Advanced indexing with hierarchical index
Variables
search_columns=['fruit','price']
search_values=['orange','5']
target_column='city'
Make search columns indexes of the df
df_temp=df.set_index(search_columns)
Use the 'loc' method to get the value
value=df_temp.loc[tuple(search_values),target_column]
The result is either a scalar for <=2 search columns or pd.Series
for >2 search columns, respectively
Snippet:
import pandas as pd
columns = "fruit price city".split()
data = zip(
'apple apple orange orange Mango Mango'.split(),
'10 20 5 7 20 15'.split(),
'Pune Mumbai Nagpur Delhi Bangalore Chennai'.split()
)
df = pd.DataFrame(data=data, columns=columns)
search_columns = ['fruit', 'price']
search_values = ['orange', '5']
target_column = 'city'
df_temp = df.set_index(search_columns)
value = df_temp.loc[tuple(search_values), target_column]
print(value)
result: Nagpur