I have a data similar to below:
Id Car Code ShowTime
1 Honda A 10/18/2017 14:45
2 Honda A 10/18/2017 17:10
3 Honda C 10/18/2017 19:35
4 Toyota B 10/18/2017 12:20
5 Toyota B 10/18/2017 14:45
My code below return multiple instance output if I include Id which is unique:
all_car_schedules = db.session.query(Schedules.id, Schedules.code,
Car.carname, Schedules.showtime) \
.filter(Schedules.id == Car.id)
df = pd.read_sql(all_car_schedules.statement, db.session.bind)
df[['show_date', 'start_times', 'median']] = df.showtime.str.split(' ', expand=True)
df['start_times'] = df['start_times'] + df['median']
df.drop('screening', axis=1, inplace=True)
df.drop('median', axis=1, inplace=True)
df_grp = df.groupby(['id', 'code', 'carname'])
df_grp_time_stacked = df_grp['start_times'].apply(list).reset_index()
df_grp_time_stacked['start_times'] = df_grp_time_stacked['start_times'].apply(lambda x: x[0] if (len(x) == 1) else x)
return_to_dict = df_grp_time_stacked.to_dict(orient='records')
Code above returns multiple rows when the expected output should be:
"data":{
'id': '1',
'schedule': {
'car': 'Honda',
'show_date': '10/18/2017',
'time_available': [
'14:45',
'17:10',
],
'code': 'A'
}
},{
'id': '3',
'schedule': {
'car': 'Honda',
'show_date': '10/18/2017',
'time_available': [
'19:35'
],
'code': 'C'
}
},{
'id': '4',
'schedule': {
'car': 'Toyota',
'show_date': '10/18/2017',
'time_available': [
'12:20',
'14:45'
],
'code': 'B'
}
}
I am also using sqlite3 as db. I am not sure if there should be a change in the query. Please let me know your thoughts and help me on this. Thank you so much. I am also using sqlite3 as db.
You could use the groupby() function combined with the list option:
df = pd.DataFrame({'Id' : [1,2,3,4,5], 'Car': ['Honda', 'Honda', 'Honda', 'Toyota', 'Toyota'],
'Code': ['A', 'A', 'B', 'C', 'C'], 'show date': ['10/18/2017', '10/18/2017',
'10/18/2017', '10/18/2017', '10/18/2017'],
'start_times' : ['14:45', '17:10', '19:35', '12:20', '14:45']})
df.groupby(['Car', 'Code', 'show date'])['start_times'].apply(list)
Output :
start_times
Car Code show date
Honda A 10/18/2017 [14:45, 17:10]
B 10/18/2017 [19:35]
Toyota C 10/18/2017 [12:20, 14:45]
If you want to keep the first id you have to add the option 'first' to the Id row like so :
df.groupby(['Car', 'Code', 'show date']).agg({'start_times' : list, 'Id' : 'first'})
# Output
start_times Id
Car Code show date
Honda A 10/18/2017 [14:45, 17:10] 1
B 10/18/2017 [19:35] 3
Toyota C 10/18/2017 [12:20, 14:45] 4
Related
If I have a dataframe that had id and date and would like to filter based on id and date how can I do it if I have many dates and ids to filter?
df = pd.DataFrame([
{'id': 'thing 1', 'date': '2016-01-01', 'quantity': 1 },
{'id': 'thing 1', 'date': '2016-02-01', 'quantity': 1 },
{'id': 'thing 1', 'date': '2016-09-01', 'quantity': 1 },
{'id': 'thing 1', 'date': '2016-10-01', 'quantity': 1 },
{'id': 'thing 2', 'date': '2017-01-01', 'quantity': 2 },
{'id': 'thing 2', 'date': '2017-02-01', 'quantity': 2 },
{'id': 'thing 2', 'date': '2017-02-11', 'quantity': 2 },
{'id': 'thing 2', 'date': '2017-09-01', 'quantity': 2 },
{'id': 'thing 2', 'date': '2017-10-01', 'quantity': 2 },
])
df.date = pd.to_datetime(df.date, format="%Y-%m-%d")
date_dict = {'thing1':'2016-02-01',
'thing2': '2017-09-01'}
If I have just 2 I could just hardcode it like this :
df.loc[((df['id']=='thing 1') & (df['date']<='2016-02-01')) | ((df['id']=='thing 2') & (df['date']<='2017-09-01'))]
However if I have 1000s of different ID and 1000s date how can I do it efficiently?
Thanks you,
Sam
You can create a Series from the dictionary, merge to df and query where the date is less than the date in your dictionary.
res = (
df.merge(pd.Series(date_dict, name='dt_max'),
left_on='id', right_index=True, how='left')
.query('date<=dt_max')[df.columns]
)
print(res)
# id date quantity
# 0 thing 1 2016-01-01 1
# 1 thing 1 2016-02-01 1
# 4 thing 2 2017-01-01 2
# 5 thing 2 2017-02-01 2
# 6 thing 2 2017-02-11 2
# 7 thing 2 2017-09-01 2
Note, make sure your dictionary keys are the same than the id (you currently have typo in it)
I'm trying to transform df like this into the dictionary with multiple nested keys.
import pandas as pd
import datetime
columns = ['country', 'city', 'from_date', 'to_date', 'sales']
data = [['UK', 'London', datetime.date(2021, 8, 26), datetime.date(2099, 5,5), 2500], ['Mexico', 'Mexico City', datetime.date(2011, 3,3), datetime.date(2012, 4, 5), 5670], ['Mexico', 'Mexico City', datetime.date(2014, 3,3), datetime.date(2017, 4, 5), 5680]]
df = pd.DataFrame(data, columns=columns)
df
country city from_date to_date sales
0 UK London 2021-08-26 2099-05-05 2500
1 Mexico Mexico City 2011-03-03 2012-04-05 5670
2 Mexico Mexico City 2014-03-03 2017-04-05 5680
Result # 1 I'm looking for:
{'Mexico':
{'Mexico City':
[
{'from_date: 2011-03-03, 'to_date: 2012-04-05, 'sales': 5670},
{'from_date: 2014-03-03, 'to_date: 2017-04-05, 'sales': 5680}
]},
'UK':
{'London':
[
{'from_date: 2021-08-26, 'to_date: 2099-05-05, 'sales': 2500}
]},
}
Or Result #2:
{'Mexico':
{'Mexico City':
{2011-03-03: 5670, # from_date: sales
2014-03-03: 5680} # from_date: sales
},
'UK':
{'London':
{2021-08-26: 2500} # from_date: sales
},
}
I don't know how to get result #1, as for result #2 I've tried this:
df.groupby(['country', 'city', 'from_date'])['sales'].apply(float).to_dict()
{('Mexico', 'Mexico City', Timestamp('2011-03-03 00:00:00')): 5670.0, ('Mexico', 'Mexico City', Timestamp('2014-03-03 00:00:00')): 5670.0, ('UK', 'London', Timestamp('2021-08-26 00:00:00')): 2500.0}
BUT I need to be able to get from_date as a separate key because I will be using it to compare to another date.
Ideally, I'd like to learn how to get both results but any help is appreciated!
You can create MultiIndex Series by lambda function in GroupBy.apply with DataFrame.to_dict:
df['from_date'] = pd.to_datetime(df['from_date']).dt.strftime('%Y-%m-%d')
df['to_date'] = pd.to_datetime(df['to_date']).dt.strftime('%Y-%m-%d')
f = lambda x: x.to_dict('records')
s = df.groupby(['country', 'city'])[['from_date','to_date','sales']].apply(f)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{
'Mexico': {
'Mexico City': [{
'from_date': '2011-03-03',
'to_date': '2012-04-05',
'sales': 5670
},
{
'from_date': '2014-03-03',
'to_date': '2017-04-05',
'sales': 5680
}
]
},
'UK': {
'London': [{
'from_date': '2021-08-26',
'to_date': '2099-05-05',
'sales': 2500
}]
}
}
For second is changed lambda function only:
f = lambda x: x.set_index('from_date')['sales'].to_dict()
s2 = df.groupby(['country', 'city']).apply(f)
print (s2)
country city
Mexico Mexico City {'2011-03-03': 5670, '2014-03-03': 5680}
UK London {'2021-08-26': 2500}
dtype: object
d2 = {level: s2.xs(level).to_dict() for level in s2.index.levels[0]}
print (d2)
{'Mexico': {'Mexico City': {'2011-03-03': 5670, '2014-03-03': 5680}},
'UK': {'London': {'2021-08-26': 2500}}}
For example I have multiple locations under the location column, then I want to add group numbers within each location. But number of groups is different in locations.
e.g. df1
Location
Chicago
Minneapolis
Dallas
.
.
.
and df2
Location times
Chicago 2
Minneapolis 5
Dallas 1
. .
. .
. .
What I want to get is:
Location Group
Chicago 1
Chicago 2
Minneapolis 1
Minneapolis 2
Minneapolis 3
Minneapolis 4
Minneapolis 5
Dallas 1
.
.
.
What I have now is... repeating same number of groups among the locations: 17 groups within each location. But I just realized there will be different groups among locations... then I don't know what to do next.
filled_results['location'] = results['location'].unique()
filled_results['times'] = 17
filled_results = filled_results.loc[filled_results.index.repeat(filled_results.times)]
v = pd.Series(range(1, 18))
filled_results['group'] = np.tile(v, len(filled_results) // len(v) + 1)[:len(filled_results)]
filled_results = filled_results.drop(columns=['times'])
I was thinking about a for loop, but don't know how to achieve that. for each unique location within df1, giving them 0 to x of groups based on #ofgroups in df2.
I think I found a solution myself. It's very easy if considering this as add index within each group. Here's the solution:
df = pd.DataFrame()
df['location'] = df1['location'].unique()
df = pd.merge(df,
df2,
on = 'location',
how = 'left' )
df = df.loc[df.index.repeat(df.times)]
df["Group"] = df.groupby("location")["times"].rank(method="first", ascending=True)
df["Group"] = df["Group"].astype(int)
df = df.drop(columns=['times'])
You can check out this code.
data = [
{ 'name': 'Chicago', 'c': 2 },
{ 'name': 'Minneapolis', 'c': 5 },
{ 'name': 'Dallas', 'c': 1 }
]
result = []
for location in data:
for i in range(0, location['c']):
result.append({ 'name': location['name'], 'group': i+1 })
result will be:
[{'group': 1, 'name': 'Chicago'}, {'group': 2, 'name': 'Chicago'}, {'group': 1, 'name': 'Minneapolis'}, {'group': 2, 'name': 'Minneapolis'}, {'group': 3, 'name': 'Minneapolis'}, {'group': 4, 'name': 'Minneapolis'}, {'group': 5, 'name': 'Minneapolis'}, {'group': 1, 'name': 'Dallas'}]
I have a column "data" which has json object as values. I would like to split them up.
source = {'_id':['SE-DATA-BB3A','SE-DATA-BB3E','SE-DATA-BB3F'], 'pi':['BB3A_CAP_BMLS','BB3E_CAP_BMLS','BB3F_CAP_PAS'], 'Datetime':['190725-122500', '190725-122500', '190725-122500'], 'data': [ {'bb3a_bmls':[{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes-': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}]}
, {'bb3b_bmls':[{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}]}
, {'bb3c_bmls':[{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}]}
] }
input_df = pd.DataFrame(source)
My input_df is as below:
I'm expecting the output_df as below:
I could manage to get the columns volume_id volume_name volume_state
name id state nodes using the below method.
input_df['data'] = input_df['data'].apply(pd.Series)
which will result as below
Test_df=pd.concat([json_normalize(input_df['bb3a_bmls'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in input_df.index if isinstance(input_df['bb3a_bmls'][key],list)]).reset_index(drop=True)
Which will result for one "SERVER" - bb3a_bmls
Now, I don't have an idea how to get the parent columns "_id", "pi", "Datetime" back.
Idea is loop by each nested lists or by dicts and create list of dictionary for pass to DataFrame constructor:
out = []
zipped = zip(source['_id'], source['pi'], source['Datetime'], source['data'])
for a,b,c,d in zipped:
for k1, v1 in d.items():
for e in v1:
#get all values of dict with exlude volumes
di = {k2:v2 for k2, v2 in e.items() if k2 != 'volumes'}
#for each dict in volumes add volume_ to keys
for f in e['volumes']:
di1 = {f'volume_{k3}':v3 for k3, v3 in f.items()}
#create dict from previous values
di2 = {'_id':a, 'pi':b,'Datetime':c, 'SERVER':k1}
#add to list merged dictionaries
out.append({**di2, ** di1, **di})
df = pd.DataFrame(out)
print (df)
_id pi Datetime SERVER volume_state \
0 SE-DATA-BB3A BB3A_CAP_BMLS 190725-122500 bb3a_bmls available
1 SE-DATA-BB3A BB3A_CAP_BMLS 190725-122500 bb3a_bmls available
2 SE-DATA-BB3E BB3E_CAP_BMLS 190725-122500 bb3b_bmls unavailable
3 SE-DATA-BB3E BB3E_CAP_BMLS 190725-122500 bb3b_bmls unavailable
4 SE-DATA-BB3F BB3F_CAP_PAS 190725-122500 bb3c_bmls unavailable
5 SE-DATA-BB3F BB3F_CAP_PAS 190725-122500 bb3c_bmls unavailable
volume_id volume_name name id state nodes
0 330172 q_-4144d4e WAG 01 105F available 3
1 275192 p_3089d821ae WAG 01 105F available 3
2 830172 w_-4144d4e FEC 01 382E available 4
3 223192 g_3089d821ae FEC 01 382E available 4
4 930172 e_-4144d4e ASD 01 303F available 6
5 245192 h_3089d821ae ASD 01 303F available 6
I have a dataframe with LISTS(with dicts) as column values . My intention is to normalize entire column(all rows). I found way to normalize a single row . However, I'm unable to apply the same function for the entire dataframe or column.
data = {'COLUMN': [ [{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}], [{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}], [{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}] ] }
source_df = pd.DataFrame(data)
source_df looks like below :
As per https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html I managed to get output for one row.
Code to apply for one row:
Target_df = json_normalize(source_df['COLUMN'][0], 'volumes', ['name','id','state','nodes'], record_prefix='volume_')
Output for above code :
I would like to know how we can achieve desired output for the entire column
Expected output:
EDIT:
#lostCode , below is the input with nan and empty list
You can do:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index]).reset_index(drop=True)
Output:
volume_state volume_id volume_name name id state nodes
0 available 330172 q_-4144d4e WAG 01 105F available 3
1 available 275192 p_3089d821ae WAG 01 105F available 3
2 unavailable 830172 w_-4144d4e FEC 01 382E available 4
3 unavailable 223192 g_3089d821ae FEC 01 382E available 4
4 unavailable 930172 e_-4144d4e ASD 01 303F available 6
5 unavailable 245192 h_3089d821ae ASD 01 303F available 6
concat, is used to concatenate a dataframe list, in this case the list that is generated using json_normalize is concatenated on all rows of source_df
You can use to check type of source_df:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index if isinstance(source_df['COLUMN'][key],list)]).reset_index(drop=True)
Target_df=source_df.apply(json_normalize)