New dataframe from grouping together two columns - python
I have a dataset that looks like the following.
Region_Name Date Average
London 1990Q1 105
London 1990Q1 118
... ... ...
London 2018Q1 157
I converted the date into quarters and wish to create a new dataframe with the matching quarters and region names grouped together, with the mean average.
What is the best way to accomplish such a task.
I have been looking at the groupby function but keep getting a traceback.
for example:
new_df = df.groupby(['Resion_Name','Date']).mean()
dict3={'Region_Name': ['London','Newyork','London','Newyork','London','London','Newyork','Newyork','Newyork','Newyork','London'],
'Date' : ['1990Q1','1990Q1','1990Q2','1990Q2','1991Q1','1991Q1','1991Q2','1992Q2','1993Q1','1993Q1','1994Q1'],
'Average': [34,56,45,67,23,89,12,45,67,34,67]}
df3=pd.DataFrame(dict3)
**Now My df3 is as follows **
Region_Name Date Average
0 London 1990Q1 34
1 Newyork 1990Q1 56
2 London 1990Q2 45
3 Newyork 1990Q2 67
4 London 1991Q1 23
5 London 1991Q1 89
6 Newyork 1991Q2 12
7 Newyork 1992Q2 45
8 Newyork 1993Q1 67
9 Newyork 1993Q1 34
10 London 1994Q1 67
code looks as follows:
new_df = df3.groupby(['Region_Name','Date'])
new1=new_df['Average'].transform('mean')
Result of dataframe new1:
print(new1)
0 34.0
1 56.0
2 45.0
3 67.0
4 56.0
5 56.0
6 12.0
7 45.0
8 50.5
9 50.5
10 67.0
Related
Efficient mean and total aggregation over multiple Pandas DataFrame columns
Suppose I have a DataFrame that looks something like this: id country grade category amount 0 7 fr a mango 52 1 5 fr b banana 68 2 7 fr a banana 73 3 4 it c mango 70 4 5 fr b banana 99 5 9 uk a apple 29 6 3 uk a mango 83 7 0 uk b banana 59 8 2 it c mango 11 9 9 uk a banana 91 10 0 uk b mango 95 11 8 uk a mango 30 12 3 uk a mango 82 13 1 it b banana 78 14 3 uk a apple 76 15 6 it c apple 76 16 2 it c mango 10 17 1 it b mango 30 18 9 uk a banana 17 19 2 it c mango 58 Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.) import pandas as pd df = pd.DataFrame({ "id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2], "country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"], "grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"], "category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"], "amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58] }) I would like to add two columns to this DF. First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos: id: 4 total: 70 id: 2 total: 11 + 10 + 58 = 79 So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations. The second column I want to add is the same but for the mean annual count for each combination. Desired output and the best I could come up with: I've managed to populate these two desired columns using the following code: import math combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])] for c in combos: x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])] m = x.groupby("id").sum()["amount"].mean() k = x.groupby("id").count()["amount"].mean() if math.isnan(m): m = 0 if math.isnan(k): k = 0 c.append(m) c.append(k) temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"]) df = df.merge(temp_grouping,on=["country","grade","category"],how="left") Which gives the desired output: id country grade category amount mean_totals mean_counts 0 7 fr a mango 52 52 1 1 5 fr b banana 68 167 2 2 7 fr a banana 73 73 1 3 4 it c mango 70 74.5 2 4 5 fr b banana 99 167 2 5 9 uk a apple 29 52.5 1 6 3 uk a mango 83 97.5 1.5 7 0 uk b banana 59 59 1 8 2 it c mango 11 74.5 2 9 9 uk a banana 91 108 2 10 0 uk b mango 95 95 1 11 8 uk a mango 30 97.5 1.5 12 3 uk a mango 82 97.5 1.5 13 1 it b banana 78 78 1 14 3 uk a apple 76 52.5 1 15 6 it c apple 76 76 1 16 2 it c mango 10 74.5 2 17 1 it b mango 30 30 1 18 9 uk a banana 17 108 2 19 2 it c mango 58 74.5 2 The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.
You can create mean_totals column as follows: mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique()) df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1) which gives 0 7 fr a mango 52 52.0 1 5 fr b banana 68 167.0 2 7 fr a banana 73 73.0 3 4 it c mango 70 74.5 4 5 fr b banana 99 167.0 5 9 uk a apple 29 52.5 6 3 uk a mango 83 97.5 7 0 uk b banana 59 59.0 8 2 it c mango 11 74.5 9 9 uk a banana 91 108.0 10 0 uk b mango 95 95.0 11 8 uk a mango 30 97.5 12 3 uk a mango 82 97.5 13 1 it b banana 78 78.0 14 3 uk a apple 76 52.5 15 6 it c apple 76 76.0 16 2 it c mango 10 74.5 17 1 it b mango 30 30.0 18 9 uk a banana 17 108.0 19 2 it c mango 58 74.5
It looks like you need a double groupby. Once for the sum, once for the mean: out = (df .groupby(['country', 'grade', 'category', 'id']).sum() .groupby(['country', 'grade', 'category']).mean() ) output: amount country grade category fr a banana 73.0 mango 52.0 b banana 167.0 it b banana 78.0 mango 30.0 c apple 76.0 mango 74.5 uk a apple 52.5 banana 108.0 mango 97.5 b banana 59.0 mango 95.0
I hope this will work fast.. First group and compute the required details and merge with existing df. import pandas as pd df = pd.DataFrame({ "id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2], "country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"], "grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"], "category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"], "amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58] }) intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean')) output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left') print(output_df) Output_dataframe
An unwanted level and wrong calculation on the graphic on pandas
I have the following dataset: df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10,11,12], 'city':['Pau','Pau','Pau','Pau','Pau','Pau','Lyon','Dax','Dax','Lyon','Lyon','Lyon'], 'type':['A','A','A','A','B','B','B','A','B','A','B','B'], 'val':[100,90,95,95,90,75,100,70,75,90,95,85]}) id city type val 0 1 Pau A 100 1 2 Pau A 90 2 3 Pau A 95 3 4 Pau A 95 4 5 Pau B 90 5 6 Pau B 75 6 7 Lyon B 100 7 8 Dax A 70 8 9 Dax B 75 9 10 Lyon A 90 10 11 Lyon B 95 11 12 Lyon B 85 And I want to create a plot grouped by variable city, and get the frequency percentage per type. I have tried this: df.groupby(['city','type']).agg({'type':'count'}).transform(lambda x: x/x.sum()).unstack().plot() But I get wrong values per group and an unwanted 'None'. The expected values should be: type A B city Dax .50 .50 Lyon .33 .66 Pau .66 .33
Looking at your requirement, you may want crosstab with normalize: pd.crosstab(df['city'],df['type'],normalize='index').plot() Where: print(pd.crosstab(df['city'],df['type'],normalize='index')) type A B city Dax 0.500000 0.500000 Lyon 0.250000 0.750000 Pau 0.666667 0.333333
Conditional filling of column based on string
I have a dataset which I have to fill conditional or dropping the conditional rows. But, I am still unsuccessful. Idx Fruits Days Name 0 60 20 1 15 85.5 2 10 62 Peter 3 40 90 Maria 4 5 10.2 5 92 66 6 65 87 John 7 50 1 Eric 8 50 0 Maria 9 80 87 John Now, I have some empty cells. I can fill with fillna or regex or can drop empty cells. I want only first starting cells until the string starts, either dropping or filling with "." Like below Idx Fruits Days Name 0 60 20 . 1 15 85.5 . 2 10 62 Peter 3 40 90 Maria 4 5 10.2 5 92 66 6 65 87 John 7 50 1 Eric 8 50 0 Maria 9 80 87 John and Idx Fruits Days Name 2 10 62 Peter 3 40 90 Maria 4 5 10.2 5 92 66 6 65 87 John 7 50 1 Eric 8 50 0 Maria 9 80 87 John Is there any possibility using pandas? or any looping?
You can try this: df['Name'] = df['Name'].replace('', np.nan) df['Name'] = df['Name'].where(df['Name'].ffill().notna(), '.') print(df) Idx Fruits Days Name 0 0 60 20.0 . 1 1 15 85.5 . 2 2 10 62.0 Peter 3 3 40 90.0 Maria 4 4 5 10.2 5 5 92 66.0 6 6 65 87.0 John 7 7 50 1.0 Eric 8 8 50 0.0 Maria 9 9 80 87.0 John
sum values in column grouped by another column pandas
My df looks like this: country id x y AT 11 50 100 AT 12 NaN 90 AT 13 NaN 104 AT 22 40 50 AT 23 30 23 AT 61 40 88 AT 62 NaN 78 UK 11 40 34 UK 12 NaN 22 UK 13 NaN 70 What I need is the sum of the y column in the first row that is not NaN in x, grouped by the first number on the left of the column id. This separately for each country. At the end I just need to drop the NaN. The result should be something like this: country id x y AT 11 50 294 AT 22 40 50 AT 23 30 23 AT 61 40 166 UK 11 40 126
You can aggregate by GroupBy.agg by first and sum functions with helper Series by compare non missing values by Series.notna and cumulative sum by Series.cumsum: df1 = (df.groupby(['country', df['x'].notna().cumsum()]) .agg({'id':'first', 'x':'first', 'y':'sum'}) .reset_index(level=1, drop=True) .reset_index()) print (df1) country id x y 0 AT 11 50.0 294 1 AT 22 40.0 50 2 AT 23 30.0 23 3 AT 61 40.0 166 4 UK 11 40.0 126 If possible first value(s) of x are misisng values add DataFrame.dropna: print (df) country id x y 0 AT 11 NaN 100 1 AT 11 50.0 100 2 AT 12 NaN 90 3 AT 13 NaN 104 4 AT 22 40.0 50 5 AT 23 30.0 23 6 AT 61 40.0 88 7 AT 62 NaN 78 8 UK 11 40.0 34 9 UK 12 NaN 22 10 UK 13 NaN 70 df1 = (df.groupby(['country', df['x'].notna().cumsum()]) .agg({'id':'first', 'x':'first', 'y':'sum'}) .reset_index(level=1, drop=True) .reset_index() .dropna(subset=['x'])) print (df1) country id x y 1 AT 11 50.0 294 2 AT 22 40.0 50 3 AT 23 30.0 23 4 AT 61 40.0 166 5 UK 11 40.0 126
Use groupby, transform and dropna: print (df.assign(y=df.groupby(df["x"].notnull().cumsum())["y"].transform('sum')) .dropna(subset=["x"])) country id x y 0 AT 11 50.0 294 3 AT 22 40.0 50 4 AT 23 30.0 23 5 AT 61 40.0 166 7 UK 11 40.0 126
Subtract/Add existing values if contents of one dataframe is present in another using pandas
Here are 2 dataframes df1: Index Number Name Amount 0 123 John 31 1 124 Alle 33 2 312 Amy 33 3 314 Holly 35 df2: Index Number Name Amount 0 312 Amy 13 1 124 Alle 35 2 317 Jack 53 The resulting dataframe should look like this result_df: Index Number Name Amount Curr_amount 0 123 John 31 31 1 124 Alle 33 68 2 312 Amy 33 46 3 314 Holly 35 35 4 317 Jack 53 I have tried using pandas isin but it only says if the Number column was present or no in boolean. Is there any way to do this efficiently?
Use merge with outer join and then add Series.add (or Series.sub if necessary): df = df1.merge(df2, on=['Number','Name'], how='outer', suffixes=('','_curr')) df['Amount_curr'] = df['Amount_curr'].add(df['Amount'], fill_value=0) print (df) Number Name Amount Amount_curr 0 123 John 31.0 31.0 1 124 Alle 33.0 68.0 2 312 Amy 33.0 46.0 3 314 Holly 35.0 35.0 4 317 Jack NaN 53.0