Imagine I have the following dataframe:
np.random.seed(42)
t = pd.DataFrame({'year': 4*['2018']+3*['2019']+4*['2016'],
'pop': np.random.randint(10, 100, size=(11)),
'production': np.random.randint(2000, 40000, size=(11))})
print(t)
year pop production
2018 61 3685
2018 24 2769
2018 81 4433
2018 70 7311
2019 30 39819
2019 92 19568
2019 96 21769
2016 84 30693
2016 84 8396
2016 97 29480
2016 33 27658
I want to find the sum of production divided by the sum of the pop by each year, my final dataframe would be something like:
tmp = t.groupby('year').sum()
tmp['production']/tmp['pop']
year
2016 322.909396
2018 77.110169
2019 372.275229
I was thinking if it could be done using groupby year and then using agg based on two columns, something like:
#doesn't work
t.groupby('year').agg(prod_per_pop = (['pop', 'production'],
lambda x: x['production'].sum()/x['pop'].sum()))
My question is basically if it is possible to use any pandas groupby method to achieve that in an easy way rather than having to create another dataframe and then having to divide.
You could use lambda functions with axis=1 to solve it in single line.
t.groupby('year')['pop','production'].agg('sum').apply(lambda x: x['production']/x['pop'], axis=1)
Related
Currently I'm using pandas and numpy to play around with a data set on rain measurements in India, however I'm stumped on trying to create a particular column. Currently my data set looks like this:
SUBDIVISION
JAN
FEB
MAR
APR
MAY
Andaman & Nicobar Islands
50
70
90
250
430
Arunachal Pradesh
46
90
151
265
356
Assam & Meghalaya
16
31
79
505
340
Bihar
13
14
100
16
53
What I want is to replace all the columns that have the months with a single column "Months", and I want this column to contain the name of the month that has the most amount of rain, so for example it would look like this:
SUBDIVISION
Months
Andaman & Nicobar Islands
MAY
Arunachal Pradesh
MAY
Assam & Meghalaya
APR
Bihar
MAR
My data set is much larger than this so trying to manually input all of the data would not be worth it. So, I'm hoping there's a way to do what I'm wanting in Python
Use
# get column name of max values in month columns
df.set_index('SUBDIVISION').idxmax(1).reset_index(name='Months')
You can use pd.melt to transform your data first.
import pandas as pd
df = pd.DataFrame({
'subdivision': ['a','b'],
'jun': [1,2],
'july': [2,1]
})
df = pd.melt(df, id_vars=['subdivision'], var_name='month', value_name='rain')
df
df:
subdivision
month
rain
a
jun
1
b
jun
2
a
july
2
b
july
1
Then, sort value with rain value and drop_duplicates subdivision for keeping only the row having max rain value in each subdivision
df = df.sort_values('rain', ascending=False).drop_duplicates(['subdivision'])
Output:
subdivision
month
rain
b
jun
2
a
july
2
This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 10 months ago.
New to the Python world and I'm working through a problem where I need to pull a value for the largest index value for each year. Will provide a table example and explain further
Year
Index
D_Value
2010
13
85
2010
14
92
2010
15
76
2011
9
68
2011
10
73
2012
100
94
2012
101
89
So, the desired output would look like this:
Year
Index
D_Value
2010
15
76
2011
10
73
2012
101
89
I've tried researching how to apply max() and .loc() functions, however, I'm not sure what the optimal approach is for this scenario. Any help would be greatly appreciated. I've also included the below code to generate the test table.
import pandas as pd
data = {'Year':[2010,2010,2010,2011,2011,2012,2012],'Index':[13,14,15,9,10,100,101],'D_Value':[85,92,76,68,73,94,89]}
df = pd.DataFrame(data)
print(df)
You can use groupby + rank
df['Rank'] = df.groupby(by='Year')['Index'].rank(ascending=False)
print(df[df['Rank'] ==1])
Based on the code below, I'm trying to assign some columns to my DataFrame which has been grouped by month of the date and works well :
all_together = (df_clean.groupby(df_clean['ContractDate'].dt.strftime('%B'))
.agg({'Amount': [np.sum, np.mean, np.min, np.max]})
.rename(columns={'sum': 'sum_amount', 'mean': 'avg_amount', 'amin': 'min_amount', 'amax': 'max_amount'}))
But for some reason when I try to plot the result(in any kind as plot), it's not able to recognize my "ContractDate" as a column and also any of those renamed names such as: 'sum_amount'.
Do you have any idea that what's the issue and what am I missing as a rule for plotting the data?
I have tried the code below for plotting and it asks me what is "ContractDate" and what is "sum_amount"!
all_together.groupby(df_clean['ContractDate'].dt.strftime('%B'))['sum_amount'].nunique().plot(kind='bar')
#or
all_together.plot(kind='bar',x='ContractDate',y='sum_amount')
I really appreciate your time
Cheers,
z.A
When you apply groupby function on a DataFrame, it makes the groupby column as index(ContractDate in your case). So you need to reset the index first to make it as a column.
df = pd.DataFrame({'month':['jan','feb','jan','feb'],'v2':[23,56,12,59]})
t = df.groupby('month').agg('sum')
Output:
v2
month
feb 115
jan 35
So as you see, you're getting months as index. Then when you reset the index:
t.reset_index()
Output:
month v2
0 feb 115
1 jan 35
Next when you apply multiple agg functions on a single column in the groupby, it will create a multiindexed dataframe. So you need to make it as single level index:
t = df.groupby('month').agg({'v2': [np.sum, np.mean, np.min, np.max]}).rename(columns={'sum': 'sum_amount', 'mean': 'avg_amount', 'amin': 'min_amount', 'amax': 'max_amount'})
v2
sum_amount avg_amount min_amount max_amount
month
feb 115 57.5 56 59
jan 35 17.5 12 23
It created a multiindex.if you check t.columns, you get
MultiIndex(levels=[['v2'], ['avg_amount', 'max_amount', 'min_amount', 'sum_amount']],
labels=[[0, 0, 0, 0], [3, 0, 2, 1]])
Now use this:
t.columns = t.columns.get_level_values(1)
t.reset_index(inplace=True)
You will get a clean dataframe:
month sum_amount avg_amount min_amount max_amount
0 feb 115 57.5 56 59
1 jan 35 17.5 12 23
Hope this helps for your plotting.
I just came across a strange phenomenon with Pandas DataFrames, when setting index using DataFrame.set_index('some_index') the old column that was also an index is deleted! Here is an example:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
>>> df_mn
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
Now I change the index to year:
df_mn.set_index('year')
sale
year
2012 55
2014 40
2013 84
2014 31
.. and the month column was removed with the index. This is vary irritating because I just wanted to swap the DataFrame index.
Is there a way to not have the previous column that was an index from being deleted? Maybe through something like: DataFrame.set_index('new_index',delete_previous_index=False)
Thanks for any advice
You can do the following
>>> df_mn.reset_index().set_index('year')
month sale
year
2012 1 55
2014 4 40
2013 7 84
2014 10 31
the solution I found to reatain a previous columns is to set drop=False
dataframe.set_index('some_column',drop=False). This is not the perfect answer but it works!
No, in such cases you have to save your previous column, like shown
below:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
df_mn['month'] = df_mn.index #Save it as another column, and then run set_index with year column as value.
df_mn.set_index('year')
Besides you are using a duplicate dataframe df_mn , so the dataframe df remains unchanged you can use it again.
And also if you aren't setting the
inplace argument for set_index to True
df_mn won't have changed even after you call set_index() on it.
Also, like the other answer you can always use reset_index().
Suppose I have a Python Pandas dataframe with 10 rows and 16 columns. Each row stands for one product. The first column is product ID. Other 15 columns are selling price for
2010/01,2010/02,2010/03,2010/05,2010/06,2010/07,2010/08,2010/10,2010/11,2010/12,2011/01,2011/02,2011/03,2011/04,2011/05.
(The column name is in strings, not in date format) Now I want to calculate the mean selling price each quarter (1Q2010,2Q2010,...,2Q2011), I don't know how to deal with it. (Note that there is missing month for 2010/04, 2010/09 and 2011/06.)
The description above is just an example. Because this data set is quite small. It is possible to loop manually. However, the real data set I work on is 10730*202. Therefore I can not manually check which month is actually missing or map quarters manually. I wonder what efficient way I can apply here.
Thanks for the help!
This should help.
import pandas as pd
import numpy as np
rng = pd.DataFrame({'date': pd.date_range('1/1/2011', periods=72, freq='M'), 'value': np.arange(72)})
df = rng.groupby([rng.date.dt.quarter, rng.date.dt.year]) .mean()
df.index.names = ['quarter', 'year']
df.columns = ['mean']
print df
mean
quarter year
1 2011 1
2012 13
2013 25
2014 37
2015 49
2016 61
2 2011 4
2012 16
2013 28
2014 40
2015 52
2016 64
3 2011 7
2012 19
2013 31
2014 43
2015 55
2016 67
4 2011 10
2012 22
2013 34
2014 46
2015 58
2016 70