Dataframe to extract 2 most recent rows of each group - python

A simple data-frame and I want to pick the most recent 2 rows (sorted by "Year") with all columns.
import pandas as pd
data = {'People' : ["John","John","John","Kate","Kate","David","David","David","David"],
'Year': ["2018","2019","2006","2017","2012","2006","2019","2018","2017"],
'Sales' : [120,100,60,150,135,140,90,110,160]}
df = pd.DataFrame(data)
I tried below but it doesn't produce what's wanted:
df = df.groupby('People')
df_1 = pd.concat([df.head(2)]).drop_duplicates().sort_values('Year').reset_index(drop=True)
What's the right way to write it? Thank you.

IIUC, use pandas.DataFrame.nlargest:
df['Year'] = df['Year'].astype(int)
df.groupby('People', as_index=False).apply(lambda x: x.nlargest(2, "Year"))
Output:
People Year Sales
0 6 David 2019 90
7 David 2018 110
1 1 John 2019 100
0 John 2018 120
2 3 Kate 2017 150
4 Kate 2012 135

Related

sort a selection of a pandas dataframe only (keep some columns fixed)

import pandas as pd
data = {'Brand': ['HH','TT','FF','AA'],
'Price': [22000,25000,27000,35000],
'Year': [2015,2013,2018,2018],
'Misc1': ['Description: ', '', '', ''],
'Misc2': ['Car Prices 2022', '', '', '']
}
df = pd.DataFrame(data, columns=['Brand','Price','Year', 'Misc1', 'Misc2'])
print (df, '\n')
df.sort_values(by=['Brand'], inplace=True)
print(df, '\n')
I would like to keep Misc1 and Misc2 columns fixed
this does not work
df = df.loc[:, ~df.columns.isin(['Misc1', 'Misc2'])].sort_values(by=['Brand'], inplace=True)
print(df,'\n')
does anybody here know a good way to do this?
here is one way to do it
breakup your DF into two DFs, while resetting the index on first one
df2=df[['Brand','Price','Year']].sort_values(by=['Brand'] ).reset_index()
df3=df[['Misc1','Misc2']]
join the two DFs
df2.join(df3).drop(columns='index')
Brand Price Year Misc1 Misc2
0 AA 35000 2018 Description: Car Prices 2022
1 FF 27000 2018
2 HH 22000 2015
3 TT 25000 2013
here is the original DF, before sort
Brand Price Year Misc1 Misc2
0 HH 22000 2015 Description: Car Prices 2022
1 TT 25000 2013
2 FF 27000 2018
3 AA 35000 2018
IIUC, you can assign the Misc column back after sorting
out = df.sort_values(by=['Brand'])
out[['Misc1', 'Misc2']] = df[['Misc1', 'Misc2']].values
print(out)
Brand Price Year Misc1 Misc2
3 AA 35000 2018 Description: Car Prices 2022
2 FF 27000 2018
0 HH 22000 2015
1 TT 25000 2013

How can I add the values of one column based on a groupby in Python?

Suppose I have the following dataframe:
year count
2001 14
2004 16
2001 2
2005 21
2001 22
2004 14
2001 8
I want to group by the year column and add the count column for each given year. I would like my result to be
year count
2001 46
2004 30
2005 21
I am struggling a bit finding a way to do this, can anyone help?
import pandas as pd
df = pd.read_csv("test.csv")
df['count'] = pd.to_numeric(df['count'])
#df['count'] = df.groupby(['year'])['count'].sum()
total = df.groupby(['year'])['count'].sum()
print(total)
Yields:
year
2001 46
2004 30
2005 21
Hope this may help !!
Lets assume your pandas dataframe name is df. then groupby code run like below:
df.groupby('year')[['count']].sum()
It will return dataframe you want.

groupby().mean() don't work under for loop

I have a dictionary named c with objects as dataframe, each dataframe has 3 columns: 'year' 'month' & 'Tmed' , I want to calculate the monthly mean values of Tmed for each year, I used
for i in range(22) : c[i].groupby(['year','month']).mean().reset_index()
This returns
year month Tmed
0 2018 12 14.8
2 2018 12 12.0
3 2018 11 16.1
5 2018 11 9.8
6 2018 11 9.8
9 2018 11 9.3
4425 rows × 3 columns
The index is not as it should be, and for the 11th month of 2018 for example, there should be only one row but as you see the dataframe has more than one.
I tried the code on a single dataframe and it gave the wanted result :
c[3].groupby(['year','month']).mean().reset_index()
year month Tmed
0 1999 9 23.950000
1 1999 10 19.800000
2 1999 11 12.676000
3 1999 12 11.012000
4 2000 1 9.114286
5 2000 2 12.442308
6 2000 3 13.403704
7 2000 4 13.803846
8 2000 5 17.820000
.
.
.
218 2018 6 21.093103
219 2018 7 24.977419
220 2018 8 26.393103
221 2018 9 24.263333
222 2018 10 19.069565
223 2018 11 13.444444
224 2018 12 13.400000
225 rows × 3 columns
I need to put for loop because I have many dataframes, I can't figure out the issue, any help would be gratefull.
I don't see a reason why your code should fail. I tried below and got the required results:
import numpy as np
import pandas as pd
def getRandomDataframe():
rand_year = pd.DataFrame(np.random.randint(2010, 2011,size=(50, 1)), columns=list('y'))
rand_month = pd.DataFrame(np.random.randint(1, 13,size=(50, 1)), columns=list('m'))
rand_value = pd.DataFrame(np.random.randint(0, 100,size=(50, 1)), columns=list('v'))
df = pd.DataFrame(columns=['year', 'month', 'value'])
df['year'] = rand_year
df['month'] = rand_month
df['value'] = rand_value
return df
def createDataFrameDictionary():
_dict = {}
length = 3
for i in range(length):
_dict[i] = getRandomDataframe()
return _dict
c = createDataFrameDictionary()
for i in range(3):
c[i] = c[i].groupby(['year','month'])['value'].mean().reset_index()
# Check results
print(c[0])
Please check if the year, month combo repeats in different dataframes which could be the reason for the repeat.
In your scenario, it may be a good idea to collect the groupby.mean results for each dataframe in another dataframe and do a groupby mean again on the new dataframe
Can you try the following:
main_df = pd.DataFrame()
for i in range(22):
main_df = pd.concat([main_df, c[i].groupby(['year','month']).mean().reset_index()])
print(main_df.groupby(['year','month']).mean())

Filtering Dataframe in Python

I have a dataframe with 2 columns as below:
Index Year Country
0 2015 US
1 2015 US
2 2015 UK
3 2015 Indonesia
4 2015 US
5 2016 India
6 2016 India
7 2016 UK
I want to create a new dataframe containing the maximum count of country in every year.
The new dataframe will contain 3 columns as below:
Index Year Country Count
0 2015 US 3
1 2016 India 2
Is there any function in pandas where this can be done quickly?
One way can be to use groupby and along with size for finding in each category adn sort values and slice by possible number of year. You can try the following:
num_year = df['Year'].nunique()
new_df = df.groupby(['Year', 'Country']).size().rename('Count').sort_values(ascending=False).reset_index()[:num_year]
Result:
Year Country Count
0 2015 US 3
1 2016 India 2
Use:
1.
First get count of each pairs Year and Country by groupby and size.
Then get index of max value by idxmax and select row by loc:
df = df.groupby(['Year','Country']).size()
df = df.loc[df.groupby(level=0).idxmax()].reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
2.
Use custom function with value_counts and head:
df = df.groupby('Year')['Country']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('Year','Country'))
.reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
Just provide a method without groupby
Count=pd.Series(list(zip(df2.Year,df2.Country))).value_counts()
.head(2).reset_index(name='Count')
Count[['Year','Country']]=Count['index'].apply(pd.Series)
Count.drop('index',1)
Out[266]:
Count Year Country
0 3 2015 US
1 2 2016 India

How can I select data from a table with specific attribute with pandas?

I have a dataset like this:
Country , Gdp , Year, Capital , Labor.
It has data from 1950 to 2014, but I want extract data from all rows with year = 2000.
I'm using python , pandas and numpy.
IIUC you can use boolean indexing:
print df
Country Gdp Year Capital Labor
0 a 40 2001 s 40
1 b 30 2000 u 70
2 c 70 2008 t 50
print df[df['Year'] == 2000]
Country Gdp Year Capital Labor
1 b 30 2000 u 70

Categories