selecting a particular row from groupby object in python - python

id marks year
1 18 2013
1 25 2012
3 16 2014
2 16 2013
1 19 2013
3 25 2013
2 18 2014
suppose now I group the above on id by python command.
grouped = file.groupby(file.id)
I would like to get a new file with only the row in each group with recent year that is highest of all the year in the group.
Please let me know the command, I am trying with apply but it ll only given the boolean expression. I want the entire row with latest year.

I cobbled this together using this: Python : Getting the Row which has the max value in groups using groupby
So basically we can groupby the 'id' column, then call transform on the 'year' column and create a boolean index where the year matches the max year value for each 'id':
In [103]:
df[df.groupby(['id'])['year'].transform(max) == df['year']]
Out[103]:
id marks year
0 1 18 2013
2 3 16 2014
4 1 19 2013
6 2 18 2014

Related

I have multiIndexes for my dataframe, how do I calculate the sum for one level?

Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''

How to remove leading '0' from my column? Python

I am trying to remove the '0' leading my data
My dataframe looks like this
Id Year Month Day
1 2019 01 15
2 2019 03 30
3 2019 10 20
4 2019 11 18
Note: 'Year','Month','Day' columns data types are object
I get the 'Year','Month','Day' columns by extracting it from a date.
I want to remove the '0' at the beginning of each months.
Desired Ouput:
Id Year Month Day
1 2019 1 15
2 2019 3 30
3 2019 10 20
4 2019 11 18
What I tried to do so far:
df['Month'].str.lstrip('0')
But it did not work.
Any solution? Thank you!
You could use re package and apply regex on it
import re
# Create sample data
d = pd.DataFrame(data={"Month":["01","02","03","10","11"]})
d["Month" = d["Month"].apply(lambda x: re.sub(r"^0+", "", x))
Result:
0 1
1 2
2 3
3 10
4 11
Name: Month, dtype: object
If you are 100% that Month column will always contain numbers, then you could simply do:
d["Month"] = d["Month"].astype(int)

loop to filter rows based on multiple column conditions pandas python

df
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
FEB 2010 13 1 2
MAR 2010 12 2 1
....
DEC 2019 2 3 4
Code to extract dataframes where month names are Jan, Feb etc for all years. For eg.
[IN]filterJan=df[df['month']=='JAN']
filterJan
[OUT]
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
JAN 2011 13 1 2
....
JAN 2019 2 3 4
I am trying to make a loop for this process.
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
filter[month]=df[df['month']==month]
[OUT]
----> 3 filter[month]=batch1_clean_Sales_database[batch1_clean_Sales_database['month']==month]
TypeError: 'type' object does not support item assignment
If I print the dataframes it is working, but i want to store them and reuse them later
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
print(df[df['month']==month])
I think you can create dictionary of DataFrames:
d = dict(tuple(df.groupby('month')))
Your solution should be changed:
d = {}
for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
d[month] = df[df['month']==month]
Then is possible select each month like d['Jan'], what working like df1.
If want loop by dictionary of DataFrames:
for k, v in d.items():
print (k)
print (v)

How to calculate average and most frequent values per group?

I have the following df:
df =
year intensity category
2015 22 1
2015 21 1
2015 23 2
2016 25 2
2017 20 1
2017 21 1
2017 20 3
I need to group by year and calculate an average intensity and a most frequent category(per year).
I know that it's possible to calculate most frequent category as follows:
df.groupby('year')['category'].agg(lambda x: x.value_counts().index[0])
I also know how to calculate average intensity:
df = df.groupby(["year"]).agg({'intensity':'mean'}).reset_index()
But I don't know how to put everything together without join operation.
Use agg with a dictionary to define how to aggregate each column.
df.groupby('year', as_index=False)[['category', 'intensity']]\
.agg({'category': lambda x: pd.Series.mode(x)[0], 'intensity':'mean'})
Output:
year category intensity
0 2015 1 22.000000
1 2016 2 25.000000
2 2017 1 20.333333
Or you can still use lambda funcion
df.groupby('year', as_index=False)[['category','intensity']]\
.agg({'category': lambda x: x.value_counts().index[0],'intensity':'mean'})

Get values on a big DataFrame Python [duplicate]

This question already has answers here:
pandas groupby where you get the max of one column and the min of another column
(2 answers)
Closed 4 years ago.
I have a big dataframe with a structure like this:
ID Year Consumption
1 2012 24
2 2012 20
3 2012 21
1 2013 22
2 2013 23
3 2013 24
4 2013 25
I want another DataFrame that contains first year of appearence, and max consumption of all time per ID like this:
ID First_Year Max_Consumption
1 2012 24
2 2012 23
3 2012 24
4 2013 25
Is there a way to extract this data without using loops? I have tried this:
year = list(set(df.Year))
ids = list(set(df.ID))
antiq = list()
max_con = list()
for i in ids:
df_id = df[df['ID'] == i]
antiq.append(min(df_id['Year']))
max_con.append(max(df_id['Consumption']))
But it's too slow. Thank you!
Use GroupBy + agg:
res = df.groupby('ID', as_index=False).agg({'Year': 'min', 'Consumption': 'max'})
print(res)
ID Year Consumption
0 1 2012 24
1 2 2012 23
2 3 2012 24
3 4 2013 25
Another alternative to groupby is pivot_table:
pd.pivot_table(df, index="ID", aggfunc={"Year":min, "Consumption":max})

Categories