Get values on a big DataFrame Python [duplicate] - python

This question already has answers here:
pandas groupby where you get the max of one column and the min of another column
(2 answers)
Closed 4 years ago.
I have a big dataframe with a structure like this:
ID Year Consumption
1 2012 24
2 2012 20
3 2012 21
1 2013 22
2 2013 23
3 2013 24
4 2013 25
I want another DataFrame that contains first year of appearence, and max consumption of all time per ID like this:
ID First_Year Max_Consumption
1 2012 24
2 2012 23
3 2012 24
4 2013 25
Is there a way to extract this data without using loops? I have tried this:
year = list(set(df.Year))
ids = list(set(df.ID))
antiq = list()
max_con = list()
for i in ids:
df_id = df[df['ID'] == i]
antiq.append(min(df_id['Year']))
max_con.append(max(df_id['Consumption']))
But it's too slow. Thank you!

Use GroupBy + agg:
res = df.groupby('ID', as_index=False).agg({'Year': 'min', 'Consumption': 'max'})
print(res)
ID Year Consumption
0 1 2012 24
1 2 2012 23
2 3 2012 24
3 4 2013 25

Another alternative to groupby is pivot_table:
pd.pivot_table(df, index="ID", aggfunc={"Year":min, "Consumption":max})

Related

I have multiIndexes for my dataframe, how do I calculate the sum for one level?

Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''

How to remove leading '0' from my column? Python

I am trying to remove the '0' leading my data
My dataframe looks like this
Id Year Month Day
1 2019 01 15
2 2019 03 30
3 2019 10 20
4 2019 11 18
Note: 'Year','Month','Day' columns data types are object
I get the 'Year','Month','Day' columns by extracting it from a date.
I want to remove the '0' at the beginning of each months.
Desired Ouput:
Id Year Month Day
1 2019 1 15
2 2019 3 30
3 2019 10 20
4 2019 11 18
What I tried to do so far:
df['Month'].str.lstrip('0')
But it did not work.
Any solution? Thank you!
You could use re package and apply regex on it
import re
# Create sample data
d = pd.DataFrame(data={"Month":["01","02","03","10","11"]})
d["Month" = d["Month"].apply(lambda x: re.sub(r"^0+", "", x))
Result:
0 1
1 2
2 3
3 10
4 11
Name: Month, dtype: object
If you are 100% that Month column will always contain numbers, then you could simply do:
d["Month"] = d["Month"].astype(int)

loop to filter rows based on multiple column conditions pandas python

df
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
FEB 2010 13 1 2
MAR 2010 12 2 1
....
DEC 2019 2 3 4
Code to extract dataframes where month names are Jan, Feb etc for all years. For eg.
[IN]filterJan=df[df['month']=='JAN']
filterJan
[OUT]
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
JAN 2011 13 1 2
....
JAN 2019 2 3 4
I am trying to make a loop for this process.
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
filter[month]=df[df['month']==month]
[OUT]
----> 3 filter[month]=batch1_clean_Sales_database[batch1_clean_Sales_database['month']==month]
TypeError: 'type' object does not support item assignment
If I print the dataframes it is working, but i want to store them and reuse them later
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
print(df[df['month']==month])
I think you can create dictionary of DataFrames:
d = dict(tuple(df.groupby('month')))
Your solution should be changed:
d = {}
for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
d[month] = df[df['month']==month]
Then is possible select each month like d['Jan'], what working like df1.
If want loop by dictionary of DataFrames:
for k, v in d.items():
print (k)
print (v)

Cumulative sum (pandas)

Apologies if this has been asked already.
I am trying to create a yearly cumulative sum for all order-points within a certain customer account, and am struggling.
Essentially, I want to create `YearlyTotal' below:
Customer Year Date Order PointsPerOrder YearlyTotal
123456 2016 11/2/16 A939 1 20
123456 2016 3/13/16 A102 19 19
789089 2016 7/15/16 A123 7 7
I've tried:
df['YEARLYTOTAL'] = df.groupby(by=['Customer','Year'])['PointsPerOrder'].cumsum()
But this produces YearlyTotal in the wrong order (i.e., YearlyTotal of A939 is 1 instead of 20.
Not sure if this matters, but Customer is a string (the database has leading zeroes -- don't get me started). sort_values(by=['Customer','Year','Date'],ascending=True) at the front also produces an error.
Help?
Use [::-1] for reversing dataframe:
df['YEARLYTOTAL'] = df[::-1].groupby(by=['Customer','Year'])['PointsPerOrder'].cumsum()
print (df)
Customer Year Date Order PointsPerOrder YearlyTotal YEARLYTOTAL
0 123456 2016 11/2/16 A939 1 20 20
1 123456 2016 3/13/16 A102 19 19 19
2 789089 2016 7/15/16 A123 7 7 7
first make sure Date is a datetime column:
In [35]: df.Date = pd.to_datetime(df.Date)
now we can do:
In [36]: df['YearlyTotal'] = df.sort_values('Date').groupby(['Customer','Year'])['PointsPerOrder'].cumsum()
In [37]: df
Out[37]:
Customer Year Date Order PointsPerOrder YearlyTotal
0 123456 2016 2016-11-02 A939 1 20
1 123456 2016 2016-03-13 A102 19 19
2 789089 2016 2016-07-15 A123 7 7
PS this solution will NOT depend on the order of records...

selecting a particular row from groupby object in python

id marks year
1 18 2013
1 25 2012
3 16 2014
2 16 2013
1 19 2013
3 25 2013
2 18 2014
suppose now I group the above on id by python command.
grouped = file.groupby(file.id)
I would like to get a new file with only the row in each group with recent year that is highest of all the year in the group.
Please let me know the command, I am trying with apply but it ll only given the boolean expression. I want the entire row with latest year.
I cobbled this together using this: Python : Getting the Row which has the max value in groups using groupby
So basically we can groupby the 'id' column, then call transform on the 'year' column and create a boolean index where the year matches the max year value for each 'id':
In [103]:
df[df.groupby(['id'])['year'].transform(max) == df['year']]
Out[103]:
id marks year
0 1 18 2013
2 3 16 2014
4 1 19 2013
6 2 18 2014

Categories