Python pandas create additional dataframe columns by grouping on existing column

Python pandas create additional dataframe columns by grouping on existing column - python

Trying to create new dataframe columns from the contents of an existing column. Easier to explain with an example. I would like to convert this:
. Yr Month Class Cost
1 2015 1 L 19.2361
2 2015 1 M 29.4723
3 2015 1 S 48.5980
4 2015 1 T 169.7630
5 2015 2 L 19.1506
6 2015 2 M 30.0886
7 2015 2 S 49.3765
8 2015 2 T 167.0000
9 2015 3 L 19.3465
10 2015 3 M 29.1991
11 2015 3 S 46.2580
12 2015 3 T 157.7916
13 2015 4 L 18.3165
14 2015 4 M 28.2314
15 2015 4 S 44.5844
16 2015 4 T 162.3241
17 2015 5 L 17.4556
18 2015 5 M 27.0434
19 2015 5 S 42.8841
20 2015 5 T 159.3457
21 2015 6 L 16.5343
22 2015 6 M 24.9853
23 2015 6 S 40.5612
24 2015 6 T 153.4902
...into the following so that I can plot 4 separate lines [L, M, S, T]:
. Yr Month L M S T
1 2015 1 19.2361 29.4723 48.5980 169.7630
2 2015 2 19.1506 30.0886 49.3765 167.0000
3 2015 3 19.3465 29.1991 46.2580 157.7916
4 2015 4 18.3165 28.2314 44.5844 162.3241
5 2015 5 17.4556 27.0434 42.8841 159.3457
6 2015 6 16.5343 24.9853 40.5612 153.4902
I was able to work through it in what feels like a very clumsy way, by filtering the dataframe on the 'class' column... and then 3 separate merges.
list_class = ['L', 'M', 'S', 'T']
year = 'Yr'
month = 'Month'
df_class = pd.DataFrame()
df_class1 = pd.DataFrame()
df_class2 = pd.DataFrame()
df_class1 = merge(df[[month, year, 'Class','Cost']][df['Class']==list_class[0]], df[[month, year, 'Class','Cost']][df['Class']==list_class[1]], \
left_on=[month, year], right_on=[month, year])
df_class2 = merge(df[[month, year, 'Class','Cost']][df['Class']==list_class[2]], df[[month, year, 'Class','Cost']][df['Class']==list_class[3]], \
left_on=[month, year], right_on=[month, year])
df_class = merge(df_class1, df_class2, left_on=[month, year], right_on=[month, year]).groupby([year, month]).mean().plot(figsize(15,8))
There must be a more efficient way. Feels like it should be done with groupby, but I couldn't nail it down.

You can first convert the df to a multi-level index type and then unstack the level Class will give you what you want. Suppose df is the original dataframe shown on the very beginning of your post.
df.set_index(['Yr', 'Month', 'Class'])['Cost'].unstack('Class')
Out[29]:
Class L M S T
Yr Month
2015 1 19.2361 29.4723 48.5980 169.7630
2 19.1506 30.0886 49.3765 167.0000
3 19.3465 29.1991 46.2580 157.7916
4 18.3165 28.2314 44.5844 162.3241
5 17.4556 27.0434 42.8841 159.3457
6 16.5343 24.9853 40.5612 153.4902

Related

Move data from row 1 to row 0

I have this function written in python. I want this thing show difference between row from production column.
Here's the code
def print_df():
mycursor.execute("SELECT * FROM productions")
myresult = mycurson.fetchall()
myresult.sort(key=lambda x: x[0])
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Dif'] = abs(df['Production (Ton)']. diff())
print(abs(df))
And of course the output is this
Year Production (Ton) Dif
0 2010 339491 NaN
1 2011 366999 27508.0
2 2012 361986 5013.0
3 2013 329461 32525.0
4 2014 355464 26003.0
5 2015 344998 10466.0
6 2016 274317 70681.0
7 2017 200916 73401.0
8 2018 217246 16330.0
9 2019 119830 97416.0
10 2020 66640 53190.0
But I want the output like this
Year Production (Ton) Dif
0 2010 339491 27508.0
1 2011 366999 5013.0
2 2012 361986 32525.0
3 2013 329461 26003.0
4 2014 355464 10466.0
5 2015 344998 70681.0
6 2016 274317 73401.0
7 2017 200916 16330.0
8 2018 217246 97416.0
9 2019 119830 53190.0
10 2020 66640 66640.0
What should I change or add to my code?

You can use a negative period input to diff to get the differences the way you want, and then fillna to fill the last value with the value from the Production column:
df['Dif'] = df['Production (Ton)'].diff(-1).fillna(df['Production (Ton)']).abs()
Output:
Year Production (Ton) Dif
0 2010 339491 27508.0
1 2011 366999 5013.0
2 2012 361986 32525.0
3 2013 329461 26003.0
4 2014 355464 10466.0
5 2015 344998 70681.0
6 2016 274317 73401.0
7 2017 200916 16330.0
8 2018 217246 97416.0
9 2019 119830 53190.0
10 2020 66640 66640.0

Use shift(-1) to shift all rows one position up.
df['Dif'] = (df['Production (Ton)'] - df['Production (Ton)'].shift(-1).fillna(0)).abs()
Notice that by setting fillna(0), you avoid the NaNs.
You can also use diff:
df['Dif'] = df['Production (Ton)'].diff().shift(-1).fillna(0).abs()

How to create a column whose values are based on the values of another column?

I have a df like this:
Year 2016 2017
Month
1 0.979000 1.109000
2 0.974500 1.085667
3 1.004000 1.075667
4 1.027333 1.184000
5 1.049000 1.089000
6 1.013250 1.085500
7 0.999000 1.059000
8 0.996667 1.104000
9 1.024000 1.121333
10 1.019000 1.126333
11 0.949000 1.183000
12 1.074000 1.203000
How can I add a 'Season' column that populates "Spring", "Summer" etc. based on the numerical value of month? E.g months 12, 1, and 2 = Winter, etc?

You could use np.select with pd.Series.between:
import numpy as np
df["Season"] = np.select([df["Month"].between(3, 5),
df["Month"].between(6, 8),
df["Month"].between(9, 11)],
["Spring", "Summer", "Fall"],
"Winter")
Month 2016 2017 Season
0 1 0.979000 1.109000 Winter
1 2 0.974500 1.085667 Winter
2 3 1.004000 1.075667 Spring
3 4 1.027333 1.184000 Spring
4 5 1.049000 1.089000 Spring
5 6 1.013250 1.085500 Summer
6 7 0.999000 1.059000 Summer
7 8 0.996667 1.104000 Summer
8 9 1.024000 1.121333 Fall
9 10 1.019000 1.126333 Fall
10 11 0.949000 1.183000 Fall
11 12 1.074000 1.203000 Winter

You could iterate through the column, appending data to a new data frame which you will add in as a column.
for i in df['Year Month'] :
if i == 12 or 1 or 2 :
i = "Winter"
df2.append(i)
Then add on your other conditions with elif and else statements and you should be good to add it onto your main df afterwards. Lemme know if this helps.

loop to filter rows based on multiple column conditions pandas python

df
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
FEB 2010 13 1 2
MAR 2010 12 2 1
....
DEC 2019 2 3 4
Code to extract dataframes where month names are Jan, Feb etc for all years. For eg.
[IN]filterJan=df[df['month']=='JAN']
filterJan
[OUT]
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
JAN 2011 13 1 2
....
JAN 2019 2 3 4
I am trying to make a loop for this process.
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
filter[month]=df[df['month']==month]
[OUT]
----> 3 filter[month]=batch1_clean_Sales_database[batch1_clean_Sales_database['month']==month]
TypeError: 'type' object does not support item assignment
If I print the dataframes it is working, but i want to store them and reuse them later
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
print(df[df['month']==month])

I think you can create dictionary of DataFrames:
d = dict(tuple(df.groupby('month')))
Your solution should be changed:
d = {}
for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
d[month] = df[df['month']==month]
Then is possible select each month like d['Jan'], what working like df1.
If want loop by dictionary of DataFrames:
for k, v in d.items():
print (k)
print (v)

Re-format Dataframe column such that any numeric month substring is replaced with month string

Looking to reformat a string column as causing errors in Django. My df:
import pandas as pd
data = {'Date_Str'['2018_11','2018_12','2019_01','2019_02','2019_03','2019_04','2019_05','2019_06','2019_07','2019_08','2019_09','2019_10',],}
df = pd.DataFrame(dict(data))
print(df)
Date_Str
0 2018_11
1 2018_12
2 2019_01
3 2019_02
4 2019_03
5 2019_04
6 2019_05
7 2019_06
8 2019_07
9 2019_08
10 2019_09
11 2019_10
My solution:
df['Date_Month'] = df.Date_Str.str[-2:]
mapper = {'01':'Jan', '02':'Feb', '03':'Mar','04':'Apr','05':'May','06':'Jun','07':'Jul','08':'Aug','09':'Sep','10':'Oct','11':'Nov','12':'Dec'}
df['Date_Month_Str'] = df.Date_Str.str[0:4] + '_' + df.Date_Month.map(mapper)
print(df)
Desired output is column Date_Month_Str or simply update Date_Str with yyyy_mmm
Date_Str Date_Month Date_Month_Str
0 2018_11 11 2018_Nov
1 2018_12 12 2018_Dec
2 2019_01 01 2019_Jan
3 2019_02 02 2019_Feb
4 2019_03 03 2019_Mar
5 2019_04 04 2019_Apr
6 2019_05 05 2019_May
7 2019_06 06 2019_Jun
8 2019_07 07 2019_Jul
9 2019_08 08 2019_Aug
10 2019_09 09 2019_Sep
11 2019_10 10 2019_Oct
Can the three lines be reduced to one? Or simply update Date_Str with a one liner?

Convert column to datetimes and then use Series.dt.strftime:
df['Date_Month_Str'] = pd.to_datetime(df.Date_Str, format='%Y_%m').dt.strftime('%Y_%b')
print(df)
Date_Str Date_Month_Str
0 2018_11 2018_Nov
1 2018_12 2018_Dec
2 2019_01 2019_Jan
3 2019_02 2019_Feb
4 2019_03 2019_Mar
5 2019_04 2019_Apr
6 2019_05 2019_May
7 2019_06 2019_Jun
8 2019_07 2019_Jul
9 2019_08 2019_Aug
10 2019_09 2019_Sep
11 2019_10 2019_Oct

groupby().mean() don't work under for loop

I have a dictionary named c with objects as dataframe, each dataframe has 3 columns: 'year' 'month' & 'Tmed' , I want to calculate the monthly mean values of Tmed for each year, I used
for i in range(22) : c[i].groupby(['year','month']).mean().reset_index()
This returns
year month Tmed
0 2018 12 14.8
2 2018 12 12.0
3 2018 11 16.1
5 2018 11 9.8
6 2018 11 9.8
9 2018 11 9.3
4425 rows × 3 columns
The index is not as it should be, and for the 11th month of 2018 for example, there should be only one row but as you see the dataframe has more than one.
I tried the code on a single dataframe and it gave the wanted result :
c[3].groupby(['year','month']).mean().reset_index()
year month Tmed
0 1999 9 23.950000
1 1999 10 19.800000
2 1999 11 12.676000
3 1999 12 11.012000
4 2000 1 9.114286
5 2000 2 12.442308
6 2000 3 13.403704
7 2000 4 13.803846
8 2000 5 17.820000
.
.
.
218 2018 6 21.093103
219 2018 7 24.977419
220 2018 8 26.393103
221 2018 9 24.263333
222 2018 10 19.069565
223 2018 11 13.444444
224 2018 12 13.400000
225 rows × 3 columns
I need to put for loop because I have many dataframes, I can't figure out the issue, any help would be gratefull.

I don't see a reason why your code should fail. I tried below and got the required results:
import numpy as np
import pandas as pd
def getRandomDataframe():
rand_year = pd.DataFrame(np.random.randint(2010, 2011,size=(50, 1)), columns=list('y'))
rand_month = pd.DataFrame(np.random.randint(1, 13,size=(50, 1)), columns=list('m'))
rand_value = pd.DataFrame(np.random.randint(0, 100,size=(50, 1)), columns=list('v'))
df = pd.DataFrame(columns=['year', 'month', 'value'])
df['year'] = rand_year
df['month'] = rand_month
df['value'] = rand_value
return df
def createDataFrameDictionary():
_dict = {}
length = 3
for i in range(length):
_dict[i] = getRandomDataframe()
return _dict
c = createDataFrameDictionary()
for i in range(3):
c[i] = c[i].groupby(['year','month'])['value'].mean().reset_index()
# Check results
print(c[0])

Please check if the year, month combo repeats in different dataframes which could be the reason for the repeat.
In your scenario, it may be a good idea to collect the groupby.mean results for each dataframe in another dataframe and do a groupby mean again on the new dataframe

Can you try the following:
main_df = pd.DataFrame()
for i in range(22):
main_df = pd.concat([main_df, c[i].groupby(['year','month']).mean().reset_index()])
print(main_df.groupby(['year','month']).mean())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas create additional dataframe columns by grouping on existing column - python

Related

Move data from row 1 to row 0

How to create a column whose values are based on the values of another column?

loop to filter rows based on multiple column conditions pandas python

Re-format Dataframe column such that any numeric month substring is replaced with month string

groupby().mean() don't work under for loop

Categories

Resources