Plot a DataFrame based on grouped by column in Python - python

Based on the code below, I'm trying to assign some columns to my DataFrame which has been grouped by month of the date and works well :
all_together = (df_clean.groupby(df_clean['ContractDate'].dt.strftime('%B'))
.agg({'Amount': [np.sum, np.mean, np.min, np.max]})
.rename(columns={'sum': 'sum_amount', 'mean': 'avg_amount', 'amin': 'min_amount', 'amax': 'max_amount'}))
But for some reason when I try to plot the result(in any kind as plot), it's not able to recognize my "ContractDate" as a column and also any of those renamed names such as: 'sum_amount'.
Do you have any idea that what's the issue and what am I missing as a rule for plotting the data?
I have tried the code below for plotting and it asks me what is "ContractDate" and what is "sum_amount"!
all_together.groupby(df_clean['ContractDate'].dt.strftime('%B'))['sum_amount'].nunique().plot(kind='bar')
#or
all_together.plot(kind='bar',x='ContractDate',y='sum_amount')
I really appreciate your time
Cheers,
z.A

When you apply groupby function on a DataFrame, it makes the groupby column as index(ContractDate in your case). So you need to reset the index first to make it as a column.
df = pd.DataFrame({'month':['jan','feb','jan','feb'],'v2':[23,56,12,59]})
t = df.groupby('month').agg('sum')
Output:
v2
month
feb 115
jan 35
So as you see, you're getting months as index. Then when you reset the index:
t.reset_index()
Output:
month v2
0 feb 115
1 jan 35
Next when you apply multiple agg functions on a single column in the groupby, it will create a multiindexed dataframe. So you need to make it as single level index:
t = df.groupby('month').agg({'v2': [np.sum, np.mean, np.min, np.max]}).rename(columns={'sum': 'sum_amount', 'mean': 'avg_amount', 'amin': 'min_amount', 'amax': 'max_amount'})
v2
sum_amount avg_amount min_amount max_amount
month
feb 115 57.5 56 59
jan 35 17.5 12 23
It created a multiindex.if you check t.columns, you get
MultiIndex(levels=[['v2'], ['avg_amount', 'max_amount', 'min_amount', 'sum_amount']],
labels=[[0, 0, 0, 0], [3, 0, 2, 1]])
Now use this:
t.columns = t.columns.get_level_values(1)
t.reset_index(inplace=True)
You will get a clean dataframe:
month sum_amount avg_amount min_amount max_amount
0 feb 115 57.5 56 59
1 jan 35 17.5 12 23
Hope this helps for your plotting.

Related

A simple way of selecting the previous row in a column and performing an operation?

I'm trying to create a forecast which takes the previous day's 'Forecast' total and adds it to the current day's 'Appt'. Something which is straightforward in Excel but I'm struggling in pandas. At the moment all I can get in pandas using .loc is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,0,0,0,0]
})
What I'm looking for it to do is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,47,52,56,69]
})
E.g. 'Forecast' total on the 1st December is 37. On the 2nd December the value in the 'Appt' column in 10. I want it to select 37 and + 10, then put this in the 'Forecast' column for the 2nd December. Then iterate over the rest of the column.
I've tied using .loc() with the index, and experimented with .shift() but neither seem to work for what I'd like. Also looked into .rolling() but I think that's not appropriate either.
I'm sure there must be a simple way to do this?
Apologies, the original df has 'Date' as a datetime column.
You can use mask and cumsum:
df['Forecast'] = df['Forecast'].mask(df['Forecast'].eq(0), df['Appt']).cumsum()
# or
df['Forecast'] = np.where(df['Forecast'].eq(0), df['Appt'], df['Forecast']).cumsum()
Output:
Date Appt Forecast
0 2022-12-01 12 37
1 2022-12-01 10 47
2 2022-12-01 5 52
3 2022-12-01 4 56
4 2022-12-01 13 69
You have to make sure that your column has datetime/date type, then you may filter df like this:
# previous code&imports
yesterday = datetime.now().date() - timedelta(days=1)
df[df["date"] == yesterday]["your_column"].sum()

weighted average aggregation on multiple columns of df

I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?
Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444
Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)

Pandas: how to get the rows that has the maximum value_count on a column grouping by another column as a dataframe

I have three columns in a pandas dataframe, Date, Hour and Content. I want to get the hour in a day when there is the most content of that day. I am using messages.groupby(["Date", "Hour"]).Content.count().groupby(level=0).tail(1). I don't know what groupby(level=0) is doing here. It outputs as follows-
Date Hour
2018-04-12 23 4
2018-04-13 21 43
2018-04-14 9 1
2018-04-15 23 29
2018-04-16 17 1
..
2020-04-23 20 1
2020-04-24 22 1
2020-04-25 20 1
2020-04-26 23 32
2020-04-27 23 3
This is a pandas series object, and my desired Date and Hour columns are MultiIndex here. If I try to convert the MultiIndex object to dataframe using pd.DataFrame(most_active.index), most_active being the output of the previous code, it creates a dataframe of tuples as below-
0
0 (2018-04-12, 23)
1 (2018-04-13, 21)
2 (2018-04-14, 9)
3 (2018-04-15, 23)
4 (2018-04-16, 17)
.. ...
701 (2020-04-23, 20)
702 (2020-04-24, 22)
703 (2020-04-25, 20)
704 (2020-04-26, 23)
705 (2020-04-27, 23)
But I need two separate columns of Date and Hour. What is the best way for this?
Edit because I misunderstood your question
First, you have to count the total content by date-hour, just like you did:
df = messages.groupby(["Date", "Hour"], as_index=False).Content.count()
Here, I left the groups in their original columns by passing the parameter as_index=False.
Then, you can run the code below, provided in the original answer:
Supposing you have unique index IDs (if not, just do df.reset_index(inplace=True)), you can use idxmax method in groupby. It will return the index with the biggest value per group, then you can use them for slicing the dataframe.
For example:
df.loc[df.groupby(['Date', 'Hour'])['Content'].idxmax()]
As an alternative (without using groupby), you can first sort the values in descending order, them remove the Date-Hour duplicates:
df.sort_values('Content', ascending=False).drop_duplicates(subset=['Date', 'Hour'])
Finally, you get a MultiIndex with the set_index() method:
df.set_index(['Date','Hour'])

Pandas groupby use aggregate based on two columns

Imagine I have the following dataframe:
np.random.seed(42)
t = pd.DataFrame({'year': 4*['2018']+3*['2019']+4*['2016'],
'pop': np.random.randint(10, 100, size=(11)),
'production': np.random.randint(2000, 40000, size=(11))})
print(t)
year pop production
2018 61 3685
2018 24 2769
2018 81 4433
2018 70 7311
2019 30 39819
2019 92 19568
2019 96 21769
2016 84 30693
2016 84 8396
2016 97 29480
2016 33 27658
I want to find the sum of production divided by the sum of the pop by each year, my final dataframe would be something like:
tmp = t.groupby('year').sum()
tmp['production']/tmp['pop']
year
2016 322.909396
2018 77.110169
2019 372.275229
I was thinking if it could be done using groupby year and then using agg based on two columns, something like:
#doesn't work
t.groupby('year').agg(prod_per_pop = (['pop', 'production'],
lambda x: x['production'].sum()/x['pop'].sum()))
My question is basically if it is possible to use any pandas groupby method to achieve that in an easy way rather than having to create another dataframe and then having to divide.
You could use lambda functions with axis=1 to solve it in single line.
t.groupby('year')['pop','production'].agg('sum').apply(lambda x: x['production']/x['pop'], axis=1)

Pandas dataframe.set_index() deletes previous index and column

I just came across a strange phenomenon with Pandas DataFrames, when setting index using DataFrame.set_index('some_index') the old column that was also an index is deleted! Here is an example:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
>>> df_mn
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
Now I change the index to year:
df_mn.set_index('year')
sale
year
2012 55
2014 40
2013 84
2014 31
.. and the month column was removed with the index. This is vary irritating because I just wanted to swap the DataFrame index.
Is there a way to not have the previous column that was an index from being deleted? Maybe through something like: DataFrame.set_index('new_index',delete_previous_index=False)
Thanks for any advice
You can do the following
>>> df_mn.reset_index().set_index('year')
month sale
year
2012 1 55
2014 4 40
2013 7 84
2014 10 31
the solution I found to reatain a previous columns is to set drop=False
dataframe.set_index('some_column',drop=False). This is not the perfect answer but it works!
No, in such cases you have to save your previous column, like shown
below:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
df_mn['month'] = df_mn.index #Save it as another column, and then run set_index with year column as value.
df_mn.set_index('year')
Besides you are using a duplicate dataframe df_mn , so the dataframe df remains unchanged you can use it again.
And also if you aren't setting the
inplace argument for set_index to True
df_mn won't have changed even after you call set_index() on it.
Also, like the other answer you can always use reset_index().

Categories