groupby().mean() don't work under for loop - python

I have a dictionary named c with objects as dataframe, each dataframe has 3 columns: 'year' 'month' & 'Tmed' , I want to calculate the monthly mean values of Tmed for each year, I used
for i in range(22) : c[i].groupby(['year','month']).mean().reset_index()
This returns
year month Tmed
0 2018 12 14.8
2 2018 12 12.0
3 2018 11 16.1
5 2018 11 9.8
6 2018 11 9.8
9 2018 11 9.3
4425 rows × 3 columns
The index is not as it should be, and for the 11th month of 2018 for example, there should be only one row but as you see the dataframe has more than one.
I tried the code on a single dataframe and it gave the wanted result :
c[3].groupby(['year','month']).mean().reset_index()
year month Tmed
0 1999 9 23.950000
1 1999 10 19.800000
2 1999 11 12.676000
3 1999 12 11.012000
4 2000 1 9.114286
5 2000 2 12.442308
6 2000 3 13.403704
7 2000 4 13.803846
8 2000 5 17.820000
.
.
.
218 2018 6 21.093103
219 2018 7 24.977419
220 2018 8 26.393103
221 2018 9 24.263333
222 2018 10 19.069565
223 2018 11 13.444444
224 2018 12 13.400000
225 rows × 3 columns
I need to put for loop because I have many dataframes, I can't figure out the issue, any help would be gratefull.

I don't see a reason why your code should fail. I tried below and got the required results:
import numpy as np
import pandas as pd
def getRandomDataframe():
rand_year = pd.DataFrame(np.random.randint(2010, 2011,size=(50, 1)), columns=list('y'))
rand_month = pd.DataFrame(np.random.randint(1, 13,size=(50, 1)), columns=list('m'))
rand_value = pd.DataFrame(np.random.randint(0, 100,size=(50, 1)), columns=list('v'))
df = pd.DataFrame(columns=['year', 'month', 'value'])
df['year'] = rand_year
df['month'] = rand_month
df['value'] = rand_value
return df
def createDataFrameDictionary():
_dict = {}
length = 3
for i in range(length):
_dict[i] = getRandomDataframe()
return _dict
c = createDataFrameDictionary()
for i in range(3):
c[i] = c[i].groupby(['year','month'])['value'].mean().reset_index()
# Check results
print(c[0])

Please check if the year, month combo repeats in different dataframes which could be the reason for the repeat.
In your scenario, it may be a good idea to collect the groupby.mean results for each dataframe in another dataframe and do a groupby mean again on the new dataframe

Can you try the following:
main_df = pd.DataFrame()
for i in range(22):
main_df = pd.concat([main_df, c[i].groupby(['year','month']).mean().reset_index()])
print(main_df.groupby(['year','month']).mean())

Related

Pandas Groupby and remove rows with numbers that are close to each other

I have a data frame df
df =
Code Bikes Year
12 356 2020
4 378 2020
2 389 2020
35 378 2021
40 370 2021
32 350 2021
I would like to group the data frame based on Year using df.groupby('Year') and check the close values in column df['Çode'] to find values that are close by at least 3 and retain the row with a maximum value in the df['Bikes'] column out of them.
For instance, in the first group of 2020, values 4 and 2 are close by at least 3 since 4-2=2 ≤ 3 and since 389 (df['Bikes']) corresponding to df['Code'] = 2 is the highest among the two, retain that and drop row where df['code']=4.
The expected out for the given example:
Code Bikes Year
12 356 2020
2 389 2020
35 378 2021
40 370 2021
new_df = df.sort_values(['Year', 'Code'])
new_df['diff'] = new_df['Code'] - new_df.groupby('Year')['Code'].shift()
new_df['cumsum'] = ((new_df['diff'] > 3) | (new_df['diff'].isna())).cumsum()
new_df = new_df.sort_values('Bikes', ascending=False).drop_duplicates(['cumsum']).sort_index()
new_df.drop(columns=['diff', 'cumsum'], inplace=True)
You can first sort values per both columns by DataFrame.sort_values, then create groups by compare differences with treshold 3 and cumulative sum, last use DataFrameGroupBy.idxmax for get maximal indices by Bikes per Year and helper Series:
df1 = df.sort_values(['Year','Code'])
g = df1.groupby('Year')['Code'].diff().gt(3).cumsum()
df2 = df.loc[df1.groupby(['Year', g])['Bikes'].idxmax()].sort_index()
print (df2)
Code Bikes Year
0 12 356 2020
2 2 389 2020
3 35 378 2021
4 40 370 2021

Compare two dataframes column values. Find which values are in one df and not the other

I have the following dataset
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/sales.csv')
df["OrderYear"] = pd.DatetimeIndex(df['Order Date']).year
I want to compare the customers in 2017 and 2018 and see if the store has lost customers.
I did two subsets corresponding to 2017 and 2018 :
Customer_2018 = df.loc[(df.OrderYear == 2018)]
Customer_2017 = df.loc[(df.OrderYear == 2017)]
I then tried to do this to compare the two :
Churn = Customer_2017['Customer ID'].isin(Customer_2018['Customer ID']).value_counts()
Churn
And i get the following output :
True 2206
False 324
Name: Customer ID, dtype: int64
The problem is some customers may appear several times in the dataset since they made several orders.
I would like to get only unique customers (Customer ID is the only unique attribute) and then compare the two dataframes to see how many customers the store lost between 2017 and 2018.
To go further in the analysis, you can use pd.crosstab:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
At this point your dataframe looks like:
>>> out
OrderYear 2015 2016 2017 2018
Customer ID
AA-10315 4 1 4 2
AA-10375 2 4 4 5
AA-10480 1 0 10 1
AA-10645 6 3 8 1
AB-10015 4 0 2 0 # <- lost customer
... ... ... ... ...
XP-21865 10 3 9 6
YC-21895 3 1 3 1
YS-21880 0 5 0 7
ZC-21910 5 9 9 8
ZD-21925 3 0 5 1
Values are the number of order per customer and year.
Now it's easy to get "lost customers":
>>> sum((out[2017] != 0) & (out[2018] == 0))
83
If only one comparison is required, I would use python sets:
c2017 = set(Customer_2017['Customer ID'])
c2018 = set(Customer_2018['Customer ID'])
print(f'lost customers between 2017 and 2018: {len(c2017 - c2018)}')
print(f'customers from 2017 remaining in 2018: {len(c2017 & c2018)}')
print(f'new customers in 2018: {len(c2018 - c2017)}')
output:
lost customers between 2017 and 2018: 83
customers from 2017 remaining in 2018: 552
new customers in 2018: 138
building on the crosstab suggestion from #Corralien:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
(out.gt(0).astype(int).diff(axis=1)
.replace({0: 'remained', 1: 'new', -1: 'lost'})
.apply(pd.Series.value_counts)
)
output:
OrderYear 2015 2016 2017 2018
lost NaN 163 123 83
new NaN 141 191 138
remained NaN 489 479 572
You could just use normal sets to get unique customer ids for each year and then subtract them appropriately:
set_lost_cust = set(Customer_2017["Customer ID"]) - set(Customer_2018["Customer ID"])
len(set_lost_cust)
Out: 83
For your original approach to work you would need to drop the duplicates from the DataFrames, to make sure each customer appears only a single time:
Customer_2018 = df.loc[(df.OrderYear == 2018), ​"Customer ID"].drop_duplicates()
Customer_2017 = df.loc[(df.OrderYear == 2017), ​"Customer ID"].drop_duplicates()
Churn = Customer_2017.isin(Customer_2018)
Churn.value_counts()
#Out:
True 552
False 83
Name: Customer ID, dtype: int64

weighted average aggregation on multiple columns of df

I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?
Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444
Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)

Pandas DataFrame: Calculate percentage difference between rows?

I have a year wise dataframe with each year has three parameters year,type and value. I'm trying to calculate percentage of taken vs empty. For example year 2014 has total of 50 empty and 50 taken - So 50% in empty and 50% in taken as shown in final_df
df
year type value
0 2014 Total 100
1 2014 Empty 50
2 2014 Taken 50
3 2013 Total 2000
4 2013 Empty 100
5 2013 Taken 1900
6 2012 Total 50
7 2012 Empty 45
8 2012 Taken 5
Final df
year Empty Taken
0 2014 50 50
0 2013 ... ...
0 2012 ... ...
Should i shift cells up and do the percentage calculate or any other method?
You can use pivot_table:
new = df[df['type'] != 'Total']
res = (new.pivot_table(index='year',columns='type',values='value').sort_values(by='year',ascending=False).reset_index())
which gets you:
res
year Empty Taken
0 2014 50 50
1 2013 100 1900
2 2012 45 5
And then you can get the percentages for each column:
total = (res['Empty'] + res['Taken'])
for col in ['Empty','Taken']:
res[col+'_perc'] = res[col] / total
year Empty Taken Empty_perc Taken_perc
2014 50 50 0.50 0.50
2013 100 1900 0.05 0.95
2012 45 5 0.90 0.10
As #sophods pointed out, you can use pivot_table to rearange your dataframe, however, to add to his answer; i think you're after the percentage, hence i suggest you keep the 'Total' record and then apply your calculation:
#pivot your data
res = (df.pivot_table(index='year',columns='type',values='value')).reset_index()
#calculate percentages of empty and taken
res['Empty'] = res['Empty']/res['Total']
res['Taken'] = res['Taken']/res['Total']
#final dataframe
res = res[['year', 'Empty', 'Taken']]
You can filter out records having Empty and Taken in type and then groupby year and apply func. In func, you can set the type as index and then get the required values and calculate the percentage. x in func would be dataframe having type and value columns and data per group.
def func(x):
x = x.set_index('type')
total = x['value'].sum()
return [(x.loc['Empty', 'value']/total)*100, (x.loc['Taken', 'value']/total)*100]
temp = (df[df['type'].isin({'Empty', 'Taken'})]
.groupby('year')[['type', 'value']]
.apply(lambda x: func(x)))
temp
year
2012 [90.0, 10.0]
2013 [5.0, 95.0]
2014 [50.0, 50.0]
dtype: object
Convert the result into the required dataframe
pd.DataFrame(temp.values.tolist(), index=temp.index, columns=['Empty', 'Taken'])
Empty Taken
year
2012 90.0 10.0
2013 5.0 95.0
2014 50.0 50.0

Pandas groupby operations between groups

I have a DataFrame with 4 fields: Locatiom Year, Week and Sales. I would like to know the difference in Sales between two years preserving the granularity of the dataset. I mean, I would like to know for each Location, Year and Week, what is the difference to the same week of another Year.
The following will generate a Dataframe with a similar structure:
raw_data = {'Location': ['A']*30 + ['B']*30 + ['C']*30,
'Year': 3*([2018]*10+[2019]*10+[2020]*10),
'Week': 3*(3*list(range(1,11))),
'Sales': random.randint(100, size=(90))
}
df = pd.DataFrame(raw_data)
Location Year Week Sales
A 2018 1 67
A 2018 2 93
A 2018 … 67
A 2019 1 49
A 2019 2 38
A 2019 … 40
B 2018 1 18
… … … …
Could you please show me what would be the best approach?
Thank you very much
You can do it using groupby and shift:
df["Next_Years_Sales"] = df.groupby(["Location", "Week"])["Sales"].shift(-1)
df["YoY_Sales_Difference"] = df["Next_Years_Sales"] - df["Sales"]
Spot checking it:
df[(df["Location"] == "A") & (df["Week"] == 1)]
Out[37]:
Location Year Week Sales Next_Years_Sales YoY_Sales_Difference
0 A 2018 1 99 10.0 -89.0
10 A 2019 1 10 3.0 -7.0
20 A 2020 1 3 NaN NaN

Categories