Find mean of each cluster and assign best cluster in pandas dataframe

Find mean of each cluster and assign best cluster in pandas dataframe - python

I would like to cluster below dataframe for column X3 and then for each cluster find mean of X3 then assign 3 for highest mean and 2 for lower and 1 for lowest mean. Below data frame
df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':
[10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':
[34,65,34,87,100,65,78,67,34,98,96,46,76]})
I did cluster according to the column X3 below
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_
cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
Now find mean of X3 for each cluster and month and then rank it and assign 3 to the max mean , 2 for medium and 1 for lowest. Below is what I did but it is not working . How can I fix this? Thank you.
mapping = {1: 'weak', 2: 'average', 3: 'good'}
cols=df.columns[3]
df['product_rank'] = df.groupby(['Month','X3_cluster_id'])
[cols].transform('mean').rank(method='dense').astype(int)
df['product_category'] = df['product_rank'].map(mapping)

While assigning ranks, Make sure to group it on the basis of month.
Complete code:
df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':[10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':[34,65,34,87,100,65,78,67,34,98,96,46,76]})
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_
cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
mapping = {1: 'weak', 2: 'average', 3: 'good'}
df['mean_X3'] = df.groupby(["Month","X3_cluster_id"])["X3"].transform("mean")
df["product_category"] = df.groupby("Month")['mean_X3'].rank(method='dense').astype(int).map(mapping)
print(df)
Month X1 X2 X3 X3_cluster_id mean_X3 product_category
0 1 10 12 34 1 57.80 weak
1 1 15 90 65 2 81.00 good
2 1 24 20 34 1 57.80 weak
3 1 32 40 87 0 66.75 average
4 1 8 10 100 0 66.75 average
5 1 6 15 65 2 81.00 good
6 3 10 30 78 1 57.80 weak
7 3 23 40 67 1 57.80 weak
8 3 24 60 34 0 66.75 average
9 3 56 42 98 2 81.00 good
10 3 45 2 96 2 81.00 good
11 3 10 4 46 0 66.75 average
12 3 56 10 76 1 57.80 weak

When you apply kmeans, the mean is already calculated, so I would suggest doing 1 fit, and return the labels, means and ranking within each groupby:
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X)
ranks = np.argsort(k_means.cluster_centers_.ravel())+1
res = pd.DataFrame({'cluster':range(k_means.n_clusters),
'means':k_means.cluster_centers_.ravel(),
'ranks':ranks}).loc[k_means.labels_,:]
res.index = X.index
return res
Then what you do is simply to apply the function above and obtain the ranks and means in one shot:
mapping = {1: 'weak', 2: 'average', 3: 'good'}
res = df.groupby("Month")[['X3']].apply(cluster, n_clusters=3)
cluster means ranks
0 1 34.000000 3
1 2 65.000000 1
2 1 34.000000 3
3 0 93.500000 2
4 0 93.500000 2
5 2 65.000000 1
6 0 73.666667 2
7 0 73.666667 2
8 1 40.000000 1
9 2 97.000000 3
10 2 97.000000 3
11 1 40.000000 1
12 0 73.666667 2
You can apply map and also a complete dataframe with a left join:
res['product_category'] = res['ranks'].map(mapping)
df.merge(res,left_index=True,right_index=True)
Month X1 X2 X3 cluster means ranks product_category
0 1 10 12 34 1 34.000000 1 weak
1 1 15 90 65 0 65.000000 2 average
2 1 24 20 34 1 34.000000 1 weak
3 1 32 40 87 2 93.500000 3 good
4 1 8 10 100 2 93.500000 3 good
5 1 6 15 65 0 65.000000 2 average
6 3 10 30 78 0 73.666667 2 average
7 3 23 40 67 0 73.666667 2 average
8 3 24 60 34 1 40.000000 1 weak
9 3 56 42 98 2 97.000000 3 good
10 3 45 2 96 2 97.000000 3 good
11 3 10 4 46 1 40.000000 1 weak
12 3 56 10 76 0 73.666667 2 average

Related

Applying Pandas iterrows logic across many groups in a dataframe

I am having trouble applying some logic across my entire dataset. I am able to apply the logic on a small "group" but not on all of the groups (note, the groups are made by primaryFilter and secondaryFilter. Do you all mind pointing me in the right direction to go about this?
Entire Data
import pandas as pd
import numpy as np
myInput = {
'primaryFilter': [100,100,100,100,100,100,100,100,100,100,200,200,200,200,200,200,200,200,200,200],
'secondaryFilter': [1,1,1,1,2,2,2,3,3,3,1,1,2,2,2,2,3,3,3,3],
'constantValuePerGroup': [15,15,15,15,20,20,20,17,17,17,10,10,30,30,30,30,22,22,22,22],
'someValue':[3,1,4,7,9,9,2,7,3,7,6,4,7,10,10,3,4,6,7,5]
}
df_input = pd.DataFrame(data=myInput)
df_input
Test Data (First Group)
df_test = df_input[df_input.primaryFilter.isin([100])]
df_test = df_test[df_test.secondaryFilter == 1.0]
df_test['newColumn'] = np.nan
for index,row in df_test.iterrows():
if index==0:
print("start")
df_test.loc[0, 'newColumn'] = 0
elif index==df_test.shape[0]-1:
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
print("end")
else:
print("inter")
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
df_test["delta"] = df_test["constantValuePerGroup"] - df_test['newColumn']
df_test.head()
Here is the output of the test
I now would like to apply the above logic to the remaining groups 100,2 and 100,3 and 200,1 and so forth..

No need to use iterrows here, you can group the dataframe on primaryFilter and secondaryFilter columns then for each unique group take the cumulative sum of values in column someValue and shift the resulting cummulative sum by 1 position downwards to obtain newColumn. Finally subtract newColumn from constantValuePerGroup to get the delta.
df_input['newColumn'] = df_input.groupby(['primaryFilter', 'secondaryFilter'])['someValue'].apply(lambda s: s.cumsum().shift(fill_value=0))
df_input['delta'] = df_input['constantValuePerGroup'] - df_input['newColumn']
>>> df_input
primaryFilter secondaryFilter constantValuePerGroup someValue newColumn delta
0 100 1 15 3 0 15
1 100 1 15 1 3 12
2 100 1 15 4 4 11
3 100 1 15 7 8 7
4 100 2 20 9 0 20
5 100 2 20 9 9 11
6 100 2 20 2 18 2
7 100 3 17 7 0 17
8 100 3 17 3 7 10
9 100 3 17 7 10 7
10 200 1 10 6 0 10
11 200 1 10 4 6 4
12 200 2 30 7 0 30
13 200 2 30 10 7 23
14 200 2 30 10 17 13
15 200 2 30 3 27 3
16 200 3 22 4 0 22
17 200 3 22 6 4 18
18 200 3 22 7 10 12
19 200 3 22 5 17 5

Pandas Dataframe - iterate and assign

I have a dataset as shown below I am looking to assign a new student if score ratio is <=0.05
import pandas as pd
df = pd.DataFrame({'Teacher': ['P','P','N','N','N','N','P','N','N','P','P','N'],
'Class': ['A','A','A','A','B','B','B','C','C','C','C','C'],
'Student': [1,2,3,4,1,2,3,1,2,3,4,5],
'Total Score': [75,10,10,5,75,20,5,60,20,10,6,4],
'Percent': [43,32,30,36,35,28,34,33,31,36,37,29]})
built a score ratio column as below
df_2 = df.groupby(['Teacher','Class']).agg({'Total Score': 'sum'}).reset_index()
final_data=pd.merge(df,df_2, on=['Teacher','Class'], how='inner')
final_data['score ratio']=final_data['Total Score_x']/final_data['Total Score_y']
If a students score ratio is <=0.05 then I need to assign a new student for the same teacher(ex:N) within the same class(Ex:C) whose percent is next best(below example student 2 has the next best percent of 31)
Expected output with new column-'new_assigned_student'

Here is a solution with nested iterrows which works but is not efficient. I would be interested to see if someone provides a more efficient vectorized solution:
for idx,row in final_data.iterrows():
if row['score ratio'] < 0.05:
min_distance = math.inf
target_index = -1
for idx2, row2 in final_data.iterrows():
if row2['Teacher'] == row['Teacher'] and\
row2['Class'] == row['Class'] and\
row2['Percent'] > row['Percent'] and\
row2['Percent'] - row['Percent'] < min_distance:
min_distance = row2['Percent'] - row['Percent']
target_index = idx2
final_data.loc[idx,'new_assigned-student'] = final_data.loc[target_index,'Student'].astype(str)
#output:
Teacher Class Student ... Total Score_y score ratio new_assigned-student
0 P A 1 ... 85 0.882353 NaN
1 P A 2 ... 85 0.117647 NaN
2 N A 3 ... 15 0.666667 NaN
3 N A 4 ... 15 0.333333 NaN
4 N B 1 ... 95 0.789474 NaN
5 N B 2 ... 95 0.210526 NaN
6 P B 3 ... 5 1.000000 NaN
7 N C 1 ... 84 0.714286 NaN
8 N C 2 ... 84 0.238095 NaN
9 N C 5 ... 84 0.047619 2
10 P C 3 ... 16 0.625000 NaN
11 P C 4 ... 16 0.375000 NaN

This should do it, just use a shift. assumes your scores are sorted per Teacher/Class, as they are in your example
final_data['new_assigned_student'] = final_data.groupby(['Teacher','Class'])['Student'].shift()
final_data.loc[final_data['score ratio']>0.05,'new_assigned_student'] = np.nan
the result
Teacher Class Student Total Score_x Percent Total Score_y score ratio new_assigned_student
-- --------- ------- --------- --------------- --------- --------------- ------------- ----------------------
0 P A 1 75 43 85 0.882353 nan
1 P A 2 10 32 85 0.117647 nan
2 N A 3 10 30 15 0.666667 nan
3 N A 4 5 36 15 0.333333 nan
4 N B 1 75 35 95 0.789474 nan
5 N B 2 20 28 95 0.210526 nan
6 P B 3 5 34 5 1 nan
7 N C 1 60 33 84 0.714286 nan
8 N C 2 20 31 84 0.238095 nan
9 N C 5 4 29 84 0.047619 2
10 P C 3 10 36 16 0.625 nan
11 P C 4 6 37 16 0.375 nan
Solution 2
Here is a more robust, if somewhat more involved, solution
df3 = final_data
df_min_pct = (df3.groupby(['Teacher','Class'],
as_index = False,
sort = False)
.apply(lambda g: g.iloc[g.loc[g['score ratio']>0.05,'Percent'].argmin()])
)
Here df_min_pct shows, for each Teacher/Class group, the details of the student in that group that has the lowest score that is above 0.05:
Teacher Class Student Total Score_x Percent Total Score_y score ratio
-- --------- ------- --------- --------------- --------- --------------- -------------
0 P A 2 10 32 85 0.117647
1 N A 3 10 30 15 0.666667
2 N B 2 20 28 95 0.210526
3 P B 3 5 34 5 1
4 N C 2 20 31 84 0.238095
5 P C 3 10 36 16 0.625
Now we merge with the original df, and remove the details from those lines where it is not relevant
df4 = df3.merge(df_min_pct[['Teacher', 'Class','Student']], on = ['Teacher', 'Class'], sort = False).rename(columns = {'Student_y':'new_assigned_student'})
df4.loc[df4['score ratio']>0.05,'new_assigned_student'] = np.nan
This produces the desired result
Teacher Class Student_x Total Score_x Percent Total Score_y score ratio new_assigned_student
-- --------- ------- ----------- --------------- --------- --------------- ------------- ----------------------
0 P A 1 75 43 85 0.882353 nan
1 P A 2 10 32 85 0.117647 nan
2 N A 3 10 30 15 0.666667 nan
3 N A 4 5 36 15 0.333333 nan
4 N B 1 75 35 95 0.789474 nan
5 N B 2 20 28 95 0.210526 nan
6 P B 3 5 34 5 1 nan
7 N C 1 60 33 84 0.714286 nan
8 N C 2 20 31 84 0.238095 nan
9 N C 5 4 29 84 0.047619 2
10 P C 3 10 36 16 0.625 nan
11 P C 4 6 37 16 0.375 nan

Get longest streak of consecutive weeks by group in pandas

Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?

Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris

If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN

You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want

You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31

Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

How transform value column to quantile at pandas python?

I use pandas to analyze my data, and execute:
df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
print df
userid recency frequency monetary
0 47918 9 53 788778
1 48302 85 10 232323
2 8873 3 79 2323
3 63158 23 23 2323232
4 364 14 43 232323
5 45191 1 75 224455
6 21061 9 64 23367
7 41356 22 55 2346777
8 42455 14 30 23478
9 65460 3 16 2345
I need to transform value recency frequency and monetary into value in range 1-5. so output is
userid recency frequency monetary
0 47918 1 2 3
1 48302 2 1 2
2 8873 3 4 5
3 63158 2 2 2
4 364 5 4 2
5 45191 1 5 4
6 21061 4 4 3
7 41356 3 5 4
8 42455 5 3 5
9 65460 3 1 2
how can do that in python ?
thx

IIUC you need qcut with codes, last need add 1, because minimal value is 1 and maximal 5:
df['recency1'] = pd.qcut(df['recency'].values, 5)
df['frequency1'] = pd.qcut(df['frequency'].values, 5)
df['monetary1'] = pd.qcut(df['monetary'].values, 5)
print df
userid recency frequency monetary recency1 frequency1 \
0 47918 9 53 788778 (3, 9] (37.8, 53.8]
1 48302 85 10 232323 (22.2, 85] [10, 21.6]
2 8873 3 79 2323 [1, 3] (66.2, 79]
3 63158 23 23 2323232 (22.2, 85] (21.6, 37.8]
4 364 14 43 232323 (9, 14] (37.8, 53.8]
5 45191 1 75 224455 [1, 3] (66.2, 79]
6 21061 9 64 23367 (3, 9] (53.8, 66.2]
7 41356 22 55 2346777 (14, 22.2] (53.8, 66.2]
8 42455 14 30 23478 (9, 14] (21.6, 37.8]
9 65460 3 16 2345 [1, 3] [10, 21.6]
monetary1
0 (232323, 1095668.8]
1 (144064.2, 232323]
2 [2323, 19162.6]
3 (1095668.8, 2346777]
4 (144064.2, 232323]
5 (144064.2, 232323]
6 (19162.6, 144064.2]
7 (1095668.8, 2346777]
8 (19162.6, 144064.2]
9 [2323, 19162.6]
df['recency'] = pd.qcut(df['recency'].values, 5).codes + 1
df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
df['monetary'] = pd.qcut(df['monetary'].values, 5).codes + 1
print df
userid recency frequency monetary
0 47918 2 3 4
1 48302 5 1 3
2 8873 1 5 1
3 63158 5 2 5
4 364 3 3 3
5 45191 1 5 3
6 21061 2 4 2
7 41356 4 4 5
8 42455 3 2 2
9 65460 1 1 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find mean of each cluster and assign best cluster in pandas dataframe - python

Related

Applying Pandas iterrows logic across many groups in a dataframe

Pandas Dataframe - iterate and assign

Get longest streak of consecutive weeks by group in pandas

Compare two pandas dataframe with different size

How transform value column to quantile at pandas python?

Categories

Resources