I would like to cluster below dataframe for column X3 and then for each cluster find mean of X3 then assign 3 for highest mean and 2 for lower and 1 for lowest mean. Below data frame
df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':
[10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':
[34,65,34,87,100,65,78,67,34,98,96,46,76]})
I did cluster according to the column X3 below
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_
cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
Now find mean of X3 for each cluster and month and then rank it and assign 3 to the max mean , 2 for medium and 1 for lowest. Below is what I did but it is not working . How can I fix this? Thank you.
mapping = {1: 'weak', 2: 'average', 3: 'good'}
cols=df.columns[3]
df['product_rank'] = df.groupby(['Month','X3_cluster_id'])
[cols].transform('mean').rank(method='dense').astype(int)
df['product_category'] = df['product_rank'].map(mapping)
While assigning ranks, Make sure to group it on the basis of month.
Complete code:
df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':[10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':[34,65,34,87,100,65,78,67,34,98,96,46,76]})
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_
cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
mapping = {1: 'weak', 2: 'average', 3: 'good'}
df['mean_X3'] = df.groupby(["Month","X3_cluster_id"])["X3"].transform("mean")
df["product_category"] = df.groupby("Month")['mean_X3'].rank(method='dense').astype(int).map(mapping)
print(df)
Month X1 X2 X3 X3_cluster_id mean_X3 product_category
0 1 10 12 34 1 57.80 weak
1 1 15 90 65 2 81.00 good
2 1 24 20 34 1 57.80 weak
3 1 32 40 87 0 66.75 average
4 1 8 10 100 0 66.75 average
5 1 6 15 65 2 81.00 good
6 3 10 30 78 1 57.80 weak
7 3 23 40 67 1 57.80 weak
8 3 24 60 34 0 66.75 average
9 3 56 42 98 2 81.00 good
10 3 45 2 96 2 81.00 good
11 3 10 4 46 0 66.75 average
12 3 56 10 76 1 57.80 weak
When you apply kmeans, the mean is already calculated, so I would suggest doing 1 fit, and return the labels, means and ranking within each groupby:
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X)
ranks = np.argsort(k_means.cluster_centers_.ravel())+1
res = pd.DataFrame({'cluster':range(k_means.n_clusters),
'means':k_means.cluster_centers_.ravel(),
'ranks':ranks}).loc[k_means.labels_,:]
res.index = X.index
return res
Then what you do is simply to apply the function above and obtain the ranks and means in one shot:
mapping = {1: 'weak', 2: 'average', 3: 'good'}
res = df.groupby("Month")[['X3']].apply(cluster, n_clusters=3)
cluster means ranks
0 1 34.000000 3
1 2 65.000000 1
2 1 34.000000 3
3 0 93.500000 2
4 0 93.500000 2
5 2 65.000000 1
6 0 73.666667 2
7 0 73.666667 2
8 1 40.000000 1
9 2 97.000000 3
10 2 97.000000 3
11 1 40.000000 1
12 0 73.666667 2
You can apply map and also a complete dataframe with a left join:
res['product_category'] = res['ranks'].map(mapping)
df.merge(res,left_index=True,right_index=True)
Month X1 X2 X3 cluster means ranks product_category
0 1 10 12 34 1 34.000000 1 weak
1 1 15 90 65 0 65.000000 2 average
2 1 24 20 34 1 34.000000 1 weak
3 1 32 40 87 2 93.500000 3 good
4 1 8 10 100 2 93.500000 3 good
5 1 6 15 65 0 65.000000 2 average
6 3 10 30 78 0 73.666667 2 average
7 3 23 40 67 0 73.666667 2 average
8 3 24 60 34 1 40.000000 1 weak
9 3 56 42 98 2 97.000000 3 good
10 3 45 2 96 2 97.000000 3 good
11 3 10 4 46 1 40.000000 1 weak
12 3 56 10 76 0 73.666667 2 average
Related
I am having trouble applying some logic across my entire dataset. I am able to apply the logic on a small "group" but not on all of the groups (note, the groups are made by primaryFilter and secondaryFilter. Do you all mind pointing me in the right direction to go about this?
Entire Data
import pandas as pd
import numpy as np
myInput = {
'primaryFilter': [100,100,100,100,100,100,100,100,100,100,200,200,200,200,200,200,200,200,200,200],
'secondaryFilter': [1,1,1,1,2,2,2,3,3,3,1,1,2,2,2,2,3,3,3,3],
'constantValuePerGroup': [15,15,15,15,20,20,20,17,17,17,10,10,30,30,30,30,22,22,22,22],
'someValue':[3,1,4,7,9,9,2,7,3,7,6,4,7,10,10,3,4,6,7,5]
}
df_input = pd.DataFrame(data=myInput)
df_input
Test Data (First Group)
df_test = df_input[df_input.primaryFilter.isin([100])]
df_test = df_test[df_test.secondaryFilter == 1.0]
df_test['newColumn'] = np.nan
for index,row in df_test.iterrows():
if index==0:
print("start")
df_test.loc[0, 'newColumn'] = 0
elif index==df_test.shape[0]-1:
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
print("end")
else:
print("inter")
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
df_test["delta"] = df_test["constantValuePerGroup"] - df_test['newColumn']
df_test.head()
Here is the output of the test
I now would like to apply the above logic to the remaining groups 100,2 and 100,3 and 200,1 and so forth..
No need to use iterrows here, you can group the dataframe on primaryFilter and secondaryFilter columns then for each unique group take the cumulative sum of values in column someValue and shift the resulting cummulative sum by 1 position downwards to obtain newColumn. Finally subtract newColumn from constantValuePerGroup to get the delta.
df_input['newColumn'] = df_input.groupby(['primaryFilter', 'secondaryFilter'])['someValue'].apply(lambda s: s.cumsum().shift(fill_value=0))
df_input['delta'] = df_input['constantValuePerGroup'] - df_input['newColumn']
>>> df_input
primaryFilter secondaryFilter constantValuePerGroup someValue newColumn delta
0 100 1 15 3 0 15
1 100 1 15 1 3 12
2 100 1 15 4 4 11
3 100 1 15 7 8 7
4 100 2 20 9 0 20
5 100 2 20 9 9 11
6 100 2 20 2 18 2
7 100 3 17 7 0 17
8 100 3 17 3 7 10
9 100 3 17 7 10 7
10 200 1 10 6 0 10
11 200 1 10 4 6 4
12 200 2 30 7 0 30
13 200 2 30 10 7 23
14 200 2 30 10 17 13
15 200 2 30 3 27 3
16 200 3 22 4 0 22
17 200 3 22 6 4 18
18 200 3 22 7 10 12
19 200 3 22 5 17 5
I have a dataset as shown below I am looking to assign a new student if score ratio is <=0.05
import pandas as pd
df = pd.DataFrame({'Teacher': ['P','P','N','N','N','N','P','N','N','P','P','N'],
'Class': ['A','A','A','A','B','B','B','C','C','C','C','C'],
'Student': [1,2,3,4,1,2,3,1,2,3,4,5],
'Total Score': [75,10,10,5,75,20,5,60,20,10,6,4],
'Percent': [43,32,30,36,35,28,34,33,31,36,37,29]})
built a score ratio column as below
df_2 = df.groupby(['Teacher','Class']).agg({'Total Score': 'sum'}).reset_index()
final_data=pd.merge(df,df_2, on=['Teacher','Class'], how='inner')
final_data['score ratio']=final_data['Total Score_x']/final_data['Total Score_y']
If a students score ratio is <=0.05 then I need to assign a new student for the same teacher(ex:N) within the same class(Ex:C) whose percent is next best(below example student 2 has the next best percent of 31)
Expected output with new column-'new_assigned_student'
Here is a solution with nested iterrows which works but is not efficient. I would be interested to see if someone provides a more efficient vectorized solution:
for idx,row in final_data.iterrows():
if row['score ratio'] < 0.05:
min_distance = math.inf
target_index = -1
for idx2, row2 in final_data.iterrows():
if row2['Teacher'] == row['Teacher'] and\
row2['Class'] == row['Class'] and\
row2['Percent'] > row['Percent'] and\
row2['Percent'] - row['Percent'] < min_distance:
min_distance = row2['Percent'] - row['Percent']
target_index = idx2
final_data.loc[idx,'new_assigned-student'] = final_data.loc[target_index,'Student'].astype(str)
#output:
Teacher Class Student ... Total Score_y score ratio new_assigned-student
0 P A 1 ... 85 0.882353 NaN
1 P A 2 ... 85 0.117647 NaN
2 N A 3 ... 15 0.666667 NaN
3 N A 4 ... 15 0.333333 NaN
4 N B 1 ... 95 0.789474 NaN
5 N B 2 ... 95 0.210526 NaN
6 P B 3 ... 5 1.000000 NaN
7 N C 1 ... 84 0.714286 NaN
8 N C 2 ... 84 0.238095 NaN
9 N C 5 ... 84 0.047619 2
10 P C 3 ... 16 0.625000 NaN
11 P C 4 ... 16 0.375000 NaN
This should do it, just use a shift. assumes your scores are sorted per Teacher/Class, as they are in your example
final_data['new_assigned_student'] = final_data.groupby(['Teacher','Class'])['Student'].shift()
final_data.loc[final_data['score ratio']>0.05,'new_assigned_student'] = np.nan
the result
Teacher Class Student Total Score_x Percent Total Score_y score ratio new_assigned_student
-- --------- ------- --------- --------------- --------- --------------- ------------- ----------------------
0 P A 1 75 43 85 0.882353 nan
1 P A 2 10 32 85 0.117647 nan
2 N A 3 10 30 15 0.666667 nan
3 N A 4 5 36 15 0.333333 nan
4 N B 1 75 35 95 0.789474 nan
5 N B 2 20 28 95 0.210526 nan
6 P B 3 5 34 5 1 nan
7 N C 1 60 33 84 0.714286 nan
8 N C 2 20 31 84 0.238095 nan
9 N C 5 4 29 84 0.047619 2
10 P C 3 10 36 16 0.625 nan
11 P C 4 6 37 16 0.375 nan
Solution 2
Here is a more robust, if somewhat more involved, solution
df3 = final_data
df_min_pct = (df3.groupby(['Teacher','Class'],
as_index = False,
sort = False)
.apply(lambda g: g.iloc[g.loc[g['score ratio']>0.05,'Percent'].argmin()])
)
Here df_min_pct shows, for each Teacher/Class group, the details of the student in that group that has the lowest score that is above 0.05:
Teacher Class Student Total Score_x Percent Total Score_y score ratio
-- --------- ------- --------- --------------- --------- --------------- -------------
0 P A 2 10 32 85 0.117647
1 N A 3 10 30 15 0.666667
2 N B 2 20 28 95 0.210526
3 P B 3 5 34 5 1
4 N C 2 20 31 84 0.238095
5 P C 3 10 36 16 0.625
Now we merge with the original df, and remove the details from those lines where it is not relevant
df4 = df3.merge(df_min_pct[['Teacher', 'Class','Student']], on = ['Teacher', 'Class'], sort = False).rename(columns = {'Student_y':'new_assigned_student'})
df4.loc[df4['score ratio']>0.05,'new_assigned_student'] = np.nan
This produces the desired result
Teacher Class Student_x Total Score_x Percent Total Score_y score ratio new_assigned_student
-- --------- ------- ----------- --------------- --------- --------------- ------------- ----------------------
0 P A 1 75 43 85 0.882353 nan
1 P A 2 10 32 85 0.117647 nan
2 N A 3 10 30 15 0.666667 nan
3 N A 4 5 36 15 0.333333 nan
4 N B 1 75 35 95 0.789474 nan
5 N B 2 20 28 95 0.210526 nan
6 P B 3 5 34 5 1 nan
7 N C 1 60 33 84 0.714286 nan
8 N C 2 20 31 84 0.238095 nan
9 N C 5 4 29 84 0.047619 2
10 P C 3 10 36 16 0.625 nan
11 P C 4 6 37 16 0.375 nan
Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.
I use pandas to analyze my data, and execute:
df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
print df
userid recency frequency monetary
0 47918 9 53 788778
1 48302 85 10 232323
2 8873 3 79 2323
3 63158 23 23 2323232
4 364 14 43 232323
5 45191 1 75 224455
6 21061 9 64 23367
7 41356 22 55 2346777
8 42455 14 30 23478
9 65460 3 16 2345
I need to transform value recency frequency and monetary into value in range 1-5. so output is
userid recency frequency monetary
0 47918 1 2 3
1 48302 2 1 2
2 8873 3 4 5
3 63158 2 2 2
4 364 5 4 2
5 45191 1 5 4
6 21061 4 4 3
7 41356 3 5 4
8 42455 5 3 5
9 65460 3 1 2
how can do that in python ?
thx
IIUC you need qcut with codes, last need add 1, because minimal value is 1 and maximal 5:
df['recency1'] = pd.qcut(df['recency'].values, 5)
df['frequency1'] = pd.qcut(df['frequency'].values, 5)
df['monetary1'] = pd.qcut(df['monetary'].values, 5)
print df
userid recency frequency monetary recency1 frequency1 \
0 47918 9 53 788778 (3, 9] (37.8, 53.8]
1 48302 85 10 232323 (22.2, 85] [10, 21.6]
2 8873 3 79 2323 [1, 3] (66.2, 79]
3 63158 23 23 2323232 (22.2, 85] (21.6, 37.8]
4 364 14 43 232323 (9, 14] (37.8, 53.8]
5 45191 1 75 224455 [1, 3] (66.2, 79]
6 21061 9 64 23367 (3, 9] (53.8, 66.2]
7 41356 22 55 2346777 (14, 22.2] (53.8, 66.2]
8 42455 14 30 23478 (9, 14] (21.6, 37.8]
9 65460 3 16 2345 [1, 3] [10, 21.6]
monetary1
0 (232323, 1095668.8]
1 (144064.2, 232323]
2 [2323, 19162.6]
3 (1095668.8, 2346777]
4 (144064.2, 232323]
5 (144064.2, 232323]
6 (19162.6, 144064.2]
7 (1095668.8, 2346777]
8 (19162.6, 144064.2]
9 [2323, 19162.6]
df['recency'] = pd.qcut(df['recency'].values, 5).codes + 1
df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
df['monetary'] = pd.qcut(df['monetary'].values, 5).codes + 1
print df
userid recency frequency monetary
0 47918 2 3 4
1 48302 5 1 3
2 8873 1 5 1
3 63158 5 2 5
4 364 3 3 3
5 45191 1 5 3
6 21061 2 4 2
7 41356 4 4 5
8 42455 3 2 2
9 65460 1 1 1