I have a dataframe that has segments of consecutive values appearing in column a (the value in column b does not matter):
import pandas as pd
import numpy as np
np.random.seed(150)
df = pd.DataFrame(data={'a':[1,2,3,4,5,15,16,17,18,203,204,205],'b':np.random.randint(50000,size=(12))})
>>> df
a b
0 1 27066
1 2 28155
2 3 49177
3 4 496
4 5 2354
5 15 23292
6 16 9358
7 17 19036
8 18 29946
9 203 39785
10 204 15843
11 205 21917
I would like to add a column c whose values are sequential counts according to presenting consecutive values in column a, as shown below:
a b c
1 27066 1
2 28155 2
3 49177 3
4 496 4
5 2354 5
15 23292 1
16 9358 2
17 19036 3
18 29946 4
203 39785 1
204 15843 2
205 21917 3
How to do this?
One solution:
df["c"] = (s := df["a"] - np.arange(len(df))).groupby(s).cumcount() + 1
print(df)
Output
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
The original idea comes from ancient Python docs.
In order to use the walrus operator ((:=) or assignment expressions) you need Python 3.8+, instead you can do:
s = df["a"] - np.arange(len(df))
df["c"] = s.groupby(s).cumcount() + 1
print(df)
A simple solution is to find consecutive groups, use cumsum to get the number sequence and then remove any extra in later groups.
a = df['a'].add(1).shift(1).eq(df['a'])
df['c'] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int) + 1
df
Result:
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
I would like to cluster below dataframe for column X3 and then for each cluster find mean of X3 then assign 3 for highest mean and 2 for lower and 1 for lowest mean. Below data frame
df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':
[10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':
[34,65,34,87,100,65,78,67,34,98,96,46,76]})
I did cluster according to the column X3 below
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_
cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
Now find mean of X3 for each cluster and month and then rank it and assign 3 to the max mean , 2 for medium and 1 for lowest. Below is what I did but it is not working . How can I fix this? Thank you.
mapping = {1: 'weak', 2: 'average', 3: 'good'}
cols=df.columns[3]
df['product_rank'] = df.groupby(['Month','X3_cluster_id'])
[cols].transform('mean').rank(method='dense').astype(int)
df['product_category'] = df['product_rank'].map(mapping)
While assigning ranks, Make sure to group it on the basis of month.
Complete code:
df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':[10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':[34,65,34,87,100,65,78,67,34,98,96,46,76]})
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_
cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
mapping = {1: 'weak', 2: 'average', 3: 'good'}
df['mean_X3'] = df.groupby(["Month","X3_cluster_id"])["X3"].transform("mean")
df["product_category"] = df.groupby("Month")['mean_X3'].rank(method='dense').astype(int).map(mapping)
print(df)
Month X1 X2 X3 X3_cluster_id mean_X3 product_category
0 1 10 12 34 1 57.80 weak
1 1 15 90 65 2 81.00 good
2 1 24 20 34 1 57.80 weak
3 1 32 40 87 0 66.75 average
4 1 8 10 100 0 66.75 average
5 1 6 15 65 2 81.00 good
6 3 10 30 78 1 57.80 weak
7 3 23 40 67 1 57.80 weak
8 3 24 60 34 0 66.75 average
9 3 56 42 98 2 81.00 good
10 3 45 2 96 2 81.00 good
11 3 10 4 46 0 66.75 average
12 3 56 10 76 1 57.80 weak
When you apply kmeans, the mean is already calculated, so I would suggest doing 1 fit, and return the labels, means and ranking within each groupby:
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X)
ranks = np.argsort(k_means.cluster_centers_.ravel())+1
res = pd.DataFrame({'cluster':range(k_means.n_clusters),
'means':k_means.cluster_centers_.ravel(),
'ranks':ranks}).loc[k_means.labels_,:]
res.index = X.index
return res
Then what you do is simply to apply the function above and obtain the ranks and means in one shot:
mapping = {1: 'weak', 2: 'average', 3: 'good'}
res = df.groupby("Month")[['X3']].apply(cluster, n_clusters=3)
cluster means ranks
0 1 34.000000 3
1 2 65.000000 1
2 1 34.000000 3
3 0 93.500000 2
4 0 93.500000 2
5 2 65.000000 1
6 0 73.666667 2
7 0 73.666667 2
8 1 40.000000 1
9 2 97.000000 3
10 2 97.000000 3
11 1 40.000000 1
12 0 73.666667 2
You can apply map and also a complete dataframe with a left join:
res['product_category'] = res['ranks'].map(mapping)
df.merge(res,left_index=True,right_index=True)
Month X1 X2 X3 cluster means ranks product_category
0 1 10 12 34 1 34.000000 1 weak
1 1 15 90 65 0 65.000000 2 average
2 1 24 20 34 1 34.000000 1 weak
3 1 32 40 87 2 93.500000 3 good
4 1 8 10 100 2 93.500000 3 good
5 1 6 15 65 0 65.000000 2 average
6 3 10 30 78 0 73.666667 2 average
7 3 23 40 67 0 73.666667 2 average
8 3 24 60 34 1 40.000000 1 weak
9 3 56 42 98 2 97.000000 3 good
10 3 45 2 96 2 97.000000 3 good
11 3 10 4 46 1 40.000000 1 weak
12 3 56 10 76 0 73.666667 2 average
DF has four columns and column 'Id' in unique and it is grouped by column 'idhogar'.
column ' parentesco1' has status 0 (or) 1 . 'Target' columns has values,which are different for various rows under same column values of 'idhogar'
INDEX Id parentesco1 idhogar Target
0 ID_fe8c32eba 0 4616164 2
1 ID_ca701e058 1 4616164 2
2 ID_5ad4372cd 0 4983866 3
3 ID_1e320689c 1 4983866 3
4 ID_700e30a8d 0 5905417 2
5 ID_bc99ecfb8 0 5905417 2
6 ID_308a05a16 1 5905417 2
7 ID_00186dde5 1 7.56E+06 4
8 ID_34570a74c 1 20713493 4
9 ID_b13870a19 1 27651991 3
10 ID_74e989389 1 45038655 4
11 ID_726ba7d34 0 60027579 4
12 ID_b75d7c648 0 60027579 4
13 ID_37e7b3aaa 1 60027579 4
14 ID_396da5a70 0 104578907 2
15 ID_4381374bb 1 104578907 2
16 ID_272a9b4d5 0 119024319 4
17 ID_1225f3779 0 119024319 4
18 ID_fc5dfaa2e 0 119024319 4
19 ID_7390a3f99 1 119024319 4
New column'Rev_target' created ,need to have the value of 'Target' of row having ' parentesco1' as 1 for all the rows falling under the group of same 'idhogar'.
I tried the following but not successful.
for idhogar in df['idhogar'].unique():
if len(df[df['idhogar'] == idhogar]['Target'].unique())!= 1:
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=rev_target_val
# NOT WORKING AS REQUIRED ---- gives output as NaN in all rows of newly created column
Tried the below but throwing error
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=np.where(len(df[df['idhogar'] == idhogar]['Target'].unique())!=
1,rev_target_val,df['Target'])
ValueError: operands could not be broadcast together with shapes () (0,) (9557,)
Tried the below but not working as intended,gives same value as 2 in all the rows of new'Rev_target' column
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=df.apply(lambda x: rev_target_val if (len(df[df['idhogar'] == idhogar]
['Target'].unique())!= 1) else df['Target'],axis=1)
Would appreciate a solution from you and thanks in advance.
I would sort the dataframe on parentesco1 in descending order to make sure that the parentesco1 1 row is the first row. Then a transform could easily access that row:
df['Rev_target'] = df.sort_values('parentesco1', ascending=False).groupby(
'idhogar').transform(lambda x: x.iloc[0])['Target']
It gives:
INDEX Id parentesco1 idhogar Target Rev_target
0 0 ID_fe8c32eba 0 4616164.0 2 2
1 1 ID_ca701e058 1 4616164.0 2 2
2 2 ID_5ad4372cd 0 4983866.0 3 3
3 3 ID_1e320689c 1 4983866.0 3 3
4 4 ID_700e30a8d 0 5905417.0 2 2
5 5 ID_bc99ecfb8 0 5905417.0 2 2
6 6 ID_308a05a16 1 5905417.0 2 2
7 7 ID_00186dde5 1 7560000.0 4 4
8 8 ID_34570a74c 1 20713493.0 4 4
9 9 ID_b13870a19 1 27651991.0 3 3
10 10 ID_74e989389 1 45038655.0 4 4
11 11 ID_726ba7d34 0 60027579.0 4 4
12 12 ID_b75d7c648 0 60027579.0 4 4
13 13 ID_37e7b3aaa 1 60027579.0 4 4
14 14 ID_396da5a70 0 104578907.0 2 2
15 15 ID_4381374bb 1 104578907.0 2 2
16 16 ID_272a9b4d5 0 119024319.0 4 4
17 17 ID_1225f3779 0 119024319.0 4 4
18 18 ID_fc5dfaa2e 0 119024319.0 4 4
19 19 ID_7390a3f99 1 119024319.0 4 4
I have a date in a list:
[datetime.date(2017, 8, 9)]
I want replace the value of a dataframe matching that date with zero.
Dataframe:
Date Amplitude Magnitude Peaks Crests
0 2017-06-21 6.953356 1046.656154 4 3
1 2017-06-27 7.015520 1185.221306 5 4
2 2017-06-28 6.947471 908.115055 2 2
3 2017-06-29 6.921587 938.175153 3 3
4 2017-07-02 6.906078 938.273547 3 2
5 2017-07-03 6.898809 955.718452 6 5
6 2017-07-04 6.876283 846.514852 5 5
7 2017-07-26 6.862897 870.610086 6 5
8 2017-07-27 6.846426 824.403786 7 7
9 2017-07-28 6.831949 813.753420 7 7
10 2017-07-29 6.823125 841.245427 4 3
11 2017-07-30 6.816301 846.603427 5 4
12 2017-07-31 6.810133 842.287006 5 4
13 2017-08-01 6.800645 794.167590 3 3
14 2017-08-02 6.793034 801.505774 4 3
15 2017-08-03 6.790814 860.497395 7 6
16 2017-08-04 6.785664 815.055002 4 4
17 2017-08-05 6.782069 829.607640 5 4
18 2017-08-06 6.778176 819.014799 4 3
19 2017-08-07 6.774587 817.624203 5 5
20 2017-08-08 6.771193 815.101641 4 3
21 2017-08-09 6.765695 772.970000 1 1
22 2017-08-10 6.769422 945.207554 1 1
23 2017-08-11 6.773154 952.422598 4 3
24 2017-08-12 6.770926 826.700122 4 4
25 2017-08-13 6.772816 916.046905 5 5
26 2017-08-14 6.771130 834.881662 5 5
27 2017-08-15 6.769183 826.009391 5 5
28 2017-08-16 6.767313 824.650882 5 4
29 2017-08-17 6.765894 832.752100 5 5
30 2017-08-18 6.766861 894.165751 5 5
31 2017-08-19 6.768392 912.200274 4 3
i have tried this:
for x in range(len(all_details)):
for y in selected_day:
m = all_details['Date'] > y
all_details.loc[m, 'Peaks'] = 0
But getting an error:
ValueError: Arrays were different lengths: 32 vs 1
Can anybody suggest me the correct way to do it>
Any help would be appreciated.
First your solution working nice with your sample data.
Another faster solution is creating each mask in loop and then reduce by logical or, and - what need. Better it is explained here.
L = [datetime.date(2017, 8, 9)]
m = np.logical_or.reduce([all_details['Date'] > x for x in L])
all_details.loc[m, 'Peaks'] = 0
In your solution is better compare only by minimal date from list:
all_details.loc[all_details['Date'] > min(L), 'Peaks'] = 0
I'm trying to get all records where the mean of the last 3 rows is greater than the overall mean for all rows in a filtered set.
_filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05]
_last_n_records = _filtered_d.tail(3)
Something like this
_filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.mean()]
However, the problem here is that the value length is incorrect. Any tips?
ValueError: Series lengths must match to compare
Sample Data
This has an index on the year and month, and 2 columns.
Col1 Col2
year month
2005 12 0.533835 0.170679
12 0.494733 0.198347
2006 3 0.440098 0.202240
6 0.410285 0.188421
9 0.502420 0.200188
12 0.522253 0.118680
2007 3 0.378120 0.171192
6 0.431989 0.145158
9 0.612036 0.178097
12 0.519766 0.252196
2008 3 0.547705 0.202163
6 0.560985 0.238591
9 0.617320 0.199537
12 0.343939 0.253855
Why not just boolean index directly on your filtered DataFrame with
df[df.tail(3).mean() > df.mean()]
Demo
>>> df
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
4 9 7 8 9 4
>>> df[df.tail(3).mean() > df.mean()]
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
Update example for MultiIndex edit
The same should work fine for your MultiIndex sample, we just have to mask a bit differently of course.
>>> df
col1 col2
2005 12 -0.340088 -0.574140
12 -0.814014 0.430580
2006 3 0.464008 0.438494
6 0.019508 -0.635128
9 0.622645 -0.824526
12 -1.674920 -1.027275
2007 3 0.397133 0.659467
6 0.026170 -0.052063
9 0.835561 0.608067
12 0.736873 -0.613877
2008 3 0.344781 -0.566392
6 -0.653290 -0.264992
9 0.080592 -0.548189
12 0.585642 1.149779
>>> df.loc[:,df.tail(3).mean() > df.mean()]
col2
2005 12 -0.574140
12 0.430580
2006 3 0.438494
6 -0.635128
9 -0.824526
12 -1.027275
2007 3 0.659467
6 -0.052063
9 0.608067
12 -0.613877
2008 3 -0.566392
6 -0.264992
9 -0.548189
12 1.149779