How to automate the bins of a column in python? - python

Background information: I have a dataframe 'test1' with column name 'y' which carries original values. I applied some model and I got prediction with the column name 'Yhat' using 'y'.I need to modify my 'Yhat' so,I have bucketed both 'y' and 'Yhat'. For a particular bucket of 'yhat' there is corresponding 'y' bucket.
Now in future if I have 3 points ahead prediction i.e 'yhat' then I can provide corresponding 'y' buckets category. For example see dataframe i.e 'test2' and codes.
Main query : To avoid manually creating bucket values,I want to automate this whole process. The reason for automating is,as the sample space increases the corresponding bucket values will also change.
test1
y Yhat
1 1
2 1
6 5
2 3
3 4
1 2
4 2
3 4
7 6
5 8
def catY(r):
if((r['y']>=1) & (r['y']<3)):
return 'Y_cat_1'
elif((r['y']>=3) & (r['y']<6)):
return 'Y_cat_2'
elif((r['y']>=6)):
return 'Y_cat_3'
test1['Actual_Y'] = test1.apply(catY,axis=1)
def cat(r):
if((r['Yhat']>=1) & (r['Yhat']<3)):
return 'Yhat_cat_1'
elif((r['Yhat']>=3) & (r['Yhat']<6)):
return 'Yhat_cat_2'
elif((r['Yhat']>=6)):
return 'Yhat_cat_3'
test1['yhat_cat'] = test1.apply(cat,axis=1)
test1.groupby('yhat_cat')['Actual_Y'].value_counts(normalize=True)
yhat_cat Actual_Y
Yhat_cat_1 Y_cat_1 0.75
Y_cat_2 0.25
Yhat_cat_2 Y_cat_2 0.50
Y_cat_1 0.25
Y_cat_3 0.25
Yhat_cat_3 Y_cat_2 0.50
Y_cat_3 0.50
test2
y Yhat
1 1
2 1
6 5
2 3
3 4
1 2
4 2
3 4
7 6
5 8
2
5
1
filter_method1 = lambda x: '0.75' if ( x >=1 and x <3) else '0.25' if (x >=3 and x <6) else '0' if x >=6 else None
test2['Y_cat_1'] = test2['Yhat'].apply(filter_method1)
filter_method2 = lambda x: '0.25' if ( x >=1 and x <3) else '0.50' if (x >=3 and x <6) else '0.50' if x >=6 else None
test2['Y_cat_2'] = test2['Yhat'].apply(filter_method2)
filter_method3 = lambda x: '0' if ( x >=1 and x <3) else '0.25' if (x >=3 and x <6) else '0.50' if x >=6 else None
test2['Y_cat_3'] = test2['Yhat'].apply(filter_method3)
print(test2)
y Yhat Y_cat_1 Y_cat_2 Y_cat_3
0 1.00 1 0.75 0.25 0
1 2.00 1 0.75 0.25 0
2 6.00 5 0.25 0.50 0.25
3 2.00 3 0.25 0.50 0.25
4 3.00 4 0.25 0.50 0.25
5 1.00 2 0.75 0.25 0
6 4.00 2 0.75 0.25 0
7 3.00 4 0.25 0.50 0.25
8 7.00 6 0 0.50 0.50
9 5.00 8 0 0.50 0.50
10 nan 2 0.75 0.25 0
11 nan 5 0.25 0.50 0.25
12 nan 1 0.75 0.25 0

You can use cut:
bins = [1,3,6,np.inf]
labels1 = [f'Y_cat_{x}' for x in range(1, len(bins))]
labels2 = [f'Yhat_cat_{x}' for x in range(1, len(bins))]
test1['Actual_Y'] = pd.cut(test1['y'], bins=bins, labels=labels1, right=False)
test1['yhat_cat'] = pd.cut(test1['Yhat'], bins=bins, labels=labels2, right=False)
print (test1)
y Yhat Actual_Y yhat_cat
0 1 1 Y_cat_1 Yhat_cat_1
1 2 1 Y_cat_1 Yhat_cat_1
2 6 5 Y_cat_3 Yhat_cat_2
3 2 3 Y_cat_1 Yhat_cat_2
4 3 4 Y_cat_2 Yhat_cat_2
5 1 2 Y_cat_1 Yhat_cat_1
6 4 2 Y_cat_2 Yhat_cat_1
7 3 4 Y_cat_2 Yhat_cat_2
8 7 6 Y_cat_3 Yhat_cat_3
9 5 8 Y_cat_2 Yhat_cat_3
Then convert normalized percentages to DataFrame by Series.unstack:
df = test1.groupby('yhat_cat')['Actual_Y'].value_counts(normalize=True).unstack(fill_value=0)
print (df)
Actual_Y Y_cat_1 Y_cat_2 Y_cat_3
yhat_cat
Yhat_cat_1 0.75 0.25 0.00
Yhat_cat_2 0.25 0.50 0.25
Yhat_cat_3 0.00 0.50 0.50
Loop by columns and dynamic create new columns by test2['Yhat']:
for c in df.columns:
#https://stackoverflow.com/a/48447871
test2[c] = df[c].values[pd.cut(test2['Yhat'], bins=bins, labels=False, right=False)]
print (test2)
y Yhat Y_cat_1 Y_cat_2 Y_cat_3
0 1.0 1 0.75 0.25 0.00
1 2.0 1 0.75 0.25 0.00
2 6.0 5 0.25 0.50 0.25
3 2.0 3 0.25 0.50 0.25
4 3.0 4 0.25 0.50 0.25
5 1.0 2 0.75 0.25 0.00
6 4.0 2 0.75 0.25 0.00
7 3.0 4 0.25 0.50 0.25
8 7.0 6 0.00 0.50 0.50
9 5.0 8 0.00 0.50 0.50
10 NaN 2 0.75 0.25 0.00
11 NaN 5 0.25 0.50 0.25
12 NaN 1 0.75 0.25 0.00

Related

Find number of datapoints in each range

I have a data frame that looks like this
data = [['A', 0.20], ['B',0.25], ['C',0.11], ['D',0.30], ['E',0.29]]
df = pd.DataFrame(data, columns=['col1', 'col2'])
Col1 is a primary key (each row has a unique value)
The max of col2 is 1 and the min is 0. I want to find the number of datapoint in ranges 0-.30 (both 0 and 0.30 are included), 0-.29, 0-.28, and so on till 0-.01. I can use pd.cut, but the lower limit is not fixed. My lower limit is always 0.
Can someone help?
One option using numpy broadcasting:
step = 0.01
up = np.arange(0, 0.3+step, step)
out = pd.Series((df['col2'].to_numpy()[:,None] <= up).sum(axis=0), index=up)
Output:
0.00 0
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.10 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.20 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.30 5
dtype: int64
With pandas.cut and cumsum:
step = 0.01
up = np.arange(0, 0.3+step, step)
(pd.cut(df['col2'], up, labels=up[1:].round(2))
.value_counts(sort=False).cumsum()
)
Output:
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.1 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.2 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.3 5
Name: col2, dtype: int64

Adding values of a second dataFrame depending on the index values of a first dataFrame

I have a DataFrame that looks like this:
base_rate weighting_factor
0 NaN
1 1.792750
2 1.792944
I have a second DataFrame that looks like this:
min_index max_index weighting_factor
0 0 8 0.15
1 9 17 0.20
2 18 26 0.60
3 27 35 0.80
as you can see, the column
weighting_factor
in the first column is empty. How can I add the weighting_factor from the second dataFrame depending on the index?
For example, I want the weighting factor with the value 0.15 beeing added in the index range 0 - 8 and the weighting factor 0.20 to the index range 9 -17.
Thanks!
EDIT 1:
Instead of
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.15
5 0.871500 0.15
6 0.813326 0.15
7 0.054184 0.15
8 0.795688 0.15
9 0.560442 0.20
10 0.192447 0.20
11 0.712720 0.20
0 0.623351 0.20
1 0.805375 0.20
2 0.484269 0.20
I want
A possible solution is to expand your second dataframe:
idx = df2.index.repeat(df2['max_index'] - df2['min_index'] + 1)
df1['weighting_factor'] = df2.reindex(idx)['weighting_factor'] .values[:len(df1)]
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.20
5 0.871500 0.20
6 0.813326 0.20
7 0.054184 0.20
8 0.795688 0.60
9 0.560442 0.60
10 0.192447 0.60
11 0.712720 0.60
12 0.623351 0.80
13 0.805375 0.80
14 0.484269 0.80
15 0.360207 0.80
16 0.889750 1
17 0.503820 1
18 0.779739 1
19 0.116079 1
20 0.417814 1
21 0.423896 1
22 0.801999 1
23 0.034853 1
Since the length of df1 increases, also the range of min_index and max_index increases
A possible solution is to expand your second dataframe:
idx = df2.index.repeat(df2['max_index'] - df2['min_index'] + 1)
df1['weighting_factor'] = df2.reindex(idx)['weighting_factor'] .values[:len(df1)]
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.15
5 0.871500 0.15
6 0.813326 0.15
7 0.054184 0.15
8 0.795688 0.15
9 0.560442 0.20
10 0.192447 0.20
11 0.712720 0.20
0 0.623351 0.20
1 0.805375 0.20
2 0.484269 0.20
3 0.360207 0.20
4 0.889750 0.20
5 0.503820 0.20
6 0.779739 0.60
7 0.116079 0.60
8 0.417814 0.60
9 0.423896 0.60
10 0.801999 0.60
11 0.034853 0.60

How to merge columns and duplicate row values to match in pandas

I want to join 2 dataframes on 'time', but one df uses .25 second intervals and another uses 1 second intervals. I want to join the values from the 1 second interval df to the .25 second interval df and repeat values while within the corresponding second value.
Below are small snippets of the 2 dataframes I want to merge:
time speaker
0.25 1
0.25 2
0.50 1
0.50 2
0.75 1
0.75 2
1.00 1
1.00 2
1.25 1
1.25 2
1.50 1
1.50 2
1.75 1
1.75 2
2.00 1
2.00 2
and:
time label
0 10
1 11
and I want:
time speaker label
0.25 1 10
0.25 2 10
0.50 1 10
0.50 2 10
0.75 1 10
0.75 2 10
1.00 1 10
1.00 2 10
1.25 1 11
1.25 2 11
1.50 1 11
1.50 2 11
1.75 1 11
1.75 2 11
2.00 1 11
2.00 2 11
Thanks!
Here is on way using merge_asof
pd.merge_asof(df1,df2.astype(float),on='time',allow_exact_matches = False)
Out[14]:
time speaker label
0 0.25 1 10.0
1 0.25 2 10.0
2 0.50 1 10.0
3 0.50 2 10.0
4 0.75 1 10.0
5 0.75 2 10.0
6 1.00 1 10.0
7 1.00 2 10.0
8 1.25 1 11.0
9 1.25 2 11.0
10 1.50 1 11.0
11 1.50 2 11.0
12 1.75 1 11.0
13 1.75 2 11.0
14 2.00 1 11.0
15 2.00 2 11.0
IIUC, this is a case of pd.cut:
df1['label'] = pd.cut(df1['time'],
bins=list(df2['time'])+[np.inf],
labels=df2['label'])
Output:
time speaker label
0 0.25 1 10
1 0.25 2 10
2 0.50 1 10
3 0.50 2 10
4 0.75 1 10
5 0.75 2 10
6 1.00 1 10
7 1.00 2 10
8 1.25 1 11
9 1.25 2 11
10 1.50 1 11
11 1.50 2 11
12 1.75 1 11
13 1.75 2 11
14 2.00 1 11
15 2.00 2 11

Pandas change value of column if other column values don't meet criteria

I have the following data frame. I want to check the values of each row for the columns of "mental_illness", "feeling", and "flavor". If all the values for those three columns per row are less than 0.5, I want to change the corresponding value of the "unclassified" column to 1.0.
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 0.0 0.19 0.38 0.16
3 3 word_4 0.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
Expected result:
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 1.0 0.19 0.38 0.16
3 3 word_4 1.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
How do I go about doing so?
Use .le and .all over axis=1:
m = df[['mental_illness', 'feeling', 'flavor']].le(0.5).all(axis=1)
df['unclassified'] = m.astype(int)
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0 0.75 0.30 0.28
1 1 word_2 0 0.17 0.72 0.16
2 2 word_3 1 0.19 0.38 0.16
3 3 word_4 1 0.39 0.20 0.14
4 4 word_5 0 0.72 0.30 0.14
Would this work?
mask1 = df["mental_illness"] < 0.5
mask2 = df["feeling"] < 0.5
mask3 = df["flavor"] < 0.5
df.loc[mask1 & mask2 & mask3, 'unclassified'] = 1
Here is my solution:
data.unclassified = data[['mental_illness', 'feeling', 'flavor']].apply(lambda x: x.le(0.5)).apply(lambda x: 1 if sum(x) == 3 else 0, axis = 1)
output
sent_no pos unclassified mental_illness feeling flavor
0 0 Word_1 0 0.75 0.30 0.28
1 1 Word_2 0 0.17 0.72 0.16
2 2 Word_3 1 0.19 0.38 0.16
3 3 Word_4 1 0.39 0.20 0.14
4 4 Word_5 0 0.72 0.30 0.14

Rearranging columns after groupby in pandas

I created a DataFrame like this:
df_example= pd.DataFrame({ 'A': [1,1,6,6,6,3,4,4],
'val_A': [3,4,1,1,2,1,1,1],
'val_B': [4,5,2,2,3,2,2,2],
'val_A_frac':[0.25,0.25,0.3,0.7,0.2,0.1,0.4,0.5],
'val_B_frac':[0.75,0.65,0,0.3,np.NaN,np.NaN,np.NaN,np.NaN]
}, columns= ['A','val_A','val_B','val_A_frac','val_B_frac'])
Then I ran a groupby operation on A to sum over val_A and val_B:
sum_df_ex = df_example.groupby(['A','val_A','val_B']).agg({'val_A_frac':'sum', 'val_B_frac':'sum'})
I got this df:
sum_df_ex
Out[67]:
val_A_frac val_B_frac
A val_A val_B
1 3 4 0.25 0.75
4 5 0.25 0.65
3 1 2 0.10 0.00
4 1 2 0.90 0.00
6 1 2 1.00 0.30
2 3 0.20 0.00
Groupby operations resulted in two columns:
sum_df_ex.columns
Out[68]: Index(['val_A_frac', 'val_B_frac'], dtype='object')
I want to create a df after groupby operation consisting of all columns that is displayed after groupby i.e like this:
Out[67]:
A val_A val_B val_A_frac val_B_frac
1 3 4 0.25 0.75
4 5 0.25 0.65
3 1 2 0.10 0.00
4 1 2 0.90 0.00
6 1 2 1.00 0.30
2 3 0.20 0.00
How to do this?
use reset_index()
sum_df_ex = df_example.groupby(['A','val_A','val_B']).agg({'val_A_frac':'sum', 'val_B_frac':'sum'}).reset_index()
Output:
A val_A val_B val_B_frac val_A_frac
0 1 3 4 0.75 0.25
1 1 4 5 0.65 0.25
2 3 1 2 NaN 0.10
3 4 1 2 NaN 0.90
4 6 1 2 0.30 1.00
5 6 2 3 NaN 0.20

Categories