How to automate the bins of a column in python?

How to automate the bins of a column in python? - python

Background information: I have a dataframe 'test1' with column name 'y' which carries original values. I applied some model and I got prediction with the column name 'Yhat' using 'y'.I need to modify my 'Yhat' so,I have bucketed both 'y' and 'Yhat'. For a particular bucket of 'yhat' there is corresponding 'y' bucket.
Now in future if I have 3 points ahead prediction i.e 'yhat' then I can provide corresponding 'y' buckets category. For example see dataframe i.e 'test2' and codes.
Main query : To avoid manually creating bucket values,I want to automate this whole process. The reason for automating is,as the sample space increases the corresponding bucket values will also change.
test1
y Yhat
1 1
2 1
6 5
2 3
3 4
1 2
4 2
3 4
7 6
5 8
def catY(r):
if((r['y']>=1) & (r['y']<3)):
return 'Y_cat_1'
elif((r['y']>=3) & (r['y']<6)):
return 'Y_cat_2'
elif((r['y']>=6)):
return 'Y_cat_3'
test1['Actual_Y'] = test1.apply(catY,axis=1)
def cat(r):
if((r['Yhat']>=1) & (r['Yhat']<3)):
return 'Yhat_cat_1'
elif((r['Yhat']>=3) & (r['Yhat']<6)):
return 'Yhat_cat_2'
elif((r['Yhat']>=6)):
return 'Yhat_cat_3'
test1['yhat_cat'] = test1.apply(cat,axis=1)
test1.groupby('yhat_cat')['Actual_Y'].value_counts(normalize=True)
yhat_cat Actual_Y
Yhat_cat_1 Y_cat_1 0.75
Y_cat_2 0.25
Yhat_cat_2 Y_cat_2 0.50
Y_cat_1 0.25
Y_cat_3 0.25
Yhat_cat_3 Y_cat_2 0.50
Y_cat_3 0.50
test2
y Yhat
1 1
2 1
6 5
2 3
3 4
1 2
4 2
3 4
7 6
5 8
2
5
1
filter_method1 = lambda x: '0.75' if ( x >=1 and x <3) else '0.25' if (x >=3 and x <6) else '0' if x >=6 else None
test2['Y_cat_1'] = test2['Yhat'].apply(filter_method1)
filter_method2 = lambda x: '0.25' if ( x >=1 and x <3) else '0.50' if (x >=3 and x <6) else '0.50' if x >=6 else None
test2['Y_cat_2'] = test2['Yhat'].apply(filter_method2)
filter_method3 = lambda x: '0' if ( x >=1 and x <3) else '0.25' if (x >=3 and x <6) else '0.50' if x >=6 else None
test2['Y_cat_3'] = test2['Yhat'].apply(filter_method3)
print(test2)
y Yhat Y_cat_1 Y_cat_2 Y_cat_3
0 1.00 1 0.75 0.25 0
1 2.00 1 0.75 0.25 0
2 6.00 5 0.25 0.50 0.25
3 2.00 3 0.25 0.50 0.25
4 3.00 4 0.25 0.50 0.25
5 1.00 2 0.75 0.25 0
6 4.00 2 0.75 0.25 0
7 3.00 4 0.25 0.50 0.25
8 7.00 6 0 0.50 0.50
9 5.00 8 0 0.50 0.50
10 nan 2 0.75 0.25 0
11 nan 5 0.25 0.50 0.25
12 nan 1 0.75 0.25 0

You can use cut:
bins = [1,3,6,np.inf]
labels1 = [f'Y_cat_{x}' for x in range(1, len(bins))]
labels2 = [f'Yhat_cat_{x}' for x in range(1, len(bins))]
test1['Actual_Y'] = pd.cut(test1['y'], bins=bins, labels=labels1, right=False)
test1['yhat_cat'] = pd.cut(test1['Yhat'], bins=bins, labels=labels2, right=False)
print (test1)
y Yhat Actual_Y yhat_cat
0 1 1 Y_cat_1 Yhat_cat_1
1 2 1 Y_cat_1 Yhat_cat_1
2 6 5 Y_cat_3 Yhat_cat_2
3 2 3 Y_cat_1 Yhat_cat_2
4 3 4 Y_cat_2 Yhat_cat_2
5 1 2 Y_cat_1 Yhat_cat_1
6 4 2 Y_cat_2 Yhat_cat_1
7 3 4 Y_cat_2 Yhat_cat_2
8 7 6 Y_cat_3 Yhat_cat_3
9 5 8 Y_cat_2 Yhat_cat_3
Then convert normalized percentages to DataFrame by Series.unstack:
df = test1.groupby('yhat_cat')['Actual_Y'].value_counts(normalize=True).unstack(fill_value=0)
print (df)
Actual_Y Y_cat_1 Y_cat_2 Y_cat_3
yhat_cat
Yhat_cat_1 0.75 0.25 0.00
Yhat_cat_2 0.25 0.50 0.25
Yhat_cat_3 0.00 0.50 0.50
Loop by columns and dynamic create new columns by test2['Yhat']:
for c in df.columns:
#https://stackoverflow.com/a/48447871
test2[c] = df[c].values[pd.cut(test2['Yhat'], bins=bins, labels=False, right=False)]
print (test2)
y Yhat Y_cat_1 Y_cat_2 Y_cat_3
0 1.0 1 0.75 0.25 0.00
1 2.0 1 0.75 0.25 0.00
2 6.0 5 0.25 0.50 0.25
3 2.0 3 0.25 0.50 0.25
4 3.0 4 0.25 0.50 0.25
5 1.0 2 0.75 0.25 0.00
6 4.0 2 0.75 0.25 0.00
7 3.0 4 0.25 0.50 0.25
8 7.0 6 0.00 0.50 0.50
9 5.0 8 0.00 0.50 0.50
10 NaN 2 0.75 0.25 0.00
11 NaN 5 0.25 0.50 0.25
12 NaN 1 0.75 0.25 0.00

Related

Find number of datapoints in each range

I have a data frame that looks like this
data = [['A', 0.20], ['B',0.25], ['C',0.11], ['D',0.30], ['E',0.29]]
df = pd.DataFrame(data, columns=['col1', 'col2'])
Col1 is a primary key (each row has a unique value)
The max of col2 is 1 and the min is 0. I want to find the number of datapoint in ranges 0-.30 (both 0 and 0.30 are included), 0-.29, 0-.28, and so on till 0-.01. I can use pd.cut, but the lower limit is not fixed. My lower limit is always 0.
Can someone help?

One option using numpy broadcasting:
step = 0.01
up = np.arange(0, 0.3+step, step)
out = pd.Series((df['col2'].to_numpy()[:,None] <= up).sum(axis=0), index=up)
Output:
0.00 0
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.10 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.20 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.30 5
dtype: int64
With pandas.cut and cumsum:
step = 0.01
up = np.arange(0, 0.3+step, step)
(pd.cut(df['col2'], up, labels=up[1:].round(2))
.value_counts(sort=False).cumsum()
)
Output:
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.1 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.2 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.3 5
Name: col2, dtype: int64

Adding values of a second dataFrame depending on the index values of a first dataFrame

I have a DataFrame that looks like this:
base_rate weighting_factor
0 NaN
1 1.792750
2 1.792944
I have a second DataFrame that looks like this:
min_index max_index weighting_factor
0 0 8 0.15
1 9 17 0.20
2 18 26 0.60
3 27 35 0.80
as you can see, the column
weighting_factor
in the first column is empty. How can I add the weighting_factor from the second dataFrame depending on the index?
For example, I want the weighting factor with the value 0.15 beeing added in the index range 0 - 8 and the weighting factor 0.20 to the index range 9 -17.
Thanks!
EDIT 1:
Instead of
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.15
5 0.871500 0.15
6 0.813326 0.15
7 0.054184 0.15
8 0.795688 0.15
9 0.560442 0.20
10 0.192447 0.20
11 0.712720 0.20
0 0.623351 0.20
1 0.805375 0.20
2 0.484269 0.20
I want
A possible solution is to expand your second dataframe:
idx = df2.index.repeat(df2['max_index'] - df2['min_index'] + 1)
df1['weighting_factor'] = df2.reindex(idx)['weighting_factor'] .values[:len(df1)]
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.20
5 0.871500 0.20
6 0.813326 0.20
7 0.054184 0.20
8 0.795688 0.60
9 0.560442 0.60
10 0.192447 0.60
11 0.712720 0.60
12 0.623351 0.80
13 0.805375 0.80
14 0.484269 0.80
15 0.360207 0.80
16 0.889750 1
17 0.503820 1
18 0.779739 1
19 0.116079 1
20 0.417814 1
21 0.423896 1
22 0.801999 1
23 0.034853 1
Since the length of df1 increases, also the range of min_index and max_index increases

A possible solution is to expand your second dataframe:
idx = df2.index.repeat(df2['max_index'] - df2['min_index'] + 1)
df1['weighting_factor'] = df2.reindex(idx)['weighting_factor'] .values[:len(df1)]
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.15
5 0.871500 0.15
6 0.813326 0.15
7 0.054184 0.15
8 0.795688 0.15
9 0.560442 0.20
10 0.192447 0.20
11 0.712720 0.20
0 0.623351 0.20
1 0.805375 0.20
2 0.484269 0.20
3 0.360207 0.20
4 0.889750 0.20
5 0.503820 0.20
6 0.779739 0.60
7 0.116079 0.60
8 0.417814 0.60
9 0.423896 0.60
10 0.801999 0.60
11 0.034853 0.60

How to merge columns and duplicate row values to match in pandas

I want to join 2 dataframes on 'time', but one df uses .25 second intervals and another uses 1 second intervals. I want to join the values from the 1 second interval df to the .25 second interval df and repeat values while within the corresponding second value.
Below are small snippets of the 2 dataframes I want to merge:
time speaker
0.25 1
0.25 2
0.50 1
0.50 2
0.75 1
0.75 2
1.00 1
1.00 2
1.25 1
1.25 2
1.50 1
1.50 2
1.75 1
1.75 2
2.00 1
2.00 2
and:
time label
0 10
1 11
and I want:
time speaker label
0.25 1 10
0.25 2 10
0.50 1 10
0.50 2 10
0.75 1 10
0.75 2 10
1.00 1 10
1.00 2 10
1.25 1 11
1.25 2 11
1.50 1 11
1.50 2 11
1.75 1 11
1.75 2 11
2.00 1 11
2.00 2 11
Thanks!

Here is on way using merge_asof
pd.merge_asof(df1,df2.astype(float),on='time',allow_exact_matches = False)
Out[14]:
time speaker label
0 0.25 1 10.0
1 0.25 2 10.0
2 0.50 1 10.0
3 0.50 2 10.0
4 0.75 1 10.0
5 0.75 2 10.0
6 1.00 1 10.0
7 1.00 2 10.0
8 1.25 1 11.0
9 1.25 2 11.0
10 1.50 1 11.0
11 1.50 2 11.0
12 1.75 1 11.0
13 1.75 2 11.0
14 2.00 1 11.0
15 2.00 2 11.0

IIUC, this is a case of pd.cut:
df1['label'] = pd.cut(df1['time'],
bins=list(df2['time'])+[np.inf],
labels=df2['label'])
Output:
time speaker label
0 0.25 1 10
1 0.25 2 10
2 0.50 1 10
3 0.50 2 10
4 0.75 1 10
5 0.75 2 10
6 1.00 1 10
7 1.00 2 10
8 1.25 1 11
9 1.25 2 11
10 1.50 1 11
11 1.50 2 11
12 1.75 1 11
13 1.75 2 11
14 2.00 1 11
15 2.00 2 11

Pandas change value of column if other column values don't meet criteria

I have the following data frame. I want to check the values of each row for the columns of "mental_illness", "feeling", and "flavor". If all the values for those three columns per row are less than 0.5, I want to change the corresponding value of the "unclassified" column to 1.0.
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 0.0 0.19 0.38 0.16
3 3 word_4 0.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
Expected result:
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 1.0 0.19 0.38 0.16
3 3 word_4 1.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
How do I go about doing so?

Use .le and .all over axis=1:
m = df[['mental_illness', 'feeling', 'flavor']].le(0.5).all(axis=1)
df['unclassified'] = m.astype(int)
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0 0.75 0.30 0.28
1 1 word_2 0 0.17 0.72 0.16
2 2 word_3 1 0.19 0.38 0.16
3 3 word_4 1 0.39 0.20 0.14
4 4 word_5 0 0.72 0.30 0.14

Would this work?
mask1 = df["mental_illness"] < 0.5
mask2 = df["feeling"] < 0.5
mask3 = df["flavor"] < 0.5
df.loc[mask1 & mask2 & mask3, 'unclassified'] = 1

Here is my solution:
data.unclassified = data[['mental_illness', 'feeling', 'flavor']].apply(lambda x: x.le(0.5)).apply(lambda x: 1 if sum(x) == 3 else 0, axis = 1)
output
sent_no pos unclassified mental_illness feeling flavor
0 0 Word_1 0 0.75 0.30 0.28
1 1 Word_2 0 0.17 0.72 0.16
2 2 Word_3 1 0.19 0.38 0.16
3 3 Word_4 1 0.39 0.20 0.14
4 4 Word_5 0 0.72 0.30 0.14

Rearranging columns after groupby in pandas

I created a DataFrame like this:
df_example= pd.DataFrame({ 'A': [1,1,6,6,6,3,4,4],
'val_A': [3,4,1,1,2,1,1,1],
'val_B': [4,5,2,2,3,2,2,2],
'val_A_frac':[0.25,0.25,0.3,0.7,0.2,0.1,0.4,0.5],
'val_B_frac':[0.75,0.65,0,0.3,np.NaN,np.NaN,np.NaN,np.NaN]
}, columns= ['A','val_A','val_B','val_A_frac','val_B_frac'])
Then I ran a groupby operation on A to sum over val_A and val_B:
sum_df_ex = df_example.groupby(['A','val_A','val_B']).agg({'val_A_frac':'sum', 'val_B_frac':'sum'})
I got this df:
sum_df_ex
Out[67]:
val_A_frac val_B_frac
A val_A val_B
1 3 4 0.25 0.75
4 5 0.25 0.65
3 1 2 0.10 0.00
4 1 2 0.90 0.00
6 1 2 1.00 0.30
2 3 0.20 0.00
Groupby operations resulted in two columns:
sum_df_ex.columns
Out[68]: Index(['val_A_frac', 'val_B_frac'], dtype='object')
I want to create a df after groupby operation consisting of all columns that is displayed after groupby i.e like this:
Out[67]:
A val_A val_B val_A_frac val_B_frac
1 3 4 0.25 0.75
4 5 0.25 0.65
3 1 2 0.10 0.00
4 1 2 0.90 0.00
6 1 2 1.00 0.30
2 3 0.20 0.00
How to do this?

use reset_index()
sum_df_ex = df_example.groupby(['A','val_A','val_B']).agg({'val_A_frac':'sum', 'val_B_frac':'sum'}).reset_index()
Output:
A val_A val_B val_B_frac val_A_frac
0 1 3 4 0.75 0.25
1 1 4 5 0.65 0.25
2 3 1 2 NaN 0.10
3 4 1 2 NaN 0.90
4 6 1 2 0.30 1.00
5 6 2 3 NaN 0.20

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to automate the bins of a column in python? - python

Related

Find number of datapoints in each range

Adding values of a second dataFrame depending on the index values of a first dataFrame

How to merge columns and duplicate row values to match in pandas

Pandas change value of column if other column values don't meet criteria

Rearranging columns after groupby in pandas

Categories

Resources