Rearranging columns after groupby in pandas - python

I created a DataFrame like this:
df_example= pd.DataFrame({ 'A': [1,1,6,6,6,3,4,4],
'val_A': [3,4,1,1,2,1,1,1],
'val_B': [4,5,2,2,3,2,2,2],
'val_A_frac':[0.25,0.25,0.3,0.7,0.2,0.1,0.4,0.5],
'val_B_frac':[0.75,0.65,0,0.3,np.NaN,np.NaN,np.NaN,np.NaN]
}, columns= ['A','val_A','val_B','val_A_frac','val_B_frac'])
Then I ran a groupby operation on A to sum over val_A and val_B:
sum_df_ex = df_example.groupby(['A','val_A','val_B']).agg({'val_A_frac':'sum', 'val_B_frac':'sum'})
I got this df:
sum_df_ex
Out[67]:
val_A_frac val_B_frac
A val_A val_B
1 3 4 0.25 0.75
4 5 0.25 0.65
3 1 2 0.10 0.00
4 1 2 0.90 0.00
6 1 2 1.00 0.30
2 3 0.20 0.00
Groupby operations resulted in two columns:
sum_df_ex.columns
Out[68]: Index(['val_A_frac', 'val_B_frac'], dtype='object')
I want to create a df after groupby operation consisting of all columns that is displayed after groupby i.e like this:
Out[67]:
A val_A val_B val_A_frac val_B_frac
1 3 4 0.25 0.75
4 5 0.25 0.65
3 1 2 0.10 0.00
4 1 2 0.90 0.00
6 1 2 1.00 0.30
2 3 0.20 0.00
How to do this?

use reset_index()
sum_df_ex = df_example.groupby(['A','val_A','val_B']).agg({'val_A_frac':'sum', 'val_B_frac':'sum'}).reset_index()
Output:
A val_A val_B val_B_frac val_A_frac
0 1 3 4 0.75 0.25
1 1 4 5 0.65 0.25
2 3 1 2 NaN 0.10
3 4 1 2 NaN 0.90
4 6 1 2 0.30 1.00
5 6 2 3 NaN 0.20

Related

How to handle strings in numeric data columns in a dataset using pandas?

I am working on a dataset where few values in one of the column are string. due to that i am getting error while performing operations on dataset.
sample dataset:-
1.99 LOHARU 0.3 2 0 2 0.3 5 2 0 2 2
1.99 31 0.76 2 0 2 0.76 5 2 7.48 4 2
1.99 4 0.96 2 0 2 0.96 5 2 9.45 4 2
1.99 14 1.26 4 0 2 1.26 5 2 0 2 2
1.99 NUH 0.55 2 0 2 0.55 5 2 0.67 2 2
1.99 99999 0.29 2 0 2 0.29 5 2 0.06 2 2
full dataset can be found here:- https://www.kaggle.com/sid321axn/audit-data?select=trial.csv
I need to found the missing values and outlier in the dataset. below is the code i am using to find missing values:-
#Replacing zeros and 99999 with NAN
dataset[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]]=dataset[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]].replace(99999,np.NaN)
#if 12,14 and 17 can have zeroes then
dataset[[0,1,2,3,4,5,6,7,8,9,10,11,13,15,16]]=dataset[[0,1,2,3,4,5,6,7,8,9,10,11,13,15,16]].replace(0,np.NaN)
print(Dataset.isnull().sum())
but this doesn't replace 99999 with NaN
and to find outlier:-
i am calculating zscore
import scipy.stats as stats
array = Dataset.values
Z=stats.zscore(array)
but it gives me below error:-
TypeError: unsupported operand type(s) for /: 'str' and 'int'
IIUC, you want to remove the non numeric values. For this you can use pandas.to_numeric with the errors='coerce' option. This will replace non-numeric values with NaNs and enable you to perform numeric operations:
df = df.apply(pd.to_numeric, errors='coerce')
output:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12
0 1.99 NaN 0.30 2 0 2 0.30 5 2 0.00 2 2
1 1.99 31.0 0.76 2 0 2 0.76 5 2 7.48 4 2
2 1.99 4.0 0.96 2 0 2 0.96 5 2 9.45 4 2
3 1.99 14.0 1.26 4 0 2 1.26 5 2 0.00 2 2
4 1.99 NaN 0.55 2 0 2 0.55 5 2 0.67 2 2
5 1.99 5.0 0.29 2 0 2 0.29 5 2 0.06 2 2

Adding values of a second dataFrame depending on the index values of a first dataFrame

I have a DataFrame that looks like this:
base_rate weighting_factor
0 NaN
1 1.792750
2 1.792944
I have a second DataFrame that looks like this:
min_index max_index weighting_factor
0 0 8 0.15
1 9 17 0.20
2 18 26 0.60
3 27 35 0.80
as you can see, the column
weighting_factor
in the first column is empty. How can I add the weighting_factor from the second dataFrame depending on the index?
For example, I want the weighting factor with the value 0.15 beeing added in the index range 0 - 8 and the weighting factor 0.20 to the index range 9 -17.
Thanks!
EDIT 1:
Instead of
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.15
5 0.871500 0.15
6 0.813326 0.15
7 0.054184 0.15
8 0.795688 0.15
9 0.560442 0.20
10 0.192447 0.20
11 0.712720 0.20
0 0.623351 0.20
1 0.805375 0.20
2 0.484269 0.20
I want
A possible solution is to expand your second dataframe:
idx = df2.index.repeat(df2['max_index'] - df2['min_index'] + 1)
df1['weighting_factor'] = df2.reindex(idx)['weighting_factor'] .values[:len(df1)]
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.20
5 0.871500 0.20
6 0.813326 0.20
7 0.054184 0.20
8 0.795688 0.60
9 0.560442 0.60
10 0.192447 0.60
11 0.712720 0.60
12 0.623351 0.80
13 0.805375 0.80
14 0.484269 0.80
15 0.360207 0.80
16 0.889750 1
17 0.503820 1
18 0.779739 1
19 0.116079 1
20 0.417814 1
21 0.423896 1
22 0.801999 1
23 0.034853 1
Since the length of df1 increases, also the range of min_index and max_index increases
A possible solution is to expand your second dataframe:
idx = df2.index.repeat(df2['max_index'] - df2['min_index'] + 1)
df1['weighting_factor'] = df2.reindex(idx)['weighting_factor'] .values[:len(df1)]
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.15
5 0.871500 0.15
6 0.813326 0.15
7 0.054184 0.15
8 0.795688 0.15
9 0.560442 0.20
10 0.192447 0.20
11 0.712720 0.20
0 0.623351 0.20
1 0.805375 0.20
2 0.484269 0.20
3 0.360207 0.20
4 0.889750 0.20
5 0.503820 0.20
6 0.779739 0.60
7 0.116079 0.60
8 0.417814 0.60
9 0.423896 0.60
10 0.801999 0.60
11 0.034853 0.60

Delete rows from a pandas DataFrame based on a conditional expression, iteration (n) and (n+1)

Given a dataframe as follows:
col1
1 0.6
2 0.88
3 1.2
4 1.2
5 1.2
6 0.55
7 0.55
8 0.65
I want to delete rows from it where the value in row (n+1) is the same value in (n), such that this would yield:
col1
1 0.6
2 0.88
3 1.2
4 row deleted
5 row deleted
6 0.55
7 row deleted
8 0.65
In [191]: df[df["col1"] != df["col1"].shift()]
Out[191]:
col1
1 0.60
2 0.88
3 1.20
6 0.55
8 0.65
Ok let do
df=df[df.diff().ne(0)]
Try this:
df = df[~df['col1'].eq(df['col1'].shift(1))]
print(df)
col1
0 0.60
1 0.88
2 1.20
5 0.55
7 0.65
Or:
df = df[df['col1'].ne(df['col1'].shift(1))]
print(df)
col1
0 0.60
1 0.88
2 1.20
5 0.55
7 0.65

How to automate the bins of a column in python?

Background information: I have a dataframe 'test1' with column name 'y' which carries original values. I applied some model and I got prediction with the column name 'Yhat' using 'y'.I need to modify my 'Yhat' so,I have bucketed both 'y' and 'Yhat'. For a particular bucket of 'yhat' there is corresponding 'y' bucket.
Now in future if I have 3 points ahead prediction i.e 'yhat' then I can provide corresponding 'y' buckets category. For example see dataframe i.e 'test2' and codes.
Main query : To avoid manually creating bucket values,I want to automate this whole process. The reason for automating is,as the sample space increases the corresponding bucket values will also change.
test1
y Yhat
1 1
2 1
6 5
2 3
3 4
1 2
4 2
3 4
7 6
5 8
def catY(r):
if((r['y']>=1) & (r['y']<3)):
return 'Y_cat_1'
elif((r['y']>=3) & (r['y']<6)):
return 'Y_cat_2'
elif((r['y']>=6)):
return 'Y_cat_3'
test1['Actual_Y'] = test1.apply(catY,axis=1)
def cat(r):
if((r['Yhat']>=1) & (r['Yhat']<3)):
return 'Yhat_cat_1'
elif((r['Yhat']>=3) & (r['Yhat']<6)):
return 'Yhat_cat_2'
elif((r['Yhat']>=6)):
return 'Yhat_cat_3'
test1['yhat_cat'] = test1.apply(cat,axis=1)
test1.groupby('yhat_cat')['Actual_Y'].value_counts(normalize=True)
yhat_cat Actual_Y
Yhat_cat_1 Y_cat_1 0.75
Y_cat_2 0.25
Yhat_cat_2 Y_cat_2 0.50
Y_cat_1 0.25
Y_cat_3 0.25
Yhat_cat_3 Y_cat_2 0.50
Y_cat_3 0.50
test2
y Yhat
1 1
2 1
6 5
2 3
3 4
1 2
4 2
3 4
7 6
5 8
2
5
1
filter_method1 = lambda x: '0.75' if ( x >=1 and x <3) else '0.25' if (x >=3 and x <6) else '0' if x >=6 else None
test2['Y_cat_1'] = test2['Yhat'].apply(filter_method1)
filter_method2 = lambda x: '0.25' if ( x >=1 and x <3) else '0.50' if (x >=3 and x <6) else '0.50' if x >=6 else None
test2['Y_cat_2'] = test2['Yhat'].apply(filter_method2)
filter_method3 = lambda x: '0' if ( x >=1 and x <3) else '0.25' if (x >=3 and x <6) else '0.50' if x >=6 else None
test2['Y_cat_3'] = test2['Yhat'].apply(filter_method3)
print(test2)
y Yhat Y_cat_1 Y_cat_2 Y_cat_3
0 1.00 1 0.75 0.25 0
1 2.00 1 0.75 0.25 0
2 6.00 5 0.25 0.50 0.25
3 2.00 3 0.25 0.50 0.25
4 3.00 4 0.25 0.50 0.25
5 1.00 2 0.75 0.25 0
6 4.00 2 0.75 0.25 0
7 3.00 4 0.25 0.50 0.25
8 7.00 6 0 0.50 0.50
9 5.00 8 0 0.50 0.50
10 nan 2 0.75 0.25 0
11 nan 5 0.25 0.50 0.25
12 nan 1 0.75 0.25 0
You can use cut:
bins = [1,3,6,np.inf]
labels1 = [f'Y_cat_{x}' for x in range(1, len(bins))]
labels2 = [f'Yhat_cat_{x}' for x in range(1, len(bins))]
test1['Actual_Y'] = pd.cut(test1['y'], bins=bins, labels=labels1, right=False)
test1['yhat_cat'] = pd.cut(test1['Yhat'], bins=bins, labels=labels2, right=False)
print (test1)
y Yhat Actual_Y yhat_cat
0 1 1 Y_cat_1 Yhat_cat_1
1 2 1 Y_cat_1 Yhat_cat_1
2 6 5 Y_cat_3 Yhat_cat_2
3 2 3 Y_cat_1 Yhat_cat_2
4 3 4 Y_cat_2 Yhat_cat_2
5 1 2 Y_cat_1 Yhat_cat_1
6 4 2 Y_cat_2 Yhat_cat_1
7 3 4 Y_cat_2 Yhat_cat_2
8 7 6 Y_cat_3 Yhat_cat_3
9 5 8 Y_cat_2 Yhat_cat_3
Then convert normalized percentages to DataFrame by Series.unstack:
df = test1.groupby('yhat_cat')['Actual_Y'].value_counts(normalize=True).unstack(fill_value=0)
print (df)
Actual_Y Y_cat_1 Y_cat_2 Y_cat_3
yhat_cat
Yhat_cat_1 0.75 0.25 0.00
Yhat_cat_2 0.25 0.50 0.25
Yhat_cat_3 0.00 0.50 0.50
Loop by columns and dynamic create new columns by test2['Yhat']:
for c in df.columns:
#https://stackoverflow.com/a/48447871
test2[c] = df[c].values[pd.cut(test2['Yhat'], bins=bins, labels=False, right=False)]
print (test2)
y Yhat Y_cat_1 Y_cat_2 Y_cat_3
0 1.0 1 0.75 0.25 0.00
1 2.0 1 0.75 0.25 0.00
2 6.0 5 0.25 0.50 0.25
3 2.0 3 0.25 0.50 0.25
4 3.0 4 0.25 0.50 0.25
5 1.0 2 0.75 0.25 0.00
6 4.0 2 0.75 0.25 0.00
7 3.0 4 0.25 0.50 0.25
8 7.0 6 0.00 0.50 0.50
9 5.0 8 0.00 0.50 0.50
10 NaN 2 0.75 0.25 0.00
11 NaN 5 0.25 0.50 0.25
12 NaN 1 0.75 0.25 0.00

pandas why does int64 - float64 column subtraction yield NaN's

I am confused by the results of pandas subtraction of two columns. When I subtract two float64 and int64 columns it yields several NaN entries. Why is this happening? What could be the cause of this strange behavior?
Final Updae: As N.Wouda pointed out, my problem was that the index columns did not match.
Y_predd.reset_index(drop=True,inplace=True)
Y_train_2.reset_index(drop=True,inplace=True)
solved my problem
Update 2: It seems like my index columns don't match, which makes sense because they are both sampled from the same data frome. How can I "start fresh" with new index coluns?
Update: Y_predd- Y_train_2.astype('float64') also yields NaN values. I am confused why this did not raise an error. They are the same size. Why could this be yielding NaN?
In [48]: Y_predd.size
Out[48]: 182527
In [49]: Y_train_2.astype('float64').size
Out[49]: 182527
Original documentation of error:
In [38]: Y_train_2
Out[38]:
66419 0
2319 0
114195 0
217532 0
131687 0
144024 0
94055 0
143479 0
143124 0
49910 0
109278 0
215905 1
127311 0
150365 0
117866 0
28702 0
168111 0
64625 0
207180 0
14555 0
179268 0
22021 1
120169 0
218769 0
259754 0
188296 1
63503 1
175104 0
218261 0
35453 0
..
112048 0
97294 0
68569 0
60333 0
184119 1
57632 0
153729 1
155353 0
114979 1
180634 0
42842 0
99979 0
243728 0
203679 0
244381 0
55646 0
35557 0
148977 0
164008 0
53227 1
219863 0
4625 0
155759 0
232463 0
167807 0
123638 0
230463 1
198219 0
128459 1
53911 0
Name: objective_for_classifier, dtype: int64
In [39]: Y_predd
Out[39]:
0 0.00
1 0.48
2 0.04
3 0.00
4 0.48
5 0.58
6 0.00
7 0.00
8 0.02
9 0.06
10 0.22
11 0.32
12 0.12
13 0.26
14 0.18
15 0.18
16 0.28
17 0.30
18 0.52
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 0.64
26 0.30
27 0.76
28 0.10
29 0.42
...
182497 0.60
182498 0.00
182499 0.06
182500 0.12
182501 0.00
182502 0.40
182503 0.70
182504 0.42
182505 0.54
182506 0.24
182507 0.56
182508 0.34
182509 0.10
182510 0.18
182511 0.06
182512 0.12
182513 0.00
182514 0.22
182515 0.08
182516 0.22
182517 0.00
182518 0.42
182519 0.02
182520 0.50
182521 0.00
182522 0.08
182523 0.16
182524 0.00
182525 0.32
182526 0.06
Name: prediction_method_used, dtype: float64
In [40]: Y_predd - Y_tr
Y_train_1 Y_train_2
In [40]: Y_predd - Y_train_2
Out[41]:
0 NaN
1 NaN
2 0.04
3 NaN
4 0.48
5 NaN
6 0.00
7 0.00
8 NaN
9 NaN
10 NaN
11 0.32
12 -0.88
13 -0.74
14 0.18
15 NaN
16 NaN
17 NaN
18 NaN
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 NaN
26 0.30
27 NaN
28 0.10
29 0.42
...
260705 NaN
260706 NaN
260709 NaN
260710 NaN
260711 NaN
260713 NaN
260715 NaN
260716 NaN
260718 NaN
260721 NaN
260722 NaN
260723 NaN
260724 NaN
260725 NaN
260726 NaN
260727 NaN
260731 NaN
260735 NaN
260737 NaN
260738 NaN
260739 NaN
260740 NaN
260742 NaN
260743 NaN
260745 NaN
260748 NaN
260749 NaN
260750 NaN
260751 NaN
260752 NaN
dtype: float64
Posting here so we can close the question, from the comments:
Are you sure each dataframe has the same index range?
You can reset the indices on both frames by df.reset_index(drop=True) and then subtract the frames as you were already doing. This process should result in the desired output.

Categories