How to merge columns and duplicate row values to match in pandas - python

I want to join 2 dataframes on 'time', but one df uses .25 second intervals and another uses 1 second intervals. I want to join the values from the 1 second interval df to the .25 second interval df and repeat values while within the corresponding second value.
Below are small snippets of the 2 dataframes I want to merge:
time speaker
0.25 1
0.25 2
0.50 1
0.50 2
0.75 1
0.75 2
1.00 1
1.00 2
1.25 1
1.25 2
1.50 1
1.50 2
1.75 1
1.75 2
2.00 1
2.00 2
and:
time label
0 10
1 11
and I want:
time speaker label
0.25 1 10
0.25 2 10
0.50 1 10
0.50 2 10
0.75 1 10
0.75 2 10
1.00 1 10
1.00 2 10
1.25 1 11
1.25 2 11
1.50 1 11
1.50 2 11
1.75 1 11
1.75 2 11
2.00 1 11
2.00 2 11
Thanks!

Here is on way using merge_asof
pd.merge_asof(df1,df2.astype(float),on='time',allow_exact_matches = False)
Out[14]:
time speaker label
0 0.25 1 10.0
1 0.25 2 10.0
2 0.50 1 10.0
3 0.50 2 10.0
4 0.75 1 10.0
5 0.75 2 10.0
6 1.00 1 10.0
7 1.00 2 10.0
8 1.25 1 11.0
9 1.25 2 11.0
10 1.50 1 11.0
11 1.50 2 11.0
12 1.75 1 11.0
13 1.75 2 11.0
14 2.00 1 11.0
15 2.00 2 11.0

IIUC, this is a case of pd.cut:
df1['label'] = pd.cut(df1['time'],
bins=list(df2['time'])+[np.inf],
labels=df2['label'])
Output:
time speaker label
0 0.25 1 10
1 0.25 2 10
2 0.50 1 10
3 0.50 2 10
4 0.75 1 10
5 0.75 2 10
6 1.00 1 10
7 1.00 2 10
8 1.25 1 11
9 1.25 2 11
10 1.50 1 11
11 1.50 2 11
12 1.75 1 11
13 1.75 2 11
14 2.00 1 11
15 2.00 2 11

Related

How to handle strings in numeric data columns in a dataset using pandas?

I am working on a dataset where few values in one of the column are string. due to that i am getting error while performing operations on dataset.
sample dataset:-
1.99 LOHARU 0.3 2 0 2 0.3 5 2 0 2 2
1.99 31 0.76 2 0 2 0.76 5 2 7.48 4 2
1.99 4 0.96 2 0 2 0.96 5 2 9.45 4 2
1.99 14 1.26 4 0 2 1.26 5 2 0 2 2
1.99 NUH 0.55 2 0 2 0.55 5 2 0.67 2 2
1.99 99999 0.29 2 0 2 0.29 5 2 0.06 2 2
full dataset can be found here:- https://www.kaggle.com/sid321axn/audit-data?select=trial.csv
I need to found the missing values and outlier in the dataset. below is the code i am using to find missing values:-
#Replacing zeros and 99999 with NAN
dataset[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]]=dataset[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]].replace(99999,np.NaN)
#if 12,14 and 17 can have zeroes then
dataset[[0,1,2,3,4,5,6,7,8,9,10,11,13,15,16]]=dataset[[0,1,2,3,4,5,6,7,8,9,10,11,13,15,16]].replace(0,np.NaN)
print(Dataset.isnull().sum())
but this doesn't replace 99999 with NaN
and to find outlier:-
i am calculating zscore
import scipy.stats as stats
array = Dataset.values
Z=stats.zscore(array)
but it gives me below error:-
TypeError: unsupported operand type(s) for /: 'str' and 'int'
IIUC, you want to remove the non numeric values. For this you can use pandas.to_numeric with the errors='coerce' option. This will replace non-numeric values with NaNs and enable you to perform numeric operations:
df = df.apply(pd.to_numeric, errors='coerce')
output:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12
0 1.99 NaN 0.30 2 0 2 0.30 5 2 0.00 2 2
1 1.99 31.0 0.76 2 0 2 0.76 5 2 7.48 4 2
2 1.99 4.0 0.96 2 0 2 0.96 5 2 9.45 4 2
3 1.99 14.0 1.26 4 0 2 1.26 5 2 0.00 2 2
4 1.99 NaN 0.55 2 0 2 0.55 5 2 0.67 2 2
5 1.99 5.0 0.29 2 0 2 0.29 5 2 0.06 2 2

How to create a new column from dataframe based on a given condition

I would like to create a new column target based on the values on the source column. I simply want to assign values from this list [6,7,8,9,10,11,12,13] to the rows of the source column
value source
0 0.83 0
1 0.99 0
2 0.20 0
3 0.79 0
4 0.19 0
5 0.86 0
6 0.31 1
7 0.19 1
8 0.50 2
9 0.44 2
10 1.00 2
11 0.67 2
12 0.74 3
13 0.43 3
14 0.21 3
15 0.03 4
16 1.00 4
17 0.57 4
18 0.67 5
19 1.00 5
expected output
value source target
0 0.83 0 6
1 0.99 0 7
2 0.20 0 8
3 0.79 0 9
4 0.19 0 10
5 0.86 0 11
6 0.31 1 6
7 0.19 1 7
8 0.50 2 6
9 0.44 2 7
10 1.00 2 8
11 0.67 2 9
12 0.74 3 6
13 0.43 3 7
14 0.21 3 8
15 0.03 4 6
16 1.00 4 7
17 0.57 4 8
18 0.67 5 6
19 1.00 5 7
Use GroupBy.cumcount with mapping by dictioanry created from list with enumerate:
L = [6,7,8,9,10,11,12,13]
df['target'] = df.groupby('source').cumcount().map(dict(enumerate(L)))
print (df)
value source target
0 0.83 0 6
1 0.99 0 7
2 0.20 0 8
3 0.79 0 9
4 0.19 0 10
5 0.86 0 11
6 0.31 1 6
7 0.19 1 7
8 0.50 2 6
9 0.44 2 7
10 1.00 2 8
11 0.67 2 9
12 0.74 3 6
13 0.43 3 7
14 0.21 3 8
15 0.03 4 6
16 1.00 4 7
17 0.57 4 8
18 0.67 5 6
19 1.00 5 7

How to automate the bins of a column in python?

Background information: I have a dataframe 'test1' with column name 'y' which carries original values. I applied some model and I got prediction with the column name 'Yhat' using 'y'.I need to modify my 'Yhat' so,I have bucketed both 'y' and 'Yhat'. For a particular bucket of 'yhat' there is corresponding 'y' bucket.
Now in future if I have 3 points ahead prediction i.e 'yhat' then I can provide corresponding 'y' buckets category. For example see dataframe i.e 'test2' and codes.
Main query : To avoid manually creating bucket values,I want to automate this whole process. The reason for automating is,as the sample space increases the corresponding bucket values will also change.
test1
y Yhat
1 1
2 1
6 5
2 3
3 4
1 2
4 2
3 4
7 6
5 8
def catY(r):
if((r['y']>=1) & (r['y']<3)):
return 'Y_cat_1'
elif((r['y']>=3) & (r['y']<6)):
return 'Y_cat_2'
elif((r['y']>=6)):
return 'Y_cat_3'
test1['Actual_Y'] = test1.apply(catY,axis=1)
def cat(r):
if((r['Yhat']>=1) & (r['Yhat']<3)):
return 'Yhat_cat_1'
elif((r['Yhat']>=3) & (r['Yhat']<6)):
return 'Yhat_cat_2'
elif((r['Yhat']>=6)):
return 'Yhat_cat_3'
test1['yhat_cat'] = test1.apply(cat,axis=1)
test1.groupby('yhat_cat')['Actual_Y'].value_counts(normalize=True)
yhat_cat Actual_Y
Yhat_cat_1 Y_cat_1 0.75
Y_cat_2 0.25
Yhat_cat_2 Y_cat_2 0.50
Y_cat_1 0.25
Y_cat_3 0.25
Yhat_cat_3 Y_cat_2 0.50
Y_cat_3 0.50
test2
y Yhat
1 1
2 1
6 5
2 3
3 4
1 2
4 2
3 4
7 6
5 8
2
5
1
filter_method1 = lambda x: '0.75' if ( x >=1 and x <3) else '0.25' if (x >=3 and x <6) else '0' if x >=6 else None
test2['Y_cat_1'] = test2['Yhat'].apply(filter_method1)
filter_method2 = lambda x: '0.25' if ( x >=1 and x <3) else '0.50' if (x >=3 and x <6) else '0.50' if x >=6 else None
test2['Y_cat_2'] = test2['Yhat'].apply(filter_method2)
filter_method3 = lambda x: '0' if ( x >=1 and x <3) else '0.25' if (x >=3 and x <6) else '0.50' if x >=6 else None
test2['Y_cat_3'] = test2['Yhat'].apply(filter_method3)
print(test2)
y Yhat Y_cat_1 Y_cat_2 Y_cat_3
0 1.00 1 0.75 0.25 0
1 2.00 1 0.75 0.25 0
2 6.00 5 0.25 0.50 0.25
3 2.00 3 0.25 0.50 0.25
4 3.00 4 0.25 0.50 0.25
5 1.00 2 0.75 0.25 0
6 4.00 2 0.75 0.25 0
7 3.00 4 0.25 0.50 0.25
8 7.00 6 0 0.50 0.50
9 5.00 8 0 0.50 0.50
10 nan 2 0.75 0.25 0
11 nan 5 0.25 0.50 0.25
12 nan 1 0.75 0.25 0
You can use cut:
bins = [1,3,6,np.inf]
labels1 = [f'Y_cat_{x}' for x in range(1, len(bins))]
labels2 = [f'Yhat_cat_{x}' for x in range(1, len(bins))]
test1['Actual_Y'] = pd.cut(test1['y'], bins=bins, labels=labels1, right=False)
test1['yhat_cat'] = pd.cut(test1['Yhat'], bins=bins, labels=labels2, right=False)
print (test1)
y Yhat Actual_Y yhat_cat
0 1 1 Y_cat_1 Yhat_cat_1
1 2 1 Y_cat_1 Yhat_cat_1
2 6 5 Y_cat_3 Yhat_cat_2
3 2 3 Y_cat_1 Yhat_cat_2
4 3 4 Y_cat_2 Yhat_cat_2
5 1 2 Y_cat_1 Yhat_cat_1
6 4 2 Y_cat_2 Yhat_cat_1
7 3 4 Y_cat_2 Yhat_cat_2
8 7 6 Y_cat_3 Yhat_cat_3
9 5 8 Y_cat_2 Yhat_cat_3
Then convert normalized percentages to DataFrame by Series.unstack:
df = test1.groupby('yhat_cat')['Actual_Y'].value_counts(normalize=True).unstack(fill_value=0)
print (df)
Actual_Y Y_cat_1 Y_cat_2 Y_cat_3
yhat_cat
Yhat_cat_1 0.75 0.25 0.00
Yhat_cat_2 0.25 0.50 0.25
Yhat_cat_3 0.00 0.50 0.50
Loop by columns and dynamic create new columns by test2['Yhat']:
for c in df.columns:
#https://stackoverflow.com/a/48447871
test2[c] = df[c].values[pd.cut(test2['Yhat'], bins=bins, labels=False, right=False)]
print (test2)
y Yhat Y_cat_1 Y_cat_2 Y_cat_3
0 1.0 1 0.75 0.25 0.00
1 2.0 1 0.75 0.25 0.00
2 6.0 5 0.25 0.50 0.25
3 2.0 3 0.25 0.50 0.25
4 3.0 4 0.25 0.50 0.25
5 1.0 2 0.75 0.25 0.00
6 4.0 2 0.75 0.25 0.00
7 3.0 4 0.25 0.50 0.25
8 7.0 6 0.00 0.50 0.50
9 5.0 8 0.00 0.50 0.50
10 NaN 2 0.75 0.25 0.00
11 NaN 5 0.25 0.50 0.25
12 NaN 1 0.75 0.25 0.00

binning two dimensional data by its index in python

How would I bin some data based on the index of the data, in python 3
Let's say I have the following data
1 0.5
3 0.6
5 0.7
6 0.8
8 0.9
10 1
11 1.1
12 1.2
14 1.3
15 1.4
17 1.5
18 1.6
19 1.7
20 1.8
22 1.9
24 2
25 2.1
28 2.2
31 2.3
35 2.4
how would I take this data and bin both columns such that each bin has n number of values in it, and average the numbers in each bin and output them.
for example, if I wanted to bin the values by 4
I would take the first four data points:
1 0.5
3 0.6
5 0.7
6 0.8
and the averages of these would be: 3.75 0.65
I would continue down the columns by taking the next set of four, and so on
until I averaged all of the sets of four to get this:
3.75 0.65
10.25 1.05
16 1.45
21.25 1.85
29.75 2.25
How can I do this using python
Base on numpy reshape
pd.DataFrame([np.mean(x.reshape(len(df)//4,-1),axis=1) for x in df.values.T]).T
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25
You can "bin" the index into groups of 4 and call groupby in the index.
df.groupby(df.index // 4).mean()
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25

How to color newly added values in a data frame with Pandas?

I'd like to highlight (or color) new values in a DataFrame that previously were NaNs.
I have 2 data frames with the same index.
One with NaNs
df_nan = pd.DataFrame(np.random.randint(10, size = (10, 10))).replace(8, np.nan)
df_nan
0 1 2 3 4 5 6 7 8 9
0 NaN NaN 7 0 0 0.0 0 6 2.0 4.0
1 6.0 3.0 7 1 0 5.0 3 5 NaN 7.0
2 5.0 6.0 0 1 0 NaN 2 4 4.0 7.0
3 NaN 2.0 6 3 1 4.0 9 0 5.0 3.0
4 9.0 0.0 5 2 2 5.0 6 0 9.0 1.0
5 9.0 4.0 0 2 3 9.0 2 9 3.0 4.0
6 4.0 4.0 9 6 7 1.0 7 9 5.0 NaN
7 0.0 NaN 9 2 0 5.0 7 6 3.0 NaN
8 9.0 9.0 0 0 4 6.0 3 3 1.0 7.0
9 3.0 6.0 3 2 7 1.0 6 5 2.0 9.0
Another one (an "updated" one) where NaNs have been replaced by new values (means of each column)
df_new = df_nan.replace(np.nan, np.mean(df_nan))
df_new
0 1 2 3 4 5 6 7 8 9
0 5.62 4.25 7 0 0 0.0 0 6 2.00 4.00
1 6.00 3.00 7 1 0 5.0 3 5 3.77 7.00
2 5.00 6.00 0 1 0 4.0 2 4 4.00 7.00
3 5.62 2.00 6 3 1 4.0 9 0 5.00 3.00
4 9.00 0.00 5 2 2 5.0 6 0 9.00 1.00
5 9.00 4.00 0 2 3 9.0 2 9 3.00 4.00
6 4.00 4.00 9 6 7 1.0 7 9 5.00 5.25
7 0.00 4.25 9 2 0 5.0 7 6 3.00 5.25
8 9.00 9.00 0 0 4 6.0 3 3 1.00 7.00
9 3.00 6.00 3 2 7 1.0 6 5 2.00 9.00
How can i highlight or color the new values (means) using Pandas .style and .applymap() methods ?
Any help would be much appreciated !
style = 'color: yellow; background: red; border: 3px solid green'
funct = lambda d: df_nan.isnull() * style
df_new.style.apply(funct, None)

Categories