pandas changed column value condition of three other columns - python

I have the following pandas dataframe:
df = pd.DataFrame({'pred': [1, 2, 3, 4],
'a': [0.4, 0.6, 0.35, 0.5],
'b': [0.2, 0.4, 0.32, 0.1],
'c': [0.1, 0, 0.2, 0.2],
'd': [0.3, 0, 0.1, 0.2]})
I want to change values on 'pred' column, based on columns a,b,c,d , as following:
if a has the value at column a is larger than the values of column b,c,d
and
if one of columns - b , c or d has value larger than 0.25
then change value in 'pred' to 0. so the results should be:
pred a b c d
0 1 0.4 0.2 0.1 0.1
1 0 0.6 0.4 0.0 0.0
2 0 0.35 0.32 0.2 0.3
3 4 0.5 0.1 0.2 0.2
How can I do this?

Create a boolean condition/mask then use loc to set value to 0 where condition is True
cols = ['b', 'c', 'd']
mask = df[cols].lt(df['a'], axis=0).all(1) & df[cols].gt(.25).any(1)
df.loc[mask, 'pred'] = 0
pred a b c d
0 1 0.40 0.20 0.1 0.1
1 0 0.60 0.40 0.0 0.0
2 0 0.35 0.32 0.2 0.3
3 4 0.50 0.10 0.2 0.2

import pandas as pd
def row_cond(row):
m_val = max(row[2:])
if row[1]>m_val and m_val>0.25:
row[0] = 0
return row
df = pd.DataFrame({'pred': [1, 2, 3, 4],
'a': [0.4, 0.6, 0.35, 0.5],
'b': [0.2, 0.4, 0.32, 0.1],
'c': [0.1, 0, 0.2, 0.2],
'd': [0.1, 0, 0.3, 0.2]})
new_df = df.apply(row_cond,axis=1)
Output:
pred a b c d
0 1.0 0.40 0.20 0.1 0.1
1 0.0 0.60 0.40 0.0 0.0
2 0.0 0.35 0.32 0.2 0.3
3 4.0 0.50 0.10 0.2 0.2

Related

How can i insert values from a list into a pandas data frame column?

i have this dataframe:
index
x
y
0
0
3
1
0.07
4
2
0.1
6
3
0. 13
5
i want to insert new x values to the x column
new_x = [0, 0.03, 0.07, 0.1, 0.13, 0.17, 0.2]
so that the dataframe becomes
index
x
y
0
0
3
1
0.03
NaN
2
0.07
4
3
0.1
6
4
0. 13
5
5
0. 17
NaN
6
0. 2
NaN
so basically for every new_x value that doesn't exist in column x, the y value is NaN
is it possible to do it in pandas? thank you
You can use Numpy's searchsorted.
After you create a new_y array that is the same length as the new_x array. You use searchsorted to identify where in the new_y array you need to drop the old y values.
new_y = np.full(len(new_x), np.nan, np.float64)
new_y[np.searchsorted(new_x, df.x)] = df.y
pd.DataFrame({'x': new_x, 'y': new_y})
x y
0 0.00 3.0
1 0.03 NaN
2 0.07 4.0
3 0.10 6.0
4 0.13 5.0
5 0.17 NaN
6 0.20 NaN
This is a straightforward application of the merge function for pandas. More specifically a left join.
import pandas as pd
x1 = [0, 0.07, 0.1, 0.13]
y1 = [3, 4, 6, 5]
df1 = pd.DataFrame({"x": x1, "y": y1})
print(df1)
x2 = [0, 0.03, 0.07, 0.1, 0.13, 0.17, 0.2]
df2 = pd.DataFrame({"x": x2})
print(df2)
df3 = df2.merge(df1, how="left", on="x")
print(df3)
x y
0 0.00 3
1 0.07 4
2 0.10 6
3 0.13 5
x
0 0.00
1 0.03
2 0.07
3 0.10
4 0.13
5 0.17
6 0.20
x y
0 0.00 3.0
1 0.03 NaN
2 0.07 4.0
3 0.10 6.0
4 0.13 5.0
5 0.17 NaN
6 0.20 NaN
You could try to use the join method. Here is some sample code that you could refer
d1 = {'x': [0, 0.07, 0.1, 0.13], 'y': [3,4,6,5]}
d2 = {'x': [0, 0.03, 0.07, 0.1, 0.13, 0.17, 0.2]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df2.set_index('x').join(df1.set_index('x'), on='x', how='left').reset_index()
try using this, you can compare the values already present in the column with the new one's and then you can create a df remaining values and concatenate it to the older df
import pandas as pd
import numpy as np
df = pd.DataFrame({'x':[0,0.07,0.1,0.13], 'y':[3,4,6,5]})
new_list = [0, 0.03, 0.07, 0.1, 0.13, 0.17, 0.2]
def diff(new_list, col_list):
if len(new_list) > len(df['x'].to_list()):
diff_list = list(set(new_list) - set(df['x'].to_list()))
else:
diff_list = list(set(df['x'].to_list()) - set(new_list))
return diff_list
new_df = pd.DataFrame({'x':diff(new_list,df['x'].to_list()),'y':np.nan})
fin_df = pd.concat([df,new_df]).reset_index(drop=True)
fin_df
x y
0 0.00 3.0
1 0.07 4.0
2 0.10 6.0
3 0.13 5.0
4 0.03 NaN
5 0.20 NaN
6 0.17 NaN

Pandas row-wise addition with another column

I have a dataframe df
A B C
0.1 0.3 0.5
0.2 0.4 0.6
0.3 0.5 0.7
0.4 0.6 0.8
0.5 0.7 0.9
For each row I would I would like to add a value to each element from dataframe df1
X
0.1
0.2
0.3
0.4
0.5
Such that the final result would be
A B C
0.2 0.4 0.6
0.4 0.6 0.8
0.6 0.8 1.0
0.8 1.0 1.2
1.0 1.2 1.4
I have tried using df_new =df.sum(df1, axis=0), but got the following error TypeError: stat_func() got multiple values for argument 'axis' I would be open to numpy solutions as well
You can use np.add:
df = np.add(df, df1.to_numpy())
print(df)
Prints:
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4
import pandas as pd
df = pd.DataFrame([[0.1,0.3, 0.5],
[0.2, 0.4, 0.6],
[0.3, 0.5, 0.7],
[0.4, 0.6, 0.8],
[0.5, 0.7, 0.9]],
columns=['A', 'B', 'C'])
df1 = [0.1, 0.2, 0.3, 0.4, 0.5]
# In one Pandas instruction
df = df.add(pd.Series(df1), axis=0)
results :
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4
Try concat with .stack() and .sum()
df_new = pd.concat([df1.stack(),df2.stack()],1).bfill().sum(axis=1).unstack(1).drop('X',1)
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4
df= pd.DataFrame([[0.1,0.3, 0.5],
[0.2, 0.4, 0.6],
[0.3, 0.5, 0.7],
[0.4, 0.6, 0.8],
[0.5, 0.7, 0.9]],
columns=['A', 'B', 'C'])
df["X"]=[0.1, 0.2, 0.3, 0.4, 0.5]
columns_to_add= df.columns[:-1]
for col in columns_to_add:
df[col]+=df['X'] #this is where addition or any other operation can be performed
df.drop('X',axis=0)

interpolation between two groups of points

I have a data set with the following form:
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 1.0 0.2 0.11 0.15 1.25
4 1.0 0.3 0.10 0.11 1.40
5 1.0 0.4 0.87 0.14 1.25
6 2.0 0.2 0.23 0.45 1.55
7 2.0 0.3 0.74 0.85 1.25
8 2.0 0.4 0.55 0.55 1.40
Here is code to generate this DataFrame with pandas:
import pandas as pd
data = [[0.5, 0.2, 0.25, 0.75, 1.25],
[0.5, 0.3, 0.12, 0.41, 1.40],
[0.5, 0.4, 0.85, 0.15, 1.55],
[1.0, 0.2, 0.11, 0.15, 1.25],
[1.0, 0.3, 0.10, 0.11, 1.40],
[1.0, 0.4, 0.87, 0.14, 1.25],
[2.0, 0.2, 0.23, 0.45, 1.55],
[2.0, 0.3, 0.74, 0.85, 1.25],
[2.0, 0.4, 0.55, 0.55, 1.40]]
df = pd.DataFrame(data,columns=['A','B','C','D','E'])
This data represent an outcome of an experiment where for each A B and E there is a unique value C
What I want is to perform a linear interpolation so that I get similar data for A= 0.7 for instance based on the values of A=0.5 and A = 1.
the expected output should be something like :
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 0.7 0.2 xxx xxx 1.25
4 0.7 0.3 xxx xxx 1.40
5 0.7 0.4 xxx xxx 1.55
6 1.0 0.2 0.11 0.15 1.25
7 1.0 0.3 0.10 0.11 1.40
8 1.0 0.4 0.87 0.14 1.25
9 2.0 0.2 0.23 0.45 1.55
10 2.0 0.3 0.74 0.85 1.25
11 2.0 0.4 0.55 0.55 1.40
is there a straightforward way to do that in Python? I tried using the panda interpolate but the value I got didn't make sense.
Any suggestions?
Here is an example of how to create an interpolation function mapping values from column A to values from column C (arbitrarily picking 0.5 to 2.0 for values of A):
import pandas as pd
import numpy as np
from scipy import interpolate
# Set up the dataframe
data = [[0.5, 0.2, 0.25, 0.75, 1.25],
[0.5, 0.3, 0.12, 0.41, 1.40],
[0.5, 0.4, 0.85, 0.15, 1.55],
[1.0, 0.2, 0.11, 0.15, 1.25],
[1.0, 0.3, 0.10, 0.11, 1.40],
[1.0, 0.4, 0.87, 0.14, 1.25],
[2.0, 0.2, 0.23, 0.45, 1.55],
[2.0, 0.3, 0.74, 0.85, 1.25],
[2.0, 0.4, 0.55, 0.55, 1.40]]
df = pd.DataFrame(data,columns=['A','B','C','D','E'])
# Create the interpolation function
f = interpolate.interp1d(df['A'], df['C'])
# Evaluate new A (x) values to get new C (y) values via interpolation
xnew = np.linspace(0.5, 2.0, 10)
ynew = f(xnew)
print("%-7s %-7s"%("A","C"))
print("-"*16)
for x, y in zip(xnew, ynew):
print("%0.4f\t%0.4f"%(x,y))
The result:
A C
----------------
0.5000 0.8500
0.6667 0.6033
0.8333 0.3567
1.0000 0.8700
1.1667 0.7633
1.3333 0.6567
1.5000 0.5500
1.6667 0.4433
1.8333 0.3367
2.0000 0.5500

Removing outliers and surrounding data from dataframe

I have a data set containing some outliers that I'd like to remove.
I want to remove the 0 value in the data frame shown below:
df = pd.DataFrame({'Time': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'data': [1.1, 1.05, 1.01, 1.05, 0, 1.2, 1.1, 1.08, 1.07, 1.1]})
I can do something like this in order to remove values below a certain threshold:
df.loc[df['data'] < 0.5, 'data'] = np.NaN
This yelds me a list without the '0' value:
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 1.01
3 0.3 1.05
4 0.4 NaN
5 0.5 1.20
6 0.6 1.10
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
However, I am also suspicious about data surrounding invalid values, and would like to remove values '0.2' units of Time away from the outliers. Like the following:
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 NaN
3 0.3 NaN
4 0.4 NaN
5 0.5 NaN
6 0.6 NaN
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
You can get a list of all points in time in which you have bad measurements and filter for all nearby time values:
bad_times = df.Time[df['data'] < 0.5]
for t in bad_times:
df.loc[(df['Time'] - t).abs() <= 0.2, 'data'] = np.NaN
result:
>>> print(df)
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 NaN
3 0.3 NaN
4 0.4 NaN
5 0.5 NaN
6 0.6 NaN
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
You can get a list of Time to be deleted, and then apply nan for those rows.
df.loc[df['data'] < 0.5, 'data'] = np.NaN
l=df[df['data'].isna()]['Time'].values
l2=[]
for i in l:
l2=l2+[round(i-0.1,1),round(i-0.2,1),round(i+0.1,1),round(i+0.2,1)]
df.loc[df['Time'].isin(l2), 'data'] = np.nan

How to convert a series of tuples into a pandas dataframe?

Assume that we have the following pandas series resulted from an apply function applied on a dataframe after groupby.
<class 'pandas.core.series.Series'>
0 (1, 0, [0.2, 0.2, 0.2], [0.2, 0.2, 0.2])
1 (2, 1000, [0.6, 0.7, 0.5], [0.1, 0.3, 0.1])
2 (1, 0, [0.4, 0.4, 0.4], [0.4, 0.4, 0.4])
3 (1, 0, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
4 (3, 14000, [0.8, 0.8, 0.8], [0.6, 0.6, 0.6])
dtype: object
Can we convert this into a dataframe when the sigList=['sig1','sig2', 'sig3'] are given?
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
1 0 0.2 0.2 0.2 0.2 0.2 0.2
2 1000 0.6 0.7 0.5 0.1 0.3 0.1
1 0 0.4 0.4 0.4 0.4 0.4 0.4
1 0 0.5 0.5 0.5 0.5 0.5 0.5
3 14000 0.8 0.8 0.8 0.6 0.6 0.6
Thanks in advance
Do it the old fashioned (and fast) way, using a list comprehension:
columns = ("Length Distance sig1Max sig2Max"
"sig3Max sig1Min sig2Min sig3Min").split()
df = pd.DataFrame([[a, b, *c, *d] for a,b,c,d in series.values], columns=columns)
print(df)
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1 0 0.2 0.2 0.2 0.2 0.2 0.2
1 2 1000 0.6 0.7 0.5 0.1 0.3 0.1
2 1 0 0.4 0.4 0.4 0.4 0.4 0.4
3 1 0 0.5 0.5 0.5 0.5 0.5 0.5
4 3 14000 0.8 0.8 0.8 0.6 0.6 0.6
Or, perhaps you meant, do it a little more dynamically
sigList = ['sig1', 'sig2', 'sig3']
columns = ['Length', 'Distance']
columns.extend(f'{s}{lbl}' for lbl in ('Max', 'Min') for s in sigList )
df = pd.DataFrame([[a,b,*c,*d] for a,b,c,d in series.values], columns=columns)
print(df)
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1 0 0.2 0.2 0.2 0.2 0.2 0.2
1 2 1000 0.6 0.7 0.5 0.1 0.3 0.1
2 1 0 0.4 0.4 0.4 0.4 0.4 0.4
3 1 0 0.5 0.5 0.5 0.5 0.5 0.5
4 3 14000 0.8 0.8 0.8 0.6 0.6 0.6
You may check
newdf=pd.DataFrame(s.tolist())
newdf=pd.concat([newdf[[0,1]],pd.DataFrame(newdf[2].tolist()),pd.DataFrame(newdf[3].tolist())],1)
newdf.columns = [
"Length", "Distance", "sig1Max", "sig2Max", "sig3Max", "sig1Min", "sig2Min", "sig3Min"
]
newdf
Out[163]:
Length Distance sig1Max ... sig1Min sig2Min sig3Min
0 1 0 0.2 ... 0.2 0.2 0.2
1 2 1000 0.6 ... 0.1 0.3 0.1
2 1 0 0.4 ... 0.4 0.4 0.4
3 1 0 0.5 ... 0.5 0.5 0.5
4 3 14000 0.8 ... 0.6 0.6 0.6
[5 rows x 8 columns]
You can flatten each element and then convert each to a Series itself. Converting each element to a Series turns the main Series (s in the example below) into a DataFrame. Then just set the column names as you wish.
For example:
import pandas as pd
# load in your data
s = pd.Series([
(1, 0, [0.2, 0.2, 0.2], [0.2, 0.2, 0.2]),
(2, 1000, [0.6, 0.7, 0.5], [0.1, 0.3, 0.1]),
(1, 0, [0.4, 0.4, 0.4], [0.4, 0.4, 0.4]),
(1, 0, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
(3, 14000, [0.8, 0.8, 0.8], [0.6, 0.6, 0.6]),
])
def flatten(x):
# note this is not very robust, but works for this case
return [x[0], x[1], *x[2], *x[3]]
df = s.apply(flatten).apply(pd.Series)
df.columns = [
"Length", "Distance", "sig1Max", "sig2Max", "sig3Max", "sig1Min", "sig2Min", "sig3Min"
]
Then you have df as:
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1.0 0.0 0.2 0.2 0.2 0.2 0.2 0.2
1 2.0 1000.0 0.6 0.7 0.5 0.1 0.3 0.1
2 1.0 0.0 0.4 0.4 0.4 0.4 0.4 0.4
3 1.0 0.0 0.5 0.5 0.5 0.5 0.5 0.5
4 3.0 14000.0 0.8 0.8 0.8 0.6 0.6 0.6

Categories