Removing outliers and surrounding data from dataframe

Removing outliers and surrounding data from dataframe - python

I have a data set containing some outliers that I'd like to remove.
I want to remove the 0 value in the data frame shown below:
df = pd.DataFrame({'Time': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'data': [1.1, 1.05, 1.01, 1.05, 0, 1.2, 1.1, 1.08, 1.07, 1.1]})
I can do something like this in order to remove values below a certain threshold:
df.loc[df['data'] < 0.5, 'data'] = np.NaN
This yelds me a list without the '0' value:
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 1.01
3 0.3 1.05
4 0.4 NaN
5 0.5 1.20
6 0.6 1.10
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
However, I am also suspicious about data surrounding invalid values, and would like to remove values '0.2' units of Time away from the outliers. Like the following:
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 NaN
3 0.3 NaN
4 0.4 NaN
5 0.5 NaN
6 0.6 NaN
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10

You can get a list of all points in time in which you have bad measurements and filter for all nearby time values:
bad_times = df.Time[df['data'] < 0.5]
for t in bad_times:
df.loc[(df['Time'] - t).abs() <= 0.2, 'data'] = np.NaN
result:
>>> print(df)
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 NaN
3 0.3 NaN
4 0.4 NaN
5 0.5 NaN
6 0.6 NaN
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10

You can get a list of Time to be deleted, and then apply nan for those rows.
df.loc[df['data'] < 0.5, 'data'] = np.NaN
l=df[df['data'].isna()]['Time'].values
l2=[]
for i in l:
l2=l2+[round(i-0.1,1),round(i-0.2,1),round(i+0.1,1),round(i+0.2,1)]
df.loc[df['Time'].isin(l2), 'data'] = np.nan

Related

Find consecutive values in rows in pandas Dataframe based on condition

I was looking at this question: How can I find 5 consecutive rows in pandas Dataframe where a value of a certain column is at least 0.5, which is similar to the one I have in mind. I would like to find say at least 3 consecutive rows where a value is less than 0.5 (but not negative nor nan), while considering the entire dataframe and not just one column as in the question linked above. Here a facsimile dataframe:
from random import uniform
idx = pd.date_range("2018-01-01", periods=10, freq="M")
df = pd.DataFrame(
{
'A':[0, 0.4, 0.5, 0.3, 0,0,0,0,0,0],
'B':[0, 0.6, 0.8,0, 0.3, 0.3, 0.9, 0.7,0,0],
'C':[0,0,0.5, 0.4, 0.4, 0.2,0,0,0,0],
'D':[0.4,0, 0.6, 0.5, 0.7, 0.2,0, 0.9, 0.8,0],
'E':[0.4, 0.3, 0.2, 0.7, 0.7, 0.8,0,0,0,0],
'F':[0,0,0.6, 0.7,0.8, 0.3, 0.4, 0.1,0,0]
},
index=idx
)
df = df.replace({0:np.nan})
df
Hence, since columns B and D don't satisfy the criteria should be removed from the output.
I'd prefer not to use for loop and the like since it is a 2000-column df, therefore I tried with the following:
def consecutive_values_in_range(s, min, max):
return s.between(left=min, right=max)
min, max = 0, 0.5
df.apply(lambda col: consecutive_values_in_range(col, min, max), axis=0)
print(df)
But I didn't obtain what I was looking for, that would be something like this:
A C E F
2018-01-31 NaN NaN 0.4 NaN
2018-02-28 0.4 NaN 0.3 NaN
2018-03-31 0.5 0.5 0.2 0.6
2018-04-30 0.3 0.4 0.7 0.7
2018-05-31 NaN 0.4 0.7 0.8
2018-06-30 NaN 0.2 0.8 0.3
2018-07-31 NaN NaN NaN 0.4
2018-08-31 NaN NaN NaN 0.1
2018-09-30 NaN NaN NaN NaN
2018-10-31 NaN NaN NaN NaN
Any suggestions? Thanks in advance.

lower, upper = 0, 0.5
n = 3
df.loc[:, ((df <= upper) & (df >= lower)).rolling(n).sum().eq(n).any()]
get an is_between mask over df
get the rolling sum of these masks per column, window size being 3
since True == 1 and False == 0, if we get 3 in any point, that implies consecutive 3 True's, i.e., 0 <= val <= 0.5 values in that column
so check equality against 3 and see if there's any in a column
lastly index with the resulting True/False mask per column
to get
A C E F
2018-01-31 NaN NaN 0.4 NaN
2018-02-28 0.4 NaN 0.3 NaN
2018-03-31 0.5 0.5 0.2 0.6
2018-04-30 0.3 0.4 0.7 0.7
2018-05-31 NaN 0.4 0.7 0.8
2018-06-30 NaN 0.2 0.8 0.3
2018-07-31 NaN NaN NaN 0.4
2018-08-31 NaN NaN NaN 0.1
2018-09-30 NaN NaN NaN NaN
2018-10-31 NaN NaN NaN NaN

Pandas row-wise addition with another column

I have a dataframe df
A B C
0.1 0.3 0.5
0.2 0.4 0.6
0.3 0.5 0.7
0.4 0.6 0.8
0.5 0.7 0.9
For each row I would I would like to add a value to each element from dataframe df1
X
0.1
0.2
0.3
0.4
0.5
Such that the final result would be
A B C
0.2 0.4 0.6
0.4 0.6 0.8
0.6 0.8 1.0
0.8 1.0 1.2
1.0 1.2 1.4
I have tried using df_new =df.sum(df1, axis=0), but got the following error TypeError: stat_func() got multiple values for argument 'axis' I would be open to numpy solutions as well

You can use np.add:
df = np.add(df, df1.to_numpy())
print(df)
Prints:
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4

import pandas as pd
df = pd.DataFrame([[0.1,0.3, 0.5],
[0.2, 0.4, 0.6],
[0.3, 0.5, 0.7],
[0.4, 0.6, 0.8],
[0.5, 0.7, 0.9]],
columns=['A', 'B', 'C'])
df1 = [0.1, 0.2, 0.3, 0.4, 0.5]
# In one Pandas instruction
df = df.add(pd.Series(df1), axis=0)
results :
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4

Try concat with .stack() and .sum()
df_new = pd.concat([df1.stack(),df2.stack()],1).bfill().sum(axis=1).unstack(1).drop('X',1)
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4

df= pd.DataFrame([[0.1,0.3, 0.5],
[0.2, 0.4, 0.6],
[0.3, 0.5, 0.7],
[0.4, 0.6, 0.8],
[0.5, 0.7, 0.9]],
columns=['A', 'B', 'C'])
df["X"]=[0.1, 0.2, 0.3, 0.4, 0.5]
columns_to_add= df.columns[:-1]
for col in columns_to_add:
df[col]+=df['X'] #this is where addition or any other operation can be performed
df.drop('X',axis=0)

How to convert a series of tuples into a pandas dataframe?

Assume that we have the following pandas series resulted from an apply function applied on a dataframe after groupby.
<class 'pandas.core.series.Series'>
0 (1, 0, [0.2, 0.2, 0.2], [0.2, 0.2, 0.2])
1 (2, 1000, [0.6, 0.7, 0.5], [0.1, 0.3, 0.1])
2 (1, 0, [0.4, 0.4, 0.4], [0.4, 0.4, 0.4])
3 (1, 0, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
4 (3, 14000, [0.8, 0.8, 0.8], [0.6, 0.6, 0.6])
dtype: object
Can we convert this into a dataframe when the sigList=['sig1','sig2', 'sig3'] are given?
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
1 0 0.2 0.2 0.2 0.2 0.2 0.2
2 1000 0.6 0.7 0.5 0.1 0.3 0.1
1 0 0.4 0.4 0.4 0.4 0.4 0.4
1 0 0.5 0.5 0.5 0.5 0.5 0.5
3 14000 0.8 0.8 0.8 0.6 0.6 0.6
Thanks in advance

Do it the old fashioned (and fast) way, using a list comprehension:
columns = ("Length Distance sig1Max sig2Max"
"sig3Max sig1Min sig2Min sig3Min").split()
df = pd.DataFrame([[a, b, *c, *d] for a,b,c,d in series.values], columns=columns)
print(df)
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1 0 0.2 0.2 0.2 0.2 0.2 0.2
1 2 1000 0.6 0.7 0.5 0.1 0.3 0.1
2 1 0 0.4 0.4 0.4 0.4 0.4 0.4
3 1 0 0.5 0.5 0.5 0.5 0.5 0.5
4 3 14000 0.8 0.8 0.8 0.6 0.6 0.6
Or, perhaps you meant, do it a little more dynamically
sigList = ['sig1', 'sig2', 'sig3']
columns = ['Length', 'Distance']
columns.extend(f'{s}{lbl}' for lbl in ('Max', 'Min') for s in sigList )
df = pd.DataFrame([[a,b,*c,*d] for a,b,c,d in series.values], columns=columns)
print(df)
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1 0 0.2 0.2 0.2 0.2 0.2 0.2
1 2 1000 0.6 0.7 0.5 0.1 0.3 0.1
2 1 0 0.4 0.4 0.4 0.4 0.4 0.4
3 1 0 0.5 0.5 0.5 0.5 0.5 0.5
4 3 14000 0.8 0.8 0.8 0.6 0.6 0.6

You may check
newdf=pd.DataFrame(s.tolist())
newdf=pd.concat([newdf[[0,1]],pd.DataFrame(newdf[2].tolist()),pd.DataFrame(newdf[3].tolist())],1)
newdf.columns = [
"Length", "Distance", "sig1Max", "sig2Max", "sig3Max", "sig1Min", "sig2Min", "sig3Min"
]
newdf
Out[163]:
Length Distance sig1Max ... sig1Min sig2Min sig3Min
0 1 0 0.2 ... 0.2 0.2 0.2
1 2 1000 0.6 ... 0.1 0.3 0.1
2 1 0 0.4 ... 0.4 0.4 0.4
3 1 0 0.5 ... 0.5 0.5 0.5
4 3 14000 0.8 ... 0.6 0.6 0.6
[5 rows x 8 columns]

You can flatten each element and then convert each to a Series itself. Converting each element to a Series turns the main Series (s in the example below) into a DataFrame. Then just set the column names as you wish.
For example:
import pandas as pd
# load in your data
s = pd.Series([
(1, 0, [0.2, 0.2, 0.2], [0.2, 0.2, 0.2]),
(2, 1000, [0.6, 0.7, 0.5], [0.1, 0.3, 0.1]),
(1, 0, [0.4, 0.4, 0.4], [0.4, 0.4, 0.4]),
(1, 0, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
(3, 14000, [0.8, 0.8, 0.8], [0.6, 0.6, 0.6]),
])
def flatten(x):
# note this is not very robust, but works for this case
return [x[0], x[1], *x[2], *x[3]]
df = s.apply(flatten).apply(pd.Series)
df.columns = [
"Length", "Distance", "sig1Max", "sig2Max", "sig3Max", "sig1Min", "sig2Min", "sig3Min"
]
Then you have df as:
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1.0 0.0 0.2 0.2 0.2 0.2 0.2 0.2
1 2.0 1000.0 0.6 0.7 0.5 0.1 0.3 0.1
2 1.0 0.0 0.4 0.4 0.4 0.4 0.4 0.4
3 1.0 0.0 0.5 0.5 0.5 0.5 0.5 0.5
4 3.0 14000.0 0.8 0.8 0.8 0.6 0.6 0.6

How to use boolean indexing with Pandas

I have a dataframe:
df =
time time b
0 0.0 1.1 21
1 0.1 2.2 22
2 0.2 3.3 23
3 0.3 4.4 24
4 0.4 5.5 24
I also have a series for my units, defined as
su =
time sal
time zulu
b m/s
Now, I want to set df.index equal to the "time (sal)" values. Those values can be in any column and I will need to check.
I can do this as:
df.index = df.values[:,(df.columns == 'time') & (su.values == 'sal')]
But, my index looks like:
array([[0.0],
[0.1],
[0.2],
[0.3],
[0.4]])
However, this is an array of arrays. In bigger datasets, plot seems to take longer. If I hardcode the value, I get just an array:
df.index = df[0,0]
array([0.0, 0.1, 0.2, 0.3, 0.4])
I can also do the following:
inx = ((df.columns == 'time') & (s.values == 'sal')).tolist().index(True)
This sets "inx" to 0 and then gets a single array
df.index=df.values[0,inx]
However, I shouldn't have to do this. Am I using pandas and boolean indexing incorrectly?
I want:
df =
time time b
0.0 0.0 1.1 21
0.1 0.1 2.2 22
0.2 0.2 3.3 23
0.3 0.3 4.4 24
0.4 0.4 5.5 24

As I understood, this is what you expected. However, I renamed time names as time1 & time2, otherwise it won't let to create the dictionary with same name.
df = {'time1': [0.0,0.1,0.2,0.3,0.4], 'time2': [1.1,2.2,3.3,4.4,5.5],'b':[21,22,23,24,24]}
su = {'time1':'sal', 'time2':'zulu', 'b':'m/s'}
indexes = df[su.keys()[su.values().index('sal')]]
df = pd.DataFrame(df, index=indexes, columns=['time1', 'time2', 'b'])
print df

Your original DataFrame has the duplicate column name, it make complexity.
Try to modify the columns' name.
Sample Code
unit = pd.Series(['sal', 'zulu', 'm/s'], index=['time', 'time', 'b'])
>>> df
time time b
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
new_col = ['{}({})'.format(df.columns[i], unit[i]) for i in range(len(df.columns))]
>>> new_col
['time(sal)', 'time(zulu)', 'b(m/s)']
>>> df.columns = new_col
>>> df
time(sal) time(zulu) b(m/s)
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
>>> df.index = df['time(sal)'].values
>>> df
time(sal) time(zulu) b(m/s)
0.0 0.0 1.1 21.0
0.1 0.1 2.2 22.0
0.2 0.2 3.3 23.0
0.3 0.3 4.4 24.0
0.4 0.4 5.5 25.0

Group rows by date and overwrite NaN values

I have a dataframe of the following structure which is simplified for this question.
A B C D E
0 2014/01/01 nan nan 0.2 nan
1 2014/01/01 0.1 nan nan nan
2 2014/01/01 nan 0.3 nan 0.7
3 2014/01/02 nan 0.4 nan nan
4 2014/01/02 0.5 nan 0.6 0.8
What I have here is a series of readings across several timestamps on single days. The columns B,C,D and E represent different locations. The data I am reading in is set up such that at a specified timestamp it takes data from certain locations and fills in nan values for the other locations.
What I wish to do is group the data by timestamp which I can easily do with a .GroupBy()command. From there I wish to have the nan values in the grouped data be overwritten with the valid values taken in later rows such that this is the following result is obtained.
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
How do I go about achieving this?

Try df.groupby with DataFrameGroupBy.agg:
In [528]: df.groupby('A', as_index=False, sort=False).agg(np.nansum)
Out[528]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
A shorter version with DataFrameGroupBy.sum (thanks MaxU!):
In [537]: df.groupby('A', as_index=False, sort=False).sum()
Out[537]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8

you can try this by using pandas first
df.groupby('A', as_index=False).first()
A B C D E
0 1/1/2014 0.1 0.3 0.2 0.7
1 1/2/2014 0.5 0.4 0.6 0.8

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing outliers and surrounding data from dataframe - python

You can get a list of Time to be deleted, and then apply nan for those rows. df.loc[df['data'] < 0.5, 'data'] = np.NaN l=df[df['data'].isna()]['Time'].values l2=[] for i in l: l2=l2+[round(i-0.1,1),round(i-0.2,1),round(i+0.1,1),round(i+0.2,1)] df.loc[df['Time'].isin(l2), 'data'] = np.nan

Related

Find consecutive values in rows in pandas Dataframe based on condition

Pandas row-wise addition with another column

How to convert a series of tuples into a pandas dataframe?

How to use boolean indexing with Pandas

Group rows by date and overwrite NaN values

Categories

Resources