Delete pandas group based on condition - python

I have a pandas dataframe in with several groups and I would like to exclude groups where some conditions (in a specific column) are not met. E.g. delete here group B because they have a non-number value in column "crit1".
I could delete specific columns based on the condition df.loc[:, (df >< 0).any(axis=0)] but then it doesn't delete the whole group.
And somehow I can't make the next step and apply this to the whole group.
name crit1 crit2
A 0.3 4
A 0.7 6
B inf 4
B 0.4 3
So the result after this filtering (allow only floats) should be:
A 0.3 4
A 0.7 6

You can use groupby and filter, for the example you give you can check if np.inf exists in a group and filter on the condition:
import pandas as pd
import numpy as np
df.groupby('name').filter(lambda g: (g != np.inf).all().all())
# name crit1 crit2
# 0 A 0.3 4
# 1 A 0.7 6
If the predicate only applies to one column, you can access the column via g., for example:
df.groupby('name').filter(lambda g: (g.crit1 != np.inf).all())
# name crit1 crit2
# 0 A 0.3 4
# 1 A 0.7 6

Related

Manipulate string values in pandas

I have a pandas dataframe with different formats for one column like this
Name
Values
First
5-9
Second
7
Third
-
Fourth
12-16
I need to iterate over all Values column, and if the format is like the first row 5-9 or like fourth row 12-16 replace it with the mean between the 2 numbers in string.
For first row replace 5-9 to 7, or for fourth row replace 12-16 to 14.
And if the format is like third row - replace it to 0
I have tried
if df["Value"].str.len() > 1:
df["Value"] = df["Value"].str.split('-')
df["Value"] = (df["Value"][0] + df["Value"][1]) / 2
elif df["Value"].str.len() == 1:
df["Value"] = df["Value"].str.replace('-', 0)
Expected output
Name
Values
First
7
Second
7
Third
0
Fourth
14
Let us split and expand the column then cast values to float and calculate mean along column axis:
s = df['Values'].str.split('-', expand=True)
df['Values'] = s[s != ''].astype(float).mean(1).fillna(0)
Name Values
0 First 7.0
1 Second 7.0
2 Third 0.0
3 Fourth 14.0
You can use str.replace with customized replacement function
mint = lambda s: int(s or 0)
repl = lambda m: str(sum(map(mint, map(m.group, [1,2])))/2)
df['Values'] = df['Values'].str.replace('(\d*)-(\d*)', repl, regex=True)
print(df)
Name Values
0 First 7.0
1 Second 7
2 Third 0.0
3 Fourth 14.0

Conditionally replace values in pandas.DataFrame with previous value

I need to filter outliers in a dataset. Replacing the outlier with the previous value in the column makes the most sense in my application.
I was having considerable difficulty doing this with the pandas tools available (mostly to do with copies on slices, or type conversions occurring when setting to NaN).
Is there a fast and/or memory efficient way to do this? (Please see my answer below for the solution I am currently using, which also has limitations.)
A simple example:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,1000,6,7,8],'B':list('abcdefgh')})
>>> df
A B
0 1 a
1 2 b
2 3 c
3 4 d
4 1000 e # '1000 e' --> '4 e'
5 6 f
6 7 g
7 8 h
You can simply mask values over your threshold and use ffill:
df.assign(A=df.A.mask(df.A.gt(10)).ffill())
A B
0 1.0 a
1 2.0 b
2 3.0 c
3 4.0 d
4 4.0 e
5 6.0 f
6 7.0 g
7 8.0 h
Using mask is necessary rather than something like shift, because it guarantees non-outlier output in the case that the previous value is also above a threshold.
I circumvented some of the issues with pandas copies and slices by converting to a numpy array first, doing the operations there, and then re-inserting the column. I'm not certain, but as far as I can tell, the datatype is the same once it is put back into the pandas.DataFrame.
def df_replace_with_previous(df,col,maskfunc,inplace=False):
arr = np.array(df[col])
mask = maskfunc(arr)
arr[ mask ] = arr[ list(mask)[1:]+[False] ]
if inplace:
df[col] = arr
return
else:
df2 = df.copy()
df2[col] = arr
return df2
This creates a mask, shifts it down by one so that the True values point at the previous entry, and updates the array. Of course, this will need to run recursively if there are multiple adjacent outliers (N times if there are N consecutive outliers), which is not ideal.
Usage in the case given in OP:
df_replace_with_previous(df,'A',lambda x:x>10,False)

How to filter Pandas rows based on last/next row?

I have two data sets from different pulse oximeters, and plot them with pyplot as displayed below. As you may see, the green data sheet has alot of outliers(vertical drops). In my work I've defined these outlayers as non-valid in for my statistical analysis, they are must certainly not measurements. Therefore I argue that I can simply remove them.
The characteristics of these rogue values is that they're single(or top two) value outliers(see df below). The "real" sample values are either the same as the previous value, or +-1. In e.g. java(pseudo code) I would do something like:
for(i; i <df.length; i++)
if (df[i+1|-1].spo2 - df[i].spo2 > 1|-1)
df[i].drop
What would be the pandas(numpy?) equivalent of what I'm trying to do, remove values that is more/less than 1 compared to the last/next value?
df:
time, spo2
1900-01-01 18:18:41.194 98.0
1900-01-01 18:18:41.376 98.0
1900-01-01 18:18:41.559 78.0
1900-01-01 18:18:41.741 98.0
1900-01-01 18:18:41.923 98.0
1900-01-01 18:18:42.105 90.0
1900-01-01 18:18:42.288 97.0
1900-01-01 18:18:42.470 97.0
1900-01-01 18:18:42.652 98.0
have a look at pandas.DataFrame.shift. This is a column-wise operation that shifts all rows in a given column to another row of another column:
# original df
x1
0 0
1 1
2 2
3 3
4 4
# shift down
df.x2 = df.x1.shift(1)
x1 x2
0 0 NaN # Beware
1 1 0
2 2 1
3 3 2
4 4 3
# Shift up
df.x2 = df.x1.shift(-1)
x1 x2
0 0 1
1 1 2
2 2 3
3 3 4
4 4 NaN # Beware
You can use this to move spo2 of timestamp n+1 next to spo2 in the timestamp n row. Then, filter based on conditions applied to that one row.
df['spo2_Next'] = df['spo2'].shift(-1)
# replace NaN to allow float comparison
df.spo2_Next.fillna(1, inplace = True)
# Apply your row-wise condition to create filter column
df.loc[((df.spo2_Next - df.spo2) > 1) or ((df.spo2_Next - df.spo2) < 1), 'Outlier'] = True
# filter
df_clean = df[df.Outlier != True]
# remove filter column
del df_clean['Outlier']
When you filter a pandas dataframe like:
df[ df.colum1 = 2 & df.colum2 < 3 ], you are:
comparing a numeric series to a scalar value and generating a boolean series
obtaining two boolean series and doing a logical and
then using a numeric series to filter the data frame (the false values will not be added in the new data frame)
So you just need create an iterative algorithm over the data frame to produce such boolean array, and use it to filter the dataframe, as in:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
df[ [True, False, True]]
You can also create a closure to filter the data frame (using df.apply), and keeping previous observations in the closure to detect abrupt changes, but this would be way too complicated. I would go for the straightforward imperative solution.

Adding calculated constant value into Python data frame

I'm new to Python, and I believe this is very basic question (sorry for that), but I tried to look for a solution here: Better way to add constant column to pandas data frame and here: add column with constant value to pandas dataframe and in many other places...
I have a data frame like this "toy" sample:
A B
10 5
20 12
50 200
and I want to add new column (C) which will be the division of the last data cells of A and B (50/200); So in my example, I'd like to get:
A B C
10 5 0.25
20 12 0.25
50 200 0.25
I tried to use this code:
groupedAC ['pNr'] = groupedAC['cIndCM'][-1:]/groupedAC['nTileCM'][-1:]
but I'm getting the result only in the last cell (I believe it's a result of my code acting as a "pointer" and not as a number - but as I said, I tried to "convert" my result into a constant (even using temp variables) but with no success).
Your help will be appreciated!
You need to index it with .iloc[-1] instead of .iloc[-1:], because the latter returns a Series and thus when assigning back to the data frame, the index needs to be matched:
df.B.iloc[-1:] # return a Series
#2 150
#Name: B, dtype: int64
df['C'] = df.A.iloc[-1:]/df.B.iloc[-1:] # the index has to be matched in this case, so only
# the row with index = 2 gets updated
df
# A B C
#0 10 5 NaN
#1 20 12 NaN
#2 50 200 0.25
df.B.iloc[-1] # returns a constant
# 150
df['C'] = df.A.iloc[-1]/df.B.iloc[-1] # there's nothing to match when assigning the
# constant to a new column, the value gets broadcasted
df
# A B C
#0 10 5 0.25
#1 20 12 0.25
#2 50 200 0.25

Pandas divide one row by another and output to another row in the same dataframe

For a Dataframe such as:
dt
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
What's the easiest way to append to the same data frame the result of dividing Row1 by Row2? i.e. the desired outcome is:
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
Newrow 2 0.5
Sorry if this is a simple question, I'm slowly getting to grips with pandas from an R background.
Thanks in advance!!!
The code below will create a new row with index d which is formed from dividing rows a and b.
import pandas as pd
df = pd.DataFrame(data={'x':[1,2,3], 'y':[4,5,6]}, index=['a', 'b', 'c'])
df.loc['d'] = df.loc['a'] / df.loc['b']
print(df)
# x y
# a 1.0 4.0
# b 2.0 5.0
# c 3.0 6.0
# d 0.5 0.8
in order to access the first two rows without caring about the index, you can use:
df.loc['newrow'] = df.iloc[0] / df.iloc[1]
then just follow #Ffisegydd's solution...
in addition, if you want to append multiple rows, use the pd.DataFrame.append function.
pandas does all the work row by row. By including another element it also interprets you want a new column:
data['new_row_with_division'] = data['row_name1_values'] / data['row_name2_values']

Categories