How to delete row based on row above? Python Pandas - python

I have a dataset which looks like this:
df = pd.DataFrame({'a': [1,1,1, 2, 3, 3, 4], 'b': [1,np.nan, np.nan, 2, 3, np.nan, 4]})
I'm looking to delete all rows which have np.nan in the next proceeding row essentially. I can't figure out how to do this because I don't know how to delete rows based on other rows.

You want to find all the rows that have a np.nan in the next row. Use shift for that:
df.shift().isnull()
a b
0 True True
1 False False
2 False True
3 False True
4 False False
5 False False
6 False True
Then you want to figure out if anything in that row was nan, so you want to reduce this to a single boolean mask.
df.shift().isnull().any(axis=1)
0 True
1 False
2 True
3 True
4 False
5 False
6 True
dtype: bool
Then just drop the columns:
df.drop(df.shift().isnull().any(axis=1))
a b
2 1 NaN
3 2 2
4 3 3
5 3 NaN
6 4 4

Yes you can create a mask which will remove unwanted rows by combining df.notnull and df.shift:
notnull = df.notnull().all(axis=1)
df = df[notnull.shift(-1)]

Test whether the rows are null with notnull:
In [11]: df.notnull()
Out[11]:
a b
0 True True
1 True False
2 True False
3 True True
4 True True
5 True False
6 True True
In [12]: df.notnull().all(1)
Out[12]:
0 True
1 False
2 False
3 True
4 True
5 False
6 True
dtype: bool
In [13]: df[df.notnull().all(1)]
Out[13]:
a b
0 1 1
3 2 2
4 3 3
6 4 4
You can shift down to get whether the above row was NaN:
In [14]: df.notnull().all(1).shift().astype(bool)
Out[14]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
In [15]: df[df.notnull().all(1).shift().astype(bool)]
Out[15]:
a b
0 1 1
1 1 NaN
4 3 3
5 3 NaN
Note: You can shift upwards with shift(-1).

Related

Find the index of the last true occurrence in a column by row

I have the following table format:
id
bool
1
true
2
true
3
false
4
false
5
false
6
true
I'd like it so that I could get another column with the index of the last true occurrence in the bool column by row. If it's true in it's own row then return it's own id. It doesn't sound too hard using a for loop but I want it in a clean pandas format. I.e in this example I would get:
column = [1,2,2,2,2,6]
IIUC, you can mask and ffill:
df['new'] = df['id'].where(df['bool']).ffill(downcast='infer')
output:
id bool new
0 1 True 1
1 2 True 2
2 3 False 2
3 4 False 2
4 5 False 2
5 6 True 6
In your case do
df['new'] = df['id'].mul(df['bool']).cummax()
Out[344]:
0 1
1 2
2 2
3 2
4 2
5 6
dtype: int64
df1.assign(col1=np.where(df1.bool2,df1.id,pd.NA)).fillna(method='pad')
id bool2 col1
0 1 True 1
1 2 True 2
2 3 False 2
3 4 False 2
4 5 False 2
5 6 True 6

How to check if every range of a dataframe series between 2 values has another value

In a Pandas Dataframe I have 1 series A
Index A
0 2
1 1
2 6
3 3
4 2
5 7
6 1
7 3
8 8
9 1
10 3
I would like to check if between every 1 of column A there is a number 2 and to write in column B the results like this:
Index A B
0 2 FALSE
1 1 FALSE
2 6 FALSE
3 3 FALSE
4 2 TRUE
5 7 FALSE
6 1 FALSE
7 3 FALSE
8 2 TRUE
9 1 FALSE
10 3 FALSE
I though to use rolling() as a function, but can rolling work with a random sized window of values (ranges) ?
mask = (df["A"] == 1) | (df["A"] == 2)
df = df.assign(B=df.loc[mask, "A"].sub(df.loc[mask, "A"].shift()).eq(1)).fillna(False)
>>> df
A B
0 2 False
1 1 False
2 6 False
3 3 False
4 2 True
5 7 False
6 1 False
7 3 False
8 2 True
9 1 False
10 3 False
IIUC use np.where to assign 2s in 'A' to True after the first 1 is found:
df['B'] = np.where(df['A'].eq(1).cumsum().ge(1) & df['A'].eq(2), True, False)
df:
A B
0 2 False
1 1 False
2 6 False
3 3 False
4 2 True
5 7 False
6 1 False
7 3 False
8 2 True
9 1 False
10 3 False
Imports and DataFrame used:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [2, 1, 6, 3, 2, 7, 1, 3, 2, 1, 3]})
Here is a very basic and straightforward way to do it:
import pandas as pd
df= pd.DataFrame({"A":[2,1,6,3,2,7,1,3,2,1,3]})
for i, cell in enumerate(df["A"]):
if (1 in list(df.loc[:i,"A"])) and (1 in list(df.loc[i:,"A"])) and cell==2:
df.at[i,"B"] = True
else:
df.at[i,"B"] = False

Create two dataframes in pandas based on values in columns

Given the following data
df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7], "b": [4, 5, 9, 5, 6, 4, 0]})
df["split_by"] = df["b"].eq(9)
which looks as
a b split_by
0 1 4 False
1 2 5 False
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
I would like to create two dataframes as follows:
a b split_by
0 1 4 False
1 2 5 False
and
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
Clearly this is based on the value in column split_by, but I'm not sure how to subset using this.
My approach is:
split_1 = df.index < df[df["split_by"].eq(True)].index.to_list()[0]
split_2 = ~df.index.isin(split_1)
df1 = df[split_1]
df2 = df[split_2]
Use argmax as:
true_index = df['split_by'].argmax()
df1 = df.loc[:true_index-1, :]
df2 = df.loc[true_index:, :]
print(df1)
a b split_by
0 1 4 False
1 2 5 False
print(df2)
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
Another approach:
i = df[df['split_by']==True].index.values[0]
df1 = df.iloc[:i]
df2 = df.iloc[i:]
This is assuming you have only one "True". If you have more than one "True", this code will split df into only two dataframes regardless, considering only the first "True".
Use groupby with cumsum , notice if you have more then one True , this will split the dataframe to n+1 dfs (n True)
d={x : y for x , y in df.groupby(df.split_by.cumsum())}
d[0]
a b split_by
0 1 4 False
1 2 5 False
d[1]
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False

Select a specific value in a group using python pandas

I have a dataset with below data.
id status div
1 True 0
2 False 2
2 True 1
3 False 4
3 False 5
1 False 5
4 True 3
4 True 10
5 False 3
5 False 3
5 True 2
I want my output as
id status div
1 True 0
2 True 1
3 False 4
4 True 3
5 True 2
If true is present in the group i want it to be true else if only False is present i want to be False.
I have tried using Pandas group by but unable to select the condition.
Use DataFrameGroupBy.any with map by helper Series with first Truerow per groups if exist:
s = (df.sort_values(['status','id'], ascending=False)
.drop_duplicates('id')
.set_index('id')['div'])
print (s)
id
5 2
4 3
2 1
1 0
3 4
Name: div, dtype: int64
df1 = df.groupby('id')['status'].any().reset_index()
df1['div'] = df1['id'].map(s)
print (df1)
id status div
0 1 True 0
1 2 True 1
2 3 False 4
3 4 True 3
4 5 True 2

How does pandas.shift really work?

I want to delete duplicate adjacent rows in a dataframe. I was trying to do this with df[df.shift() != df].dropna().reset_index(drop=True) but shift() is not behaving in the way I meant.
Look at the following example
In [11]: df
Out[11]:
x y
0 a 1
1 b 2
2 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8
df.x[3] equals df.x[4] but the numbers are different. Though the output is the following:
In [13]: df[df.shift() != df]
Out[13]:
x y
0 a 1
1 b 2
2 NaN NaN
3 e 4
4 NaN 5
5 f 6
6 g 7
7 h 8
I want to delete the row if they are really duplicates, not if they contain some duplicate values. Any idea?
Well, look at df.shift() != df:
>>> df.shift() != df
x y
0 True True
1 True True
2 False False
3 True True
4 False True
5 True True
6 True True
7 True True
This is a 2D object, not 1D, so when you use it as a filter on a frame you keep the ones where you have True and get NaN with the ones where you have False. It sounds like you want to keep the ones where either are True -- where any are True -- which is a 1D object:
>>> (df.shift() != df).any(axis=1)
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
>>> df[(df.shift() != df).any(axis=1)]
x y
0 a 1
1 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8

Categories