Create two dataframes in pandas based on values in columns - python

Given the following data
df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7], "b": [4, 5, 9, 5, 6, 4, 0]})
df["split_by"] = df["b"].eq(9)
which looks as
a b split_by
0 1 4 False
1 2 5 False
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
I would like to create two dataframes as follows:
a b split_by
0 1 4 False
1 2 5 False
and
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
Clearly this is based on the value in column split_by, but I'm not sure how to subset using this.
My approach is:
split_1 = df.index < df[df["split_by"].eq(True)].index.to_list()[0]
split_2 = ~df.index.isin(split_1)
df1 = df[split_1]
df2 = df[split_2]

Use argmax as:
true_index = df['split_by'].argmax()
df1 = df.loc[:true_index-1, :]
df2 = df.loc[true_index:, :]
print(df1)
a b split_by
0 1 4 False
1 2 5 False
print(df2)
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False

Another approach:
i = df[df['split_by']==True].index.values[0]
df1 = df.iloc[:i]
df2 = df.iloc[i:]
This is assuming you have only one "True". If you have more than one "True", this code will split df into only two dataframes regardless, considering only the first "True".

Use groupby with cumsum , notice if you have more then one True , this will split the dataframe to n+1 dfs (n True)
d={x : y for x , y in df.groupby(df.split_by.cumsum())}
d[0]
a b split_by
0 1 4 False
1 2 5 False
d[1]
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False

Related

How to replace values along the row until some condition is met?

Imagine I have a dataframe like this:
df = pd.DataFrame({"ID":["A","B","C","C","D"],
"DAY 1":[0, 0, 4, 0, 8],
"DAY 2":[3, 0, 4, 1, 2],
"DAY 3":[0, 2, 9, 9, 6],
"DAY 4":[9, 2, 4, 5, 7]})
df
Out[7]:
ID DAY 1 DAY 2 DAY 3 DAY 4
0 A 0 3 0 9
1 B 0 0 2 2
2 C 4 4 9 4
3 C 0 1 9 5
4 D 8 2 6 7
I would like to iterate over every row and replace all 0 values at the beginning of the row before I see a non-zero value.
The ID column shouldn't be in this condition, only the other columns. And I would like to replace these values by NaN. So the output should be like this:
ID DAY 1 DAY 2 DAY 3 DAY 4
0 A nan 3 0 9
1 B nan nan 2 2
2 C 4 4 9 4
3 C nan 1 9 5
4 D 8 2 6 7
And notice that the 0 value in df.loc[0, "DAY 3"] is still there because it didn't meet the condition, as this condition happens only before df.loc[0, "DAY 2"].
Anyone could help me?
You can use a boolean cummin on a subset of the DataFrame to generate a mask for boolean indexing:
mask = (df.filter(like='DAY').eq(0).cummin(axis=1)
.reindex(columns=df.columns, fill_value=False)
)
df[mask] = float('nan')
print(df)
Output:
ID DAY 1 DAY 2 DAY 3 DAY 4
0 A NaN 3.0 0 9
1 B NaN NaN 2 2
2 C 4.0 4.0 9 4
3 C NaN 1.0 9 5
4 D 8.0 2.0 6 7
Intermediate mask:
ID DAY 1 DAY 2 DAY 3 DAY 4
0 False True False False False
1 False True True False False
2 False False False False False
3 False True False False False
4 False False False False False

How to check if every range of a dataframe series between 2 values has another value

In a Pandas Dataframe I have 1 series A
Index A
0 2
1 1
2 6
3 3
4 2
5 7
6 1
7 3
8 8
9 1
10 3
I would like to check if between every 1 of column A there is a number 2 and to write in column B the results like this:
Index A B
0 2 FALSE
1 1 FALSE
2 6 FALSE
3 3 FALSE
4 2 TRUE
5 7 FALSE
6 1 FALSE
7 3 FALSE
8 2 TRUE
9 1 FALSE
10 3 FALSE
I though to use rolling() as a function, but can rolling work with a random sized window of values (ranges) ?
mask = (df["A"] == 1) | (df["A"] == 2)
df = df.assign(B=df.loc[mask, "A"].sub(df.loc[mask, "A"].shift()).eq(1)).fillna(False)
>>> df
A B
0 2 False
1 1 False
2 6 False
3 3 False
4 2 True
5 7 False
6 1 False
7 3 False
8 2 True
9 1 False
10 3 False
IIUC use np.where to assign 2s in 'A' to True after the first 1 is found:
df['B'] = np.where(df['A'].eq(1).cumsum().ge(1) & df['A'].eq(2), True, False)
df:
A B
0 2 False
1 1 False
2 6 False
3 3 False
4 2 True
5 7 False
6 1 False
7 3 False
8 2 True
9 1 False
10 3 False
Imports and DataFrame used:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [2, 1, 6, 3, 2, 7, 1, 3, 2, 1, 3]})
Here is a very basic and straightforward way to do it:
import pandas as pd
df= pd.DataFrame({"A":[2,1,6,3,2,7,1,3,2,1,3]})
for i, cell in enumerate(df["A"]):
if (1 in list(df.loc[:i,"A"])) and (1 in list(df.loc[i:,"A"])) and cell==2:
df.at[i,"B"] = True
else:
df.at[i,"B"] = False

Match Value and Get its Column Header in python

Sample
I have 1000 by 6 dataframe, where A,B,C,D were rated by people on scale of 1-10.
In SELECT column, I have a value, which in all cases is same as value in either of A/B/C/D.
I want to change value in 'SELECT' to name of column to which it matches. For example, for ID 1, SELECT = 1, and D = 1, so the value of select should change to D.
import pandas as pd
df = pd.read_excel("u.xlsx",sheet_name = "Sheet2",header = 0)
But I am lost how to proceed.
Gwenersl solution compare all columns without ID and SELECT filtered by difference with DataFrame.eq (==), check first True value by idxmax and also if not exist matching value is set value no match with numpy.where:
cols = df.columns.difference(['ID','SELECT'])
mask = df[cols].eq(df['SELECT'], axis=0)
df['SELECT'] = np.where(mask.any(axis=1), mask.idxmax(axis=1), 'no match')
print (df)
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C
Detail:
print (mask)
A B C D
0 False False False True
1 False False True False
2 False False True False
Assuming the values in A, B, C, D are unique in each row with respect to SELECT, I'd do it like this:
>>> df
ID A B C D SELECT
0 1 4 9 7 1 1
1 2 5 7 2 8 2
2 3 7 4 8 6 8
>>>
>>> df_abcd = df.loc[:, 'A':'D']
>>> df['SELECT'] = df_abcd.apply(lambda row: row.isin(df['SELECT']).idxmax(), axis=1)
>>> df
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C
Use -
df['SELECT2'] = df.columns[pd.DataFrame([df['SELECT'] == df['A'], df['SELECT'] == df['B'], df['SELECT'] == df['C'], df['SELECT'] == df['D']]).transpose().idxmax(1)+1]
Output
ID A B C D SELECT SELECT2
0 1 4 9 7 1 1 D
1 2 5 7 2 8 2 C
2 3 7 4 8 6 8 C

How does pandas.shift really work?

I want to delete duplicate adjacent rows in a dataframe. I was trying to do this with df[df.shift() != df].dropna().reset_index(drop=True) but shift() is not behaving in the way I meant.
Look at the following example
In [11]: df
Out[11]:
x y
0 a 1
1 b 2
2 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8
df.x[3] equals df.x[4] but the numbers are different. Though the output is the following:
In [13]: df[df.shift() != df]
Out[13]:
x y
0 a 1
1 b 2
2 NaN NaN
3 e 4
4 NaN 5
5 f 6
6 g 7
7 h 8
I want to delete the row if they are really duplicates, not if they contain some duplicate values. Any idea?
Well, look at df.shift() != df:
>>> df.shift() != df
x y
0 True True
1 True True
2 False False
3 True True
4 False True
5 True True
6 True True
7 True True
This is a 2D object, not 1D, so when you use it as a filter on a frame you keep the ones where you have True and get NaN with the ones where you have False. It sounds like you want to keep the ones where either are True -- where any are True -- which is a 1D object:
>>> (df.shift() != df).any(axis=1)
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
>>> df[(df.shift() != df).any(axis=1)]
x y
0 a 1
1 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8

How to delete row based on row above? Python Pandas

I have a dataset which looks like this:
df = pd.DataFrame({'a': [1,1,1, 2, 3, 3, 4], 'b': [1,np.nan, np.nan, 2, 3, np.nan, 4]})
I'm looking to delete all rows which have np.nan in the next proceeding row essentially. I can't figure out how to do this because I don't know how to delete rows based on other rows.
You want to find all the rows that have a np.nan in the next row. Use shift for that:
df.shift().isnull()
a b
0 True True
1 False False
2 False True
3 False True
4 False False
5 False False
6 False True
Then you want to figure out if anything in that row was nan, so you want to reduce this to a single boolean mask.
df.shift().isnull().any(axis=1)
0 True
1 False
2 True
3 True
4 False
5 False
6 True
dtype: bool
Then just drop the columns:
df.drop(df.shift().isnull().any(axis=1))
a b
2 1 NaN
3 2 2
4 3 3
5 3 NaN
6 4 4
Yes you can create a mask which will remove unwanted rows by combining df.notnull and df.shift:
notnull = df.notnull().all(axis=1)
df = df[notnull.shift(-1)]
Test whether the rows are null with notnull:
In [11]: df.notnull()
Out[11]:
a b
0 True True
1 True False
2 True False
3 True True
4 True True
5 True False
6 True True
In [12]: df.notnull().all(1)
Out[12]:
0 True
1 False
2 False
3 True
4 True
5 False
6 True
dtype: bool
In [13]: df[df.notnull().all(1)]
Out[13]:
a b
0 1 1
3 2 2
4 3 3
6 4 4
You can shift down to get whether the above row was NaN:
In [14]: df.notnull().all(1).shift().astype(bool)
Out[14]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
In [15]: df[df.notnull().all(1).shift().astype(bool)]
Out[15]:
a b
0 1 1
1 1 NaN
4 3 3
5 3 NaN
Note: You can shift upwards with shift(-1).

Categories