Perform logical operations on every column of a pandas dataframe? - python

I'm trying to create a new df column based on a condition to be validated in the all the rest of the columns per each row.
df = pd.DataFrame([[1, 5, 2, 8, 2], [2, 4, 4, 20, 5], [3, 3, 1, 20, 2], [4, 2, 2, 1, 0],
[5, 1, 4, -5, -4]],
columns=['a', 'b', 'c', 'd', 'e'],
index=[1, 2, 3, 4, 5])
I tried:
df['f'] = ""
df.loc[(df.any() >= 10), 'f'] = df['e'] + 10
However I get:
IndexingError: Unalignable boolean Series key provided
This is the desired output:
a b c d e f
1 1 5 2 8 2
2 2 4 4 20 5 15
3 3 3 1 20 2 12
4 4 2 2 1 0
5 5 1 4 -5 -4

Use
In [984]: df.loc[(df >= 10).any(1), 'f'] = df['e'] + 10
In [985]: df
Out[985]:
a b c d e f
1 1 5 2 8 2 NaN
2 2 4 4 20 5 15.0
3 3 3 1 20 2 12.0
4 4 2 2 1 0 NaN
5 5 1 4 -5 -4 NaN

Note that:
df.any()
a True
b True
c True
d True
e True
f True
dtype: bool
df.any() >= 10
a False
b False
c False
d False
e False
f False
dtype: bool
I assume you want to check if any value in a column is >= 10. That would be done with (df >= 10).any(axis=1).
You should be able to do this in one step, using np.where:
df['f'] = np.where((df >= 10).any(axis=1), df.e + 10, '')
df
a b c d e f
1 1 5 2 8 2
2 2 4 4 20 5 15
3 3 3 1 20 2 12
4 4 2 2 1 0
5 5 1 4 -5 -4
If you'd prefer NaNs instead of blanks, use:
df['f'] = np.where((df >= 10).any(axis=1), df.e + 10, np.nan)
df
a b c d e f
1 1 5 2 8 2 NaN
2 2 4 4 20 5 15.0
3 3 3 1 20 2 12.0
4 4 2 2 1 0 NaN
5 5 1 4 -5 -4 NaN

By using max
df['f'] = ""
df.loc[df.max(1)>=10,'f']=df.e+10
Out[330]:
a b c d e f
1 1 5 2 8 2
2 2 4 4 20 5 15
3 3 3 1 20 2 12
4 4 2 2 1 0
5 5 1 4 -5 -4

Related

Pandas add new column with CumSum of two columns, restart with new value in other column

I have the following df:
A B C
1 10 2
1 15 0
2 5 2
2 5 0
I add column D through:
df["D"] = (df.B - df.C).cumsum()
A B C D
1 10 2 8
1 15 0 23
2 5 2 26
2 5 0 31
I want the cumsum to restart in row 3 where the value in column A is different from the value in row 2.
Desired output:
A B C D
1 10 2 8
1 15 0 23
2 5 2 3
2 5 0 8
Try with
df['new'] = (df.B-df.C).groupby(df.A).cumsum()
Out[343]:
0 8
1 23
2 3
3 8
dtype: int64
Use groupby and cumsum
df['D'] = df.assign(D=df['B']-df['C']).groupby('A')['D'].cumsum()
A B C D
0 1 10 2 8
1 1 15 0 23
2 2 5 2 3
3 2 5 0 8
import pandas as pd
df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [10, 15, 5, 5], "C": [2, 0, 2, 0]})
df['D'] = df['B'] - df['C']
df = df.groupby('A').cumsum()
print(df)
output:
B C D
0 10 2 8
1 25 2 23
2 5 2 3
3 10 2 8

Cumulative count of values with grouping using Pandas

I have the following DataFrame:
>>>> df = pd.DataFrame(data={
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'value': [0, 2, 3, 4, 0, 3, 2, 3, 0]})
>>> df
type value
0 A 0
1 A 2
2 A 3
3 B 4
4 B 0
5 B 3
6 C 2
7 C 3
8 C 0
What I need to accomplish is the following: for each type, trace the cumulative count of non-zero values but starting from zero each time a 0-value is encountered.
type value cumcount
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN
Idea is create consecutive groups and filter out non 0 values, last assign to new column with filter:
m = df['value'].eq(0)
g = m.ne(m.shift()).cumsum()[~m]
df.loc[~m, 'new'] = df.groupby(['type',g]).cumcount().add(1)
print (df)
type value new
0 A 0 NaN
1 A 2 1.0
2 A 3 2.0
3 B 4 1.0
4 B 0 NaN
5 B 3 1.0
6 C 2 1.0
7 C 3 2.0
8 C 0 NaN
For pandas 0.24+ is possible use Nullable integer data type:
df['new'] = df['new'].astype('Int64')
print (df)
type value new
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN

use meshgrid for rows with common values in column

my dataframes:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [7, 8, 8]]),columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [5, 8, 8]]),columns=['a', 'b', 'c'])
df1,df2:
a b c
0 1 2 3
1 4 2 3
2 7 8 8
a b c
0 1 2 3
1 4 2 3
2 5 8 8
I want to combine rows from columns a from both df's in all sequences but only where values in column b and c are equal.
Right now I have only solution for all in general with this code:
x = np.array(np.meshgrid(df1.a.values,
df2.a.values)).T.reshape(-1,2)
df = pd.DataFrame(x)
print(df)
0 1
0 1 1
1 1 4
2 1 5
3 4 1
4 4 4
5 4 5
6 7 1
7 7 4
8 7 5
expected output for df1.a and df2.a only for rows where df1.b==df2.b and df1.c==df2.c:
0 1
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
so basically i need to group by common rows in selected columns band c
You should try DataFrame.merge using inner merge:
df1.merge(df2, on=['b', 'c'])[['a_x', 'a_y']]
a_x a_y
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5

How to reassign the value of a column that has repeated values if it exist for some value?

I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'codes': [1, 2, 3, 4, 1, 2, 1, 2, 1, 2], 'results': ['a', 'b', 'c', 'd', None, None, None, None, None, None]})
I need to produce the following:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
It is guaranteed that if the value of results is not None for a value in codes it will be unique. I mean there won't be two rows with different values for code and results.
You can do with merge
df[['codes']].reset_index().merge(df.dropna()).set_index('index').sort_index()
Out[571]:
codes results
index
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or map
df['results']=df.codes.map(df.set_index('codes').dropna()['results'])
df
Out[574]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or groupby + ffill
df['results']=df.groupby('codes').results.ffill()
df
Out[577]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or reindex | .loc
df.set_index('codes').dropna().reindex(df.codes).reset_index()
Out[589]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b

Pandas filter columns of a DataFrame with bool

For a DataFrame (df) with multiple columns and rows
A B C D
0 1 4 2 6
1 2 5 7 4
2 3 6 5 6
and another DataFrame (dfBool) containing dtype: bool
0 True
1 False
2 False
3 True
What is the easiest way to split this DataFrame by columns into two different DataFrames by transposing dfbool so you get the desired output
A D
0 1 6
1 2 4
2 3 6
B C
0 4 2
1 5 7
2 6 5
I cannot understand, in my limited experience why dfTrue = df[dfBool.transpose() == True] does not work
I would like to modify EdChum's comment, because if dfBool is DataFrame, you have to first select column:
import pandas as pd
df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
'A': {0: 1, 1: 2, 2: 3},
'C': {0: 2, 1: 7, 2: 5},
'B': {0: 4, 1: 5, 2: 6}})
print (df)
A B C D
0 1 4 2 6
1 2 5 7 4
2 3 6 5 6
dfBool = pd.DataFrame({'a':[True, False, False, True]})
print (dfBool)
a
0 True
1 False
2 False
3 True
#select first column in dfBool
df2 = (dfBool.iloc[:,0])
#or select column a in dfBool
#df2 = (dfBool.a)
print (df2)
0 True
1 False
2 False
3 True
Name: a, dtype: bool
print (df[df.columns[df2]])
A D
0 1 6
1 2 4
2 3 6
print (df[df.columns[~df2]])
B C
0 4 2
1 5 7
2 6 5
Another very nice solution from ayhan, thank you:
print (df.loc[:, dfBool.a.values])
A D
0 1 6
1 2 4
2 3 6
print (df.loc[:, ~dfBool.a.values])
B C
0 4 2
1 5 7
2 6 5
But if dfBool is Series, solution works very well:
dfBool = pd.Series([True, False, False, True])
print (dfBool)
0 True
1 False
2 False
3 True
dtype: bool
print (df[df.columns[dfBool]])
A D
0 1 6
1 2 4
2 3 6
print (df[df.columns[~dfBool]])
B C
0 4 2
1 5 7
2 6 5
And for Series:
print (df.loc[:, dfBool.values])
A D
0 1 6
1 2 4
2 3 6
print (df.loc[:, ~dfBool.values])
B C
0 4 2
1 5 7
2 6 5
Timings:
In [277]: %timeit (df[df.columns[dfBool.a]])
1000 loops, best of 3: 769 µs per loop
In [278]: %timeit (df.loc[:, dfBool1.a.values])
The slowest run took 9.08 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 380 µs per loop
In [279]: %timeit (df.transpose()[dfBool1.a.values].transpose())
The slowest run took 5.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 550 µs per loop
Code for timings:
import pandas as pd
df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
'A': {0: 1, 1: 2, 2: 3},
'C': {0: 2, 1: 7, 2: 5},
'B': {0: 4, 1: 5, 2: 6}})
print (df)
df = pd.concat([df]*1000, axis=1).reset_index(drop=True)
dfBool = pd.DataFrame({'a': [True, False, False, True]})
dfBool1 = pd.concat([dfBool]*1000).reset_index(drop=True)
Output is little different:
print (df[df.columns[dfBool.a]])
A A A A A A A A A A ... D D D D D D D D D D
0 1 1 1 1 1 1 1 1 1 1 ... 6 6 6 6 6 6 6 6 6 6
1 2 2 2 2 2 2 2 2 2 2 ... 4 4 4 4 4 4 4 4 4 4
2 3 3 3 3 3 3 3 3 3 3 ... 6 6 6 6 6 6 6 6 6 6
[3 rows x 2000 columns]
print (df.loc[:, dfBool1.a.values])
A D A D A D A D A D ... A D A D A D A D A D
0 1 6 1 6 1 6 1 6 1 6 ... 1 6 1 6 1 6 1 6 1 6
1 2 4 2 4 2 4 2 4 2 4 ... 2 4 2 4 2 4 2 4 2 4
2 3 6 3 6 3 6 3 6 3 6 ... 3 6 3 6 3 6 3 6 3 6
[3 rows x 2000 columns]
print (df.transpose()[dfBool1.a.values].transpose())
A D A D A D A D A D ... A D A D A D A D A D
0 1 6 1 6 1 6 1 6 1 6 ... 1 6 1 6 1 6 1 6 1 6
1 2 4 2 4 2 4 2 4 2 4 ... 2 4 2 4 2 4 2 4 2 4
2 3 6 3 6 3 6 3 6 3 6 ... 3 6 3 6 3 6 3 6 3 6
[3 rows x 2000 columns]
Maybe something like the following ?
import pandas as pd
totalDF = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [2, 7, 5], 'D': [6, 4, 8]})
dfBool = pd.DataFrame(data=[True, False, False, True])
totalDF.transpose()[dfBool.values].transpose()
A D
0 1 6
1 2 4
2 3 8
totalDF.transpose()[~dfBool.values].transpose()
B C
0 4 2
1 5 7
2 6 5

Categories