Perform logical operations on every column of a pandas dataframe?

Perform logical operations on every column of a pandas dataframe? - python

I'm trying to create a new df column based on a condition to be validated in the all the rest of the columns per each row.
df = pd.DataFrame([[1, 5, 2, 8, 2], [2, 4, 4, 20, 5], [3, 3, 1, 20, 2], [4, 2, 2, 1, 0],
[5, 1, 4, -5, -4]],
columns=['a', 'b', 'c', 'd', 'e'],
index=[1, 2, 3, 4, 5])
I tried:
df['f'] = ""
df.loc[(df.any() >= 10), 'f'] = df['e'] + 10
However I get:
IndexingError: Unalignable boolean Series key provided
This is the desired output:
a b c d e f
1 1 5 2 8 2
2 2 4 4 20 5 15
3 3 3 1 20 2 12
4 4 2 2 1 0
5 5 1 4 -5 -4

Use
In [984]: df.loc[(df >= 10).any(1), 'f'] = df['e'] + 10
In [985]: df
Out[985]:
a b c d e f
1 1 5 2 8 2 NaN
2 2 4 4 20 5 15.0
3 3 3 1 20 2 12.0
4 4 2 2 1 0 NaN
5 5 1 4 -5 -4 NaN

Note that:
df.any()
a True
b True
c True
d True
e True
f True
dtype: bool
df.any() >= 10
a False
b False
c False
d False
e False
f False
dtype: bool
I assume you want to check if any value in a column is >= 10. That would be done with (df >= 10).any(axis=1).
You should be able to do this in one step, using np.where:
df['f'] = np.where((df >= 10).any(axis=1), df.e + 10, '')
df
a b c d e f
1 1 5 2 8 2
2 2 4 4 20 5 15
3 3 3 1 20 2 12
4 4 2 2 1 0
5 5 1 4 -5 -4
If you'd prefer NaNs instead of blanks, use:
df['f'] = np.where((df >= 10).any(axis=1), df.e + 10, np.nan)
df
a b c d e f
1 1 5 2 8 2 NaN
2 2 4 4 20 5 15.0
3 3 3 1 20 2 12.0
4 4 2 2 1 0 NaN
5 5 1 4 -5 -4 NaN

By using max
df['f'] = ""
df.loc[df.max(1)>=10,'f']=df.e+10
Out[330]:
a b c d e f
1 1 5 2 8 2
2 2 4 4 20 5 15
3 3 3 1 20 2 12
4 4 2 2 1 0
5 5 1 4 -5 -4

Related

Pandas add new column with CumSum of two columns, restart with new value in other column

I have the following df:
A B C
1 10 2
1 15 0
2 5 2
2 5 0
I add column D through:
df["D"] = (df.B - df.C).cumsum()
A B C D
1 10 2 8
1 15 0 23
2 5 2 26
2 5 0 31
I want the cumsum to restart in row 3 where the value in column A is different from the value in row 2.
Desired output:
A B C D
1 10 2 8
1 15 0 23
2 5 2 3
2 5 0 8

Try with
df['new'] = (df.B-df.C).groupby(df.A).cumsum()
Out[343]:
0 8
1 23
2 3
3 8
dtype: int64

Use groupby and cumsum
df['D'] = df.assign(D=df['B']-df['C']).groupby('A')['D'].cumsum()
A B C D
0 1 10 2 8
1 1 15 0 23
2 2 5 2 3
3 2 5 0 8

import pandas as pd
df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [10, 15, 5, 5], "C": [2, 0, 2, 0]})
df['D'] = df['B'] - df['C']
df = df.groupby('A').cumsum()
print(df)
output:
B C D
0 10 2 8
1 25 2 23
2 5 2 3
3 10 2 8

Cumulative count of values with grouping using Pandas

I have the following DataFrame:
>>>> df = pd.DataFrame(data={
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'value': [0, 2, 3, 4, 0, 3, 2, 3, 0]})
>>> df
type value
0 A 0
1 A 2
2 A 3
3 B 4
4 B 0
5 B 3
6 C 2
7 C 3
8 C 0
What I need to accomplish is the following: for each type, trace the cumulative count of non-zero values but starting from zero each time a 0-value is encountered.
type value cumcount
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN

Idea is create consecutive groups and filter out non 0 values, last assign to new column with filter:
m = df['value'].eq(0)
g = m.ne(m.shift()).cumsum()[~m]
df.loc[~m, 'new'] = df.groupby(['type',g]).cumcount().add(1)
print (df)
type value new
0 A 0 NaN
1 A 2 1.0
2 A 3 2.0
3 B 4 1.0
4 B 0 NaN
5 B 3 1.0
6 C 2 1.0
7 C 3 2.0
8 C 0 NaN
For pandas 0.24+ is possible use Nullable integer data type:
df['new'] = df['new'].astype('Int64')
print (df)
type value new
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN

use meshgrid for rows with common values in column

my dataframes:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [7, 8, 8]]),columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [5, 8, 8]]),columns=['a', 'b', 'c'])
df1,df2:
a b c
0 1 2 3
1 4 2 3
2 7 8 8
a b c
0 1 2 3
1 4 2 3
2 5 8 8
I want to combine rows from columns a from both df's in all sequences but only where values in column b and c are equal.
Right now I have only solution for all in general with this code:
x = np.array(np.meshgrid(df1.a.values,
df2.a.values)).T.reshape(-1,2)
df = pd.DataFrame(x)
print(df)
0 1
0 1 1
1 1 4
2 1 5
3 4 1
4 4 4
5 4 5
6 7 1
7 7 4
8 7 5
expected output for df1.a and df2.a only for rows where df1.b==df2.b and df1.c==df2.c:
0 1
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
so basically i need to group by common rows in selected columns band c

You should try DataFrame.merge using inner merge:
df1.merge(df2, on=['b', 'c'])[['a_x', 'a_y']]
a_x a_y
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5

How to reassign the value of a column that has repeated values if it exist for some value?

I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'codes': [1, 2, 3, 4, 1, 2, 1, 2, 1, 2], 'results': ['a', 'b', 'c', 'd', None, None, None, None, None, None]})
I need to produce the following:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
It is guaranteed that if the value of results is not None for a value in codes it will be unique. I mean there won't be two rows with different values for code and results.

You can do with merge
df[['codes']].reset_index().merge(df.dropna()).set_index('index').sort_index()
Out[571]:
codes results
index
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or map
df['results']=df.codes.map(df.set_index('codes').dropna()['results'])
df
Out[574]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or groupby + ffill
df['results']=df.groupby('codes').results.ffill()
df
Out[577]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or reindex | .loc
df.set_index('codes').dropna().reindex(df.codes).reset_index()
Out[589]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b

Pandas filter columns of a DataFrame with bool

For a DataFrame (df) with multiple columns and rows
A B C D
0 1 4 2 6
1 2 5 7 4
2 3 6 5 6
and another DataFrame (dfBool) containing dtype: bool
0 True
1 False
2 False
3 True
What is the easiest way to split this DataFrame by columns into two different DataFrames by transposing dfbool so you get the desired output
A D
0 1 6
1 2 4
2 3 6
B C
0 4 2
1 5 7
2 6 5
I cannot understand, in my limited experience why dfTrue = df[dfBool.transpose() == True] does not work

I would like to modify EdChum's comment, because if dfBool is DataFrame, you have to first select column:
import pandas as pd
df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
'A': {0: 1, 1: 2, 2: 3},
'C': {0: 2, 1: 7, 2: 5},
'B': {0: 4, 1: 5, 2: 6}})
print (df)
A B C D
0 1 4 2 6
1 2 5 7 4
2 3 6 5 6
dfBool = pd.DataFrame({'a':[True, False, False, True]})
print (dfBool)
a
0 True
1 False
2 False
3 True
#select first column in dfBool
df2 = (dfBool.iloc[:,0])
#or select column a in dfBool
#df2 = (dfBool.a)
print (df2)
0 True
1 False
2 False
3 True
Name: a, dtype: bool
print (df[df.columns[df2]])
A D
0 1 6
1 2 4
2 3 6
print (df[df.columns[~df2]])
B C
0 4 2
1 5 7
2 6 5
Another very nice solution from ayhan, thank you:
print (df.loc[:, dfBool.a.values])
A D
0 1 6
1 2 4
2 3 6
print (df.loc[:, ~dfBool.a.values])
B C
0 4 2
1 5 7
2 6 5
But if dfBool is Series, solution works very well:
dfBool = pd.Series([True, False, False, True])
print (dfBool)
0 True
1 False
2 False
3 True
dtype: bool
print (df[df.columns[dfBool]])
A D
0 1 6
1 2 4
2 3 6
print (df[df.columns[~dfBool]])
B C
0 4 2
1 5 7
2 6 5
And for Series:
print (df.loc[:, dfBool.values])
A D
0 1 6
1 2 4
2 3 6
print (df.loc[:, ~dfBool.values])
B C
0 4 2
1 5 7
2 6 5
Timings:
In [277]: %timeit (df[df.columns[dfBool.a]])
1000 loops, best of 3: 769 µs per loop
In [278]: %timeit (df.loc[:, dfBool1.a.values])
The slowest run took 9.08 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 380 µs per loop
In [279]: %timeit (df.transpose()[dfBool1.a.values].transpose())
The slowest run took 5.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 550 µs per loop
Code for timings:
import pandas as pd
df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
'A': {0: 1, 1: 2, 2: 3},
'C': {0: 2, 1: 7, 2: 5},
'B': {0: 4, 1: 5, 2: 6}})
print (df)
df = pd.concat([df]*1000, axis=1).reset_index(drop=True)
dfBool = pd.DataFrame({'a': [True, False, False, True]})
dfBool1 = pd.concat([dfBool]*1000).reset_index(drop=True)
Output is little different:
print (df[df.columns[dfBool.a]])
A A A A A A A A A A ... D D D D D D D D D D
0 1 1 1 1 1 1 1 1 1 1 ... 6 6 6 6 6 6 6 6 6 6
1 2 2 2 2 2 2 2 2 2 2 ... 4 4 4 4 4 4 4 4 4 4
2 3 3 3 3 3 3 3 3 3 3 ... 6 6 6 6 6 6 6 6 6 6
[3 rows x 2000 columns]
print (df.loc[:, dfBool1.a.values])
A D A D A D A D A D ... A D A D A D A D A D
0 1 6 1 6 1 6 1 6 1 6 ... 1 6 1 6 1 6 1 6 1 6
1 2 4 2 4 2 4 2 4 2 4 ... 2 4 2 4 2 4 2 4 2 4
2 3 6 3 6 3 6 3 6 3 6 ... 3 6 3 6 3 6 3 6 3 6
[3 rows x 2000 columns]
print (df.transpose()[dfBool1.a.values].transpose())
A D A D A D A D A D ... A D A D A D A D A D
0 1 6 1 6 1 6 1 6 1 6 ... 1 6 1 6 1 6 1 6 1 6
1 2 4 2 4 2 4 2 4 2 4 ... 2 4 2 4 2 4 2 4 2 4
2 3 6 3 6 3 6 3 6 3 6 ... 3 6 3 6 3 6 3 6 3 6
[3 rows x 2000 columns]

Maybe something like the following ?
import pandas as pd
totalDF = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [2, 7, 5], 'D': [6, 4, 8]})
dfBool = pd.DataFrame(data=[True, False, False, True])
totalDF.transpose()[dfBool.values].transpose()
A D
0 1 6
1 2 4
2 3 8
totalDF.transpose()[~dfBool.values].transpose()
B C
0 4 2
1 5 7
2 6 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Perform logical operations on every column of a pandas dataframe? - python

Use In [984]: df.loc[(df >= 10).any(1), 'f'] = df['e'] + 10 In [985]: df Out[985]: a b c d e f 1 1 5 2 8 2 NaN 2 2 4 4 20 5 15.0 3 3 3 1 20 2 12.0 4 4 2 2 1 0 NaN 5 5 1 4 -5 -4 NaN

By using max df['f'] = "" df.loc[df.max(1)>=10,'f']=df.e+10 Out[330]: a b c d e f 1 1 5 2 8 2 2 2 4 4 20 5 15 3 3 3 1 20 2 12 4 4 2 2 1 0 5 5 1 4 -5 -4

Related

Pandas add new column with CumSum of two columns, restart with new value in other column

Cumulative count of values with grouping using Pandas

use meshgrid for rows with common values in column

How to reassign the value of a column that has repeated values if it exist for some value?

Pandas filter columns of a DataFrame with bool

Categories

Resources