Dynamic comparison with pandas AND - python

I have a dictionary with each column as a key in a dataframe like:
dict = {"colA":1,"colB":1,"colC":1}
with colA, colB, colC the columns of my dataframe.
I would like to do something like:
df.loc[(df["colA"] < = dict["colA"]) & (df["colB"] < = dict["colB"]) & (df["colC"] < = dict["colC"])]
but dynamically (I don't know the length of the dict / number of columns)
Is there a way to do a & with dynamic number of arguments?

You can use:
from functools import reduce
df = pd.DataFrame({'colA':[1,2,0],
'colB':[0,5,6],
'colC':[1,8,9]})
print (df)
colA colB colC
0 1 0 1
1 2 5 8
2 0 6 9
d = {"colA":1,"colB":1,"colC":1}
a = df[(df["colA"] <= d["colA"]) & (df["colB"] <= d["colB"]) & (df["colC"] <= d["colC"])]
print (a)
colA colB colC
0 1 0 1
Solution with creating Series, compare with le, check all True by all and last use boolean indexing:
d = {"colA":1,"colB":1,"colC":1}
s = pd.Series(d)
print (s)
colA 1
colB 1
colC 1
dtype: int64
print (df.le(s).all(axis=1))
0 True
1 False
2 False
dtype: bool
print (df[df.le(s).all(axis=1)])
colA colB colC
0 1 0 1
Another solution with numpy.logical_and and reduce for creating mask and list comprehension for apply conditions:
print ([df[x] <= d[x] for x in df.columns])
[0 True
1 False
2 True
Name: colA, dtype: bool, 0 True
1 False
2 False
Name: colB, dtype: bool, 0 True
1 False
2 False
Name: colC, dtype: bool]
mask = reduce(np.logical_and, [df[x] <= d[x] for x in df.columns])
print (mask)
0 True
1 False
2 False
Name: colA, dtype: bool
print (df[mask])
colA colB colC
0 1 0 1

Here is one SQL-like solution, which uses .query() method:
Data:
In [23]: df
Out[23]:
colA colB colC
0 2 2 5
1 3 0 8
2 5 9 2
3 3 0 2
4 9 1 3
5 7 5 6
6 7 8 0
7 0 4 1
8 8 2 6
9 9 6 7
Solution:
In [20]: dct = {"colA":4,"colB":4,"colC":4}
In [21]: qry = ' and '.join(('{0[0]} <= {0[1]}'.format(tup) for tup in dct.items()))
In [22]: qry
Out[22]: 'colB <= 4 and colA <= 4 and colC <= 4'
In [24]: df.query(qry)
Out[24]:
colA colB colC
3 3 0 2
7 0 4 1

Related

Change cell value according to values within another column [pandas]

I have a dataframe such as
Names Value COLA COLB COLC
A 100 0 4 1
B NaN 0 2 1
C 20 3 0 0
D 1 0 1 0
E 300 3 0 0
And I would like to change all the COLA,B and C values (except the 0) :
to 1 if the Value col > 30
to 2 if the Value col <=30 or NaN.
I should then get
Names Value COLA COLB COLC
A 100 0 1 1
B NaN 0 2 2
C 20 2 0 0
D 1 0 2 0
E 300 1 0 0
Does someone have a sugestion ?
Use numpy.where with chain condition used for broadcasting - assign mask from Series to multiple columns, for set 0 multiple ouput to boolean mask for set 0:
cols = ['COLA','COLB','COLC']
df[cols] = np.where(df['Value'].gt(30).to_numpy()[:, None], 1, 2) * df[cols].ne(0)
print (df)
Names Value COLA COLB COLC
0 A 100.0 0 1 1
1 B NaN 0 2 2
2 C 20.0 2 0 0
3 D 1.0 0 2 0
4 E 300.0 1 0 0

Select rows where two or more columns are bigger than 0 in pandas

I am working with a dataframe in pandas. My dataframe had 55 columns and 70.000 rows.
How can I select the rows where two or more values are bigger than 0?
It now looks like this:
A B C D E
a 0 2 0 8 0
b 3 0 0 0 0
c 6 2 5 0 0
And would like to make this:
A B C D E F
a 0 2 0 8 0 true
b 3 0 0 0 0 false
c 6 2 5 0 0 true
Have tried converting it to just 0's and 1's and summing that, like so:
df[df > 0] = 1
df[(df > 0).sum(axis=1) >= 2]
But then I lose all the other info in the dataframe and I still want to be able to see the original values.
Try assigning to a column like this:
>>> df['F'] = df.gt(0).sum(axis=1).ge(2)
>>> df
A B C D E F
a 0 2 0 8 0 True
b 3 0 0 0 0 False
c 6 2 5 0 0 True
Or try with astype(bool):
>>> df['F'] = df.astype(bool).sum(axis=1).ge(2)
>>> df
A B C D E F
a 0 2 0 8 0 True
b 3 0 0 0 0 False
c 6 2 5 0 0 True
>>>
You are close, only assign mask to new column:
df['F'] = (df > 0).sum(axis=1) >= 2
Or:
df['F'] = np.count_nonzero(df, axis=1) >= 2
print (df)
A B C D E F
a 0 2 0 8 0 True
b 3 0 0 0 0 False
c 6 2 5 0 0 True

Join two Pandas DataFrames on specific column with matching values

I want to join two pandas dataframes on "ColA", but the thing is that values in "ColA" in these two dataframes are not in order and dataframes are not the same lenght. I want to join them so that missing values are changed with 0 value, and that values in "ColA" are matching.
df1 = pd.DataFrame({"ColA":["num 1", "num 2", "num 3"],
"ColB":[5,6,7]})
print(df1)
df2 = pd.DataFrame({"ColA":["num 2", "num 3","num 1", "num 4"],
"ColC":[3,2,1,5]})
print(df2)
ColA ColB
0 num 1 5
1 num 2 6
2 num 3 7
ColA ColC
0 num 2 3
1 num 3 2
2 num 1 1
3 num 4 5
Result should look like this:
# num1 is matched with appropriate values and num4 has the value 0 for "ColB"
ColA ColB ColC
0 num 1 5 1
1 num 2 6 3
2 num 3 7 2
3 num 4 0 5
Use DataFrame.merge with outer join, convert NaNs to 0 and last if necessary convert dtypes to original by dictionary:
d = df1.dtypes.append(df2.dtypes).to_dict()
df = df1.merge(df2, how='outer', on='ColA').fillna(0).astype(d)
print (df)
ColA ColB ColC
0 num 1 5 1
1 num 2 6 3
2 num 3 7 2
3 num 4 0 5
Or use concat with convert all columns to integers (if possible):
df = (pd.concat([df1.set_index('ColA'),
df2.set_index('ColA')], axis=1, sort=True)
.fillna(0)
.astype(int)
.rename_axis('ColA')
.reset_index())
print (df)
ColA ColB ColC
0 num 1 5 1
1 num 2 6 3
2 num 3 7 2
3 num 4 0 5

How can i make a dataframe where the count is higher than a specific value?

I have a df that looks like this:
df
a b c d
0 1 0 0 1
1 1 1 0 1
2 0 1 1 1
3 1 0 0 1
I try to get a df where the count of the columns is higher than 2. But can't find the solution for this. It should look like this:
a d
0 1 1
1 1 1
2 0 1
3 1 1
If there are only 1 and 0 values use DataFrame.loc with boolean indexing, first : is for match all rows:
df = df.loc[:, df.sum() > 2]
print (df)
a d
0 1 1
1 1 1
2 0 1
3 1 1
Detail:
print (df.sum())
a 3
b 2
c 1
d 4
dtype: int64
print (df.sum() > 2)
a True
b False
c False
d True
dtype: bool
If possible another values and need count only 1:
df = df.loc[:, df.eq(1).sum() > 2]

How to exclude values from pandas dataframe?

I have two dataframes:
1) customer_id,gender
2) customer_id,...[other fields]
The first dataset is an answer dataset (gender is an answer). So, I want to exclude from the second dataset those customer_id which are in the first dataset (which gender we know) and call it 'train'. The rest records should become a 'test' dataset.
I think you need boolean indexing and condition with isin, inverting boolean Series is by ~:
df1 = pd.DataFrame({'customer_id':[1,2,3],
'gender':['m','f','m']})
print (df1)
customer_id gender
0 1 m
1 2 f
2 3 m
df2 = pd.DataFrame({'customer_id':[1,7,5],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
B C D E F customer_id
0 4 7 1 5 7 1
1 5 8 3 3 4 7
2 6 9 5 6 3 5
mask = df2.customer_id.isin(df1.customer_id)
print (mask)
0 True
1 False
2 False
Name: customer_id, dtype: bool
print (~mask)
0 False
1 True
2 True
Name: customer_id, dtype: bool
train = df2[mask]
print (train)
B C D E F customer_id
0 4 7 1 5 7 1
test = df2[~mask]
print (test)
B C D E F customer_id
1 5 8 3 3 4 7
2 6 9 5 6 3 5

Categories