pandas: multiple conditions while indexing data frame - unexpected behavior - python

I am filtering rows in a dataframe by values in two columns.
For some reason the OR operator behaves like I would expect AND operator to behave and vice versa.
My test code:
df = pd.DataFrame({'a': range(5), 'b': range(5) })
# let's insert some -1 values
df['a'][1] = -1
df['b'][1] = -1
df['a'][3] = -1
df['b'][4] = -1
df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a != -1) | (df.b != -1)]
print(pd.concat([df, df1, df2], axis=1,
keys = [ 'original df', 'using AND (&)', 'using OR (|)',]))
And the result:
original df using AND (&) using OR (|)
a b a b a b
0 0 0 0 0 0 0
1 -1 -1 NaN NaN NaN NaN
2 2 2 2 2 2 2
3 -1 3 NaN NaN -1 3
4 4 -1 NaN NaN 4 -1
[5 rows x 6 columns]
As you can see, the AND operator drops every row in which at least one value equals -1. On the other hand, the OR operator requires both values to be equal to -1 to drop them. I would expect exactly the opposite result. Could anyone explain this behavior?
I am using pandas 0.13.1.

As you can see, the AND operator drops every row in which at least one
value equals -1. On the other hand, the OR operator requires both
values to be equal to -1 to drop them.
That's right. Remember that you're writing the condition in terms of what you want to keep, not in terms of what you want to drop. For df1:
df1 = df[(df.a != -1) & (df.b != -1)]
You're saying "keep the rows in which df.a isn't -1 and df.b isn't -1", which is the same as dropping every row in which at least one value is -1.
For df2:
df2 = df[(df.a != -1) | (df.b != -1)]
You're saying "keep the rows in which either df.a or df.b is not -1", which is the same as dropping rows where both values are -1.
PS: chained access like df['a'][1] = -1 can get you into trouble. It's better to get into the habit of using .loc and .iloc.

Late answer, but you can also use query(), i.e. :
df_filtered = df.query('a == 4 & b != 2')

A little mathematical logic theory here:
"NOT a AND NOT b" is the same as "NOT (a OR b)", so:
"a NOT -1 AND b NOT -1" is equivalent of "NOT (a is -1 OR b is -1)", which is opposite (Complement) of "(a is -1 OR b is -1)".
So if you want exact opposite result, df1 and df2 should be as below:
df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a == -1) | (df.b == -1)]

You can try the following:
df1 = df[(df['a'] != -1) & (df['b'] != -1)]

By de Morgan's laws, (i) the negation of a union is the intersection of the negations, and (ii) the negation of an intersection is the union of the negations, i.e.,
A AND B <=> not A OR not B
A OR B <=> not A AND not B
If the aim is to
drop every row in which at least one value equals -1
you can either use AND operator to identify the rows to keep or use OR operator to identify the rows to drop.
# select rows where both a and b values are not equal to -1
df2_0 = df[df['a'].ne(-1) & df['b'].ne(-1)]
# index of rows where at least one of a or b equals -1
idx = df.index[df.eval('a == -1 or b == -1')]
# drop `idx` rows
df2_1 = df.drop(idx)
df2_0.equals(df2_1) # True
On the other hand, if the aim is to
drop every row in which both values equal -1
you do the exact opposite; either use OR operator to identify the rows to keep or use AND operator to identify the rows to drop.

Related

Change value based on condition on slice of dataframe

I have a dataframe like this:
df = pd.DataFrame(columns=['Dog', 'Small', 'Adult'])
df.Dog = ['Poodle', 'Shepard', 'Bird dog','St.Bernard']
df.Small = [1,1,0,0]
df.Adult = 0
That will look like this:
Dog Small Adult
0 Poodle 1 0
1 Shepard 1 0
2 Bird dog 0 0
3 St.Bernard 0 0
Then I would like to change one column based on another. I can do that:
df.loc[df.Small == 0, 'Adult'] = 1
However, I just want to do so for the 3 first rows.
I can select the first three rows:
df.iloc[0:2]
But if I try to change values on the first three rows:
df.iloc[0:2, df.Small == 0, 'Adult'] = 1
I get an error.
I also get an error if I merge the two:
df.iloc[0:2].loc[df.Small == 0, 'Adult'] = 1
It tells me that I am trying to set a value on a copy of a slice.
How should I do this correctly?
You could include the range as another condition in your .loc selection (for the general case, I'll explicitly include the 0):
df.loc[(df.Small == 0) & (0 <= df.index) & (df.index <= 2), 'Adult'] = 1
Another option is to transform the index into a series to use pd.Series.between:
df.loc[(df.Small == 0) & (df.index.to_series().between(0, 2)), 'Adult'] = 1
adding conditionals based on index works only if the index is already sorted. Alternatively, you can do the following:
ind = df[df.Small == 0].index[:2]
df.loc[ind, 'Adult'] = 1

Switch boolean output to string in pandas python

When comparing two dataframes and putting the result back into a dataframe:
dfA = pd.DataFrame({'Column':[1,2,3,4]})
or in human readable form:
Column
0 1
1 2
2 3
3 4
dfB = pd.DataFrame({'Column':[1,2,4,3]})
or in human readable form:
Column
0 1
1 2
2 4
3 3
pd.DataFrame(dfA > dfB)
pandas outputs a dataframe with true or false values.
Column
0 False
1 False
2 False
3 True
Is it possible to change the name from 'true' or 'false' to 'lower' or 'higher'?
I want to know if the outcome is higher, lower or equal, that is why I ask.
If the output is not higher or lower, (true or false) then it must be equal.
You may use map:
In [10]: pd.DataFrame(dfA > dfB)['Column'].map({True: 'higher', False: 'lower'})
Out[10]:
0 lower
1 lower
2 lower
3 higher
Name: Column, dtype: object
I'd recommend np.where for performance/simplicity.
pd.DataFrame(np.where(dfA > dfB, 'higher', 'lower'), columns=['col'])
col
0 lower
1 lower
2 lower
3 higher
You can also nest conditions if needed with np.where,
m1 = dfA > dfB
m2 = dfA < dfB
pd.DataFrame(
np.where(m1, 'higher', np.where(m2, 'lower', 'equal')),
columns=['col']
)
Or, follow a slightly different approach with np.select.
pd.DataFrame(
np.select([m1, m2], ['higher', 'lower'], default='equal'),
columns=['col']
)
col
0 equal
1 equal
2 lower
3 higher

Duplicating Pandas Dataframe rows based on string split, without iteration

I have a dataframe with a multiindex, where one of thecolumns represents multiple values, separated by a "|", like this:
value
left right
x a|b 2
y b|c|d -1
I want to duplicate the rows based on the "right" column, to get something like this:
values
left right
x a 2
x b 2
y b -1
y c -1
y d -1
The solution I have to this feels wrong and runs slow, because it's based on iteration:
df2 = df.iloc[:0]
for index, row in df.iterrows():
stgs = index[1].split("|")
for s in stgs:
row.name = (index[0], s)
df2 = df2.append(row)
Is there a more vectored way to do this?
Pandas Series have a dedicated method split to perform this operation
split works only on Series so isolate the Column you want
SO = df['right']
Now 3 steps at once: spilt return A Series of array. apply(pd.Series, 1) convert array in columns. stack stacks you columns into a unique column
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
The only issue is that you have now a multi-index. So just drop the level you don`t need
S1.index.droplevel(-1)
Full example
SO = pd.Series(data=["a,b", "b,c,d"])
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
S1
Out[4]:
0 0 a
1 b
1 0 b
1 c
2 d
S1.index = S1.index.droplevel(-1)
S1
Out[5]:
0 a
0 b
1 b
1 c
1 d
Building upon the answer #xNoK, I am adding here the additional step needed to include the result back in the original DataFrame.
We have this data:
arrays = [['x', 'y'], ['a|b', 'b|c|d']]
midx = pd.MultiIndex.from_arrays(arrays, names=['left', 'right'])
df = pd.DataFrame(index=midx, data=[2, -1], columns=['value'])
df
Out[17]:
value
left right
x a|b 2
y b|c|d -1
First, let's generate the values for right index as #xNoK suggested. First take the Index level we want to work on by index.levels[1] and convert it it to series so that we can perform the str.split() function, and finally stack() it to get the result we want.
new_multi_idx_val = df.index.levels[1].to_series().str.split('|').apply(pd.Series).stack()
new_multi_idx_val
Out[18]:
right
a|b 0 a
1 b
b|c|d 0 b
1 c
2 d
dtype: object
Now we want to put this value in the original DataFrame df. To do that, let's change its shape so that result we generated in the previous step could be copied.
In order to do that, we can repeat the rows (including the indexes) by a number of | present in right level of multi-index. df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)) gives the number of times a row (including index) should be repeated. We apply this to the function index.repeat() and fetch values at those indexes to create a new DataFrame df_repeted.
df_repeted = df.loc[df.index.repeat(df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)))]
df_repeted
Out[19]:
value
left right
x a|b 2
a|b 2
y b|c|d -1
b|c|d -1
b|c|d -1
Now df_repeted DataFrame is in a shape where we could change the index to get the answer we want.
Replace the index of df_repeted with desired values as following:
df_repeted.index = [df_repeted.index.droplevel(1), new_multi_idx_val]
df_repeted.index.rename(names=['left', 'right'], inplace=True)
df_repeted
Out[20]:
value
left right
x a 2
b 2
y b -1
c -1
d -1

Find the count of -1 in each column

I have a pandas data frame. Some entries are equal to -1. How to find the number of times -1 exist in every column in the data frame. Based on that count, I am planning to drop the column.
Since you say you want the result for each column separately, you can use the condition like - df[column] == -1 , and then take .sum() on the result of the condition to get the count of -1 values for that row. Example -
(df[column] == -1).sum()
Demo -
In [22]: df
Out[22]:
A B C
0 -1 2 -1
1 3 4 5
2 3 1 4
3 -1 2 1
In [23]: for col in df.columns:
....: print(col, (df[col] == -1).sum())
....:
A 2
B 0
C 1
This works because when taking sum() , True value is equivalent to 1 and False is equivalent to 0. And the condition df[column] == -1 returns a Series of True/False values, True where the condition is met and False where the condition is not met.
I think you could have tried a few things before asking here, but I might as well post the answer anyway:
(df == -1).sum()
Ironically you can't use the count() method of a DataFrame because that counts all values except for None or nan, and there's no way to change the criterion. It's easier to just use sum than to figure out a way to convert the -1s to Nones.

Drop rows if value in a specific column is not an integer in pandas dataframe

If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]

Categories