Apply function then Filter DataFrame - python

I am currently trying to create a new column to then filter on:
df['filterSalaryLoc'] = df[True if df['distance'] <= 25 & df['compensation_right'] else False]
This is how the DF Looks:
distance compensation_right
1 20.299433 True
2 1014.258732 True
3 1027.524228 True
4 5556.81612 True
5 926.003129 True
6 19.832819 True
7 1.489066 True
8 434.355273 True
9 23.647016 True
Where if the column entry is false, then it will be extracted out. However it is not working and creates an error here: df['filterSalaryLoc'] = df[True if df['distance'] <= 25 & df['compensation_right'] else False]. Anyone know what's going wrong?

I think perhaps you could do the assignment this way:
In [10]: df['filterSalaryLoc'] = (df['distance']<=25) & (df['compensation_right'])
In [11]: df
Out[11]:
distance compensation_right filterSalaryLoc
0 20.299433 True True
1 1014.258732 True False
2 1027.524228 True False
3 5556.816120 True False
4 926.003129 True False
5 19.832819 True True
6 1.489066 True True
7 434.355273 True False
8 23.647016 True True
The parentheses are necessary on the right-hand side, since without them df['distance']<=25 & df['compensation_right'] is parsed like
In [18]: df['distance']<=(25 & df['compensation_right'])
Out[18]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
dtype: bool
(Note this is all False.)

You could try this:
Find where your condition is true with df[(df.distance <= 25) & (df.compensation_right)] (you don't need the [True if ... else False]). Then take those rows (the .index) and make a new column that's True at those indices and NaN everywhere else.
In [7]: df.loc[df[(df.distance <= 25) & (df.compensation_right)].index, 'filterSalaryLoc'] = True
In [8]: df
Out[8]:
distance compensation_right filterSalaryLoc
1 20.299433 True True
2 1014.258732 True NaN
3 1027.524228 True NaN
4 5556.816120 True NaN
5 926.003129 True NaN
6 19.832819 True True
7 1.489066 True True
8 434.355273 True NaN
9 23.647016 True True
[9 rows x 3 columns]
Fill the NaNs with False:
In [9]: df.filterSalaryLoc.fillna(False, inplace=True)
In [10]: df
Out[10]:
distance compensation_right filterSalaryLoc
1 20.299433 True True
2 1014.258732 True False
3 1027.524228 True False
4 5556.816120 True False
5 926.003129 True False
6 19.832819 True True
7 1.489066 True True
8 434.355273 True False
9 23.647016 True True
[9 rows x 3 columns]
If you have pandas 0.13 or later installed, the first line can be replaced by:
In [13]: df.loc[df.query('distance <= 25 and compensation_right').index, 'filterSalaryLoc'] = True

Related

Fil rows in df.column betwen rows if number of rows between is less than something

I have a df where i want to fill the rows in column values with True if the number of rows between values True in column values is less then two.
counter
values
1
True
2
False
3
False
4
True
5
False
6
True
7
True
8
False
9
True
10
False
11
False
The result i want is like the df below:
counter
values
1
True
2
False
3
False
4
True
5
True
6
True
7
True
8
True
9
True
10
False
11
False
You can make groups starting with True, if the group is 2 items (or less), replace with True. Then compute the boolean OR with the original column:
N = 2
fill = df['values'].groupby(df['values'].cumsum()).transform(lambda g: len(g)<=N)
df['values'] = df['values']|fill ## or df['values'] |= fill
output (here as new column value2 for clarity):
counter values values2
0 1 True True
1 2 False False
2 3 False False
3 4 True True
4 5 False True
5 6 True True
6 7 True True
7 8 False True
8 9 True True
9 10 False False
10 11 False False
Other option that works only in the particular case of N=2, check if both the row before and after is True:
df['values'] = df['values']|(df['values'].shift()&df['values'].shift(-1))

Filtering Boolean in Dataframe

I have a df with ±100k rows and 10 columns.
I would like to find/filter which rows contain at least 2 to 4 True values.
For simplicity's sake, let's say I have this df:
A B C D E F
1 True True False False True
2 False True True True False
3 False False False False False
4 True False False False True
5 True False False False False
Expected output:
A B C D E F
1 True True False False True
2 False True True True False
4 True False False False True
I have tried using
df[(df['B']==True) | (df['C']==True) | (df['D']==True)| (df['E']==True)| (df['F']==True)]
But this only eliminates False rows and doesn't work if I want to find instances of at least 2/3 True.
Can anyone please help? Appreciate it.
Use DataFrame.select_dtypes for only boolean columns, count Trues by sum and then filter values by Series.between in boolean indexing:
df = df[df.select_dtypes(bool).sum(axis=1).between(2,4)]
print (df)
A B C D E F
0 1 True True False False True
1 2 False True True True False
3 4 True False False False True

Quickly apply multiple filters to pandas dataframe

I am trying to filter this dataframe:
ID fallowDeer woodland fox rabbits
1 0.0 4.056649 2.210927 31.150451
2 0.0 2.267544 1.380185 38.631221
3 0.0 5.602904 1.201781 4.124286
4 0.0 7.377308 7.834358 25.911328
5 0.0 2.129115 1.564075 3.931565
6 0.0 5.988451 1.699852 32.915730
7 0.0 1.427553 3.586585 7.444735
8 0.0 9.857577 8.709137 34.004470
9 0.0 7.468365 1.317117 38.440278
10 0.0 3.902134 4.112038 22.427969
To keep only those rows where each species is between the minimum and maximum values shown below (these are series):
Minimum values
fallowDeer 0
woodland 1
fox 3
rabbits 10
Maximum values
fallowDeer 0
woodland 4
fox 6
rabbits 20
This code works:
accepted_simulations = df[(df['fallowDeer'] <= max_values['fallowDeer']) & (df['fallowDeer'] >= min_values['fallowDeer']) & (df['woodland'] <= max_values['woodland']) & (df['woodland'] >= min_values['woodland']) & (df['fox'] <= max_values['fox']) & (df['fox'] >= min_values['fox']) & (df['rabbits'] <= max_values['rabbits']) & (df['rabbits'] >= min_values['rabbits'])]
However, I am going to be adding many more species/columns in the future, and would like to avoid having to manually check each species against the min/max as I've done here. Is there a way to quickly compare each species to the min/max and filter the dataframe, without having to manually check each one?
You can compare all columns together, if min_values and max_values are Series with Index.intersection for get columns names by index of minimal and maximal values and compared by DataFrame.le and
DataFrame.ge:
c1 = df.columns.intersection(min_values.index)
m1 = df[c1].ge(min_values.loc[c1])
print (m1)
fallowDeer woodland fox rabbits
0 True True False True
1 True False False True
2 True True False False
3 True True True True
4 True False False False
5 True True False True
6 True False False False
7 True True True True
8 True True False True
9 True False False True
c2 = df.columns.intersection(max_values.index)
m2 = df[c2].le(max_values.loc[c2])
print (m2)
fallowDeer woodland fox rabbits
0 True False True False
1 True False True False
2 True False True True
3 True False False False
4 True False True True
5 True False True False
6 True False False True
7 True False False False
8 True False True False
9 True False False False
Then is possible chain both masks:
m = m1 & m2
print (m)
fallowDeer woodland fox rabbits
0 True False False False
1 True False False False
2 True False False False
3 True False False False
4 True False False False
5 True False False False
6 True False False False
7 True False False False
8 True False False False
9 True False False False
Last for filter if True in all rows of mask use DataFrame.all:
accepted_simulations = df[m.all(axis=1)]
print (accepted_simulations)
Empty DataFrame
Columns: [ID, fallowDeer, woodland, fox, rabbits]
Index: []

Pandas: How to mask first row after consecutive rows?

Given a Pandas data frame like
ID VALUE
1 false
2 true
3 false
4 false
5 false
6 true
7 true
8 true
9 false
the result should be true for the next row following a group of true values
ID RESULT
1 false
2 false
3 true
4 false
5 false
6 false
7 false
8 false
9 true
How to achieve this in Pandas?
You can check if the diff() result of the VALUE column is equal to -1:
df.VALUE.astype(int).diff() == -1
#0 False
#1 False
#2 True
#3 False
#4 False
#5 False
#6 False
#7 False
#8 True
#Name: VALUE, dtype: bool
You can compare the values against an offset version to find where a new false is after trues:
>>> df['VALUE'] = df['VALUE'].astype('bool')
>>> (~df['VALUE'] & df['VALUE'].shift())
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
Name: VALUE, dtype: bool
import pandas as pd
values = ['false','true','false','false','false','true','true','true','false']
df = pd.DataFrame(values,columns=['values'])
print "Before changes: "
print df
to_become_false = df[df['values'] == 'true'].index.tolist()
to_become_true = [idx+1 for idx in to_become_false if not(idx+1 in to_become_false)]
df['values'][to_become_false] = 'false'
df['values'][to_become_true] = 'true'
print "\n\nAfter changes: "
print df
result :
Before changes:
values
0 false
1 true
2 false
3 false
4 false
5 true
6 true
7 true
8 false
After changes:
values
0 false
1 false
2 true
3 false
4 false
5 false
6 false
7 false
8 true

How to select column names in python pandas by dataframe values?

I have the following dataframe:
pandas.DataFrame(numpy.random.randn(10, 5) > 1, index=range(1, 11), columns=list('ABCDE'))
A B C D E
1 False False False False False
2 False False False False False
3 True True False True False
4 False False True True False
5 False False False False False
6 False False False False False
7 False False False False False
8 False False False False False
9 False False False False False
10 False True False True False
For each row I would like to get the column name that is the last one in that row containing True.
If there isn't any, return any resonable value.
How can I do that?
A one liner:
>>> value = np.nan
>>> df.reindex_axis(df.columns[::-1], axis=1)\ # flip vertically
.idxmax(axis=1)\ # find last(now first) True value
.reset_index()\ # get index for the next step
.apply(lambda x: value if (x[0]==df.columns[-1] and not df.ix[x['index'], x[0]])
else x[0], axis=1) # =value if col=="E" and value==False
Out [1]:
0 NaN
1 NaN
2 D
3 D
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 D
Explanation:
idxmax returns the index of the max value in a row, if there is more than one max it returns the first one. We want the last one so we flip the dataframe vertically.
Finally we must replace the obtained Series with value if col=="E" and value==False. You can't apply a condition on the index of a Series, thats why you need the reset_index first.
This last step could be more elegantly done with df.replace({'E': {False: value}), which replaces False in column 'E' with value, but somehow it doesn't work for me.
set up the example data first:
np.random.seed(1)
df = pd.DataFrame( (randn(10, 5) > 1) , index=range(1, 11), columns=['A','B','C','D','E'])
df
looks like:
A B C D E
1 True False False False False
2 False True False False False
3 True False False False True
4 False False False False False
5 False True False False False
6 False False False False False
7 False False False False False
8 False False False True False
9 False False False True False
10 False False True False False
it sounds like what you want to do is get the index # for each true value and then select the max index #. On a single column that might look like the following:
df['A'][df['A']].index.max()
which returns 3. To do this for all the columns, the easiest is to iterate through each column and shove the result in a list:
mylist = []
for col in df.columns:
myval = df[col][df[col]].index.max()
mylist.append(myval)
mylist
that returns:
[3, 5, 10, 9, 3]
the loop logic above returns nan if there is no True value in the column.

Categories