Given a Pandas data frame like
ID VALUE
1 false
2 true
3 false
4 false
5 false
6 true
7 true
8 true
9 false
the result should be true for the next row following a group of true values
ID RESULT
1 false
2 false
3 true
4 false
5 false
6 false
7 false
8 false
9 true
How to achieve this in Pandas?
You can check if the diff() result of the VALUE column is equal to -1:
df.VALUE.astype(int).diff() == -1
#0 False
#1 False
#2 True
#3 False
#4 False
#5 False
#6 False
#7 False
#8 True
#Name: VALUE, dtype: bool
You can compare the values against an offset version to find where a new false is after trues:
>>> df['VALUE'] = df['VALUE'].astype('bool')
>>> (~df['VALUE'] & df['VALUE'].shift())
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
Name: VALUE, dtype: bool
import pandas as pd
values = ['false','true','false','false','false','true','true','true','false']
df = pd.DataFrame(values,columns=['values'])
print "Before changes: "
print df
to_become_false = df[df['values'] == 'true'].index.tolist()
to_become_true = [idx+1 for idx in to_become_false if not(idx+1 in to_become_false)]
df['values'][to_become_false] = 'false'
df['values'][to_become_true] = 'true'
print "\n\nAfter changes: "
print df
result :
Before changes:
values
0 false
1 true
2 false
3 false
4 false
5 true
6 true
7 true
8 false
After changes:
values
0 false
1 false
2 true
3 false
4 false
5 false
6 false
7 false
8 true
Related
I have a df where i want to fill the rows in column values with True if the number of rows between values True in column values is less then two.
counter
values
1
True
2
False
3
False
4
True
5
False
6
True
7
True
8
False
9
True
10
False
11
False
The result i want is like the df below:
counter
values
1
True
2
False
3
False
4
True
5
True
6
True
7
True
8
True
9
True
10
False
11
False
You can make groups starting with True, if the group is 2 items (or less), replace with True. Then compute the boolean OR with the original column:
N = 2
fill = df['values'].groupby(df['values'].cumsum()).transform(lambda g: len(g)<=N)
df['values'] = df['values']|fill ## or df['values'] |= fill
output (here as new column value2 for clarity):
counter values values2
0 1 True True
1 2 False False
2 3 False False
3 4 True True
4 5 False True
5 6 True True
6 7 True True
7 8 False True
8 9 True True
9 10 False False
10 11 False False
Other option that works only in the particular case of N=2, check if both the row before and after is True:
df['values'] = df['values']|(df['values'].shift()&df['values'].shift(-1))
I have this kind of dataframe :
d = {1 : [False,False,False,False,True],2:[False,True,True,True,False],3 :[True,False,False,False,True]}
df = pd.DataFrame(d)
df :
1 2 3
0 False False True
1 False True False
2 False True False
3 False True False
4 True False True
My goal is to keep rows n+1 and n where rows n+1 and n are differents. In the example df, the result would be :
df_result :
1 2 3
0 False False True
1 False True False
3 False True False
4 True False True
I have already tried this line df_result = df.neq(df.shift()) and kept only rows where there is a least one true but it doesn't get the row 3
Any idea how i can have the expected result ?
Thanks !
I believe you need compare bot not equal by DataFrame.ne shifting by 1 and by -1, get at least one match by DataFrame.any and chain with | for bitwise OR:
df_result = df[df.ne(df.shift()).any(axis=1) | df.ne(df.shift(-1)).any(axis=1)]
print (df_result)
1 2 3
0 False False True
1 False True False
3 False True False
4 True False True
Another similar idea:
df_result = df[(df.ne(df.shift()) | df.ne(df.shift(-1))).any(axis=1)]
With the following data:
A
false
false
false
false
true
false
false
true
true
true
I would like to generate the following output:
A B
false 1
false 2
false 3
false 4
true 1
false 1
false 2
true 1
true 2
true 3
so, at each change, I restart the counter and then increment as long as the content doesn't change.
I can do it with a loop (pseudocode):
count = 0
current = df['A'][0]
for i in df['A'].index:
if df['A'][i] != current:
current = df['A'][i]
count = 0
df['B'][i] = ++count
but is there a Panda-ish way to achieve this since the loop will be very slow?
You can try this:
df['B'] = df.groupby(df.A.ne(df.A.shift()).cumsum()).cumcount()+1
A B
0 False 1
1 False 2
2 False 3
3 False 4
4 True 1
5 False 1
6 False 2
7 True 1
8 True 2
9 True 3
I have the following dataframe:
pandas.DataFrame(numpy.random.randn(10, 5) > 1, index=range(1, 11), columns=list('ABCDE'))
A B C D E
1 False False False False False
2 False False False False False
3 True True False True False
4 False False True True False
5 False False False False False
6 False False False False False
7 False False False False False
8 False False False False False
9 False False False False False
10 False True False True False
For each row I would like to get the column name that is the last one in that row containing True.
If there isn't any, return any resonable value.
How can I do that?
A one liner:
>>> value = np.nan
>>> df.reindex_axis(df.columns[::-1], axis=1)\ # flip vertically
.idxmax(axis=1)\ # find last(now first) True value
.reset_index()\ # get index for the next step
.apply(lambda x: value if (x[0]==df.columns[-1] and not df.ix[x['index'], x[0]])
else x[0], axis=1) # =value if col=="E" and value==False
Out [1]:
0 NaN
1 NaN
2 D
3 D
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 D
Explanation:
idxmax returns the index of the max value in a row, if there is more than one max it returns the first one. We want the last one so we flip the dataframe vertically.
Finally we must replace the obtained Series with value if col=="E" and value==False. You can't apply a condition on the index of a Series, thats why you need the reset_index first.
This last step could be more elegantly done with df.replace({'E': {False: value}), which replaces False in column 'E' with value, but somehow it doesn't work for me.
set up the example data first:
np.random.seed(1)
df = pd.DataFrame( (randn(10, 5) > 1) , index=range(1, 11), columns=['A','B','C','D','E'])
df
looks like:
A B C D E
1 True False False False False
2 False True False False False
3 True False False False True
4 False False False False False
5 False True False False False
6 False False False False False
7 False False False False False
8 False False False True False
9 False False False True False
10 False False True False False
it sounds like what you want to do is get the index # for each true value and then select the max index #. On a single column that might look like the following:
df['A'][df['A']].index.max()
which returns 3. To do this for all the columns, the easiest is to iterate through each column and shove the result in a list:
mylist = []
for col in df.columns:
myval = df[col][df[col]].index.max()
mylist.append(myval)
mylist
that returns:
[3, 5, 10, 9, 3]
the loop logic above returns nan if there is no True value in the column.
I am currently trying to create a new column to then filter on:
df['filterSalaryLoc'] = df[True if df['distance'] <= 25 & df['compensation_right'] else False]
This is how the DF Looks:
distance compensation_right
1 20.299433 True
2 1014.258732 True
3 1027.524228 True
4 5556.81612 True
5 926.003129 True
6 19.832819 True
7 1.489066 True
8 434.355273 True
9 23.647016 True
Where if the column entry is false, then it will be extracted out. However it is not working and creates an error here: df['filterSalaryLoc'] = df[True if df['distance'] <= 25 & df['compensation_right'] else False]. Anyone know what's going wrong?
I think perhaps you could do the assignment this way:
In [10]: df['filterSalaryLoc'] = (df['distance']<=25) & (df['compensation_right'])
In [11]: df
Out[11]:
distance compensation_right filterSalaryLoc
0 20.299433 True True
1 1014.258732 True False
2 1027.524228 True False
3 5556.816120 True False
4 926.003129 True False
5 19.832819 True True
6 1.489066 True True
7 434.355273 True False
8 23.647016 True True
The parentheses are necessary on the right-hand side, since without them df['distance']<=25 & df['compensation_right'] is parsed like
In [18]: df['distance']<=(25 & df['compensation_right'])
Out[18]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
dtype: bool
(Note this is all False.)
You could try this:
Find where your condition is true with df[(df.distance <= 25) & (df.compensation_right)] (you don't need the [True if ... else False]). Then take those rows (the .index) and make a new column that's True at those indices and NaN everywhere else.
In [7]: df.loc[df[(df.distance <= 25) & (df.compensation_right)].index, 'filterSalaryLoc'] = True
In [8]: df
Out[8]:
distance compensation_right filterSalaryLoc
1 20.299433 True True
2 1014.258732 True NaN
3 1027.524228 True NaN
4 5556.816120 True NaN
5 926.003129 True NaN
6 19.832819 True True
7 1.489066 True True
8 434.355273 True NaN
9 23.647016 True True
[9 rows x 3 columns]
Fill the NaNs with False:
In [9]: df.filterSalaryLoc.fillna(False, inplace=True)
In [10]: df
Out[10]:
distance compensation_right filterSalaryLoc
1 20.299433 True True
2 1014.258732 True False
3 1027.524228 True False
4 5556.816120 True False
5 926.003129 True False
6 19.832819 True True
7 1.489066 True True
8 434.355273 True False
9 23.647016 True True
[9 rows x 3 columns]
If you have pandas 0.13 or later installed, the first line can be replaced by:
In [13]: df.loc[df.query('distance <= 25 and compensation_right').index, 'filterSalaryLoc'] = True