I got a Dataframe with a Matrix colum like this
11034-A
11034-B
1120-A
1121-A
112570-A
113-A
113.558
113.787-A
113.787-B
114-A
11691-A
11691-B
117-A RRS
12 X R
12-476-AT-A
12-476-AT-B
I'd like to filter only matrix that ends with A or B only when they are consecutive, so in the example above 11034-A and 11034-B, 113.787-A and 113.787-B, 11691-A and 11691-B, 12-476-AT-A and 12-476-AT-B
I wrote the function that will compare those 2 strings and return True or False, the problem is I fail to see how to apply / applymap to the consecutive rows:
def isAB(stringA, stringB):
if stringA.endswith('A') and stringB.endswith('B') and stringA[:-1] == stringB[:-1]:
return True
else:
return False
I tried df['result'] = isAB(df['Matrix'].str, df['Matrix'].shift().str) to no-avail
I seem to lack something in the way I designed this
edit :
I think this works, looks like I overcomplicated at 1st :
df['t'] = (df['Matrix'].str.endswith('A') & df['Matrix'].shift(-1).str.endswith('B')) | (df['Matrix'].str.endswith('B') & df['Matrix'].shift(1).str.endswith('A'))
df['p'] = (df['Matrix'].str[:-1] == df['Matrix'].shift(-1).str[:-1]) | (df['Matrix'].str[:-1] == df['Matrix'].shift(1).str[:-1])
df['e'] = df['p'] & df['t']
final = df[df['e']]
Here is how I would do it.
df['ShiftUp'] = df['matrix'].shift(-1)
df['ShiftDown'] = df['matrix'].shift()
def check_matrix(x):
if pd.isnull(x.ShiftUp) == False and x.matrix[:-1] == x.ShiftUp[:-1]:
return True
elif pd.isnull(x.ShiftDown) == False and x.matrix[:-1] == x.ShiftDown[:-1]:
return True
else:
return False
df['new'] = df.apply(check_matrix, axis=1)
df = df.drop(['ShiftUp', 'ShiftDown'], axis=1)
print df
prints
matrix new
0 11034-A True
1 11034-B True
2 1120-A False
3 1121-A False
4 112570-A False
5 113-A False
6 113.558 False
7 113.787-A True
8 113.787-B True
9 114-A False
10 11691-A True
11 11691-B True
12 117-A RRS False
13 12 X R False
14 12-476-AT-A True
15 12-476-AT-B True
Here's my solution, it requires a bit of work.
The strategy is the following: obtain a new column that has the same values as the current column but shifted one position.
Then, it's just a matter to check whether one column is A or B and the other one B or A.
Say your matrix colum is called "column_name".
Then:
myl = ['11034-A',
'11034-B',
'1120-A',
'1121-A',
'112570-A',
'113-A',
'113.558',
'113.787-A',
'113.787-B',
'114-A',
'11691-A',
'11691-B',
'117-A RRS',
'12 X R',
'12-476-AT-A',
'12-476-AT-B']
#toy data frame
mydf = pd.DataFrame.from_dict({'column_name':myl})
#get a new series which is the same one as the original
#but the first entry contains "nothing"
new_series = pd.Series( ['nothing'] +
mydf['column_name'][:-1].values.tolist() )
#add it to the original dataframe
mydf['new_col'] = new_series
You then define a simple function:
def do_i_want_this_row(x,y):
left_char = x[-1]
right_char = y[-1]
return ((left_char == 'A') & (right_char == 'B')) or ((left_char == 'B') & (right_char=='A'))
and voila:
print mydf[mydf.apply(lambda x: do_i_want_this_row( x.column_name, x.new_col), axis=1)]
column_name new_col
1 11034-B 11034-A
2 1120-A 11034-B
8 113.787-B 113.787-A
9 114-A 113.787-B
11 11691-B 11691-A
15 12-476-AT-B 12-476-AT-A
There is still the question of the last element, but I'm sure you can think of what to do with it if you decide to follow this strategy ;)
You can delete rows from a DataFrame using DataFrame.drop(labels, axis). To get a list of labels to delete, I would first get a list of pairs that match your criterion. With the labels from above in a list labels and your isAB function,
pairs = zip(labels[:-1], labels[1:])
delete_pairs = filter(isAB, pairs)
delete_labels = []
for a,b in delete_pairs:
delete_labels.append(a)
delete_labels.append(b)
Examinedelete_labels to make sure you've put it together correctly,
print(delete_labels)
And finally, delete the rows. With the DataFrame in question as x,
x.drop(delete_labels) # or x.drop(delete_labels, axis) if appropriate
Related
Say I have a data-frame, df as below, that has a 'Value' column which I'd like to apply some boolean analysis too.
date Value
10/11 0.798
11/11 1.235
12/11 0.890
13/11 0.756
14/11 0.501
...
Essentially, I'd like to create a new column that switches to TRUE when the value is greater than 1, and remains true unless the value drops below 0.75. For example, it would look like the below using df:
column
FALSE
TRUE
TRUE
TRUE
FALSE
I am struggling to find an appropriate way to reference the previous value of a column I am defining without running into some error. The logic I want to use is as below:
df['column'] = (df['value'] >= 1) | ((df['column'].shift(1) == True) & (df['value'] >= 0.75))
Is there a way that I can achieve this without over-complicating things?
A possible solution:
val1, val2 = 1, 0.75
out = (df.assign(
new=df.Value.where(df.Value.gt(val1) | df.Value.lt(val2))
.ffill().gt(val1)))
print(out)
Output:
date Value new
0 10/11 0.798 False
1 11/11 1.235 True
2 12/11 0.890 True
3 13/11 0.756 True
4 14/11 0.501 False
Actually calling function with apply might help, with some "remembering" logic.
res = True
def CheckRow(row):
global res
if res == True:
if row['value']>1.0:
res = False #next time check for < 0.75
return True
else:
return False
else: #res == False
if row['value']<0.75:
res = True #next time check for above 1.0
return False
else:
return True
df['column'] = df.apply(lambda x: CheckRow(x), axis = 1)
I need to update the value of the column matching the selection criteria and repeat it some consecutive times.
eg:
INPUT:
df = pd.DataFrame({"a" : [True,False,False,False,False,True,False,False]})
roll value to next 3 indexes
(input format)
OUTPUT:
output = pd.DataFrame({"a" : [True,True,True, False, False, True, True, True]}
(output format)
I looked at pandas.series.repeat but that adds new value. I need to make sure that the size remains the same.
Use .rolling(...) to get rolling window:
df.rolling(window=3, min_periods=1).agg(lambda x: any(x)).astype("bool")
Output:
a
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
You could use a while and for loop like so:
i = 0 // index
content = 'true' // set content to replace
while (i < 3): // run 3 times
for x in dataframe:
dataframe['column'] = dataframe['column'].replace(content, str(i)) // replace content at index i
i = i + 1 // add 1 to index
This way you replace values at a certain index while only looping through the first 3 indexes.
I've a question regarding some code in python. I'm trying to extract the index of the first row when the condition TRUE is satisfied in 3 different columns. This is the data I'm using:
0 1 2 3 4
0 TRUE TRUE TRUE 0.41871395 0.492517879
1 TRUE TRUE TRUE 0.409863582 0.519425031
2 TRUE TRUE TRUE 0.390077415 0.593127232
3 FALSE FALSE FALSE 0.372020631 0.704367199
4 FALSE FALSE FALSE 0.373546556 0.810876797
5 FALSE FALSE FALSE 0.398876919 0.86855678
6 FALSE FALSE FALSE 0.432142094 0.875576037
7 FALSE FALSE FALSE 0.454115421 0.863063448
8 FALSE TRUE FALSE 0.460676901 0.855739006
9 FALSE TRUE FALSE 0.458693197 0.855128636
10 FALSE FALSE FALSE 0.459201839 0.856451104
11 FALSE FALSE FALSE 0.458693197 0.855739006
12 FALSE FALSE FALSE 0.458082827 0.856349376
13 FALSE FALSE FALSE 0.456556902 0.856959746
14 TRUE TRUE TRUE 0.455946532 0.858180486
15 TRUE TRUE TRUE 0.455030976 0.858790857
16 TRUE TRUE TRUE 0.454725791 0.858485672
17 FALSE FALSE FALSE 0.454420606 0.857875301
18 FALSE FALSE FALSE 0.454725791 0.858383943
19 FALSE TRUE FALSE 0.453199866 0.856654561
20 FALSE FALSE FALSE 0.451979125 0.856349376
21 FALSE FALSE FALSE 0.45167394 0.856959746
22 FALSE FALSE FALSE 0.451775669 0.857570116
23 FALSE FALSE FALSE 0.45106357 0.857264931
24 TRUE TRUE TRUE 0.450758385 0.856654561
25 TRUE TRUE TRUE 0.4504532 0.856044191
26 TRUE TRUE TRUE 0.449232459 0.856349376
27 TRUE TRUE TRUE 0.448316904 0.855535549
and I need to get the index number only when there are 3 'True' conditions:
0
14
24
Thank you!
I guess everyone missed the "extract the index of the first row" part. One of the way would be removing consecutive duplicates first and then obtaining index where all three is True so that you only get first row of the truth
df=df[['0', '1', '2']]
df=df[df.shift()!=df].dropna().all(axis=1)
print(df[df].index.tolist())
OUTPUT:
[0, 14, 24]
I tried this on a demo dataframe and it seems to work for me.
df = pd.DataFrame(data={'A':[True,True,True,True,True,False,True,True],'B':[True,True,False,True,True,False,True,True],'C':[True,False,True,True,True,False,True,True]})
i =df[(df['A']==True) & (df['B']==True) & (df['C']==True)].index.to_list()
i = [x for x in i if x-1 not in i]
EDIT 2: I have a new answer in response to some clarifications.
You're looking for each row that has TRUE in columns 0, 1, or 2, BUT you'd like to ignore such rows that are not the first in a streak of them. The first part of my answer is still the same, I think you should create a mask that selects your TRUE triplet rows:
condition = df[[0, 1, 2]].all(axis='columns')
But now I present a possible way to filter out the rows you want to ignore. To be not-first in a streak of TRUE triplet rows means that the previous row also satisfies condition.
idx = df[condition].index
ignore = idx.isin(idx + 1)
result = idx[~ignore]
In other words, ignore rows where the index value is the successor of an index value satisfying condition.
Hope this helps!
Keeping my original answer for record keeping:
I think you'll end up with the most readable solution by breaking this out into two steps:
First, find out which rows have the value True for all of the columns you're interested in:
condition = df[[0, 1, 2]].all(axis='columns')
Then, the index values you're interested in are simply df[condition].index.
EDIT: if, as Benoit points out may be the case, TRUE and FALSE are strings, that's fine, you just need a minor tweak to the first step:
condition = (df[[0, 1, 2]] == 'TRUE').all(axis='columns')
If the TRUE and FALSE in your DataFrame are actually the boolean values True and False then,
#This will look at the first 3 columns and return True if "all" are True else it will return False:
step1 = [all(q) for q in df[[0,1,2]].values]
id = []
cnt = 0
temp_cnt = 0
#this loop finds where the value is true and checks if the next 2 are also true
#it then appends the count-2 to a list named id, the -2 compensates for the index.
for q in step1:
if q:
cnt += 1
if cnt == 3:
id.append(temp_cnt - 2)
else:
cnt = 0
temp_cnt += 1
#Then when printing "id" it will return the first index where AT LEAST 3 True values occur in sequence.
id
Out[108]: [0, 14, 24]
I think this could do the trick. As a general advice though, it always helps to name the columns in pandas.
Say that your pandas data frame is named data:
data[(data[0] == True) & (data[1] == True) & (data[2] == True)].index.values
or
list(data[(data[0] == True) & (data[1] == True) & (data[2] == True)].index.values)
Based on the answer here, something like this will provide a list of indices for the rows that meet all conditions:
df[(df[0]==True) & (df[1]==True) & (df[2]==True)].index.tolist()
The following will work regardless of the position of the 3 columns you wish to check for True values, and gives you back a list indicating which rows have 3 True values present:
Edit:
Now updated to better align with the OP's original request:
#df.iloc[:,:3] = df.iloc[:,:3].apply(lambda x: str(x) == "TRUE") # If necessary
s = (df == True).apply(sum, axis=1) == 3
s = s[s.shift() != s]
s.index[s].tolist()
I just want to switch correct to false and false to correct in my panda data frame, doing what I have written below changes everything to correct. How do I fix this?
a.loc[(a["outcome"] == 'correct') 'outcome'] = 'false' and a.loc[(a["outcome"] == 'false'), 'outcome'] = 'correct'
Use map by dictionary and if some another values out of dict add fillna:
a = pd.DataFrame({'outcome':['correct','correct','false', 'val']})
print (a)
outcome
0 correct
1 correct
2 false
3 val
d = {'correct':'false', 'false':'correct'}
a['outcome'] = a['outcome'].map(d).fillna(a['outcome'])
print (a)
outcome
0 false
1 false
2 correct
3 val
I have a Pandas Dataframe of indices and values between 0 and 1, something like this:
6 0.047033
7 0.047650
8 0.054067
9 0.064767
10 0.073183
11 0.077950
I would like to retrieve tuples of the start and end points of regions of more than 5 consecutive values that are all over a certain threshold (e.g. 0.5). So that I would have something like this:
[(150, 185), (632, 680), (1500,1870)]
Where the first tuple is of a region that starts at index 150, has 35 values that are all above 0.5 in row, and ends on index 185 non-inclusive.
I started by filtering for only values above 0.5 like so
df = df[df['values'] >= 0.5]
And now I have values like this:
632 0.545700
633 0.574983
634 0.572083
635 0.595500
636 0.632033
637 0.657617
638 0.643300
639 0.646283
I can't show my actual dataset, but the following one should be a good representation
import numpy as np
from pandas import *
np.random.seed(seed=901212)
df = DataFrame(range(1,501), columns=['indices'])
df['values'] = np.random.rand(500)*.5 + .35
yielding:
1 0.491233
2 0.538596
3 0.516740
4 0.381134
5 0.670157
6 0.846366
7 0.495554
8 0.436044
9 0.695597
10 0.826591
...
Where the region (2,4) has two values above 0.5. However this would be too short. On the other hand, the region (25,44) with 19 values above 0.5 in a row would be added to list.
You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:
# tag rows based on the threshold
df['tag'] = df['values'] > .5
# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]
so for example the first region would be:
>>> i, j = pr[0]
>>> df.loc[i:j]
indices values tag
15 16 0.639992 True
16 17 0.593427 True
17 18 0.810888 True
18 19 0.596243 True
19 20 0.812684 True
20 21 0.617945 True
I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that.
import numpy as np
# from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
# with minor edits
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition,n=1, axis=0)
idx, _ = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right. -JK
# LB this copy to increment is horrible but I get
# ValueError: output array is read-only without it
mutable_idx = np.array(idx)
mutable_idx += 1
idx = mutable_idx
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def main():
import pandas as pd
RUN_LENGTH_THRESHOLD = 5
VALUE_THRESHOLD = 0.5
np.random.seed(seed=901212)
data = np.random.rand(500)*.5 + .35
df = pd.DataFrame(data=data,columns=['values'])
match_bools = df.values > VALUE_THRESHOLD
print('with boolian array')
for start, stop in contiguous_regions(match_bools):
if (stop - start > RUN_LENGTH_THRESHOLD):
print (start, stop)
if __name__ == '__main__':
main()
I would be surprised if there were not more elegant ways