Change the consecutive value of selected rows in pandas - python

I need to update the value of the column matching the selection criteria and repeat it some consecutive times.
eg:
INPUT:
df = pd.DataFrame({"a" : [True,False,False,False,False,True,False,False]})
roll value to next 3 indexes
(input format)
OUTPUT:
output = pd.DataFrame({"a" : [True,True,True, False, False, True, True, True]}
(output format)
I looked at pandas.series.repeat but that adds new value. I need to make sure that the size remains the same.

Use .rolling(...) to get rolling window:
df.rolling(window=3, min_periods=1).agg(lambda x: any(x)).astype("bool")
Output:
a
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True

You could use a while and for loop like so:
i = 0 // index
content = 'true' // set content to replace
while (i < 3): // run 3 times
for x in dataframe:
dataframe['column'] = dataframe['column'].replace(content, str(i)) // replace content at index i
i = i + 1 // add 1 to index
This way you replace values at a certain index while only looping through the first 3 indexes.

Related

pandas groupby apply with condition on the first occurrence of a column value

I have a data frame shown below with pid and event_date being the indices after applying groupby. I want to apply groupby again this time only to pid, and applies to two conditions:
A person (pid=person) has two or more True labels;
The first True instance of this person occurred when he/she was under 45 years old;
If the two above conditions satisfy then assign this person/pid to True in the groupby-ed dataframe.
age label
pid event_date
00000001 2000-08-28 76.334247 False
2000-10-17 76.471233 False
2000-10-31 76.509589 True
2000-11-02 76.512329 True
... ... ... ...
00000005 2014-08-15 42.769863 False
2015-04-04 43.476712 False
2015-11-06 44.057534 True
2017-03-06 45.386301 True
I have come only so far to implement the first condition:
df = (df.groupby(['pid']).apply(lambda x: sum(x['label'])>1).to_frame('label'))
The second one is tricky for me. How do I condition on the first occurrence of some column value? Any advice is very much welcomed! Many thanks!
UPDATE with an example dataframe:
a = pd.DataFrame(columns=['pid', 'event_date', 'age', 'label'])
a['pid'] = [1, 1, 1, 1, 5, 5, 5, 5]
a['event_date'] = ['2000-08-28', '2000-08-28', '2000-08-28', '2000-08-28',\
'2000-08-28', '2000-08-28', '2000-08-28', '2000-08-28']
a['event_date'] = pd.to_datetime(a.event_date)
a['age'] = [76.334247, 76.471233, 76.509589, 76.512329, 42.769863, 43.476712, 44.057534, 45.386301]
a['label'] = [False, False, True, True, False, False, True, True]
a = (a.groupby(['pid', 'event_date', 'age']).apply(lambda x: x['label'].any()).to_frame('label'))
a.reset_index(level=['age'], inplace=True)
Now if I apply (a.groupby(['pid']).apply(lambda x: sum(x['label'])>1).to_frame('label')) I would get
label
pid
1 True
5 True
Which only satisfies the first condition (well because I skipped the second one). Adding the second condition should only label pid=5 True since only this person/pid was under 45 when the first label=True occurred.
After half a (fun) hour, I came up with this:
condition = a.reset_index().groupby('pid')['label'].sum().ge(2) & a.reset_index().groupby('pid').apply(lambda x: x['age'][x['label'].idxmax()] < 45)
Output:
>>> condition
pid
1 False
5 True
dtype: bool
It could be shorten a little bit (removing the two .reset_index() calls) if the index was normal, not a MultiIndex of pid + event_date. If you can't avoid that from the start and you don't mind changing a:
a = a.reset_index()
condition = a.groupby('pid')['label'].sum().ge(2) & a.groupby('pid').apply(lambda x: x['age'][x['label'].idxmax()] < 45)
Expanded:
condition = (
a.groupby('pid') # Group by pid
['label'] # Get the label column for each group
.sum() # Compute the sum of the True values
.ge(2) # Are there two or more?
& # Boolean mask. The previous and the next bits of code are the two conditions, and they return a series, where the index is each unique pid, and the value is whether the condition is met for all the rows in that pid
a.groupby('pid') # Group by pid
.apply( # Call a function for each group, passing the group (a dataframe) to the function as its first parameter
lambda x: # Function start
x['age'][ # Get item from the age column at the specified index
x['label'].idxmax() # Get the index of the highest value of the label column (since they're only boolean values, the highest will be the first True value)
] < 45 # Check if it's less than 45
)
)

Add a value in a column as a function of the timestamp and another column

The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0

How do I label data based on the values of the previous row?

I want to label the data "1" if the current value is higher than that of the previous row and "0" otherwise.
Lets say I have this DataFrame:
df = pd.DataFrame({'date': [1,2,3,4,5], 'price': [50.125, 45.25, 65.857, 100.956, 77.4152]})
and I want the output as if the DataFrame is constructed like this:
df = pd.DataFrame({'date': [1,2,3,4,5], 'price': [50.125, 45.25, 65.857, 100.956, 77.4152], 'label':[0, 0, 1, 1, 0]})
*I don't know how to post a DataFrame
These code is my attempt:
df['label'] = 0
i = 0
for price in df['price']:
i = i+1
if price in i > price: #---> right now I am stuck here. i=It says argument of type 'int' is not iterable
df.append['label', 1]
elif price in i <= price:
df.append['label', 0]
I think there are also other logical mistakes in my codes. What am I doing wrong?
Create boolean mask by Series.ge (>=) with Series.shift and convert to integers for True/False to 1/0 mapping by Series.view:
df['label'] = df['price'].ge(df['price'].shift()).view('i1')
Or by Series.astype:
df['label'] = df['price'].ge(df['price'].shift()).astype(int)
IIUC np.where with a boolean shift to see the change in the row price and if it's greater than the row above.
df['label'] = np.where(df['price'].ge(df['price'].shift()),1,0)
print(df)
date price label
0 1 50.1250 0
1 2 45.2500 0
2 3 65.8570 1
3 4 100.9560 1
4 5 77.4152 0
Explanation:
print(df['price'].ge(df['price'].shift()))
returns a boolean of True and False statements that we can use in our where clause
0 False
1 False
2 True
3 True
4 False
To explain what is happening in your code:
df['label'] should be initiated to an empty list, not "0". If you want to set the first value of the list to 0, you can do df['label'] = [0].
i is just the index value (0, 1, 2, 3...) and not the value of the price at a specific index (50.125, 45.25, 65.857...) , so it is not what you want to compare.
price in is used to check if the value of the price variable exists in a list that follows. The in statement isn't followed with a list, so you will get an error. You want to instead get the price value at a specific index and compare if it is greater or less than the value at the previous index.
The append method uses () and not [].
If you want to continue along your method of using a loop, the following can work:
df['label'] = []
for i in range(len(df['price'])):
if df['price'][i] > df['price'][i - 1]:
df['label'].append(1)
else:
df['label'].append(0)
What this does is loop through the range of the length of the price list. It then compares the values of the price at position i and position i - 1.
You can also further simplify the if/else statement using a ternary operator to:
df['label'].append(1 if df['price'][i] > df['price'][i - 1] else 0)
Working fiddle: https://repl.it/repls/HoarseImpressiveScientists

Pandas change column value if greater than len

I am trying to use an if statement to change values within a column if there length is greater than x
My csv data ..
ID Test_Case TC_NUM
14581,dialog_testcase_4000.0134_mvp_not_understood-inprogress.xml,4000.0134
14582,dialog_testcase_4000.0135_mvp_not_understood-inprogress.xml,4000.0135
14583,dialog_testcase_4000.0136_mvp_not_understood-inprogress.xml,4000.0136
14584,dialog_testcase_4000.0137_mvp_not_understood_6.2.0-inprogress.xml,4000.01376.2.0
14585,dialog_testcase_4000.0138_mvp_not_understood_6.2.0-inprogress.xml,4000.01386.2.0
What I want:
ID Test_Case TC_NUM
14581,dialog_testcase_4000.0134_mvp_not_understood-inprogress.xml,4000.0134
14582,dialog_testcase_4000.0135_mvp_not_understood-inprogress.xml,4000.0135
14583,dialog_testcase_4000.0136_mvp_not_understood-inprogress.xml,4000.0136
14584,dialog_testcase_4000.0137_mvp_not_understood_6.2.0-inprogress.xml,4000.0137
14585,dialog_testcase_4000.0138_mvp_not_understood_6.2.0-inprogress.xml,4000.0138
My current code is able to extract some of the right columns, but messes up if additional numbers are in there.
df1['TC_NUM'] = df1['TC_NUM'].str.replace(r'[^0-9.]+', '')
df1['TC_NUM'] = df1['TC_NUM'].str[:-1]
My thought/attempt as using an if statement to correct this.
if dfidtcnum(len['TC_NUM'] > 12):
print "True"
IIUC you can use mask:
print (df.TC_NUM.str.len() > 9)
0 False
1 False
2 False
3 True
4 True
Name: TC_NUM, dtype: bool
df['TC_NUM'] = df.TC_NUM.mask(df.TC_NUM.str.len() > 9, df.TC_NUM.str[:-5])
print (df)
ID Test_Case TC_NUM
0 14581 dialog_testcase_4000.0134_mvp_not_understood-i... 4000.0134
1 14582 dialog_testcase_4000.0135_mvp_not_understood-i... 4000.0135
2 14583 dialog_testcase_4000.0136_mvp_not_understood-i... 4000.0136
3 14584 dialog_testcase_4000.0137_mvp_not_understood_6... 4000.0137
4 14585 dialog_testcase_4000.0138_mvp_not_understood_6... 4000.0138

pandas filtering consecutive rows

I got a Dataframe with a Matrix colum like this
11034-A
11034-B
1120-A
1121-A
112570-A
113-A
113.558
113.787-A
113.787-B
114-A
11691-A
11691-B
117-A RRS
12 X R
12-476-AT-A
12-476-AT-B
I'd like to filter only matrix that ends with A or B only when they are consecutive, so in the example above 11034-A and 11034-B, 113.787-A and 113.787-B, 11691-A and 11691-B, 12-476-AT-A and 12-476-AT-B
I wrote the function that will compare those 2 strings and return True or False, the problem is I fail to see how to apply / applymap to the consecutive rows:
def isAB(stringA, stringB):
if stringA.endswith('A') and stringB.endswith('B') and stringA[:-1] == stringB[:-1]:
return True
else:
return False
I tried df['result'] = isAB(df['Matrix'].str, df['Matrix'].shift().str) to no-avail
I seem to lack something in the way I designed this
edit :
I think this works, looks like I overcomplicated at 1st :
df['t'] = (df['Matrix'].str.endswith('A') & df['Matrix'].shift(-1).str.endswith('B')) | (df['Matrix'].str.endswith('B') & df['Matrix'].shift(1).str.endswith('A'))
df['p'] = (df['Matrix'].str[:-1] == df['Matrix'].shift(-1).str[:-1]) | (df['Matrix'].str[:-1] == df['Matrix'].shift(1).str[:-1])
df['e'] = df['p'] & df['t']
final = df[df['e']]
Here is how I would do it.
df['ShiftUp'] = df['matrix'].shift(-1)
df['ShiftDown'] = df['matrix'].shift()
def check_matrix(x):
if pd.isnull(x.ShiftUp) == False and x.matrix[:-1] == x.ShiftUp[:-1]:
return True
elif pd.isnull(x.ShiftDown) == False and x.matrix[:-1] == x.ShiftDown[:-1]:
return True
else:
return False
df['new'] = df.apply(check_matrix, axis=1)
df = df.drop(['ShiftUp', 'ShiftDown'], axis=1)
print df
prints
matrix new
0 11034-A True
1 11034-B True
2 1120-A False
3 1121-A False
4 112570-A False
5 113-A False
6 113.558 False
7 113.787-A True
8 113.787-B True
9 114-A False
10 11691-A True
11 11691-B True
12 117-A RRS False
13 12 X R False
14 12-476-AT-A True
15 12-476-AT-B True
Here's my solution, it requires a bit of work.
The strategy is the following: obtain a new column that has the same values as the current column but shifted one position.
Then, it's just a matter to check whether one column is A or B and the other one B or A.
Say your matrix colum is called "column_name".
Then:
myl = ['11034-A',
'11034-B',
'1120-A',
'1121-A',
'112570-A',
'113-A',
'113.558',
'113.787-A',
'113.787-B',
'114-A',
'11691-A',
'11691-B',
'117-A RRS',
'12 X R',
'12-476-AT-A',
'12-476-AT-B']
#toy data frame
mydf = pd.DataFrame.from_dict({'column_name':myl})
#get a new series which is the same one as the original
#but the first entry contains "nothing"
new_series = pd.Series( ['nothing'] +
mydf['column_name'][:-1].values.tolist() )
#add it to the original dataframe
mydf['new_col'] = new_series
You then define a simple function:
def do_i_want_this_row(x,y):
left_char = x[-1]
right_char = y[-1]
return ((left_char == 'A') & (right_char == 'B')) or ((left_char == 'B') & (right_char=='A'))
and voila:
print mydf[mydf.apply(lambda x: do_i_want_this_row( x.column_name, x.new_col), axis=1)]
column_name new_col
1 11034-B 11034-A
2 1120-A 11034-B
8 113.787-B 113.787-A
9 114-A 113.787-B
11 11691-B 11691-A
15 12-476-AT-B 12-476-AT-A
There is still the question of the last element, but I'm sure you can think of what to do with it if you decide to follow this strategy ;)
You can delete rows from a DataFrame using DataFrame.drop(labels, axis). To get a list of labels to delete, I would first get a list of pairs that match your criterion. With the labels from above in a list labels and your isAB function,
pairs = zip(labels[:-1], labels[1:])
delete_pairs = filter(isAB, pairs)
delete_labels = []
for a,b in delete_pairs:
delete_labels.append(a)
delete_labels.append(b)
Examinedelete_labels to make sure you've put it together correctly,
print(delete_labels)
And finally, delete the rows. With the DataFrame in question as x,
x.drop(delete_labels) # or x.drop(delete_labels, axis) if appropriate

Categories