I want to replace the first 3 values with 1 by a 0 if the current row value df.iloc[i,0] is 0 by iterating through the dataframe df. After replacing the values the dafaframe iteration should skip the new added value and start from the next index-in the following example from index 7.
If the last tow values in the dataframe are 1 this should be replaced as well by 0- Replacing two values is only happened if these values are the last values. In the example this is the case for the values with index 9 and 10.
original DataFrame:
index column 1
0 1
1 1
2 1
3 0
4 1
5 1
6 1
7 1
8 0
9 1
10 1
the new DataFrame what I want to have should look as follows:
index column 1
0 1
1 1
2 1
3 0
4 **0** --> new value
5 **0** --> new value
6 **0** --> new value
7 1
8 0
9 **0** --> new value
10 **0** --> new value
I type that code but it does not work.
for i in range(len(df)):
print(df.iloc[i,0])
if df.iloc[i,0]== 0 :
j= i + 1
while j <= i + 3:
df.iloc[j,1]= 0
j= j+ 1
i = i + 4 #this is used to skip the new values and starting by the next firt index
if (len(df)- i < 2) and (df.iloc[i,0]== 0): #replacing the two last values by 0 if the previous value is 0.
j= i + 1
while j <= len(df)
df.iloc[j,1]= 0
There are many issues you could improve and change in your code.
First it is usually not a good idea to use for i in range(len(df)): loop. It's not Pandas. Pandas has **df.size** (for use instead of len(df). And you loop in Python like:
for i, colmn_value in enumerate(df[colmn_name]):
if you definitely need the index ( in most cases, including this one in your question you don't ) or with
for colmn_value in df[colmn_name]:
I have provided at the bottom your corrected code which now works.
The issues I have fixed to make your code run are explained in the code so check them out. These issues were only usual 'traps' a beginner runs into learning how to code. The main idea was the right one.
You seem to have already programming experience in another programming language like C or C++, but ... don't expect a for i in range(N): Python loop to behave like a C-loop which increases the index value on each iteration, so you could change it in a loop to skip indices. You can't do the same in the Python for loop getting its values from range(), enumerate() or other iterable. If you want to change the index within the loop use the Python 'while' loop.
The code I provide here below for the same task in two versions (a longer one, not Pandas way, and another doing the same Pandas way) is using the 'trick' of counting down the replacements from 3 to 0 if a zero value was detected in the column and replaces the values only if countdown:.
Change VERBOSE to False to switch off printing lines which show how the code works under the hood. And as it is Python, the code explains mostly by itself using in Python available appropriate syntax sounding like speaking about what is to do.
VERBOSE = True
if VERBOSE: new_colmn_value = "**0**"
else: new_colmn_value = 0
new_colmn = []
countdown = 0
for df_colmn_val in df.iloc[:,0]: # i.e. "column 1"
new_colmn.append(new_colmn_value if countdown else df_colmn_val)
if VERBOSE:
print(f'{df_colmn_val=}, {countdown=}, new_colmn={new_colmn_value if countdown else df_colmn_val}')
if df_colmn_val == 0 and not countdown:
countdown = 4
if countdown: countdown -= 1
df.iloc[:,[0]] = new_colmn # same as df['column 1'] = new_colmn
print(df)
gives:
df_colmn_val=1, countdown=0, new_colmn=1
df_colmn_val=1, countdown=0, new_colmn=1
df_colmn_val=1, countdown=0, new_colmn=1
df_colmn_val=0, countdown=0, new_colmn=0
df_colmn_val=1, countdown=3, new_colmn=**0**
df_colmn_val=1, countdown=2, new_colmn=**0**
df_colmn_val=1, countdown=1, new_colmn=**0**
df_colmn_val=1, countdown=0, new_colmn=1
df_colmn_val=0, countdown=0, new_colmn=0
df_colmn_val=1, countdown=3, new_colmn=**0**
df_colmn_val=1, countdown=2, new_colmn=**0**
column 1
index
0 1
1 1
2 1
3 0
4 **0**
5 **0**
6 **0**
7 1
8 0
9 **0**
10 **0**
And now the Pandas way of doing the same:
ct = 0; nv ='*0*'
def ctF(row):
global ct # the countdown counter
r0 = row.iloc[0] # column 0 value in the row of the dataframe
row.iloc[0] = nv if ct else r0 # assign new or old value depending on counter
if ct: ct -= 1 # decrease the counter if not yet zero
else : ct = 3 if not ct and r0==0 else 0 # set counter if there is zero in row
df.apply(ctF, axis=1) # axis=1: work on rows (and not on columns)
print(df)
The code above uses the Pandas .apply() method which passes as argument a row of the DataFrame to the ctF function which then works on the row and assigning new values to its elements if necessary. So the looping over the rows is done outside Python which is usually faster in case of large DataFrames. A global variable in the ctF function makes sure that the next function call knows the countdown value set in previous call. The .apply() returns a column of values ( this feature is not used in code above ) which can be for example added as new column to the DataFrame df providing the results of processing all the rows.
Below your own code which I had fixed so that it runs now and does what it was written for:
for i in range(len(df)):
print(df.iloc[i,0])
if df.iloc[i,0]== 0 :
j= i + 1
while ( j <= i + 3 ) and j < df.size: # handles table end !!!
print(f'{i=} {j=}')
df.iloc[j, 0] = '**0**' # first column has index 0 !!!
j= j+ 1
# i = i + 4 # this is used to skip the new values and starting by the next firt index
# !!!### changing i in the loop will NOT do what you expect it to do !!!
# the next i will be just i+1 getting its value from range() and NOT i+4
this_is_not_necessary_as_it_is_handled_already_above = """
if (len(df)- i < 2) and (df.iloc[i,0]== 0): #replacing the two last values by 0 if the previous value is 0.
j= i + 1
while j <= len(df):
df.iloc[j,1]= 0
"""
printing:
1
1
1
0
i=3 j=4
i=3 j=5
i=3 j=6
**0**
**0**
**0**
1
0
i=8 j=9
i=8 j=10
**0**
**0**
column 1
index
0 1
1 1
2 1
3 0
4 **0**
5 **0**
6 **0**
7 1
8 0
9 **0**
10 **0**
Related
I have 2 columns that represent the on switch and off switch indicator. I want to create a column called last switch where it keeps record the 'last' direction of the switch (whether it is on or off). Another condition is that if both on and off switch value is 1 for a particular row, then the 'last switch' output will return the opposite sign of the previous last switch. Currently I managed to find a solution to get this almost correct except facing the situation where both on and off shows 1 that makes my code wrong.
I also attached the screenshot with a desired output. Please help appreciate all.
df=pd.DataFrame([[1,0],[1,0],[0,1],[0,1],[0,0],[0,0],[1,0],[1,1],[0,1],[1,0],[1,1],[1,1],[0,1]], columns=['on','off'])
df['last_switch']=(df['on']-df['off']).replace(0,method='ffill')
Add the following lines to your existing code:
for i in range(df.shape[0]):
df['prev']=df['last_switch'].shift()
df.loc[(df['on']==1) & (df['off']==1), 'last_switch']=df['prev'] * (-1)
df.drop('prev', axis=1, inplace=True)
df['last_switch']=df['last_switch'].astype(int)
Output:
on off last_switch
0 1 0 1
1 1 0 1
2 0 1 -1
3 0 1 -1
4 0 0 -1
5 0 0 -1
6 1 0 1
7 1 1 -1
8 0 1 -1
9 1 0 1
10 1 1 -1
11 1 1 1
12 0 1 -1
Let me know if you need expanation of the code
df=pd.DataFrame([[1,0],[1,0],[0,1],[0,1],[0,0],[0,0],[1,0],[1,1],[0,1],[1,0],[1,1],[1,1],[0,1]], columns=['on','off'])
df['last_switch']=(df['on']-df['off']).replace(0,method='ffill')
prev_row = None
def apply_logic(row):
global prev_row
if prev_row is not None:
if (row["on"] == 1) and (row["off"] == 1):
row["last_switch"] = -prev_row["last_switch"]
prev_row = row.copy()
return row
df.apply(apply_logic,axis=1)
personally i am not a big fan of using loop against dataframe. shift wont work in this case as the "last_switch" column is dynamic and subject to change based on on&off status.
Using your intermediate reesult with apply while carrying the value from previous row should do the trick. Hope it makes sense.
Assuming a dataframe like this
In [5]: data = pd.DataFrame([[9,4],[5,4],[1,3],[26,7]])
In [6]: data
Out[6]:
0 1
0 9 4
1 5 4
2 1 3
3 26 7
I want to count how many times the values in a rolling window/slice of 2 on column 0 are greater or equal to the value in col 1 (4).
On the first number 4 at col 1, a slice of 2 on column 0 yields 5 and 1, so the output would be 2 since both numbers are greater than 4, then on the second 4 the next slice values on col 0 would be 1 and 26, so the output would be 1 because only 26 is greater than 4 but not 1. I can't use rolling window since iterating through rolling window values is not implemented.
I need something like a slice of the previous n rows and then I can iterate, compare and count how many times any of the values in that slice are above the current row.
I have done this using list instead of doing it in data frame. Check the code below:
list1, list2 = df['0'].values.tolist(), df['1'].values.tolist()
outList = []
for ix in range(len(list1)):
if ix < len(list1) - 2:
if list2[ix] < list1[ix + 1] and list2[ix] < list1[ix + 2]:
outList.append(2)
elif list2[ix] < list1[ix + 1] or list2[ix] < list1[ix + 2]:
outList.append(1)
else:
outList.append(0)
else:
outList.append(0)
df['2_rows_forward_moving_tag'] = pd.Series(outList)
Output:
0 1 2_rows_forward_moving_tag
0 9 4 1
1 5 4 1
2 1 3 0
3 26 7 0
I have a df named value of size 567 and it has a column index as follows:
index
96.875
96.6796875
96.58203125
96.38671875
95.80078125
94.7265625
94.62890625
94.3359375
58.88671875
58.7890625
58.69140625
58.59375
58.49609375
58.3984375
58.30078125
58.203125
I also have 2 additional variables:
mu = 56.80877955613938
sigma= 17.78935620293665
What I want is to check the values in the index column. If the value is greater than, say, mu+3*sigma, a new column named alarm must be added to the value df and a value of 4 must be added.
I tried:
for i in value['index']:
if (i >= mu+3*sigma):
value['alarm'] = 4
elif ((i < mu+3*sigma) and (i >= mu+2*sigma)):
value['alarm'] = 3
elif((i < mu+2*sigma) and (i >= mu+sigma)):
value['alarm'] = 2
elif ((i < mu+sigma) and (i >= mu)):
value['alarm'] = 1
But it creates an alarm column and fills it completely with 1.
What is the mistake I am doing here?
Expected output:
index alarm
96.875 3
96.6796875 3
96.58203125 3
96.38671875 3
95.80078125 3
94.7265625 3
94.62890625 3
94.3359375 3
58.88671875 1
58.7890625 1
58.69140625 1
58.59375 1
58.49609375 1
58.3984375 1
58.30078125 1
58.203125 1
If you have multiple conditions, you don't want to loop through your dataframe and use if, elif, else. A better solution would be to use np.select where we define conditions and based on those conditions we define choices:
conditions=[
value['index'] >= mu+3*sigma,
(value['index'] < mu+3*sigma) & (value['index'] >= mu+2*sigma),
(value['index'] < mu+2*sigma) & (value['index'] >= mu+sigma),
]
choices = [4, 3, 2]
value['alarm'] = np.select(conditions, choices, default=1)
value
alarm
index
96.875000 3
96.679688 3
96.582031 3
96.386719 3
95.800781 3
94.726562 3
94.628906 3
94.335938 3
58.886719 1
58.789062 1
58.691406 1
58.593750 1
58.496094 1
58.398438 1
58.300781 1
58.203125 1
If you have 10 min time, here's a good post by CS95 explaining why looping over a dataframe is bad practice.
I have a very big dataframe (20.000.000+ rows) that containes a column called 'sequence' amongst others.
The 'sequence' column is calculated from a time series applying a few conditional statements. The value "2" flags the start of a sequence, the value "3" flags the end of a sequence, the value "1" flags a datapoint within the sequence and the value "4" flags datapoints that need to be ignored. (Note: the flag values doesn't necessarily have to be 1,2,3,4)
What I want to achieve is a continous ID value (written in a seperate column - see 'desired_Id_Output' in the example below) that labels the slices of sequences from 2 - 3 in a unique fashion (the length of a sequence is variable ranging from 2 [Start+End only] to 5000+ datapoints) to be able to do further groupby-calculations on the individual sequences.
index sequence desired_Id_Output
0 2 1
1 1 1
2 1 1
3 1 1
4 1 1
5 3 1
6 2 2
7 1 2
8 1 2
9 3 2
10 4 NaN
11 4 NaN
12 2 3
13 3 3
Thanks in advance and BR!
I can't think of anything better than the "dumb" solution of looping through the entire thing, something like this:
import numpy as np
counter = 0
tmp = np.empty_like(df['sequence'].values, dtype=np.float)
for i in range(len(tmp)):
if df['sequence'][i] == 4:
tmp[i] = np.nan
else:
if df['sequence'][i] == 2:
counter += 1
tmp[i] = counter
df['desired_Id_output'] = tmp
Of course this is going to be pretty slow with a 20M-sized DataFrame. One way to improve this is via just-in-time compilation with numba:
import numba
#numba.njit
def foo(sequence):
# put in appropriate modification of the above code block
return tmp
and call this with argument df['sequence'].values.
Does it work to count the sequence starts? And then just set the ignore values (flag 4) afterwards. Like this:
sequence_starts = df.sequence == 2
sequence_ignore = df.sequence == 4
sequence_id = sequence_starts.cumsum()
sequence_id[sequence_ignore] = numpy.nan
Thanks for your help in advance!
I have one dataframe. say, for colume incr, the number if 0 1 1 1 -1 0 1 1 ...., I would like to get parse the list and get for each datapoint, how many times the data list had increased (or more exactly not decreased); decrease in the point would reset the output into zero for the current data point. for example for the list (named output['inc_adj'] in the code )
0 1 1 1 -1 0 1 -1
I should get named output['cont_inc'] in the code
1 2 3 4 0 1 2 0
I wrote the following code, but it is very inefficient, any suggestion for improve the efficiency significantly, please? It seemed that I am keeping reloading the cache in CPU among the two loops(if my feeling is correct), but I could not find a better solution at the current stage.
output['cont_inc']=0;
for i in xrange(1,output['inc_adj'].count()):
j=i;
while(output['inc_adj'][j] != -1):
#for both increase or unchanged
output['cont_inc'][i]+=1;
j-=1
Thanks in advance!
If memory allows, I would suggest building a list with all the adjacent values for comparison to start with (in my sample using zip), and append the result to a new list, re-assign the whole result list back to the DataFrame after completion.
Although it sounds odd, but in reality it improves the performance a little by eliminating a little overhead of constant DataFrame index/value lookup:
import pandas as pd
import random
# random DataFrame with values from -1 to 2
df = pd.DataFrame([random.randint(-1, 2) for _ in xrange(999)], columns=['inc_adj'])
df['cont_inc'] = 0
def calc_inc(df):
inc = [1]
# I use zip to PREPARE the adjacent values
for i, n in enumerate(zip(df['inc_adj'][1:], df['inc_adj'][:-1]), 0):
if n[0] >= n[1]:
inc.append(inc[i]+1)
continue
inc.append(0)
df['cont_inc'] = inc
calc_inc(df)
df.head()
inc_adj cont_inc
0 0 1
1 0 2
2 1 3
3 -1 0
4 0 1
%timeit calc_inc(df)
1000 loops, best of 3: 696 µs per loop
As a comparison, using indexing and/or lookup and in-place assignment, similarly coding logic:
def calc_inc_using_ix(df):
for idx, row in df.iterrows():
try:
if row['inc_adj'] >= df['inc_adj'][idx-1]:
row['cont_inc'] = df['cont_inc'][idx-1] + 1
continue
row['cont_inc'] = 0
except KeyError:
row['cont_inc'] = 1
calc_inc_using_ix(df)
df.head()
inc_adj cont_inc
0 0 1
1 1 2
2 1 3
3 0 0
4 2 1
%timeit calc_inc_using_ix(df)
10 loops, best of 3: 58.5 ms per loop
That said, I'm also interested in any other solutions that will further improve the performance, always willing to learn.