Dataframe python where conditions considering previous rows condition - python

:)
I've a dataframe like that (it's an extract of the entire dataframe):
a b
1 1
6 3
7 5
1 7
12 5
12 5
2 5
95 2
44 3
i want to create a new column using NumPy in python based on a multiple where conditions, considering previous conditions. Let me explain with an example:
I want to create column 'C' with value = '1' when:
(a > b) and (a[-1] < b) and (the previous valued value of "c" must be 2)
another condition is 'C' = '2' when:
(a < b) and (the previous valued value of "c" must be 1)
Thanks you!

You can use np.select to return an array drawn from elements in choicelist, depending on conditions.
Use:
df['c'] = '' # --> assign initial value
conditions = [
(df['a'].gt(df['b']) & df['a'].shift().lt(df['b'])) & (df['c'].shift().eq('') | df['c'].shift().eq(2)),
df['a'].lt(df['b']) & (df['c'].shift().eq(1) | df['c'].shift().eq(''))
]
choices = [1, 2]
df['c'] = np.select(conditions, choices, default='')
print(df)
This prints:
a b c
0 1 1
1 6 3 1
2 7 5
3 1 7 2
4 12 5 1
5 12 5
6 2 5 2
7 95 2
8 44 3

Related

How to create a new column based on a condition in another column

In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1

pandas , add a constant to column based on condition on another column

I have a data frame :
A B
1 2
4 3
5 9
6 7
9 7
I want to check if values in column A are divisible by 2 (check odd even) if they are divisible by two then I want to add 18 to the value in Column B
So far I have been able to check if value in column A is divisible by 2 and extract it.
df = df[df['A'] % 2 == 0]
Thanks
df['A']%2==0 will return boolean series where A is divisible by 2 and then corresponding values of B would be updated
df.loc[df['A']%2==0, 'B'] = df['B'] + 18
df
A B
0 1 2
1 4 21
2 5 9
3 6 25
4 9 7
Lets try:
df['B']=np.where(df['A'] % 2 == 0,df.B.add(18),df.B)

Iterate over a groupby dataframe to operate in each row

I have a DataFrame like this:
subject trial attended
0 1 1 1
1 1 3 0
2 1 4 1
3 1 7 0
4 1 8 1
5 2 1 1
6 2 2 1
7 2 6 1
8 2 8 0
9 2 9 1
10 2 11 1
11 2 12 1
12 2 13 1
13 2 14 1
14 2 15 1
I would like to GroupBy subject.
Then iterate in each row of the GroupBy dataframe.
If for a row 'attended' == 1, then to increase a variable sum_reactive by 1.
If the sum_reactive variable reaches == 4, then to add in a dictionary the 'subject' and 'trial' in which the variable sum_reactive reached a value of 4.
I as trying to define a function for this, but it doesn't work:
def count_attended():
sum_reactive = 0
dict_attended = {}
for i, g in reactive.groupby(['subject']):
for row in g:
if g['attended'][row] == 1:
sum_reactive += 1
if sum_reactive == 4:
dict_attended.update({g['subject'] : g['trial'][row]})
return dict_attended
return dict_attended
I think that I don't have clear how to iterate inside each GroupBy dataframe. I'm quite new using pandas.
IIUC try,
df = df.query('attended == 1')
df.loc[df.groupby('subject')['attended'].cumsum() == 4, ['subject', 'trial']].to_dict(orient='record')
Output:
[{'subject': 2, 'trial': 9}]
Using groupby with cumsum will do the counting attended, then check to see when this value equals to 4 to create a boolean series. You can use this boolean series to do boolean indexing to filter your dataframe to certain rows. Lastly, with lock and column filtering select subject and trial.

Get rows from Pandas DataFrame from index until condition

Let's say I have a Pandas DataFrame:
x = pd.DataFrame(data=[5,4,3,2,1,0,1,2,3,4,5],columns=['value'])
x
Out[9]:
value
0 5
1 4
2 3
3 2
4 1
5 0
6 1
7 2
8 3
9 4
10 5
Now, I want to, given an index, find rows in x until a condition is met.
For example, if index = 2:
x.loc[2]
Out[14]:
value 3
Name: 2, dtype: int64
Now I want to, from that index, find the next n rows where the value is greater than some threshold. For example, if the threshold is 0, the results should be:
x
Out[9]:
value
2 3
3 2
4 1
5 0
How can I do this?
I have tried:
x.loc[2:x['value']>0,:]
But of course this will not work because x['value']>0 returns a boolean array of:
Out[20]:
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 True
10 True
Name: value, dtype: bool
Using idxmin and slicing
x.loc[2:x['value'].gt(0).idxmin(),:]
2 3
3 2
4 1
5 0
Name: value
Edit:
For a general formula, use
index = 7
threshold = 2
x.loc[index:x.loc[index:,'value'].gt(threshold).idxmin(),:]
From your description in comments, seemed like you want to begin from index+1 and not index. So, if that is the case, just use
x.loc[index+1:x.loc[index+1:,'value'].gt(threshold).idxmin(),:]
You want to filter for index greater than your index=2, and for x['value']>=threshold, and then select the first n of these rows, which can be accomplished with .head(n).
Say:
idx = 2
threshold = 0
n = 4
x[(x.index>=idx) & (x['value']>=threshold)].head(n)
Out:
# value
# 2 3
# 3 2
# 4 1
# 5 0
Edit: changed to >=, and updated example to match OP's example.
Edit 2 due to clarification from OP: since n is unknown:
idx = 2
threshold = 0
x.loc[idx:(x['value']<=threshold).loc[x.index>=idx].idxmax()]
This is selecting from the starting idx, in this case idx=2, up to and including the first row where the condition is not met (in this case index 5).

Pandas - Delete only contiguous rows that equal zero

I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .

Categories