I have a dataframe that contain 3 columns, Id, Stage, Status. I would like to change that value based on the condition: if for the same ID, the stage changed, then change the status to 1. If another occurrence of the same id happened, stage is still same then change status back to 0.
Thanks !!
To calculate the Period column, you can calculate the result with two (nested) groupbys:
df["Period"] = (df.groupby("ID", group_keys=False)
# use the common diff.cumsum pattern to calculate the group variable here
.apply(lambda g: g.groupby(by = (g.Stage.diff() != 0).cumsum())
.cumcount() * 30))
df
The status column can be obtained this way:
df.groupby('ID').diff().Stage.fillna(0).ne(0)
Out[86]:
4 False
10 True
0 False
2 True
3 True
5 True
7 False
8 False
9 True
1 False
6 False
Name: Stage, dtype: bool
You needs to sort on column ID and then use np.where() and df.shift() to find the right status.
df=df.sort_values('ID')
df['Status']=np.where(((df.ID.shift()==df.ID) & (df.Stage.shift()<>df.Stage)),1,0)
output
ID Stage Status
4 45 2 0
10 45 3 1
0 50 4 0
2 50 5 1
3 50 6 1
5 50 4 1
7 50 4 0
8 50 4 0
9 50 5 1
1 55 3 0
6 55 3 0
Related
I want to calculate consecutive numbers from colA and then select the middle number from consecutive seqeunce to print out value corresponding to it in column Freq1
this code is not printing any value
for col in df.ColA:
if col == col + 1 and col + 1 == col + 2:
print(col)
can anyone suggest any idea
ColA Freq1
4 0
5 100
6 200
18 5
19 600
20 700
This will return the resulting rows you are looking for:
Basically where ColA is equal to the previous row + 1 and the next row - 1
This of course operates under the assumption that there are always only 3 consecutive numbers.
df.loc[(df['ColA'].eq(df['ColA'].shift(1).add(1))) & (df['ColA'].eq(df['ColA'].shift(-1).sub(1)))]
You can use a custom groupby to identify the stretches of contiguous values, then size of the groups and the middle value:
# find consecutive values
group = df['ColA'].diff().ne(1).cumsum()
# make grouper
g = df.groupby(group)['ColA']
# get index of row in group
# compare to group size to find mid-point
m = g.cumcount().eq(g.transform('size').floordiv(2))
# perform boolean indexing
out = df.loc[m]
output:
ColA Freq1
1 5 100
4 19 600
intermediates:
ColA Freq1 diff group cumcount size size//2 m
0 4 0 NaN 1 0 3 1 False
1 5 100 1.0 1 1 3 1 True
2 6 200 1.0 1 2 3 1 False
3 18 5 12.0 2 0 3 1 False
4 19 600 1.0 2 1 3 1 True
5 20 700 1.0 2 2 3 1 False
Under the if-then section of the pandas documentation cookbook, we can assign values in one column, based on a condition being met for a separate column using loc[].
df = pd.DataFrame({'AAA' : [4,5,6,7],
'BBB' : [10,20,30,40],
'CCC' : [100,50,-30,-50]})
# AAA BBB CCC
# 0 4 10 100
# 1 5 20 50
# 2 6 30 -30
# 3 7 40 -50
df.loc[df.AAA >= 5,'BBB'] = -1
# AAA BBB CCC
# 0 4 10 100
# 1 5 -1 50
# 2 6 -1 -30
# 3 7 -1 -50
But what if I want to write a condition that involves the previous or subsequent row using .loc[]? For example, say I want to assign df.BBB=5 wherever the difference between the df.CCC of the current row and the df.CCC of the next row is greater than or equal to 50. Then I would like to create a condition that gives me the following data frame:
# AAA BBB CCC
# 0 4 5 100 <-| 100 - 50 = 50, assign df.BBB = 5
# 1 5 5 50 <-| 50 -(-30)= 80, assign df.BBB = 5
# 2 6 -1 -30 <-| 30 -(-50)= 20, don't assign df.BBB = 5
# 3 7 -1 -50 <-| (-50) -0 =-50, don't assign df.BBB = 5
How can I get this result?
Edit
The answer I'm hoping to find is something like
mask = df['CCC'].current - df['CCC'].next >= 50
df.loc[mask, 'BBB'] = 5
because I'm interested in the general problem of how I can access values above or below the current row being considered in a dataframe.(not necessarily solving this one toy example.)
diff() will work on the example I first described, but what of other cases, say, where we want to compare two elements instead of subtracting them?
What if I take the previous data frame and I want to find all rows where the current column entry doesn't match the next in df.BBB and then assign df.CCC based on those comparisons?
if df.BBB.current == df.CCC.next:
df.CCC = 1
# AAA BBB CCC
# 0 4 5 1 <-| 5 == 5, assign df.CCC = 1
# 1 5 5 50 <-| 5 != -1, do nothing
# 2 6 -1 1 <-| -1 == -1, assign df.CCC = 1
# 3 7 -1 -50 <-| -1 != 0, do nothing
Is there a way to do this with pandas using .loc[]?
Given
>>> df
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
you can compute a boolean mask first via
>>> mask = df['CCC'].diff(-1) >= 50
>>> mask
0 True
1 True
2 False
3 False
Name: CCC, dtype: bool
and then issue
>>> df.loc[mask, 'BBB'] = 5
>>>
>>> df
AAA BBB CCC
0 4 5 100
1 5 5 50
2 6 30 -30
3 7 40 -50
More generally, you can compute a shift
>>> df['CCC_next'] = df['CCC'].shift(-1) # or df['CCC'].shift(-1).fillna(0)
>>> df
AAA BBB CCC CCC_next
0 4 5 100 50.0
1 5 5 50 -30.0
2 6 30 -30 -50.0
3 7 40 -50 NaN
... and then do whatever you want, such as:
>>> df['CCC'].sub(df['CCC_next'], fill_value=0)
0 50.0
1 80.0
2 20.0
3 -50.0
dtype: float64
>>> mask = df['CCC'].sub(df['CCC_next'], fill_value=0) >= 50
>>> mask
0 True
1 True
2 False
3 False
dtype: bool
although for the specific problem in your question the diff approach is sufficient.
You can use enumerate function to access row and its index simultaneously. Thus you can obtain previous and next row based on the index of the current row. I provide an example script below for your reference:
import pandas as pd
df = pd.DataFrame({'AAA' : [4,5,6,7],
'BBB' : [10,20,30,40],
'CCC' : [100,50,-30,-50]}, index=['a','b','c','d'])
print('row_pre','row_pre_AAA','row','row_AA','row_next','row_next_AA')
for irow, row in enumerate(df.index):
if irow==0:
row_next = df.index[irow+1]
print('row_pre', "df.loc[row_pre,'AAA']", row, df.loc[row,'AAA'], row_next, df.loc[row_next,'AAA'])
elif irow>0 and irow<df.index.size-1:
row_pre = df.index[irow-1]
row_next = df.index[irow+1]
print(row_pre, df.loc[row_pre,'AAA'], row, df.loc[row,'AAA'], row_next, df.loc[row_next,'AAA'])
else:
row_pre = df.index[irow-1]
print(row_pre, df.loc[row_pre,'AAA'], row, df.loc[row,'AAA'], 'row_next', "df.loc[row_next,'AAA']")
Output as below:
row_pre row_pre_AAA row row_AA row_next row_next_AA
row_pre df.loc[row_pre,'AAA'] a 4 b 5
a 4 b 5 c 6
b 5 c 6 d 7
c 6 d 7 row_next df.loc[row_next,'AAA']
Let's say I have a Pandas DataFrame:
x = pd.DataFrame(data=[5,4,3,2,1,0,1,2,3,4,5],columns=['value'])
x
Out[9]:
value
0 5
1 4
2 3
3 2
4 1
5 0
6 1
7 2
8 3
9 4
10 5
Now, I want to, given an index, find rows in x until a condition is met.
For example, if index = 2:
x.loc[2]
Out[14]:
value 3
Name: 2, dtype: int64
Now I want to, from that index, find the next n rows where the value is greater than some threshold. For example, if the threshold is 0, the results should be:
x
Out[9]:
value
2 3
3 2
4 1
5 0
How can I do this?
I have tried:
x.loc[2:x['value']>0,:]
But of course this will not work because x['value']>0 returns a boolean array of:
Out[20]:
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 True
10 True
Name: value, dtype: bool
Using idxmin and slicing
x.loc[2:x['value'].gt(0).idxmin(),:]
2 3
3 2
4 1
5 0
Name: value
Edit:
For a general formula, use
index = 7
threshold = 2
x.loc[index:x.loc[index:,'value'].gt(threshold).idxmin(),:]
From your description in comments, seemed like you want to begin from index+1 and not index. So, if that is the case, just use
x.loc[index+1:x.loc[index+1:,'value'].gt(threshold).idxmin(),:]
You want to filter for index greater than your index=2, and for x['value']>=threshold, and then select the first n of these rows, which can be accomplished with .head(n).
Say:
idx = 2
threshold = 0
n = 4
x[(x.index>=idx) & (x['value']>=threshold)].head(n)
Out:
# value
# 2 3
# 3 2
# 4 1
# 5 0
Edit: changed to >=, and updated example to match OP's example.
Edit 2 due to clarification from OP: since n is unknown:
idx = 2
threshold = 0
x.loc[idx:(x['value']<=threshold).loc[x.index>=idx].idxmax()]
This is selecting from the starting idx, in this case idx=2, up to and including the first row where the condition is not met (in this case index 5).
I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .
I want to adapt my former SAS code to Python using the dataframe framework.
In SAS I often use this type of code (assume the columns are sorted by group_id where group_id takes values 1 to 10 where there are multiple observations for each group_id):
data want;set have;
by group_id;
if first.group_id then c=1; else c=0;
run;
so what goes on here is that I select the first observations for each id and I create a new variable c that takes value 1 and 0 for the others. The dataset looks like this:
group_id c
1 1
1 0
1 0
2 1
2 0
2 0
3 1
3 0
3 0
How can I do this in Python using dataframe? Assume that I start with the group_id vector only.
If you're using 0.13+ you can use cumcount groupby method:
In [11]: df
Out[11]:
group_id
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
In [12]: df.groupby('group_id').cumcount() == 0
Out[12]:
0 True
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 False
dtype: bool
You can force the dtype to be int rather than bool:
In [13]: df['c'] = (df.groupby('group_id').cumcount() == 0).astype(int)