Pandas Drop all rows after all rows == True - python

I have a df like this:
Low isLower isPrevHigher isNextHigher
0 22470.0 True False True
1 22480.0 NaN False True
2 22576.6 NaN False True
3 22600.4 NaN False False
4 22583.5 NaN True True
5 22652.2 NaN False True
6 22656.8 NaN False False
7 22646.5 NaN True False
8 22600.0 NaN True False
9 22555.0 NaN True True
10 22580.1 NaN False True
11 22620.0 NaN False True
12 22682.2 NaN False False
13 22681.0 NaN True True
14 22710.8 NaN False False
15 22657.2 NaN True False
16 22623.0 NaN True True
17 22634.0 NaN False True
18 22660.0 NaN False True
19 22673.6 NaN False True
20 22721.2 NaN False False
21 22580.0 NaN True False
22 22552.6 NaN True False
23 22382.6 True True False
24 22353.0 True True False
25 22341.7 True True False
26 22312.4 True True False
**27 22256.4 True True True**
28 22310.6 True False False
29 22286.0 True True True
30 22306.8 True False True
31 22386.3 True False False
I want to drop all rows after the first isLower == True & isPrevHigher == True & isNextHigher == True.
So everything after row 27.

drop_row = df[df[['isLower', 'isPrevHigher', 'isNextHigher']].eq(True).all(axis=1)].index[0]
df = df[df.index <= drop_row]
print(df)
Output:
Low isLower isPrevHigher isNextHigher
0 22470.0 True False True
1 22480.0 NaN False True
2 22576.6 NaN False True
3 22600.4 NaN False False
4 22583.5 NaN True True
5 22652.2 NaN False True
6 22656.8 NaN False False
7 22646.5 NaN True False
8 22600.0 NaN True False
9 22555.0 NaN True True
10 22580.1 NaN False True
11 22620.0 NaN False True
12 22682.2 NaN False False
13 22681.0 NaN True True
14 22710.8 NaN False False
15 22657.2 NaN True False
16 22623.0 NaN True True
17 22634.0 NaN False True
18 22660.0 NaN False True
19 22673.6 NaN False True
20 22721.2 NaN False False
21 22580.0 NaN True False
22 22552.6 NaN True False
23 22382.6 True True False
24 22353.0 True True False
25 22341.7 True True False
26 22312.4 True True False
27 22256.4 True True True

You may want to drop rows on/after the first row with ALL empty values:
# create another data frame
df = pd.DataFrame(
{'direction': ['north', 'east', 'south', None, 'up', 'down'],
'amount': [10, 20, 30, None, 100, 200]})
# does the whole row consist of `None`
df['row_is_none'] = df.isna().all(axis=1)
# calculate the cumulative sum of the new column
df['row_is_non_accum'] = df['row_is_none'].cumsum()
# create boolean mask and perform drop (not shown to save space)
print(df)
direction amount row_is_none row_is_non_accum
0 north 10.0 False 0
1 east 20.0 False 0
2 south 30.0 False 0
3 None NaN True 1
4 up 100.0 False 1
5 down 200.0 False 1

This will find the where all the specified columns = True, then find the lowest index making it into a number variable the using iloc to final all the index's specified
first_all_true = df.iloc[np.where((df['isLower'] == True) & (df['isPrevHigher'] == True) & (df['isNextHigher'] == True))]index[0]
df.iloc[0:first_all_true + 1]

Using boolean indexing with help of all, the boolean NOT (~), and cummin:
df[(~df[['isLower', 'isPrevHigher', 'isNextHigher']].eq(True).all(axis=1)).cummin()]
NB. untested answer

Related

How to delete rows in Pandas after a certain value?

I replicated a Pandas series with the following code:
data = np.array([1, 2, 3, 4, 5, np.nan, np.nan, np.nan, 9,10,11,12,13,14])
ser = pd.Series(data)
print(ser)
I would like to select only the columns before the NaN values so that I only get the values 1,2,3,4,5. How should I do that?
Test missing values with Series.isna and add Series.cummax for repeat Trues after first match and last invert mask by ~, filter in boolean indexing:
a = ser[~ser.isna().cummax()]
print(a)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64
Alternative solution with cumulative sum:
a = ser[ser.isna().cumsum().eq(0)]
print(a)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64
Details:
print(ser.to_frame().assign(testna = ser.isna(),
cummax = ser.isna().cumsum(),
invert = ser.isna().cumsum().eq(0)))
0 testna cummax invert
0 1.0 False 0 True
1 2.0 False 0 True
2 3.0 False 0 True
3 4.0 False 0 True
4 5.0 False 0 True
5 NaN True 1 False
6 NaN True 2 False
7 NaN True 3 False
8 9.0 False 3 False
9 10.0 False 3 False
10 11.0 False 3 False
11 12.0 False 3 False
12 13.0 False 3 False
13 14.0 False 3 False
print(ser.to_frame().assign(testna = ser.isna(),
cummax = ser.isna().cummax(),
test0 = ~ser.isna().cummax()))
0 testna cummax test0
0 1.0 False False True
1 2.0 False False True
2 3.0 False False True
3 4.0 False False True
4 5.0 False False True
5 NaN True True False
6 NaN True True False
7 NaN True True False
8 9.0 False True False
9 10.0 False True False
10 11.0 False True False
11 12.0 False True False
12 13.0 False True False
13 14.0 False True False
Use a boolean mask to slice the series.
You have two options, check is the value is not NA with notna and extend the False values after the first True with
Series.cummin.
ser[ser.notna().cummin()]
output:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64
Or, test if the values are NA with isna, then extend the True values after the first True with Series.cummax, then invert the mask with ~:
ser[~ser.isna().cummax()]
visual representation of how it works:
data notna notna+cummin isna isna+cummax ~(isna+cummax)
0 1.0 True True False False True
1 2.0 True True False False True
2 3.0 True True False False True
3 4.0 True True False False True
4 5.0 True True False False True
5 NaN False False True True False
6 NaN False False True True False
7 NaN False False True True False
8 9.0 True False False True False
9 10.0 True False False True False
10 11.0 True False False True False
11 12.0 True False False True False
12 13.0 True False False True False
13 14.0 True False False True False

What is the fastest way to populate one pandas dataframe based on values from another pandas dataframe?

I have a pandas dataframe position
row column
1 3 Brazil
2 6 USA
3 3 USA
4 7 Canada
and another x
Brazil Canada USA
1 False False False
2 False False False
3 False False False
4 False False False
5 False False False
6 False False False
7 False False False
I want to populate the second one based on the values from the first one, so the result is:
Brazil Canada USA
1 False False False
2 False False False
3 True False True
4 False False False
5 False False False
6 False False True
7 False True False
I'm doing that using iterrows()
for i, r in positions.iterrows():
x.at[r['row'],r['column']] = True
Is there a faster way to do that?
I will do crosstab with update
x.update(pd.crosstab(df.row,df.column).eq(1))
x
Out[44]:
Brazil Canada USA
1 False False False
2 False False False
3 True False True
4 False False False
5 False False False
6 False False True
7 False True False
You can pivot the positions table:
s = (df.assign(dummy=True).set_index(['row','column'])
['dummy'].unstack(fill_value=False)
)
x |= s
Output:
Brazil Canada USA
1 False False False
2 False False False
3 True False True
4 False False False
5 False False False
6 False False True
7 False True False
searchsorted and slice assignment with iloc
This assumes that index and columns in x are sorted.
We'll use iloc and tuples of positions to assign the value of True
i = tuple(x.index.searchsorted(df.row))
j = tuple(x.columns.searchsorted(df.column))
x.iloc[[i, j]] = True
x
Brazil Canada USA
1 False False False
2 False False False
3 True False True
4 False False False
5 False False False
6 False False True
7 False True False

How to split a dataframe and create sub dataframes by grouping rows using a specific value?

I have a dataframe like given bellow
date,value
2/10/19,34
2/11/19,34
2/12/19,34
2/13/19,34
2/14/19,34
2/15/19,34
2/16/19,34
2/17/19,0
2/18/19,0
2/19/19,0
2/20/19,22
2/21/19,22
2/22/19,22
2/23/19,22
2/24/19,0
2/25/19,0
2/26/19,0
2/27/19,0
2/28/19,1
3/1/19,2
3/2/19,2
3/3/19,1
3/4/19,0
3/5/19,0
3/6/19,0
3/7/19,3
3/8/19,3
3/9/19,3
3/10/19,0
After every interval dataframe has zero values, I want to group rows in such a way that if zero appears more two times continuously it should create a sub dataframe and save a file.
Output:
df1
2/17/19,0
2/18/19,0
2/19/19,0
df2
2/24/19,0
2/25/19,0
2/26/19,0
2/27/19,0
df3
3/4/19,0
3/5/19,0
3/6/19,0
I tried many ways to do it but it fails.
Thank you.
You can try using rolling:
def merge_intervals(intervals):
sorted_intervals = sorted(intervals, key=lambda x: x[0])
interval_index = 0
#print(sorted_intervals)
for i in sorted_intervals:
if i[0] > sorted_intervals[interval_index][1]:
interval_index += 1
sorted_intervals[interval_index] = i
else:
sorted_intervals[interval_index] = [sorted_intervals[interval_index][0], i[1]]
#print(sorted_intervals)
return sorted_intervals[:interval_index+1]
end_ids = df[df['value'].rolling(3).apply(lambda x: (x==0).all())==1].index
start_ids = end_ids-3
intervals = merge_intervals([*zip(starts_ids, end_ids)])
for i,interval in enumerate(intervals):
df[interval[0]+1:interval[1]+1].to_csv('df_' + str(i) + '.csv')
Not the prettiest code, but it works, the merge function was found here: Merging Overlapping Intervals in Python
Find where values are equal to zero and take a rolling sum of length 3. Find where the rolling sums are equal to 3. The result will be lagged by 2 spaces so we take the logical or of the result with the -1 shifted and -2 shifted versions of the result.
mask = df['value'].eq(0).rolling(3).sum().eq(3)
mask |= mask.shift(-2) | mask.shift(-1)
In order to get groups, I take the cumulative sum of the logical negation. That will increment for each non-zero value and stagnate at the zeros. However, each group of zeros will be different. At the point I use groupby, that won't matter because I will have used the initial mask to see only the rows that satisfied the condition in the first place.
However, the resulting groups will be a non contiguous set of integers. Because I don't like that, I used factorize to give these groups unique integer values starting from zero.
grp_masked = (~mask).cumsum()[mask].factorize()[0]
g = df[mask].groupby(grp_masked)
Save files
for grp, d in g:
d.to_csv(f'df_{grp}.csv', index=False)
Create a dictionary
df_dict = {grp: d for grp, d in g}
Details
This shows the original dataframe along with additional columns that show some of what we calculated.
group_series = pd.Series(
grp_masked, df.index[mask], pd.Int64Dtype()
)
df_ = df.assign(
EqZero=df['value'].eq(0),
Roll2=df['value'].eq(0).rolling(3).sum(),
Is3=df['value'].eq(0).rolling(3).sum().eq(3),
Shift=lambda d: d.Is3.shift(-2) | d.Is3.shift(-1),
Mask=mask,
PreGrp=(~mask).cumsum(),
Grp=group_series
)
df_
date value EqZero Roll2 Is3 Shift Mask PreGrp Grp
0 2/10/19 34 False NaN False False False 1 <NA>
1 2/11/19 0 True NaN False False False 2 <NA>
2 2/12/19 0 True 2.0 False False False 3 <NA>
3 2/13/19 34 False 2.0 False False False 4 <NA>
4 2/14/19 34 False 1.0 False False False 5 <NA>
5 2/15/19 34 False 0.0 False False False 6 <NA>
6 2/16/19 34 False 0.0 False False False 7 <NA>
7 2/17/19 0 True 1.0 False True True 7 0
8 2/18/19 0 True 2.0 False True True 7 0
9 2/19/19 0 True 3.0 True False True 7 0
10 2/20/19 22 False 2.0 False False False 8 <NA>
11 2/21/19 22 False 1.0 False False False 9 <NA>
12 2/22/19 22 False 0.0 False False False 10 <NA>
13 2/23/19 22 False 0.0 False False False 11 <NA>
14 2/24/19 0 True 1.0 False True True 11 1
15 2/25/19 0 True 2.0 False True True 11 1
16 2/26/19 0 True 3.0 True True True 11 1
17 2/27/19 0 True 3.0 True False True 11 1
18 2/28/19 1 False 2.0 False False False 12 <NA>
19 3/1/19 2 False 1.0 False False False 13 <NA>
20 3/2/19 2 False 0.0 False False False 14 <NA>
21 3/3/19 1 False 0.0 False False False 15 <NA>
22 3/4/19 0 True 1.0 False True True 15 2
23 3/5/19 0 True 2.0 False True True 15 2
24 3/6/19 0 True 3.0 True False True 15 2
25 3/7/19 3 False 2.0 False False False 16 <NA>
26 3/8/19 3 False 1.0 False False False 17 <NA>
27 3/9/19 3 False 0.0 False False False 18 <NA>
28 3/10/19 0 True 1.0 False False False 19 <NA>

Vectorized Method to create and write to csv an aribrary number of DataFrames

So currently, I have DataFrames that look like this:
id temp1 temp2
9 10.0 True False
10 10.0 True False
11 10.0 False True
12 10.0 False True
17 15.0 True False
18 15.0 True False
19 15.0 True False
20 15.0 True False
21 15.0 False False
33 27.0 True False
34 27.0 True False
35 27.0 False True
36 27.0 False False
40 31.0 True False
41 31.0 False True
.
.
.
and I sort through the table with these commands:
u = coinc.groupby('id')
m = u.temp1.any() & u.temp2.any()
res = df.loc[coinc.id.isin(m[m].index), ['id']]
which check if in any group of the same id, if there is a true in both the temp1 column, and the temp2 column. If there is, it goes in the new "res" DataFrame
Now, I have a DataFrame that looks like this:
id temp1_0 temp2_0 temp1_1 temp2_1
9 10.0 False False True False
10 10.0 False False True False
11 10.0 False True False False
12 10.0 False True False False
17 15.0 True False False False
18 15.0 True False False False
19 15.0 False False True False
20 15.0 False False True False
21 15.0 False False False False
33 27.0 False False True False
34 27.0 False False True False
35 27.0 False True False False
36 27.0 False False False False
40 31.0 False False True False
41 31.0 False True False False
except, in reality, I will have an arbitrary number of temp columns, but always in groups of two like above.
I was wondering if there was a vectorized way of doing that same opperation above, but for each group of two temp columns (i.e. for the example above it would make two separate dataframes) and then output them to csv according to their name (..._0.csv, ..._1.csv, etc)

How to Apply a logic to a subset of rows in a dataframe?

In this problem, I am trying to understand when an alert should be generated based on 'value'.
If previous 5 values are above 10, then an alert is created. The alert continues to stay active till the value goes below 7.5. Now, once the alert is no longer active and it reaches a stage where previous 5 values are above 10, then an alert is created again.
Here is the logic I am using to do this:
NUM_PREV_ROWS = 5
PREV_5_THRESHOLD = 10.0
PREV_THRESHOLD = 7.5
d = {'device': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'a','a','a','a','a','b','b','b','b','b',
'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b'] ,
'value': [11,11,11,11,11,11,11,8,9,11,11,11,11,11,8,9,6,11,11,11,11,11,11,11,11,11,11,11,11,8,9,11,11,11,11,11,8,9,6,11,11,11,11,11]}
df = pd.DataFrame(data=d)
df['prev>10'] = df['value']>PREV_5_THRESHOLD
df['prev5>10'] = df['prev>10'].rolling(NUM_PREV_ROWS).sum()
df['prev>7.5'] = df['value']>PREV_THRESHOLD
alert = False
alert_series = []
for row in df.iterrows():
if row[1]['prev5>10']==NUM_PREV_ROWS:
alert = True
if row[1]['prev>7.5']==False:
alert = False
alert_series.append(alert)
df['alert'] = alert_series
The problem is that the loop should restart when a new device is encountered (in this case, it should first run for A and then run over B once it comes across that device). How can I do this?
This is the output with current logic:
print(df)
value prev>10 prev5>10 prev>7.5 alert
a 11 True NaN True False
a 11 True NaN True False
a 11 True NaN True False
a 11 True NaN True False
a 11 True 5.0 True True
a 11 True 5.0 True True
a 11 True 5.0 True True
a 8 False 4.0 True True
a 9 False 3.0 True True
a 11 True 3.0 True True
a 11 True 3.0 True True
a 11 True 3.0 True True
a 11 True 4.0 True True
a 11 True 5.0 True True
a 8 False 4.0 True True
a 9 False 3.0 True True
a 6 False 2.0 False False
a 11 True 2.0 True False
a 11 True 2.0 True False
a 11 True 3.0 True False
a 11 True 4.0 True False
a 11 True 5.0 True True
b 11 True 5.0 True True
b 11 True 5.0 True True
b 11 True 5.0 True True
b 11 True 5.0 True True
b 11 True 5.0 True True
b 11 True 5.0 True True
b 11 True 5.0 True True
b 8 False 4.0 True True
b 9 False 3.0 True True
b 11 True 3.0 True True
b 11 True 3.0 True True
b 11 True 3.0 True True
b 11 True 4.0 True True
b 11 True 5.0 True True
b 8 False 4.0 True True
b 9 False 3.0 True True
b 6 False 2.0 False False
b 11 True 2.0 True False
b 11 True 2.0 True False
b 11 True 3.0 True False
b 11 True 4.0 True False
b 11 True 5.0 True True
Appreciate all the help!
I'am not sure if this is the best way, but what about using groupbyto reset the loop?
def f(df):
alert = False
alert_series = []
for row in df.iterrows():
if row[1]['prev5>10']==NUM_PREV_ROWS:
alert = True
if row[1]['prev>7.5']==False:
alert = False
alert_series.append(alert)
return pd.DataFrame({'alert': alert_series})
df['alert'] = df.groupby("device").apply(f).reset_index(drop=True)
First you need a method that does the parsing for a block. I tried a different, vectorized method:
def larger_than_threshold(
data,
previous_5_threshold=PREV_5_THRESHOLD,
amount=NUM_PREV_ROWS,
previous_threshold=PREV_THRESHOLD,
):
prev5_over_limit = (
((data > previous_5_threshold).rolling(amount).sum() == amount)
.astype(int)
.diff()
== 1
).replace({False: None})
prev_under_threshold = (data < previous_threshold)
prev5_over_limit[prev_under_threshold] = False
return prev5_over_limit.fillna(method="ffill").fillna(False)
This accepts the value Series of your data
You can also use your iterative method here instead.
Then you can use groupby.transform to apply this to each separate device
df["alert"] = df.groupby("device").transform(larger_than_threshold)

Categories