Keep upper n rows of a pandas dataframe based on condition - python

how would I delete all rows from a dataframe that come after a certain fulfilled condition? As an example I have the following dataframe:
import pandas as pd
xEnd=1
yEnd=2
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
How would i get a dataframe that deletes the last 4 rows and keeps the upper 2 as in row 2 the condition x=xEnd and y=yEnd is fulfilled.
EDITED: should have mentioned that the dataframe is not necessarily ascending. Could also be descending and i still would like to get the upper ones.

To slice your dataframe until the first time a condition across 2 series are satisfied, first calculate the required index and then slice via iloc.
You can calculate the index via set_index, isin and np.ndarray.argmax:
idx = df.set_index(['x', 'y']).isin((xEnd, yEnd)).values.argmax()
res = df.iloc[:idx+1]
print(res)
x y id
0 1 1 0
1 1 2 1
If you need better performance, see Efficiently return the index of the first value satisfying condition in array.

not 100% sure i understand correctly, but you can filter your dataframe like this:
df[(df.x <= xEnd) & (df.y <= yEnd)]
this yields the dataframe:
id x y
0 0 1 1
1 1 1 2
If x and y are not strictly increasing and you want whats above the line that satisfy condition:
df[df.index <= (df[(df.x == xEnd) & (df.y == yEnd)]).index.tolist()]

df = df.iloc[[0:yEnd-1],[:]]
Select just first two rows and keep all columns and put it in new dataframe.
Or you can use the same name of variable too.

Related

Comparing and dropping columns based on greater or smaller timestamp

I have this df:
id started completed
1 2022-02-20 15:00:10.157 2022-02-20 15:05:10.044
and I have this other one data:
timestamp x y
2022-02-20 14:59:47.329 16 0.0
2022-02-20 15:01:10.347 16 0.2
2022-02-20 15:06:35.362 16 0.3
what I wanna do is filter the rows in data where timestamp > started and timestamp < completed (which will leave me with the middle row only)
I tried to do it like this:
res = data[(data['timestamp'] > '2022-02-20 15:00:10.157')]
res = res[(res['timestamp'] > '2022-02-20 15:05:10.044')]
and it works.
But when I wanted to combine the two like this:
res = data[(data['timestamp'] > df['started']) and (data['timestamp'] < df['completed'])]
I get ValueError: Can only compare identically-labeled Series objects
Can anyone please explain why and where am I doing the mistake? Do I have to convert to string the df['started'] or something?
You have two issues here.
The first is the use of and. If you want to combine multiple masks (boolean array) with a "and" logic element-wise, you want to use & instead of and.
Then, the use of df['started'] and df['completed'] for comparing. If you use a debugger, you can see that
df['started'] is a dataframe with its own indexes, the same for data['timestamp']. The rule for comparing, two dataframes are described here. Essentially, you can compare only two dataframes with the same indexing. But here df has only one row, data multiple. Try convert your element from df as a non dataframe format. Using loc for instance.
For instance :
Using masks
n = 10
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
df2 = pd.DataFrame(
{
"max_val" : [0],
"min_val" : [-0.5]
}
)
df[(df.y < df2.loc[0, 'max_val']) & (df.y > df2.loc[0, 'min_val'])]
Out[95]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
Using query
df2 = pd.DataFrame(
{
"max_val" : np.repeat(0, n),
"min_val" : np.repeat(-0.5, n)
}
)
df.query("y < #df2.max_val and y > #df2.min_val")
Out[124]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
To make the comparisons, Pandas need to have the same rows count in both the dataframes, that's because a comparison is made between the first row of the data['timestamp'] series and the first row of the df['started'] series, and so on.
The error is due to the second row of the data['timestamp'] series not having anything to compare with.
In order to make the code work, you can add for any row of data, a row in df to match against. In this way, Pandas will return a Boolean result for every row, and you can use the AND logical operator to get the results that are both True.
Pandas doesn't want Python's and operator, so you need to use the & operator, so your code will look like this:
data[(data['timestamp'] > df['started']) & (data['timestamp'] < df['completed'])]

drop row if 3 out of 5 columns are lower than a specific value

I'm trying to code a line in which I drop a row in a dataframe if a pvalue (columns) is lower than 1.3 for 3 out of 5 columns. If the pvalue is greater than 1.3 in 3 out of 5 columns i keep the row. The code looks like this:
for i in np.arange(pvalue.shape[0]):
if (pvalue.iloc[i,1:] < 1.3).count() > 2:
pvalue.drop(index = pvalue.index[i], axis = 0, inplace = True)
else:
None
the pvalue dataframe has 6 columns, first column is a string and the next 5 are pvalues of an experiment. I get this error:
IndexError: single positional indexer is out-of-bounds
and I don't how to fix this. I appreciate every help. BTW I'm a complete python beginner, so be patient with me! :) Thanks and looking forward to your solutions!
I am not very knowledgeable with Pandas so there probably is a better way to go about it but this should work:
By using iterrows(), you can iterate over each row of a DataFrame.
for idx, row in pvalue.iterrows():
In the loop you will have access to the idx variable which is the index of the row you're currently iterating on, and the row values itself in the row variable.
Then for every row, you can iterate through each column value with a simple for loop.
for val in row[1:]:
while making sure you start with the 2nd value (or in other words, by ignoring the index 0 and starting with index 1).
The rest is pretty straightforward.
threshold = 1.3
for idx, row in pvalue.iterrows():
count = 0
for val in row[1:]:
if val < threshold:
count += 1
if count > 2:
pvalue.drop(idx, inplace=True)

Setting value of row in certain column based on a slice of pandas dataframe - using both loc and iloc

I am trying to slice my dataframe based on a certain condition, and select the first row of that slice, and set the value of the column of that first row.
index COL_A. COL.B. COLC
0. cond_A1 cond_B1. 0
1 cond_A1 cond_B1. 0
2 cond_A1 cond_B1. 0
3 cond_A2 cond_B2. 0
4. cond_A2 cond_B2. 0
Neither of the following lines of code I have attempted update the dataframe
df.loc[((df['COL_A'] == cond_A1) & (df['COL_B'] == cond_b1)), 'COL_C'].iloc[0] = 1
df[((df['COL_A'] == cond_A1) & (df['COL_B'] == cond_b1))].iloc[0]['COL_C'] = 1
I need to be able to loop through the conditions so that I could apply the same code to the next set of conditions, and update the COL_C row with index 3 based on these new conditions.
You can update only the first row of your slice with the following code:
df.loc[df.loc[(df['COL_A'] == cond_A1) & (df['COL_B'] == cond_b1), 'COL_C'].index[0], 'COL_C'] = 1

Loop through Pandas dataframe and copying to a new dataframe based on a condition

I have a dataframe df with 6000+ rows of data with a datetime index in the form YYYY-MM-DD and with columns ID, water_level and change.
I want to:
Loop through each value in the column change and identify turning points
When I find a turning point, copy that entire row of data including the index into a new dataframe e.g. turningpoints_df
For each new turning point identified in the loop, add that row of data to my new dataframe turningpoints_df so that I end up with something like this:
ID water_level change
date
2000-10-01 2 5.5 -0.01
2000-12-13 40 10.0 0.02
2001-02-10 150 1.1 -0.005
2001-07-29 201 12.4 0.01
... ... ... ...
I was thinking of taking a positional approach so something like (purely illustrative):
turningpoints_df = pd.DataFrame(columns = ['ID', 'water_level', 'change'])
for i in range(len(df['change'])):
if [i-1] < 0 and [i+1] > 0:
#this is a min point and take this row and copy to turningpoints_df
elif [i-1] > 0 and [i+1] < 0:
#this is a max point and take this row and copy to turningpoints_df
else:
pass
My issue is, is that I'm not sure how to examine each value in my change column against the value before and after and then how to pull out that row of data into a new df when the conditions are met.
it sounds like you want to make use of the shift method of the DataFrame.
# shift values down by 1:
df[change_down] = df[change].shift(1)
# shift values up by 1:
df[change_up] = df[change].shift(-1)
you should then be able to compare the values of each row and proceed with whatever you're trying to achieve..
for row in df.iterrows():
*check conditions here*
Using some NumPy features that allows you to roll() a series forwards or backwards. Then have prev and next on same row so can then use a simple function to apply() your logic as everything is on same row.
from decimal import *
import numpy as np
d = list(pd.date_range(dt.datetime(2000,1,1), dt.datetime(2010,12,31)))
df = pd.DataFrame({"date":d, "ID":[random.randint(1,200) for x in d],
"water_level":[round(Decimal(random.uniform(1,13)),2) for x in d],
"change":[round(Decimal(random.uniform(-0.05, 0.05)),3) for x in d]})
# have ref to prev and next, just apply logic
def turningpoint(r):
r["turningpoint"] = (r["prev_change"] < 0 and r["next_change"] > 0) or \
(r["prev_change"] > 0 and r["next_change"] < 0)
return r
# use numpy to shift "change" so have prev and next on same row as new columns
# initially default turningpoint boolean
df = df.assign(prev_change=np.roll(df["change"],1),
next_change=np.roll(df["change"],-1),
turningpoint=False).apply(turningpoint, axis=1).drop(["prev_change", "next_change"], axis=1)
# first and last rows cannot be turning points
df.loc[0:0,"turningpoint"] = False
df.loc[df.index[-1], "turningpoint"] = False
# take a copy of all rows that are turningpoints into new df with index
df_turningpoint = df[df["turningpoint"]].copy()
df_turningpoint

Select column with only one negative value

I'd like to select the columns that only have one negative value or none. How can I construct this looking at this example? I've been searching for something similar, didn't succeed though. Thanks for any help.
N = 5
np.random.seed(0)
df1 = pd.DataFrame(
{'X':np.random.uniform(-3,3,N),
'Y':np.random.uniform(-3,3,N),
'Z':np.random.uniform(-3,3,N),
})
X Y Z
0 0.292881 0.875365 1.750350
1 1.291136 -0.374477 0.173370
2 0.616580 2.350638 0.408267
3 0.269299 2.781977 2.553580
4 -0.458071 -0.699351 -2.573784
So in this example I'd like to return columns X and Z.
Use np.sign to get the signs. Look for negative signs. Get the count of negative numbers for each column. Compare against the threshold of 1 to get a mask. Select the column names from the mask.
Hence, the implementation -
df1.columns[(np.sign(df1)<0).sum(0)<=1].tolist()
Or directly compare against 0 to replace the use of np.sign -
df1.columns[(df1<0).sum(0)<=1].tolist()
This gives us the column names. To select entire columns, other solutions I think have covered it.
You can use iloc to do that i.e
df1.iloc[:,((df1<0).sum(0) <= 1).values]
or (Thanks Jon)
df1.loc[:,df1.lt(0).sum() <= 1]
Output:
X Z
0 0.292881 1.750350
1 1.291136 0.173370
2 0.616580 0.408267
3 0.269299 2.553580
4 -0.458071 -2.573784
Or You can try :
df1.columns[(df1<0).sum(0).lt(1)]
Out[338]: Index(['X', 'Z'], dtype='object')

Categories