How to find rows from a groupby condition in original dataframe? - python

I have the following condition which eliminates the last row of a group if that last row does not equal 'no' for the column outcome:
m1 = df.groupby(['id'])['outcome'].tail(1) != 'no'
I then use this condition to drop these rows from the dataframe:
df = df.drop(m1[m1].index)
However I do not know how to do the opposite and instead of dropping these rows from the original df, extract entire rows that satisfy the m1 condition. Any suggestions?

From the comments:
df.loc[m1[m1].index, :]
will work.

Related

Pandas groupby and transform with row filter

I am working with the following multi-indexed dataframe:
I would like to get the average of column 'EY' for all the rows grouped by ['date','SECTOR'] but only if EST_UNIV == 1.
I could do the following:
This gets me most of what I need, but you'll notice the # of rows dropped down from 6553 to 1313.
I would like to pull in the values for all the rows in the original dataframe, even if EST_UNIV == 0, but I would like the average calculation to only apply for rows where EST_UNIV == 1.
Thanks very much for the help!
Use Series.where for helper column for missing values if not match condition:
df['new'] = (df.assign(new = df['EY'].where(df.EST_UNIV.eq(1)))
.groupby(['date','SECTOR'])['new']
.transform('mean'))

Python: How to filter out rows based on a condition from 2 columns

I have a data frame DF in Python and I want to filter its rows based on 2 columns.
In particular, I want to remove the rows where orderdate is earlier than the startdate
How can I reverse/opposite the condition inside the following code to achieve what I want?
DF = DF.loc[DF['orderdate']<DF['startdate']]
I could reframe the code like below but it won't cover some rows that have NaT and I want to keep them
DF = DF.loc[DF['orderdate']>=DF['startdate']]
Inserting the ~ in front of the condition in parenthesis will reverse the condition and remove all the rows that do not satisfy it.
DF = DF.loc[~(DF['orderdate']<DF['startdate'])]
1- loc takes the rows from the 'orderdate' column and compares them with the rows from the 'startdate' column. Where the condition is true, it returns the index of the lines and stores it in the ids array.
2 - The drop method deletes lines in the dataframe, the parameters are the array with the indices of the lines, and inplace = True, this ensures that the operation is performed on the dataframe itself, if it is False operation it will return a copy of the dataframe
# Get names of indexes for which column orderdate > = startdate
ids = DF.loc[DF['orderdate'] >= DF['startdate']].index
# Delete these row indexes from dataFrame
DF.drop(ids, inplace=True)

Delete rows from a pandas DataFrame based on a conditional expression in another dataframe

I have two pandas dataframes, df1 and df2, with both equal number of rows. df2 has 11 rows which contain NaN values. I know how to drop the empty rows in df2, by applying:
df2.dropna(subset=['HIGH'], inplace=True)
But now I want to delete these same rows from df1 (the rows with the same row numbers that have been deleted from df2). I tried the following but this does not seem to work.
df1.drop(df2[df2['HIGH'] == 'NaN'].index, inplace=False)
Any other suggestions?
You can get all rows with NaN values in it with:
is_NaN = df2.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df2[row_has_NaN]
After that you can delete the rows with NaN. (like you said in the question)
Now you can get every index out of 'rows_with_NaN'. With every index you can delete it out of df1 (Should have the same index like you said).
I hope this is correct! (No test done)

How to find the last row without '-' in it

I have a pandas dataframe df. In every column eventually the values are '-' until the end of the dataframe. I would like to find the final row where there is no '-' value. How can I do that?
df.isin(['-'])
gives me a dataframe full of Trues and Falses. So I want the last row that only has False in it.
You can use df.tail(1) to pick the last row:
df[df.isin(['-'])].tail(1)
To get the last row with False use ~ Not operator:
df[~df.isin(['-'])].tail(1)

Pandas DataFrame Slice Column Based on Condition

I am looking to slice rows in a dataframe column based on conditions- I understand I can assign specific values to rows in my df column based on given conditions using .loc, however I need the condition just to determine how much to slice.
For example, if the row starts with 'A', I would like the first 6 chars ([:6]) whereas if it starts with 'B' I would like it to have the first 8 chars ([:8]).
I am doing this in order to get the data into the correct format before I perform an inner join with another dataframe using pd.merge()
.loc. I can use df.loc[df['column'][:1] == 'A'], but it doesn't give me the index of the rows that satisfy the condition. The best solution I can think of is creating a list of all of the indexes that satisfy the conditions and then manipulating each row one by one. Is there a better way to do this?
You can check with np.select
m1 = df.col.str[0] == 'A'
m2 = df.col.str[0] == 'B'
df['NewCol'] = np.select([m1, m2], [df.col.str[:6], df.col.str[:8]], default = df.col)

Categories