Delete all rows bellow a certain condition in pandas - python

I have a dataframe with multiple columns. One of the columns (denoted as B in the example) works as a trigger, i.e.,
I have to drop all rows after the first value bigger than 0.5. However, I have to conserve this first number.
An example is given above. All rows after 0.59 (which is the first that obeys to the condition of being bigger than 0.5) are deleted.
initial_df = pd.DataFrame([[1,0.4], [5,0.43], [4,0.59], [11,0.41], [9,0.61]], columns = ['A', 'B'])
Bellow the blue box indicates the trigger and the red box the values that have to dropped.
In the end we will have:
The final goal is to obtain the following dataframe:
Is it possible to do it in pandas in a efficient way (not using a for loop)?

You can use np.where with Boolean indexing to extract the positional index of the first value matching a condition. Then feed this to iloc:
idx = np.where(df['B'].gt(0.5))[0][0]
res = df.iloc[:idx+1]
print(res)
A B
0 1 0.40
1 5 0.43
2 4 0.59
For very large dataframes where the condition is likely to met early on, more optimal would be to use next with a generator expression to calculate idx:
idx = next((idx for idx, val in enumerate(df['B']) if val > 0.5), len(df.index))
For better performance, see Efficiently return the index of the first value satisfying condition in array.

So this works if your index is the same as your iloc:
first_occurence = initial_df[initial_df.B>0.5].index[0]
initial_df.iloc[:first_occurence+1]
EDIT: this is a more general solution
first_occurence = initial_df.index.get_loc(initial_df[initial_df.B>0.5].iloc[0].name)
final_df = initial_df.iloc[:first_occurence+1]

I found a solution similar to the one shown by jpp:
indices = initial_df.index
trigger = initial_df[initial_df.B > 0.5].index[0]
initial_df[initial_df.index.isin(indices[indices<=trigger])]
Since the real dataframe has multiple indices, this is the only solution that I found.

I am assuming you want to delete all rows where "B" column value is less than 0.5.
Try this:
initial_df = pd.DataFrame([[1, 0.4], [5, 0.43], [4, 0.59], [11, 0.41], [9, 0.61]], columns=['A', 'B'])
final_df = initial_df[initial_df['B'] >= 0.5]
The resulting data frame, final_df is:
A B
2 4 0.59
4 9 0.61

Related

Selecting rows based on Boolean values in a non dangerous way

This is an easy question since it is so fundamental. See - in R, when you want to slice rows from a dataframe based on some condition, you just write the condition and it selects the corresponding rows. For example, if you have a condition such that only the third row in the dataframe meets the condition it returns the third row. Easy Peasy.
In python, you have to use loc. IF the index matches the row numbers then everything is great. IF you have been removing rows or re-ordering them for any reason, you have to remember that - since loc is based on INDEX NOT ROW POSITION. So if in your current dataframe the third row matches your boolean conditional in the loc statement - then it will retrieve the index with a number 3 - which could be the 50th row, rather than your current third row. This seems to be an incredibly dangerous way to select rows, so I know I am doing something wrong.
So what is the best practice method of ensuring you select the nth row based on a boolean conditional? Is it just to use loc and "always remember to use reset_index - otherwise if you miss it, even once your entire dataframe is wrecked"? This can't be it.
Use iloc instead of loc for integer based indexing:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=[1, 2, 3])
df
Dataset:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
Label based index
df.loc[1]
Results:
A 1
B 4
C 7
Integer based:
df.iloc[1]
Results:
A 2
B 5
C 8

Quick sum of all rows that fill a condition in DataFrame

I have a pandas dataframe that looks something like this:
df = pd.DataFrame(np.array([[1,1, 0], [5, 1, 4], [7, 8, 9]]),columns=['a','b','c'])
a b c
0 1 1 0
1 5 1 4
2 7 8 9
I want to find the first column in which the majority of elements in that column are equal to 1.0.
I currently have the following code, which works, but in practice, my dataframes usually have thousands of columns and this code is in a performance critical part of my application, so I wanted to know if there is a way to do this faster.
for col in df.columns:
amount_votes = len(df[df[col] == 1.0])
if amount_votes > len(df) / 2:
return col
In this case, the code should return 'b', since that is the first column in which the majority of elements are equal to 1.0
Try:
print((df.eq(1).sum() > len(df) // 2).idxmax())
Prints:
b
Find columns with more than half of values equal to 1.0
cols = df.eq(1.0).sum().gt(len(df)/2)
Get first one:
cols[cols].head(1)

Question about drop=True in pd.dataframe.reset_index()

In a Pandas dataframe, it's possible to reset the index using the reset_index() method. One optional argument is drop=True which according to the documentation:
drop : bool, default False
Do not try to insert index into dataframe columns.
This resets the index to the default integer index.
My question is, what does the first sentence mean? Will it try to convert an integer index to a new column in my df if I leave if False?
Also, will my row order be preserved or should I also sort to ensure proper ordering?
As you can see below, df.reset_index() will move the index into the dataframe as a column. If the index was just a generic numerical index, you probably don't care about it and can just discard it. Below is a simple dataframe, but I dropped the first row just to have differing values in the index.
df = pd.DataFrame([['a', 10], ['b', 20], ['c', 30], ['d', 40]], columns=['letter','number'])
df = df[df.number > 10]
print(df)
# letter number
# 1 b 20
# 2 c 30
# 3 d 40
Default behavior now shows a column named index which was the previous index. You can see that df['index'] matches the index from above, but the index has been renumbered starting from 0.
print(df.reset_index())
# index letter number
# 0 1 b 20
# 1 2 c 30
# 2 3 d 40
drop=True doesn't pretend like the index was important and just gives you a new index.
print(df.reset_index(drop=True))
# letter number
# 0 b 20
# 1 c 30
# 2 d 40
Regarding row order, I suspect that it would be maintained, but the order in which things are stored should not be relied on in general. If you are performing an aggregate function, you probably want to make sure you have the data ordered properly for the aggrigation.

How to get values on one dataframe based on the position of a value in other dataframe

I have two dataframes with the same size.
df1
1 5 3
6 5 1
2 4 9
df2
a b c
d e f
g h i
I want to get the corresponding value on df2 that is in the same position as the maximum value of each row in df1. For example, row 0 has element [0,1] as its max, so I'd like to get [0,1] from df2 in return
Desired result would be:
df3
b
d
i
Thank you so much!
Don't use for loops. numpy can be handy here
vals = df2.values[np.arange(len(df2)), df1.values.argmax(1)]
Of course, can df3 = pd.DataFrame(vals)
col
0 b
1 d
2 i
S=df1.idxmax(axis=0)
p=0
for a in range(len(df1):
df3.iloc(['a','0'])=df2.iloc([S[p],0])
p+=1
Try the code:
>>> for i, j in enumerate(df1.idxmax()):
... print(df2.iloc[i, j])
...
b
d
i
idxmax gives the id of the maximum value in the dataframe, either row-wise or column-wise.
Your problem has two parts:
1- Finding the maximum value of each row
2- Choosing the maximum column of each row with values found in step one
You can easily use lookup function. The first argument is finding the maximum column in rows(step one), and the second is the selection(step two)
df2.lookup(range(len(df1)), df1.idxmax()) #output => array(['b', 'd', 'i'], dtype=object)
If array does not work for you, you can also create data frame from these values if by simply passing it to pd.DataFrame:
pd.DataFrame(df2.lookup(range(len(df1)), df1.idxmax()))
One good feature of this solution is avoiding loops which makes it efficient.

Python df groupby with agg for string and sum

With this df as base i want the following output:
So all should be aggregated by column 0 and all strings from column 1 should be added and the numbers from column 2 should be summed when the strings from column 1 have the same name.
With the following code i could aggregate the strings but without summing the numbers:
df2= df1.groupby([0]).agg(lambda x: ','.join(set(x))).reset_index()
df2
Avoid an arbitrary number of columns
Your desired output suggests you have an arbitrary number of columns dependent on the number of values in 1 for each group 0. This is anti-Pandas, which is strongly geared towards an arbitrary number of rows. Hence series-wise operations are preferred.
So you can just use groupby + sum to store all the information you require.
df = pd.DataFrame({0: ['2008-04_E.pdf']*3,
1: ['Mat1', 'Mat2', 'Mat2'],
2: [3, 1, 1]})
df_sum = df.groupby([0, 1]).sum().reset_index()
print(df_sum)
0 1 2
0 2008-04_E.pdf Mat1 3
1 2008-04_E.pdf Mat2 2
But if you insist...
If you insist on your unusual requirement, you can achieve it as follows via df_sum calculated as above.
key = df_sum.groupby(0)[1].cumcount().add(1).map('Key{}'.format)
res = df_sum.set_index([0, key]).unstack().reset_index().drop('key', axis=1)
res.columns = res.columns.droplevel(0)
print(res)
Key1 Key2 Key1 Key2
0 2008-04_E.pdf Mat1 Mat2 3 2
This seems like a 2-step process. It also requires that each group from column 1 has the same number of unique elements in column 2. First groupby the columns you want grouped
df_grouped = df.groupby([0,1]).sum().reset_index()
Then reshape to the form you want:
def group_to_row(group):
group = group.sort_values(1)
output = []
for i, row in group[[1,2]].iterrows():
output += row.tolist()
return pd.DataFrame(data=[output])
df_output = df_grouped.groupby(0).apply(group_to_row).reset_index()
This is untested but this is also quite a non-standard form so unfortunately I don't think there's a standard Pandas function for you.

Categories