Pandas delete first n rows until condition on columns is fulfilled - python

I am trying to delete some rows from my dataframe. In fact I want to delete the the first n rows, while n should be the row number of a certain condition. I want the dataframe to start with the row that contains the x-y values xEnd,yEnd. All earlier rows shall be dropped from the dataframe. Somehow I do not get the solution. That is what i have so far.
Example:
import pandas as pd
xEnd=2
yEnd=3
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
n=df["id"].iloc[df["x"]==xEnd and df["y"]==yEnd]
df = df.iloc[n:]
I want my code to reduce the dataframe from
{'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]}
to
{'x':[2,2,2], 'y':[3,4,3], 'id':[3,4,5]}

Use & instead of and
Use loc instead of iloc. You can use iloc but it could break depending on the index
Use idxmax to find the first positiopn
# I used idxmax to find the index |
# v
df.loc[((df['x'] == xEnd) & (df['y'] == yEnd)).idxmax():]
# ^
# | finding the index goes with using loc
id x y
3 3 2 3
4 4 2 4
5 5 2 3
Here is an iloc variation
# I used values.argmax to find the position |
# v
df.iloc[((df['x'] == xEnd) & (df['y'] == yEnd)).values.argmax():]
# ^
# | finding the position goes with using iloc
id x y
3 3 2 3
4 4 2 4
5 5 2 3

Using cummax
df[((df['x'] == xEnd) & (df['y'] == yEnd)).cummax()]
Out[147]:
id x y
3 3 2 3
4 4 2 4
5 5 2 3

Related

Filter based on pairs within a group - if value represent at BOTH ends

Group Code
1 2
1 2
1 4
1 1
2 4
2 1
2 2
2 3
2 1
2 1
2 3
Within each group there are pairs. In Group 1 for example; the pairs are (2,2),(2,4),(4,1)
I want to filter these pairs based on code numbers 2 AND 4 being present at BOTH ends(not either)
In group 1 for example, only (2,4) will be kept while (2,2) and (4,1) will be filtered out.
Excepted Output:
Group Code
1 2
1 4
You can approach by making 2 boolean masks for current row and next row code in 2 or 4. Then, form the required combination condition of present at BOTH ends(not either), as follows:
If you require both 2 AND 4 be present in the pair, then we can make another boolean mask for asserting that these 2 consecutive codes are not equal:
m_curr = df['Code'].isin([2,4]) # current row code is 2 or 4
m_next = df.groupby("Group")['Code'].shift(-1).isin([2,4]) # next row code in same group is 2 or 4
m_diff = df['Code'].ne(df.groupby("Group")['Code'].shift(-1)) # different row codes in current and next row in the same group
# current row AND next row code in 2 or 4 AND (2 and 4 both present, i.e. the 2 values in pair are diffrent)
mask = m_curr & m_next & m_diff
df[mask | mask.shift()]
Result:
Group Code
1 1 2
2 1 4
Another way to do it, may be a little bit simpler for this special case:
m1 = df['Code'].eq(2) & df.groupby("Group")['Code'].shift(-1).eq(4) # current row is 2 and next row in same group is 4
m2 = df['Code'].eq(4) & df.groupby("Group")['Code'].shift(-1).eq(2) # current row is 4 and next row in same group is 2
mask = m1 | m2 # either pair of (2, 4) or (4, 2)
df[mask | mask.shift()]
Result:
Same result:
Group Code
1 1 2
2 1 4
You can try with shift
s = df.groupby('Group')['Code'].apply(lambda x : (x==4) & (x.shift()==2))
out = df[s | s.shift(-1)]
Out[97]:
Group Code
1 1 2
2 1 4

How to remove rows with multiple occurrences in a row with pandas

i have this data:
A
1 1
2 1
3 1
4 2
5 2
6 1
i expect to get:
A
1 1
- - -> (drop)
3 1
4 2
5 2
6 1
I want to drop all the rows in col ['A'] with the same value that appear in a row,
but without the first and the last ones.
Until now I used:
df = df.loc[df[col].shift() != df[col]]
but it will remove also the last appearance.
Sorry for my bad English, thanks in advance.
Looks like you have the same problem as this question: Pandas drop_duplicates. Keep first AND last. Is it possible?.
The suggested solution is:
pd.concat([
df['A'].drop_duplicates(keep='first'),
df['A'].drop_duplicates(keep='last'),
])
Update after clarification:
First get the boolean masks for your described criteria:
is_last = df['A'] != df['A'].shift(-1)
is_duplicate = df['A'] == df['A'].shift()
And drop the rows based on these:
df.drop(df.index[~is_last & is_duplicate]) # note the ~ to negate is_last
Basically you need to group consecutive numbers, which can be achieved by diff and cumsum:
print (df.groupby(df["A"].diff().ne(0).cumsum(), as_index=False).nth([0, -1]))
A
1 1
3 1
4 2
5 2
6 1

Pandas multi index Dataframe - Select and remove

I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5

Comparing 2 columns on 2 pandas dataframes for equality

I have two dataframes pd and pd2:
pd
Name A B
t1 3 4
t5 2 2
fry 4 5
net 3 3
pd2
Name A B
t1 3 4
t5 2 2
fry 4 5
net 3 3
I want to make sure that the columns 'Name' between the two dataframes match not only the names (t1,t5,etc..) but also they need to be in the same order. I've tried chekS = (df.index == df2.index).all(axis=1).astype(str) with no luck.
Assuming that Name is your index, you either change your axis to 0, or use chekS = sum(df.index != df2.index). If it's not the index, then chekS = sum(df.Name != df2.Name) will work.
If Name is the column not the index as your sample dataframe suggests, you can compare the two columns
(df1['Name'] == df2['Name']).all()
It returns True in this case.
Lets say your df2 is
Name A B
0 t1 3 4
1 t5 2 2
2 net 3 3
3 fry 4 5
I just flipped the rows at index 2 and 3 keeping the values same,
(df1['Name'] == df2['Name']).all()
will return False

Attempting to delete multiple rows from Pandas Dataframe but more rows than intended are being deleted

I have a list, to_delete, of row indexes that I want to delete from both of my two Pandas Dataframes, df1 & df2. They both have 500 rows. to_delete has 50 entries.
I run this:
df1.drop(df1.index[to_delete], inplace=True)
df2.drop(df2.index[to_delete], inplace=True)
But this results in df1 and df2 having 250 rows each. It deletes 250 rows from each, and not the 50 specific rows that I want it to...
to_delete is ordered in descending order.
The full method:
def method(results):
#results is a 500 x 1 matrix of 1's and -1s
global df1, df2
deletions = []
for i in xrange(len(results)-1, -1, -1):
if results[i] == -1:
deletions.append(i)
df1.drop(df1.index[deletions], inplace=True)
df2.drop(df2.index[deletions], inplace=True)
Any suggestions as to what I'm doing wrong?
(I've also tried using .iloc instead of .index and deleting in the if statement instead of appending to a list first.
Your index values are not unique and when you use drop it is removing all rows with those index values. to_delete may have been of length 50 but there were 250 rows that had those particular index values.
Consider the example
df = pd.DataFrame(dict(A=range(10)), [0, 1, 2, 3, 4] * 2)
df
A
0 0
1 1
2 2
3 3
4 4
0 5
1 6
2 7
3 8
4 9
Let's say you want to remove the first, third, and fourth rows.
to_del = [0, 2, 3]
Using your method
df.drop(df.index[to_del])
A
1 1
4 4
1 6
4 9
Which is a problem
Option 1
use np.in1d to find complement of to_del
This is more self explanatory than the others. I'm looking in an array from 0 to n and seeing if it is in to_del. The result will be a boolean array the same length as df. I use ~ to get the negation and use that to slice the dataframe.
df[~np.in1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 2
use np.bincount to find complement of to_del
This accomplishes the same thing as option 1 by counting the positions defined in to_del. I end up with an array of 0 and 1 with a 1 in each position defined in to_del and 0 else where. I want to keep the 0s so I make a boolean array by finding where it is equal to 0. I then use this to slice the dataframe.
df[np.bincount(to_del, minlength=len(df)) == 0]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 3
use np.setdiff1d to find positions
This uses set logic to find the difference between a full array of positions and just the ones I want to delete. I then use iloc to select.
df.iloc[np.setdiff1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9

Categories