Pandas - Select rows based on other rows - python

Let's say I have a dataframe:
df = pd.DataFrame({'A':[1,2,3,3,6,8,1,4]})
df
A
0 1
1 2
2 3
3 3
4 6
5 8
6 1
7 4
And I want to select rows which are preceded by a row that contains a one:
selection
A
1 2
7 4
Right now, I can solve this by selecting rows with ones, getting their indices, and then adding one to the indices and then using iloc:
df.iloc[df[df.A == 1].index + 1]
But I'm wondering if there is a "more pandas" way to do this. Further, if the search was more complicated like: select all rows preceded by a 1 and followed by a 3. Or what if the index wasn't just integers, but timestamps. How do I express inter-row dependencies cleanly?

Solution
df[df.shift().A == 1]

Related

How to remove rows with multiple occurrences in a row with pandas

i have this data:
A
1 1
2 1
3 1
4 2
5 2
6 1
i expect to get:
A
1 1
- - -> (drop)
3 1
4 2
5 2
6 1
I want to drop all the rows in col ['A'] with the same value that appear in a row,
but without the first and the last ones.
Until now I used:
df = df.loc[df[col].shift() != df[col]]
but it will remove also the last appearance.
Sorry for my bad English, thanks in advance.
Looks like you have the same problem as this question: Pandas drop_duplicates. Keep first AND last. Is it possible?.
The suggested solution is:
pd.concat([
df['A'].drop_duplicates(keep='first'),
df['A'].drop_duplicates(keep='last'),
])
Update after clarification:
First get the boolean masks for your described criteria:
is_last = df['A'] != df['A'].shift(-1)
is_duplicate = df['A'] == df['A'].shift()
And drop the rows based on these:
df.drop(df.index[~is_last & is_duplicate]) # note the ~ to negate is_last
Basically you need to group consecutive numbers, which can be achieved by diff and cumsum:
print (df.groupby(df["A"].diff().ne(0).cumsum(), as_index=False).nth([0, -1]))
A
1 1
3 1
4 2
5 2
6 1

How to restrict DataFrame number of rows to the Xth unique value in certain column?

Say for example we have the following DataFrame:
A B
1 2
1 2
2 3
3 4
4 5
4 2
And we would know we wanted an x(say 3) number of unique values in column A.
Then the desired output would be:
A B
1 2
1 2
2 3
3 4
I thought about looping through the column in question, counting the number of unique values by tracking and taking the subset of the DataFrame with the right index. I am still a newbie to Python and I believe there would be a more efficient way to do this, please share your solutions. Appreciated!
You can try series.factorize which indexes the unique values starting at 0 and then select the values which is <= n-1 (because index starts at 0),hence reserves order too:
n=3
df[df['A'].factorize()[0]<=n-1]
A B
0 1 2
1 1 2
2 2 3
3 3 4
You can use np.random.choice to select the unique id, then isin to select rows with those id:
selected_ids = np.random.choice(df['A'].unique(), replace=False, size=3)
df[df['A'].isin(selected_ids)]

Making a Multiindexed Pandas Dataframe Non-Symmetric

I have a multi-indexed dataframe which looks roughly like this:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
>>> Output
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
4 5 1 1 5
In this dataframe, the zero-th row and fifth row are symmetric in the sense that if the entire A and B columns of the zero-th row are flipped, it becomes identical to the fifth one. Similarly, the second row is symmetric with itself.
I am planning to remove these rows from my original dataframe, thus making it 'non-symmetric'. The specific plans are as follow:
If a row with higher index is symmetric with a row with lower index, keep the lower one and remove the higher one. For example, from the above dataframe, keep the zero-th row and remove the fifth row.
If a row is symmetric with itself, remove that row. For example, from the above dataframe, remove the second row.
My attempt was to first zip the four lists into a tuple list, remove the symmetric tuples by a simple if-statement, unzip them, and merge them back into a dataframe. However, this turned out to be inefficient, making it unscalable for large dataframes.
How can I achieve this in an efficient manner? I guess utilizing several built-in pandas methods is necessary, but it seems quite complicated.
Namudon'tdie,
Try this solution:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
test['idx'] = test.index * 2 # adding auxiliary column 'idx' (all even)
test2 = test.iloc[:, [2,3,0,1,4]] # creating flipped DF
test2.columns = test.columns # fixing column names
test2['idx'] = test2.index * 2 + 1 # for flipped DF column 'idx' is all odd
df = pd.concat([test, test2])
df = df.sort_values (by='idx')
df = df.set_index('idx')
print(df)
A B
a b a b
idx
0 1 5 5 1
1 5 1 1 5
2 2 4 2 4
3 2 4 2 4
4 3 3 3 3
5 3 3 3 3
6 4 2 4 2
7 4 2 4 2
8 5 1 1 5
9 1 5 5 1
df = df.drop_duplicates() # remove rows with duplicates
df = df[df.index%2 == 0] # remove rows with odd idx (flipped)
df = df.reset_index()[['A', 'B']]
print(df)
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
The idea is to create flipped rows with odd indexes, so that they will be placed under their original rows after reindexing. Then delete duplicates, keeping rows with lower indices. For cleanup simply delete remaining rows with odd indices.
Note that row [3,3,3,3] stayed. There should be a separate filter to take care of self-symmetric rows. Since your definition of self-symmetric is unclear (other rows have certain degree of symmetry too), I leave this part to you. Should be straightforward.

Delete rows in Dataframe based on condition in another Dataframe

I am familiar on how to remove rows within a Dataframe based on a condition as:
df1 = df1.drop(df1[<some boolean condition>].index)
Let df1 and df2 be equally sized DataFrames. The problem is to remove the same index rows in df2 that satisfy the aforementioned condition for df1. I am looking for an elegant solution instead of keeping the indexes and then iterating over them again for df2.
Example:
df1
index value
1 4
2 5
3 6
4 3
1 1
2 5
1 3
2 3
3 2
4 2
5 1
6 7
7 12
df2
index value
1 4
2 5
3 7
4 3
1 1
2 109
1 44
2 3
3 2
4 2
5 1
6 7
7 12
The indexing is not consecutive so a simple df.drop won't work. Its based on groups created before.
First you should fix your indexing in your dataframes. What you want to do will not work unless the indexes are consecutive since you will remove multiple rows by deleting by index. You should try and avoid many to many relationships in data analytics - they simply cause more problems then they solve).
Try something like this:
df1.reset_index()
df2.reset_index()
for indexes, row in df1.iterrows():
if df1.columnname = 2: #imaginary value, place Boolean condition here
df1.drop(df1.index[[indexes]])
df2.drop(df2.index[[indexes]])

Attempting to delete multiple rows from Pandas Dataframe but more rows than intended are being deleted

I have a list, to_delete, of row indexes that I want to delete from both of my two Pandas Dataframes, df1 & df2. They both have 500 rows. to_delete has 50 entries.
I run this:
df1.drop(df1.index[to_delete], inplace=True)
df2.drop(df2.index[to_delete], inplace=True)
But this results in df1 and df2 having 250 rows each. It deletes 250 rows from each, and not the 50 specific rows that I want it to...
to_delete is ordered in descending order.
The full method:
def method(results):
#results is a 500 x 1 matrix of 1's and -1s
global df1, df2
deletions = []
for i in xrange(len(results)-1, -1, -1):
if results[i] == -1:
deletions.append(i)
df1.drop(df1.index[deletions], inplace=True)
df2.drop(df2.index[deletions], inplace=True)
Any suggestions as to what I'm doing wrong?
(I've also tried using .iloc instead of .index and deleting in the if statement instead of appending to a list first.
Your index values are not unique and when you use drop it is removing all rows with those index values. to_delete may have been of length 50 but there were 250 rows that had those particular index values.
Consider the example
df = pd.DataFrame(dict(A=range(10)), [0, 1, 2, 3, 4] * 2)
df
A
0 0
1 1
2 2
3 3
4 4
0 5
1 6
2 7
3 8
4 9
Let's say you want to remove the first, third, and fourth rows.
to_del = [0, 2, 3]
Using your method
df.drop(df.index[to_del])
A
1 1
4 4
1 6
4 9
Which is a problem
Option 1
use np.in1d to find complement of to_del
This is more self explanatory than the others. I'm looking in an array from 0 to n and seeing if it is in to_del. The result will be a boolean array the same length as df. I use ~ to get the negation and use that to slice the dataframe.
df[~np.in1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 2
use np.bincount to find complement of to_del
This accomplishes the same thing as option 1 by counting the positions defined in to_del. I end up with an array of 0 and 1 with a 1 in each position defined in to_del and 0 else where. I want to keep the 0s so I make a boolean array by finding where it is equal to 0. I then use this to slice the dataframe.
df[np.bincount(to_del, minlength=len(df)) == 0]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 3
use np.setdiff1d to find positions
This uses set logic to find the difference between a full array of positions and just the ones I want to delete. I then use iloc to select.
df.iloc[np.setdiff1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9

Categories