Delete rows in Dataframe based on condition in another Dataframe

Delete rows in Dataframe based on condition in another Dataframe - python

I am familiar on how to remove rows within a Dataframe based on a condition as:
df1 = df1.drop(df1[<some boolean condition>].index)
Let df1 and df2 be equally sized DataFrames. The problem is to remove the same index rows in df2 that satisfy the aforementioned condition for df1. I am looking for an elegant solution instead of keeping the indexes and then iterating over them again for df2.
Example:
df1
index value
1 4
2 5
3 6
4 3
1 1
2 5
1 3
2 3
3 2
4 2
5 1
6 7
7 12
df2
index value
1 4
2 5
3 7
4 3
1 1
2 109
1 44
2 3
3 2
4 2
5 1
6 7
7 12
The indexing is not consecutive so a simple df.drop won't work. Its based on groups created before.

First you should fix your indexing in your dataframes. What you want to do will not work unless the indexes are consecutive since you will remove multiple rows by deleting by index. You should try and avoid many to many relationships in data analytics - they simply cause more problems then they solve).
Try something like this:
df1.reset_index()
df2.reset_index()
for indexes, row in df1.iterrows():
if df1.columnname = 2: #imaginary value, place Boolean condition here
df1.drop(df1.index[[indexes]])
df2.drop(df2.index[[indexes]])

Related

Pandas merge multiple value columns into a value and type column

I have a pandas dataframe where there are multiple integer value columns denoting a count. I want to transform this dataframe such that the value columns are merged into one column but another column is created denoting the column the value was taken from.
Input
a b c
0 2 5 8
1 3 6 9
2 4 7 10
Output
count type
0 2 a
1 3 a
2 4 a
3 5 b
4 6 b
5 7 b
6 8 c
7 9 c
8 10 c
Im sure this is possible by looping over the entries and creating however many rows for each original row but im sure there is a pandas way to achieve this and I would like to know what it is called.

You could do that with the following
pd.melt(df, value_vars=['a','b','c'], value_name='count', var_name='type')

How to restrict DataFrame number of rows to the Xth unique value in certain column?

Say for example we have the following DataFrame:
A B
1 2
1 2
2 3
3 4
4 5
4 2
And we would know we wanted an x(say 3) number of unique values in column A.
Then the desired output would be:
A B
1 2
1 2
2 3
3 4
I thought about looping through the column in question, counting the number of unique values by tracking and taking the subset of the DataFrame with the right index. I am still a newbie to Python and I believe there would be a more efficient way to do this, please share your solutions. Appreciated!

You can try series.factorize which indexes the unique values starting at 0 and then select the values which is <= n-1 (because index starts at 0),hence reserves order too:
n=3
df[df['A'].factorize()[0]<=n-1]
A B
0 1 2
1 1 2
2 2 3
3 3 4

You can use np.random.choice to select the unique id, then isin to select rows with those id:
selected_ids = np.random.choice(df['A'].unique(), replace=False, size=3)
df[df['A'].isin(selected_ids)]

Pandas Dataframe: How to check if a column contains continues integers, and if not, how to add and fill 0

Assume we have a dataframe like this:
Order Value
1 10
2 3
3 5
5 34
7 23
Is there a way to test if a column (In this case 'Order') contains continues integers. And if it doesn't, is there a way to fill in the missing row with corresponding integer and set other column(In this case 'Value') with 0, or other specific value?
Like in the case above, there should be two rows be added (4,0), and(6,0)
Thank you!

You can use reindex after setting Order as index. With the fill_value argument you can choose which value will be used for filling in new rows.
df.set_index('Order').reindex(pd.RangeIndex(df.Order.min(), df.Order.max()+1), fill_value=0).reset_index()
Out:
Order Value
0 1 10
1 2 3
2 3 5
3 4 0
4 5 34
5 6 0
6 7 23

Making a Multiindexed Pandas Dataframe Non-Symmetric

I have a multi-indexed dataframe which looks roughly like this:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
>>> Output
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
4 5 1 1 5
In this dataframe, the zero-th row and fifth row are symmetric in the sense that if the entire A and B columns of the zero-th row are flipped, it becomes identical to the fifth one. Similarly, the second row is symmetric with itself.
I am planning to remove these rows from my original dataframe, thus making it 'non-symmetric'. The specific plans are as follow:
If a row with higher index is symmetric with a row with lower index, keep the lower one and remove the higher one. For example, from the above dataframe, keep the zero-th row and remove the fifth row.
If a row is symmetric with itself, remove that row. For example, from the above dataframe, remove the second row.
My attempt was to first zip the four lists into a tuple list, remove the symmetric tuples by a simple if-statement, unzip them, and merge them back into a dataframe. However, this turned out to be inefficient, making it unscalable for large dataframes.
How can I achieve this in an efficient manner? I guess utilizing several built-in pandas methods is necessary, but it seems quite complicated.

Namudon'tdie,
Try this solution:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
test['idx'] = test.index * 2 # adding auxiliary column 'idx' (all even)
test2 = test.iloc[:, [2,3,0,1,4]] # creating flipped DF
test2.columns = test.columns # fixing column names
test2['idx'] = test2.index * 2 + 1 # for flipped DF column 'idx' is all odd
df = pd.concat([test, test2])
df = df.sort_values (by='idx')
df = df.set_index('idx')
print(df)
A B
a b a b
idx
0 1 5 5 1
1 5 1 1 5
2 2 4 2 4
3 2 4 2 4
4 3 3 3 3
5 3 3 3 3
6 4 2 4 2
7 4 2 4 2
8 5 1 1 5
9 1 5 5 1
df = df.drop_duplicates() # remove rows with duplicates
df = df[df.index%2 == 0] # remove rows with odd idx (flipped)
df = df.reset_index()[['A', 'B']]
print(df)
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
The idea is to create flipped rows with odd indexes, so that they will be placed under their original rows after reindexing. Then delete duplicates, keeping rows with lower indices. For cleanup simply delete remaining rows with odd indices.
Note that row [3,3,3,3] stayed. There should be a separate filter to take care of self-symmetric rows. Since your definition of self-symmetric is unclear (other rows have certain degree of symmetry too), I leave this part to you. Should be straightforward.

Attempting to delete multiple rows from Pandas Dataframe but more rows than intended are being deleted

I have a list, to_delete, of row indexes that I want to delete from both of my two Pandas Dataframes, df1 & df2. They both have 500 rows. to_delete has 50 entries.
I run this:
df1.drop(df1.index[to_delete], inplace=True)
df2.drop(df2.index[to_delete], inplace=True)
But this results in df1 and df2 having 250 rows each. It deletes 250 rows from each, and not the 50 specific rows that I want it to...
to_delete is ordered in descending order.
The full method:
def method(results):
#results is a 500 x 1 matrix of 1's and -1s
global df1, df2
deletions = []
for i in xrange(len(results)-1, -1, -1):
if results[i] == -1:
deletions.append(i)
df1.drop(df1.index[deletions], inplace=True)
df2.drop(df2.index[deletions], inplace=True)
Any suggestions as to what I'm doing wrong?
(I've also tried using .iloc instead of .index and deleting in the if statement instead of appending to a list first.

Your index values are not unique and when you use drop it is removing all rows with those index values. to_delete may have been of length 50 but there were 250 rows that had those particular index values.
Consider the example
df = pd.DataFrame(dict(A=range(10)), [0, 1, 2, 3, 4] * 2)
df
A
0 0
1 1
2 2
3 3
4 4
0 5
1 6
2 7
3 8
4 9
Let's say you want to remove the first, third, and fourth rows.
to_del = [0, 2, 3]
Using your method
df.drop(df.index[to_del])
A
1 1
4 4
1 6
4 9
Which is a problem
Option 1
use np.in1d to find complement of to_del
This is more self explanatory than the others. I'm looking in an array from 0 to n and seeing if it is in to_del. The result will be a boolean array the same length as df. I use ~ to get the negation and use that to slice the dataframe.
df[~np.in1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 2
use np.bincount to find complement of to_del
This accomplishes the same thing as option 1 by counting the positions defined in to_del. I end up with an array of 0 and 1 with a 1 in each position defined in to_del and 0 else where. I want to keep the 0s so I make a boolean array by finding where it is equal to 0. I then use this to slice the dataframe.
df[np.bincount(to_del, minlength=len(df)) == 0]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 3
use np.setdiff1d to find positions
This uses set logic to find the difference between a full array of positions and just the ones I want to delete. I then use iloc to select.
df.iloc[np.setdiff1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Delete rows in Dataframe based on condition in another Dataframe - python

Related

Pandas merge multiple value columns into a value and type column

How to restrict DataFrame number of rows to the Xth unique value in certain column?

Pandas Dataframe: How to check if a column contains continues integers, and if not, how to add and fill 0

Making a Multiindexed Pandas Dataframe Non-Symmetric

Attempting to delete multiple rows from Pandas Dataframe but more rows than intended are being deleted

Categories

Resources