Find duplicates between 2 columns (independent order) , count and drop Python - python

I'm trying to find the duplicates between 2 columns, were order is independent, but I need to keep the count of duplicates after dropping them
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],
)
This is my expected result
source target count
0 A B 2
1 D B 1
3 B C 2
I've already tried several approaches, but I can't come close to a solution.
It does not matter which combination is maintained. In the result example I kept the first.

The following approach creates a new column containing a set of the values in the columns specified. The advantage is that all other columns are preserved in the final result. Furthermore, the indices are preserved the same way as in the expected output you posted:
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],)
# Create column with set of both columns
df['tmp'] = df.apply(lambda x: frozenset([x['source'], x['target']]), axis=1)
# Create count column based on new tmp column
df['count'] = df.groupby(['tmp'])['target'].transform('size')
# Drop duplicate rows based on new tmp column
df = df[~df.duplicated(subset='tmp', keep='first')]
# Remove tmp column
df = df.drop('tmp', 1)
df
Output:
source target count
0 A B 2
1 D B 1
3 B C 2

You can use df.duplicated() to see which ones are duplicated, the output is true if item is duplicated and false if it isn’t. For more infos and practical example check out the documentation

Create a summary based on applying a frozenset to your desired columns. Here we're using all columns.
summary = df.apply(frozenset, axis=1).value_counts()
This'll give you a Series of:
(A, B) 2
(C, B) 2
(B, D) 1
dtype: int64
You can then reconstruct a DataFrame by iterating over that series, eg:
df2 = pd.DataFrame(((*idx, val) for idx, val in summary.items()), columns=[*df.columns, 'count'])
Which results in:
source target count
0 A B 2
1 C B 2
2 B D 1

Related

Lambda function with groupby and condition issue

I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())
Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2

Drop a column which is a subset of any other column in a dataframe

I have a pandas dataframe as below. How can I drop any column which is a subset of any of the remaining columns? I would like to do this without using fillna.
df = pd.DataFrame([ [1,1,3,3], [np.NaN,2,np.NaN,4]], columns=['A','B','C','D'] )
df
A B C D
0 1.0 1 3.0 3
1 NaN 2 NaN 4
I can identify here that column A is subset of B and column C is a subset of D with something like this:
if all(df[A][df[A].notnull()].isin(df[B]))
I could run a loop over all columns and drop the subset columns. But is there a more efficient way to accomplish this, so that I have the following result:
df
B D
0 1 3
1 2 4
Thanks.
It still requires iteration, but you can use this list comprehension (with an if statement similar to the one you provided) to get columns to keep:
keep_cols = [x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))]
# ['B', 'D']
And then use the result with filter:
df.filter(items=keep_cols)
# B D
# 0 1 3
# 1 2 4
This should be fast enough, since it still uses apply at its core, and seems to be safer/more efficient than dropping columns within a loop.
If you're keen on a one-line solution, of course assigning the list to a variable is an optional step:
df.filter(items=[x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))])

Keep Columns When Aggregating an Empty DataFrame

I'm working in pandas 0.18.0 on python 2.7.9.
Take a sample DataFrame and group by a few columns, then sum over a different column for the result, like this:
>>> df = pandas.DataFrame([[1,2,3],[4,5,6],[1,2,9]], columns=['a','b','c'])
>>> print df
a b c
0 1 2 3
1 4 5 6
2 1 2 9
>>> df.groupby(['a','b'], as_index=False)['c'].sum()
a b c
0 1 2 12
1 4 5 6
That all looks great, but when the same operation is preformed on an empty DataFrame the columns are dropped from the result:
>>> empty = pandas.DataFrame(columns=['a','b','c'])
>>> print empty
Empty DataFrame
Columns: [a, b, c]
Index: []
>>> empty.groupby(['a','b'], as_index=False)['c'].sum()
Empty DataFrame
Columns: []
Index: []
Were someone to reference valid columns from the result later in the code, a key error would result. Is there a way to keep the columns?
I believe this is a standard result of groupby.sum() (see here http://pandas.pydata.org/pandas-docs/stable/missing_data.html).
The only way I can think would be to write an if statement checking if the dataframe is empty, e.g.:
if sum(empty.isnull().sum()) == 9:
print "empty dataframe"
elif sum(empty.isnull().sum()) < 9:
empty.groupby(['a','b'], as_index=False)['c'].sum()
This should keep your empty dataframe with column headers.
Hope this helps.

set multiple Pandas DataFrame columns to values in a single column or multiple scalar values at the same time

I'm trying to set multiple new columns to one column and, separately, multiple new columns to multiple scalar values. Can't do either. Any way to do it other than setting each one individually?
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df.loc[:,['C','D']]=df['A']
df.loc[:,['C','D']]=[0,1]
for c in ['C', 'D']:
df[c] = d['A']
df['C'] = 0
df['D'] = 1
Maybe it is what you are looking for.
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df['C'], df['D'] = df['A'], df['A']
df['E'], df['F'] = 0, 1
# Result
A B C D E F
0 0 1 0 0 0 1
1 2 3 2 2 0 1
2 4 5 4 4 0 1
The assign method will create multiple, new columns in one step. You can pass a dict() with the column and values to return a new DataFrame with the new columns appended to the end.
Using your examples:
df = df.assign(**{'C': df['A'], 'D': df['A']})
and
df = df.assign(**{'C': 0, 'D':1})
See this answer for additional detail: https://stackoverflow.com/a/46587717/4843561

Comparing rows of pandas dataframe (rows have some overlapping values)

I have a pandas dataframe with 21 columns. I am focusing on a subset of rows that have exactly same column data values except for 6 that are unique to each row. I don't know which column headings these 6 values correspond to a priori.
I tried converting each row to Index objects, and performed set operation on two rows. Ex.
row1 = pd.Index(sample_data[0])
row2 = pd.Index(sample_data[1])
row1 - row2
which returns an Index object containing values unique to row1. Then I can manually deduce which columns have unique values.
How can I programmatically grab the column headings that these values correspond to in the initial dataframe? Or, is there a way to compare two or multiple dataframe rows and extract the 6 different column values of each row, as well as the corresponding headings? Ideally, it would be nice to generate a new dataframe with the unique columns.
In particular, is there a way to do this using set operations?
Thank you.
Here's a quick solution to return only the columns in which the first two rows differ.
In [13]: df = pd.DataFrame(zip(*[range(5), list('abcde'), list('aaaaa'),
... list('bbbbb')]), columns=list('ABCD'))
In [14]: df
Out[14]:
A B C D
0 0 a a b
1 1 b a b
2 2 c a b
3 3 d a b
4 4 e a b
In [15]: df[df.columns[df.iloc[0] != df.iloc[1]]]
Out[15]:
A B
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
And a solution to find all columns with more than one unique value throughout the entire frame.
In [33]: df[df.columns[df.apply(lambda s: len(s.unique()) > 1)]]
Out[33]:
A B
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
You don't really need the index, you could just compare two rows and use that to filter the columns with a list comprehension.
df = pd.DataFrame({"col1": np.ones(10), "col2": np.ones(10), "col3": range(2,12)})
row1 = df.irow(0)
row2 = df.irow(1)
unique_columns = row1 != row2
cols = [colname for colname, unique_column in zip(df.columns, bools) if unique_column]
print cols # ['col3']
If you know the standard value for each column, you can convert all the rows to a list of booleans, i.e.:
standard_row = np.ones(3)
columns = df.columns
unique_columns = df.apply(lambda x: x != standard_row, axis=1)
unique_columns.apply(lambda x: [col for col, unique_column in zip(columns, x) if unique_column], axis=1)
Further to #jeff-tratner's answer
produce truth table of identical cells between two rows (selected in this case by their index positions):
uq = di2.iloc[0] != di2.iloc[1]
get list of columns of identical cells:
uq[uq==True].index.to_list()
Or get list of columns of non-identical cells:
uq[uq!=True].index.to_list()

Categories