Finding the Location of the Duplicate for Duplicated Columns in Pandas - python

I know I can find duplicate columns using:
df.T.duplicated()
what I'd like to know the index that a duplicate column is a duplicate of. For example, both C and D are duplicates of a A below:
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
A B C D
0 1 0 1 1
1 2 0 2 2
I'd like something like:
duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])

I don't know if duplicated have an option to give information about the first row with the same data. My idea is by using groupby and transform such as:
arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
.transform(lambda x: x.iloc[0]).values)
With your example, arr_first is then equal to array(['A', 'B', 'A', 'A'], dtype=object) and because they have the same order than df.columns, to get the expected output, you use np.where like:
duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)
and the result for duplicate_index is
A None
B None
C A
D A
dtype: object

Another more direct way to test if two numeric columns are duplicated with each other is to test the correlation matrix, which test all pairs of columns. Here is the code:
import pandas as pd
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
# compute the correlation matrix
cm = df.corr()
cm
This shows a matrix of the correlation of all columns to each other column (including itself). If a column is 1:1 with another column, then the value is 1.0.
To find all columns that are duplicates of A, then :
cm['A']
A 1.0
B NaN
C 1.0
D 1.0
If you have categorical (string objects) and not numeric, you could make a cross correlation table.
Hope this helps!

Related

How to drop rows based on column value if column is not set as index in pandas?

I have a list and a dataframe which look like this:
list = ['a', 'b']
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]})
I would like to do something like this:
df1 = df.drop([x for x in list])
I am getting the following error message:
"KeyError: "['a' 'b'] not found in axis""
I know I can do the following:
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]}).set_index('A')
df1 = df.drop([x for x in list])
How can I drop the list values without having to set column 'A' as index? My dataframe has multiple columns.
Input:
A B
0 a 9
1 b 9
2 c 8
3 d 4
Code:
for i in list:
ind = df[df['A']==i].index.tolist()
df=df.drop(ind)
df
Output:
A B
2 c 8
3 d 4
You need to specify the correct axis, namely axis=1.
According to the docs:
axis{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
You also need to make sure you are using the correct column names, (ie. uppercase rather than lower case. And you should avoid using python reserved names, so don't use list.
This should work:
myList = ['A', 'B']
df1 = df.drop([x for x in myList], axis=1)

Apply a specific function for two columns in Pandas DataFrame

I have a Pandas DataFrame with two columns, each row contains a list of elements. I'm trying to find set difference between two columns for each row using pandas.apply method.
My df for example
A B
0 ['a','b','c'] ['a']
1 ['e', 'f', 'g'] ['f', 'g']
So it should look like this:
df.apply(set_diff_func, axis=1)
What I'm trying to achieve:
0 ['b','c']
1 ['e']
I can make it using iterrows, but I've once read, that it's better to use apply when it's possible.
How about
df.apply(lambda row: list(set(row['A']) - set(row['B'])), axis = 1)
or
(df['A'].apply(set) - df['B'].apply(set)).apply(list)
Here's the function you need, you can change the name of the columns with the col1 and col2 arguments by passing them to the args option in apply:
def set_diff_func(row, col1, col2):
return list(set(row[col1]).difference(set(row[col2])))
This should return the required result:
>>> dataset = pd.DataFrame(
[{'A':['a','b','c'], 'B':['a']},
{'A':['e', 'f', 'g'] , 'B':['f', 'g']}])
>>> dataset.apply(set_diff_func, axis=1, args=['A','B'])
0 [c, b]
1 [e]

fillna for category column with Series input does not work as expected

I have a category column which I want to fill with a Series.
I tried this:
df = pd.DataFrame({'key': ['a', 'b'], 'value': ['c', np.nan]})
df['value'] = df['value'].astype("category")
df['value'] = df['value'].cat.add_categories(df['key'].unique())
print(df['value'].cat.categories)
df['value'] = df['value'].fillna(df['key'])
print(df)
Expected output:
Index(['c', 'a', 'b'], dtype='object')
key value
0 a c
1 b b
Actual output:
Index(['c', 'a', 'b'], dtype='object')
key value
0 a a
1 b b
This appears to be a bug, but thankfully the workaround is quite simple. You will have to treat "value" as a string column when filling.
df['value'] = pd.Categorical(
df.value.astype(object).fillna(df.key), categories=df.stack().unique())
df
key value
0 a c
1 b b
From the doc , Categorical data will accept scalar not series , so you may need to convert it back to series
df.value.astype('object').fillna(df.key) # then convert to category again
Out[248]:
0 c
1 b
Name: value, dtype: object
value : scalar Value to use to fill holes (e.g. 0)

Return opposite values from a pandas Series that has two unique values

Given a Series with two unique values, what is the most efficient way to get a Series with the element-wise opposite values? Here is an example:
ser = pd.Series(['a', 'b', 'a'])
I am looking for a function to be applied to ser, returning:
0 b
1 a
2 b
EDIT: Also, how would the solution be ammended if there are null values. That is, if
ser = pd.Series(['a', 'b', np.nan , 'a'])
and we would like to get:
0 b
1 a
2 np.nan
3 b
Use numpy.unique to get a handy inverse array.
v = ser.values
u, i = np.unique(v, return_inverse=True)
If there are truly only 2 unique values, then you can do this.
pd.Series(u[1 - i], ser.index)
0 b
1 a
2 b
dtype: object
How It Works
The inverse array is intended to allow you to recreate your passed array, v in our case, by slicing the unique values u with our inverse i. Since u only has 2 values, those values will be 0 and 1. So when we slice u[i], we get array(['a', 'b', 'a'], dtype=object). But we can get the opposite with u[1 - i] yields array(['b', 'a', 'b'], dtype=object)
You can do an element by element on the series using apply:
Code:
ser = pd.Series(['a', 'b', 'a'])
print(ser.apply(lambda x: 'a' if x == 'b' else 'b'))
Results:
0 b
1 a
2 b
dtype: object

How can I multiply two dataframes with different column labels in pandas?

I'm trying to multiply (add/divide/etc.) two dataframes that have different column labels.
I'm sure this is possible, but what's the best way to do it? I've tried using rename to change the columns on one df first, but (1) I'd rather not do that and (2) my real data has a multiindex on the columns (where only one layer of the multiindex is differently labeled), and rename seems tricky for that case...
So to try and generalize my question, how can I get df1 * df2 using map to define the columns to multiply together?
df1 = pd.DataFrame([1,2,3], index=['1', '2', '3'], columns=['a', 'b', 'c'])
df2 = pd.DataFrame([4,5,6], index=['1', '2', '3'], columns=['d', 'e', 'f'])
map = {'a': 'e', 'b': 'd', 'c': 'f'}
df1 * df2 = ?
I was also troubled by this problem.
It seems that the pandas requires matrix multiply needs both dataframes has same column names.
I searched a lot and found the example in the setting enlargement is add one column to the dataframe.
For your question,
rs = pd.np.multiply(ds2, ds1)
The rs will have the same column names as ds2.
Suppose we want to multiply several columns with other serveral columns in the same dataframe and append these results into the original dataframe.
For example ds1,ds2 are in the same dataframe ds. We can
ds[['r1', 'r2', 'r3']] = pd.np.multiply(ds[['a', 'b', 'c']], ds[['d', 'e', 'f']])
I hope these will help.
Updated solution now that pd.np is being deprecated: df1.multiply(np.array(df2)
It will keep the column names of df1 and multiply them by the columns of df2 in order
I just stumbled onto the same problem. It seems like pandas wants both the column and row index to be aligned to do the element-wise multiplication, so you can just rename with your mapping during the multiplication:
>>> df1 = pd.DataFrame([[1,2,3]], index=['1', '2', '3'], columns=['a', 'b', 'c'])
>>> df2 = pd.DataFrame([[4,5,6]], index=['1', '2', '3'], columns=['d', 'e', 'f'])
>>> df1
a b c
1 1 2 3
2 1 2 3
3 1 2 3
>>> df2
d e f
1 4 5 6
2 4 5 6
3 4 5 6
>>> mapping = {'a' : 'e', 'b' : 'd', 'c' : 'f'}
>>> df1.rename(columns=mapping) * df2
d e f
1 8 5 18
2 8 5 18
3 8 5 18
If you want the 'natural' order of columns, you can create a mapping on the fly like:
>>> df1 * df2.rename(columns=dict(zip(df2.columns, df1.columns)))
for example to do the "Frobenius inner product" of the two matrices, you could do:
>>> (df1 * df2.rename(columns=dict(zip(df2.columns, df1.columns)))).sum().sum()
96
This is a pretty old question, and as nnsk said, pd.np is being deprecated.
A nice looking solution is df1 * df2.values. This will produce the element-wise product of the two dataframes, and keep the column names of df1.
Assuming the index is already aligned, you probably just want to align the columns in both DataFrame in the right order and divide the .values of both DataFrames.
Supposed mapping = {'a' : 'e', 'b' : 'd', 'c' : 'f'}:
v1 = df1.reindex(columns=['a', 'b', 'c']).values
v2 = df2.reindex(columns=['e', 'd', 'f']).values
rs = DataFrame(v1 / v2, index=v1.index, columns=['a', 'b', 'c'])
Another solution assuming indexes and columns are well positioned :
df_mul= pd.DataFrame(df1.values * df2.values, columns= df1.columns, index= df1.index)

Categories