How to drop duplicate data with different column names in pandas?

How to drop duplicate data with different column names in pandas? - python

I have a DataFrame with columns with duplicate data with different names:
In[1]: df
Out[1]:
X1 X2 Y1 Y2
0.0 0.0 6.0 6.0
3.0 3.0 7.1 7.1
7.6 7.6 1.2 1.2
I know .drop(columns = ) exists but is there a way more efficient way to drop these without having to list down the column names? or not.. please let me know as i can just use .drop()

We can use np.unique over axis 1. Unfortunately, there's no pandas built-in function to drop duplicate columns.
df.drop_duplicates only removes duplicate rows.
Return DataFrame with duplicate rows removed.
We can create a function around np.unique to drop duplicate columns.
def drop_duplicate_cols(df):
uniq, idxs = np.unique(df, return_index=True, axis=1)
return pd.DataFrame(uniq, index=df.index, columns=df.columns[idxs])
drop_duplicate_cols(X)
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
Online Demo
NB: np.unique docs:
Returns the sorted unique elements of an array.
Workaround: To retain the original order, sort the idxs.
Using .T on dataframe having multiple dtypes is going to mess with your actual dtypes.
df = pd.DataFrame({'A': [0, 1], 'B': ['a', 'b'], 'C': [0, 1], 'D':[2.1, 3.1]})
df.dtypes
A int64
B object
C int64
D float64
dtype: object
df.T.T.dtypes
A object
B object
C object
D object
dtype: object
# To get back original `dtypes` we can use `.astype`
df.T.T.astype(df.dtypes).dtypes
A int64
B object
C int64
D float64
dtype: object

You could transpose with T and drop_duplicates then transpose back:
>>> df.T.drop_duplicates().T
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
>>>
Or with loc and duplicated:
>>> df.loc[:, df.T.duplicated(keep='last')]
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
>>>

Related

Looping over dataframe with extracting different columns at each iteration

I am trying to loop over a dataframe df and I would like to extract different columns at each iteration.
say I have Columns: ['A', 'B', 'C', 'D', 'E', 'F'] in my df
column_names=[['A','B'],['A','C','D']]
for index,row in df: //lets assume index starts with 0
row[column_names[index]] // However you can not apply this syntax for rows like you could for a df to get a sub dataframe.
What are my options? I have tried itertuples and iterrows but you can not select different columns by passing a list of column names
Thanks

The easiest way to loop over columns and retrieving a dataframe would be to invert your loops :
for col in column_names:
for ix in df.index:
print(df.loc[ix, col])

With iterrows() you get a tuple with index at 0th position and row at the 1st. You might want to use iterrows() as:
column_names=[['A',"B"],['A','C','D']]
for row in df.iterrows():
print(row[1][column_names[row[0]]].to_frame())
For a df of ones i.e.:
A B C D E F
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
You get:
A 1.0
B 1.0
Name: 0, dtype: float64
A 1.0
C 1.0
D 1.0
Name: 1, dtype: float64

Find the mean of columns with matching column names

I have a dataframe similar to the following but with thousands of rows and columns:
x y ghb_00hr_rep1 ghb_00hr_rep2 ghb_00hr_rep3 ghl_06hr_rep1 ghl_06hr_rep2
x y 2 3 2 1 3
x y 5 7 6 2 1
I would like my output to look like this:
ghb_00hr hl_06hr
2.3 2
6 1.5
My goal is to find the average of the matching columns. I have come up with this: temp = df.groupby(name, axis=1).agg('mean') But I am not sure how to define 'name' as the matching columns.
My previous strategy was the following:
name = pd.Series(['_'.join(i.split('_')[:-1])
for i in df.columns[3:]],
index = df.columns[3:]
)
temp = df.groupby(name, axis=1).agg('mean')
avg = pd.concat([df.iloc[:, :3], temp],
axis=1
)
However the number of 'replicates' ranges from 1-4 so grouping by index location isn't an option.
Not sure if there is a better way to do this or if I am on the right track.

An option is to groupby level=0:
(df.set_index(['name','x','y'])
.groupby(level=0, axis=1)
.mean().reset_index()
)
Output:
name x y ghb_00hr ghl_06hr
0 gene1 x y 2.333333 2.0
1 gene2 x y 6.000000 1.5
Update: for the modified question:
d = df.filter(like='gh')
# or d = df.iloc[:, 2:]
# depending on your columns of interest
names = d.columns.str.rsplit('_', n=1).str[0]
d.groupby(names, axis=1).mean()
Output:
ghb_00hr ghl_06hr
0 2.333333 2.0
1 6.000000 1.5

You can convert df.columns to set then iterate:
df = pd.DataFrame([[1, 2, 3, 4, 5, 6]], columns=['a', 'a', 'a', 'b', 'b', 'b'])
for column in set(df.columns):
print(column, df[common_name].mean(axis=1))
will outputs
a 0 2.0
dtype: float64
b 0 5.0
dtype: float64
Use sorted if the order matters:
for column in sorted(set(df.columns)):
From here you can get the output in pretty much any format you want.

Replace column in Pandas dataframe with the mean of that column

I have a dataframe:
df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
A B
0 1 2
1 1 3
2 4 6
I want to return a dataframe of the same size containing the mean of each column:
A B
0 2 3.666
1 2 3.666
2 2 3.666
Is there a simple way of doing this?

You can only provide one single line at DataFrame creation time:
pd.DataFrame(data = [df.mean()], index = df.index)
It gives:
A B
0 2.0 3.666667
1 2.0 3.666667
2 2.0 3.666667

Here's one with assign:
df.assign(**df.mean())
A B
0 2.0 3.666667
1 2.0 3.666667
2 2.0 3.666667
Details
The mean is easily obtained with DataFrame.mean:
df.mean()
tenor_yrs 14.292857
rates 2.622000
dtype: float64
From the above Series, we can use dictionary unpacking to replace the existing columns with the resulting values. Note that we can unpack the Series into a dictionary using **:
{**df.mean()}
# {'tenor_yrs': 14.292857142857143, 'rates': 2.622}
Given that the way assign adds new columns is as df.assign(a_given_column=a_value, another_column=some_other_value), the unpacking makes the dictionary keys be the function's arguments. And since the original dataframe's index is respected, df.assign(**df.mean()) will replace the dataframe`s values with the means.

Recreate the DataFrame. Send the mean Series to a dict, then the index defines the number of rows.
pd.DataFrame(df.mean().to_dict(), index=df.index)
# A B
#0 2.0 3.666667
#1 2.0 3.666667
#2 2.0 3.666667
Same concept, but creating the full array first saves a decent amount of time.
pd.DataFrame(np.broadcast_to(df.mean(), df.shape),
index=df.index,
columns=df.columns)
Here are some timings. Of course this will depend slightly on the number of columns but you can see there are pretty large differences when you provide the entire array to begin with
import perfplot
import pandas as pd
import numpy as np
perfplot.show(
setup=lambda N: pd.DataFrame(np.random.randint(1,100, (N, 5)),
columns=[str(x) for x in range(5)]),
kernels=[
lambda df: pd.DataFrame(np.broadcast_to(df.mean(), df.shape), index=df.index, columns=df.columns),
lambda df: df.assign(**df.mean()),
lambda df: pd.DataFrame(df.mean().to_dict(), index=df.index)
],
labels=['numpy broadcast', 'assign', 'dict'],
n_range=[2 ** k for k in range(1, 22)],
equality_check=np.allclose,
xlabel="Len(df)"
)

Pairwise Euclidean distance with pandas ignoring NaNs

I start with a dictionary, which is the way my data was already formatted:
import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},
'C':{'b':1.0,'c':2.0, 'd':4.0}}
I then convert it to a pandas dataframe:
df = pd.DataFrame(dict2)
print(df)
A B C
a 1.0 2.0 NaN
b 2.0 NaN 1.0
c NaN 2.0 2.0
d 4.0 5.0 4.0
Of course, I can get the difference one at a time by doing this:
df['A'] - df['B']
Out[643]:
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
I figured out how to loop through and calculate A-A, A-B, A-C:
for column in df:
print(df['A'] - df[column])
a 0.0
b 0.0
c NaN
d 0.0
Name: A, dtype: float64
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
a NaN
b 1.0
c NaN
d 0.0
dtype: float64
What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.
I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.

You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.
i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5
If you want a DataFrame representing a distance matrix, here's what that would look like:
df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
A B C
A 0.000000 1.414214 1.0
B 1.414214 0.000000 1.0
C 1.000000 1.000000 0.0
df[i, j] represents the distance between the ith and jth column in the original DataFrame.

The code below iterates through columns to calculate the difference.
# Import libraries
import pandas as pd
import numpy as np
# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()
# Calculate difference
clist = df.columns
for i in range (0,len(clist)-1):
for j in range (1,len(clist)):
if (clist[i] != clist[j]):
var = clist[i] + '-' + clist[j]
df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional
Output in same dataframe
df.head()
Output in a new dataframe
df2.head()

How to create a matrix that is the sum of multiple matrices using pandas dataframe?

I have multiple data frames that I saved in a concatenated list like below. Each df represents a matrix.
my_df = pd.concat([df1, df2, df3, .....])
How do I sum all these dfs (matrices) into one df (matrix)?
I found a discussion here, but it only answers how to add two data frames, by using code like below.
df_x.add(df_y, fill_value=0)
Should I use the code above in a loop, or is there a more concise way?
I tried to do print(my_df.sum()) but got a very confusing result (it's suddenly turned into a one row instead of two-dimensional matrix).
Thank you.

I believe need functools.reduce if each DataFrame in list have same index and columns values:
np.random.seed(2018)
df1 = pd.DataFrame(np.random.choice([1,np.nan,2], size=(3,3)), columns=list('abc'))
df2 = pd.DataFrame(np.random.choice([1,np.nan,3], size=(3,3)), columns=list('abc'))
df3 = pd.DataFrame(np.random.choice([1,np.nan,4], size=(3,3)), columns=list('abc'))
print (df1)
a b c
0 2.0 2.0 2.0
1 NaN NaN 1.0
2 1.0 2.0 NaN
print (df2)
a b c
0 NaN NaN 1.0
1 3.0 3.0 3.0
2 NaN 1.0 3.0
print (df3)
a b c
0 4.0 NaN NaN
1 4.0 1.0 1.0
2 4.0 NaN 1.0
from functools import reduce
my_df = [df1,df2, df3]
df = reduce(lambda x, y: x.add(y, fill_value=0), my_df)
print (df)
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0

I believe the idiomatic solution to this is to preserve the information about different DataFrames with the help of the keys parameter and then use sum on the innermost level:
dfs = [df1, df2, df3]
my_df = pd.concat(dfs, keys=['df{}'.format(i+1) for i in range(len(dfs))])
my_df.sum(level=1)
which yields
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
with jezrael's sample DataFrames.

One method is to use sum with a list of arrays. The output here will be an array rather than a dataframe.
This assumes you need to replace np.nan with 0:
res = sum([x.fillna(0).values for x in [df1, df2, df3]])
Alternatively, you can use numpy directly in a couple of different ways:
res_np1 = np.add.reduce([x.fillna(0).values for x in [df1, df2, df3]])
res_np2 = np.nansum([x.values for x in [df1, df2, df3]], axis=0)
numpy.nansum assumes np.nan equals zero for summing purposes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to drop duplicate data with different column names in pandas? - python

You could transpose with T and drop_duplicates then transpose back: >>> df.T.drop_duplicates().T X1 Y1 0 0.0 6.0 1 3.0 7.1 2 7.6 1.2 >>> Or with loc and duplicated: >>> df.loc[:, df.T.duplicated(keep='last')] X1 Y1 0 0.0 6.0 1 3.0 7.1 2 7.6 1.2 >>>

Related

Looping over dataframe with extracting different columns at each iteration

Find the mean of columns with matching column names

Replace column in Pandas dataframe with the mean of that column

Pairwise Euclidean distance with pandas ignoring NaNs

How to create a matrix that is the sum of multiple matrices using pandas dataframe?

Categories

Resources