How to find duplicate values (not rows) in an entire pandas dataframe? - python

Consider this dataframe.
df = pd.DataFrame(data={'one': list('abcd'),
'two': list('efgh'),
'three': list('ajha')})
one two three
0 a e a
1 b f j
2 c g h
3 d h a
How can I output all duplicate values and their respective index? The output can look something like this.
id value
0 2 h
1 3 h
2 0 a
3 0 a
4 3 a

Try .melt + .duplicated:
x = df.reset_index().melt("index")
print(
x.loc[x.duplicated(["value"], keep=False), ["index", "value"]]
.reset_index(drop=True)
.rename(columns={"index": "id"})
)
Prints:
id value
0 0 a
1 3 h
2 0 a
3 2 h
4 3 a

We can stack the DataFrame, use Series.loc to keep only where value is Series.duplicated then Series.reset_index to convert to a DataFrame:
new_df = (
df.stack() # Convert to Long Form
.droplevel(-1).rename_axis('id') # Handle MultiIndex
.loc[lambda x: x.duplicated(keep=False)] # Filter Values
.reset_index(name='value') # Make Series a DataFrame
)
new_df:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a

I used here melt to reshape and duplicated(keep=False) to select the duplicates:
(df.rename_axis('id')
.reset_index()
.melt(id_vars='id')
.loc[lambda d: d['value'].duplicated(keep=False), ['id','value']]
.sort_values(by='id')
.reset_index(drop=True)
)
Output:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a

Related

pd.get_dummies() with seperator and counts

I have a data that looks like:
index stringColumn
0 A_B_B_B_C_C_D
1 A_B_C_D
2 B_C_D_E_F
3 A_E_F_F_F
I need to vectorize this stringColumn with counts, ending up with:
index A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
Therefore I need to do both: counting and splitting. Pandas str.get_dummies() function allows me to split the string using sep = '_' argument, however it does not count multiple values. pd.get_dummies() does the counting but it does not allow seperator.
My data is huge so I am looking for vectorized solutions, rather than for loops.
You can use Series.str.split with get_dummies and sum:
df1 = (pd.get_dummies(df['stringColumn'].str.split('_', expand=True),
prefix='', prefix_sep='')
.sum(level=0, axis=1))
Or count values per rows by value_counts, replace missing values by DataFrame.fillna and convert to integers:
df1 = (df['stringColumn'].str.split('_', expand=True)
.apply(pd.value_counts, axis=1)
.fillna(0)
.astype(int))
Or use collections.Counter, performance should be very good:
from collections import Counter
df1 = (pd.DataFrame([Counter(x.split('_')) for x in df['stringColumn']])
.fillna(0)
.astype(int))
Or reshape by DataFrame.stack and count by SeriesGroupBy.value_counts:
df1 = (df['stringColumn'].str.split('_', expand=True)
.stack()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
print (df1)
A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3

Swapping values in columns depending on value type in one of the columns

Suppose I have the following pandas dataframe:
df = pd.DataFrame([['A','B'],[8,'s'],[5,'w'],['e',1],['n',3]])
print(df)
0 1
0 A B
1 8 s
2 5 w
3 e 1
4 n 3
If there is an integer in column 1, then I want to swap the value with the value from column 0, so in other words I want to produce this dataframe:
0 1
0 A B
1 8 s
2 5 w
3 1 e
4 3 n
Replace numbers from second column with mask by to_numeric with errors='coerce' and Series.notna:
m = pd.to_numeric(df[1], errors='coerce').notna()
Another solution with convert to strings by Series.astype and Series.str.isnumeric - but working only for integers:
m = df[1].astype(str).str.isnumeric()
And then replace by DataFrame.loc with DataFrame.values for numpy array for avoid columns alignment:
df.loc[m, [0, 1]] = df.loc[m, [1, 0]].values
print(df)
0 1
0 A B
1 8 s
2 5 w
3 1 e
4 3 n
Last if possible better is convert first row to columns names:
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
print(df)
A B
1 8 s
2 5 w
3 1 e
4 3 n
or possible removing header=None in read_csv.
sorted
with a key that test for int
df.loc[:] = [
sorted(t, key=lambda x: not isinstance(x, int))
for t in zip(*map(df.get, df))
]
df
0 1
0 A B
1 8 s
2 5 w
3 1 e
4 3 n
You can be explicit with the columns if you'd like
df[[0, 1]] = [
sorted(t, key=lambda x: not isinstance(x, int))
for t in zip(df[0], df[1])
]

How to pivot a dataframe into a square dataframe with number of intersections in other column as values

How to pivot a dataframe into a square dataframe with number of intersections in value column as values where
my input dataframe is
field value
a 1
a 2
b 3
b 1
c 2
c 5
Output should be
a b c
a 2 1 1
b 1 2 0
c 1 0 2
The values in the output data frame should be the number of intersection of values in the value column.
Use cross join with crosstab:
df = df.merge(df, on='value')
df = pd.crosstab(df['field_x'], df['field_y'])
print (df)
field_y a b c
field_x
a 2 1 1
b 1 2 0
c 1 0 2
Then remove index and columns names by rename_axis:
#pandas 0.24+
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(index=None, columns=None)
print (df)
a b c
a 2 1 1
b 1 2 0
c 1 0 2
#pandas bellow
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(None).rename_axis(None, axis=1)

map DataFrame index and forward fill nan values

I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C

Python: how to drop duplicates with duplicates?

I have a dataframe like the following
df
Name Y
0 A 1
1 A 0
2 B 0
3 B 0
5 C 1
I want to drop the duplicates of Name and keep the ones that have Y=1 such as:
df
Name Y
0 A 1
1 B 0
2 C 1
Use drop_duplicates method,
df.sort_values('Y', ascending= False).drop_duplicates(subset=['Name'])
groupby + max
Assuming your Y series consists only of 0 and 1 values:
res = df.groupby('Name', as_index=False)['Y'].max()
print(res)
Name Y
0 A 1
1 B 0
2 C 1
Does 'Y' column contain only 0-1? In that case, you can try the following :
df = df.sort_values(['Y'], ascending= False)
df = df.drop_duplicates(['Name'])

Categories