Replace specific row-wise duplicate cells in selected columns without dropping rows

Replace specific row-wise duplicate cells in selected columns without dropping rows - python

How can I replace specific row-wise duplicate cells in selected columns without dropping rows (preferably without looping through the rows)?
Basically, I want to keep the first value and replace the remaining duplicates in a row with NAN.
For example:
df_example = pd.DataFrame({'A':['a' , 'b', 'c'], 'B':['a', 'f', 'c'],'C':[1,2,3]})
df_example.head()
Original:
A B C
0 a a 1
1 b f 2
2 c c 3
Expected output:
A B C
0 a nan 1
1 b f 2
2 c nan 3
A bit more complicated example is as follows:
Original:
A B C D
0 a 1 a 1
1 b 2 f 5
2 c 3 c 3
Expected output:
A B C D
0 a 1 nan nan
1 b 2 f 5
2 c 3 nan nan

Use DataFrame.mask with Series.duplicated per rows in DataFrame.apply:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C
0 a NaN 1
1 b f 2
2 c NaN 3
With new data:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C D
0 a 1 NaN NaN
1 b 2 f 5.0
2 c 3 NaN NaN

Related

Fill cell containing NaN with average of value before and after considering groupby

I would like to fill missing values in a pandas dataframe with the average of the cells directly before and after the missing value considering that there are different IDs.
maskedid test value
1 A 4
1 B NaN
1 C 5
2 A 5
2 B NaN
2 B 2
expected DF
maskedid test value
1 A 4
1 B 4.5
1 C 5
2 A 5
2 B 3.5
2 B 2

Try to interpolate:
df['value'] = df['value'].interpolate()
And by group:
df['value'] = df.groupby('maskedid')['value'].apply(pd.Series.interpolate)

Reverse Row Values in Pandas DataFrame

I'm working on a pandas data frame where I want to find the farthest out non-null value in each row and then reverse the order of those values and output a data frame with the row values reversed without leaving null values in the first column. Essentially reversing column order and shifting non-null values to the left.
IN:
1 2 3 4 5
1 a b c d e
2 a b c
3 a b c d
4 a b c
OUT:
1 2 3 4 5
1 e d c b a
2 c b a
3 d c b a
4 c b a

For each row, create a new Series with the same indexes but with the values reversed:
def reverse(s):
# Strip the NaN on both ends, but not in the middle
idx1 = s.first_valid_index()
idx2 = s.last_valid_index()
idx = s.loc[idx1:idx2].index
return pd.Series(s.loc[idx[::-1]].values, index=idx)
df.apply(reverse, axis=1)
Result:
1 2 3 4 5
1 e d c b a
2 c b a NaN NaN
3 d c b a NaN
4 c NaN b a NaN

Remove outliers from aggregated Dataframe (Python)

My origin dataframe looks like this, only the first rows...:
categories id products
0 A 1 a
1 B 1 a
2 C 1 a
3 A 1 b
4 B 1 b
5 A 2 c
6 B 2 c
I aggregated it with the following code:
df2 = df.groupby('id').products.nunique().reset_index().merge(
pd.crosstab(df.id, df.categories).reset_index()
The dataframe is the following then, I added n outlier from my DF too:
id products A B C
0 1 2 2 2 1
1 2 1 1 1 0
2 3 50 1 1 30
Now I am trying to remove the outliers in my new DF:
#remove outliners
del df2['id']
df2 = df2.loc[df2['products']<=20,[str(i) for i in df2.columns]]
What I then get is:
products A B C
0 2 NaN NaN NaN
1 1 NaN NaN NaN
It removes the outliers but why do I get only NaNs now in the categorie column?

df2 = df2.loc[df2['products'] <= 20]

merging multiple partially overlapping dataframes compactly without extra rows and nans

I have a bunch of partially overlapping (in rows and columns) pandas DataFrames, exemplified like so:
df1 = pandas.DataFrame({'a':['1','2','3'], 'b':['a','b','c']})
df2 = pandas.DataFrame({'c':['q','w','e','r','t','y'], 'b':['a','b','c','d','e','f']})
df3 = pandas.DataFrame({'a':['4','5','6'], 'c':['r','t','y']})
...etc.
I want to merge them all together with as few NaN holes as possible.
Consecutive blind outer merges invariably give some (unfortunately useless to me) hole-and-duplicate-filled variant of:
a b c
0 1 a q
1 2 b w
2 3 c e
3 NaN d r
4 NaN e t
5 NaN f y
6 4 NaN r
7 5 NaN t
8 6 NaN y
My desired output given a, b, and c above would be this (column order doesn't matter):
a b c
0 1 a q
1 2 b w
2 3 c e
3 4 d r
4 5 e t
5 6 f y
I want the NaNs to be treated as places to insert data from the next dataframe, not obstruct it.
I'm at a loss here. Is there any way to achieve this in a general way?

I can not grantee the speed , But after sort with key , seems work for your sample data.
df.apply(lambda x : sorted(x,key=pd.isnull)).dropna(0)
Out[47]:
a b c
0 1.0 a q
1 2.0 b w
2 3.0 c e
3 4.0 d r
4 5.0 e t
5 6.0 f y

Python pandas; fill in data frame with pivot_table

I have a large python script, which makes two dataframes A and B, and at the end, I want to fill in dataframe A with the values of dataframe B, and keep the columns of dataframe A, but it is not going well.
Dataframe A is like this
A B C D
1 ab
2 bc
3 cd
Dataframe B:
A BB CC
1 C 10
2 C 11
3 D 12
My output must be:
new dataframe
A B C D
1 ab 10
2 bc 11
3 cd 12
But my output is
A B C D
1 ab
2 bc
3 cd
Why is it not filling in the values of dataframe B?
My command is
dfnew = dfB.pivot_table(index='A', columns='BB', values='CC').reindex(index=dfA.index, columns=dfA.columns).fillna(dfA)

I think you need set_index by index column of df for align data, fillna or combine_first and last reset_index:
dfA = pd.DataFrame({'A':[1,2,3], 'B':['ab','bc','cd'], 'C':[np.nan] * 3,'D':[np.nan] * 3})
print (dfA)
A B C D
0 1 ab NaN NaN
1 2 bc NaN NaN
2 3 cd NaN NaN
dfB = pd.DataFrame({'A':[1,2,3], 'BB':['C','C','D'], 'CC':[10,11,12]})
print (dfB)
A BB CC
0 1 C 10
1 2 C 11
2 3 D 12
df = dfB.pivot_table(index='A', columns='BB', values='CC')
print (df)
BB C D
A
1 10.0 NaN
2 11.0 NaN
3 NaN 12.0
dfA = dfA.set_index('A').fillna(df).reset_index()
#dfA = dfA.set_index('A').combine_first(df).reset_index()
print (dfA)
A B C D
0 1 ab 10.0 NaN
1 2 bc 11.0 NaN
2 3 cd NaN 12.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace specific row-wise duplicate cells in selected columns without dropping rows - python

Related

Fill cell containing NaN with average of value before and after considering groupby

Reverse Row Values in Pandas DataFrame

Remove outliers from aggregated Dataframe (Python)

merging multiple partially overlapping dataframes compactly without extra rows and nans

Python pandas; fill in data frame with pivot_table

Categories

Resources