Pandas : How to drop a specific number of duplicates rows? - python

I hope you're doing well.
So I want to drop a specific number of duplicates rows. Let me explain by an example :
A B C
0 foo 2 3
1 foo nan 9
2 foo 1 4
3 bar 8 nan
4 xxx 9 10
5 xxx 4 4
6 xxx 9 6
So we have duplicated rows based on column A, so for 'foo' I want to drop 2 duplicates rows for example and for 'xxx' I want to drop just one row.
The method drop_duplicates can keep either 0 or 1 rows so it didn't help me.
Thanks in advance.

Probably not the optimal solution, but this one works:
df = pd.DataFrame({
'A': ['foo','foo','foo','bar','xxx','xxx','xxx'],
'B': [2,np.nan,1,8,9,4,9],
'C': [3,9,4,np.nan,10,4,6]
})
nb_drops = {'foo':2, 'xxx':1}
df2 = pd.DataFrame()
for k, v in nb_drops.items():
df2 = df2.append(df[df['A'] == k].head(v))
df = df.drop_duplicates(subset=['A'])
df = df.merge(df2,how='outer')
df
Gives
A B C
0 foo 2.0 3.0
1 bar 8.0 NaN
2 xxx 9.0 10.0
3 foo NaN 9.0

I made this code and it works...
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['foo','foo','foo','bar','xxx','xxx','xxx'],
'B': [2,np.nan,1,8,9,4,9],
'C': [3,9,4,np.nan,10,4,6]
nb_drops = {'foo':2, 'xxx':1}
rows_to_delete = []
for item in nb_drops :
indices_item = list(df[df['A'] == item].index)
rows_to_delete += range(indices_item[-1] - nb_drops[item] + 1,indices_item[-1] + 1)
df.drop(rows_to_delete, inplace = True)

Related

python pandas dataframe multiply columns matching index or row name

I have two dataframes,
df1:
hash a b c
ABC 1 2 3
def 5 3 4
Xyz 3 2 -1
df2:
hash v
Xyz 3
def 5
I want to make
df:
hash a b c
ABC 1 2 3 (= as is, because no matching 'ABC' in df2)
def 25 15 20 (= 5*5 3*5 4*5)
Xyz 9 6 -3 (= 3*3 2*3 -1*3)
as like above,
I want to make a dataframe with values of multiplying df1 and df2 according to their index (or first column name) matched.
As df2 only has one column (v), all df1's columns except for the first one (index) should be affected.
Is there any neat Pythonic and Panda's way to achieve it?
df1.set_index(['hash']).mul(df2.set_index(['hash'])) or similar things seem not work..
One approach:
df1 = df1.set_index("hash")
df2 = df2.set_index("hash")["v"]
res = df1.mul(df2, axis=0).combine_first(df1)
print(res)
Output
a b c
hash
ABC 1.0 2.0 3.0
Xyz 9.0 6.0 -3.0
def 25.0 15.0 20.0
One Method:
# We'll make this for convenience
cols = ['a', 'b', 'c']
# Merge the DataFrames, keeping everything from df
df = df1.merge(df2, 'left').fillna(1)
# We'll make the v column integers again since it's been filled.
df.v = df.v.astype(int)
# Broadcast the multiplication across axis 0
df[cols] = df[cols].mul(df.v, axis=0)
# Drop the no-longer needed column:
df = df.drop('v', axis=1)
print(df)
Output:
hash a b c
0 ABC 1 2 3
1 def 25 15 20
2 Xyz 9 6 -3
Alternative Method:
# Set indices
df1 = df1.set_index('hash')
df2 = df2.set_index('hash')
# Apply multiplication and fill values
df = (df1.mul(df2.v, axis=0)
.fillna(df1)
.astype(int)
.reset_index())
# Output:
hash a b c
0 ABC 1 2 3
1 Xyz 9 6 -3
2 def 25 15 20
The function you are looking for is actually multiply.
Here's how I have done it:
>>> df
hash a b
0 ABC 1 2
1 DEF 5 3
2 XYZ 3 -1
>>> df2
hash v
0 XYZ 4
1 ABC 8
df = df.merge(df2, on='hash', how='left').fillna(1)
>>> df
hash a b v
0 ABC 1 2 8.0
1 DEF 5 3 1.0
2 XYZ 3 -1 4.0
df[['a','b']] = df[['a','b']].multiply(df['v'], axis='index')
>>>df
hash a b v
0 ABC 8.0 16.0 8.0
1 DEF 5.0 3.0 1.0
2 XYZ 12.0 -4.0 4.0
You can actually drop v at the end if you don't need it.

In pandas, how to re-arrange the dataframe to simultaneously combine groups of columns?

I hope someone could help me solve my issue.
Given a pandas dataframe as depicted in the image below,
I would like to re-arrange it into a new dataframe, combining several sets of columns (the sets have all the same size) such that each set becomes a single column as shown in the desired result image below.
Thank you in advance for any tips.
For a general solution, you can try one of this two options:
You could try this, using OrderedDict to get the alpha-nonnumeric column names ordered alphabetically, pd.DataFrame.filter to filter the columns with similar names, and then concat the values with pd.DataFrame.stack:
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame([[0,1,2,3,4],[5,6,7,8,9]], columns=['a1','a2','b1','b2','c'])
newdf=pd.DataFrame()
for col in list(OrderedDict.fromkeys( ''.join(df.columns)).keys()):
if col.isalpha():
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reset_index(drop=True)
Output:
df
a1 a2 b1 b2 c
0 0 1 2 3 4
1 5 6 7 8 9
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Another way to get the column names could be using re and set like this, and then sort columns alphabetically:
newdf=pd.DataFrame()
import re
for col in set(re.findall('[^\W\d_]',''.join(df.columns))):
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reindex(sorted(newdf.columns), axis=1).reset_index(drop=True)
Output:
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
You can do this with pd.wide_to_long and rename the 'c' column:
df_out = pd.wide_to_long(df.reset_index().rename(columns={'c':'c1'}),
['a','b','c'],'index','no')
df_out = df_out.reset_index(drop=True).ffill().astype(int)
df_out
Output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Same dataframe just sorting is different.
pd.wide_to_long(df, ['a','b'], 'c', 'no').reset_index().drop('no', axis=1)
Output:
c a b
0 4 0 2
1 9 5 7
2 4 1 3
3 9 6 8
The fact that column c only had one columns versus other letters having two columns, made it kind of tricky. I first stacked the dataframe and got rid of the numbers in the column names. Then for a and b I pivoted a dataframe and removed all nans. For c, I multiplied the length of the dataframe by 2 to make it match a and b and then merged it in with a and b.
input:
import pandas as pd
df = pd.DataFrame({'a1': {0: 0, 1: 5},
'a2': {0: 1, 1: 6},
'b1': {0: 2, 1: 7},
'b2': {0: 3, 1: 8},
'c': {0: 4, 1: 9}})
df
code:
df1=df.copy().stack().reset_index().replace('[0-9]+', '', regex=True)
dfab = df1[df1['level_1'].isin(['a','b'])].pivot(index=0, columns='level_1', values=0) \
.apply(lambda x: pd.Series(x.dropna().values)).astype(int)
dfc = pd.DataFrame(np.repeat(df['c'].values,2,axis=0)).rename({0:'c'}, axis=1)
df2=pd.merge(dfab, dfc, how='left', left_index=True, right_index=True)
df2
output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9

Python: How to replace missing values column wise by median

I have a dataframe as follows
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.45, 2.33, np.nan], 'C': [4, 5, 6], 'D': [4.55, 7.36, np.nan]})
I want to replace the missing values i.e. np.nan in generic way. For this I have created a function as follows
def treat_mis_value_nu(df):
df_nu = df.select_dtypes(include=['number'])
lst_null_col = df_nu.columns[df_nu.isnull().any()].tolist()
if len(lst_null_col)>0:
for i in lst_null_col:
if df_nu[i].isnull().sum()/len(df_nu[i])>0.10:
df_final_nu = df_nu.drop([i],axis=1)
else:
df_final_nu = df_nu[i].fillna(df_nu[i].median(),inplace=True)
return df_final_nu
When I apply this function as follows
df_final = treat_mis_value_nu(df)
I am getting a dataframe as follows
A B C
0 1 1.0 4
1 2 2.0 5
2 3 NaN 6
So it has actually removed column D correctly, but failed to remove column B.
I know in past there have been discussion on this topic (here). Still I might be missing something?
Use:
df = pd.DataFrame({'A': [1, 2, 3,5,7], 'B': [1.45, 2.33, np.nan, np.nan, np.nan],
'C': [4, 5, 6,8,7], 'D': [4.55, 7.36, np.nan,9,10],
'E':list('abcde')})
print (df)
A B C D E
0 1 1.45 4 4.55 a
1 2 2.33 5 7.36 b
2 3 NaN 6 NaN c
3 5 NaN 8 9.00 d
4 7 NaN 7 10.00 e
def treat_mis_value_nu(df):
#get only numeric columns to dataframe
df_nu = df.select_dtypes(include=['number'])
#get only columns with NaNs
df_nu = df_nu.loc[:, df_nu.isnull().any()]
#get columns for remove with mean instead sum/len, it is same
cols_to_drop = df_nu.columns[df_nu.isnull().mean() <= 0.30]
#replace missing values of original columns and remove above thresh
return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1)
print (treat_mis_value_nu(df))
A C D E
0 1 4 4.55 a
1 2 5 7.36 b
2 3 6 8.18 c
3 5 8 9.00 d
4 7 7 10.00 e
I would recommend looking at the sklearn Imputer transformer. I don't think it it can drop columns but it can definetly fill them in a 'generic way' - for example, filling in missing values with the median of the relevant column.
You could use it as such:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='median')
num_df = df.values
names = df.columns.values
df_final = pd.DataFrame(imputer.transform(num_df), columns=names)
If you have additional transformations you would like to make you could consider making a transformation Pipeline or could even make your own transformers to do bespoke tasks.

Using Panda to skip the next three rows from a specific column

I have been working on this code forever:
I would like to accomplish the following
-If Pout>3 then drop/delete the next 3 Rows
df=pd.read_csv(file,sep=',',usecols=['Iin', 'Iout','Pout'])
print(df['Pout'])
for i in df['Pout']:
if i>3:
df.drop(df[3:])# drop/delete the next 3 rows regardless of the value
print(df)
Any help will greatly appreciated
Thanks
I came up with this code based on your first code. but the updated version that you have just posted is more efficient. I am now dropping the next five rows after the conditions have been met.
import pandas as pd
df = pd.DataFrame({'a': [1,5.0,1,2.3,2.1,2,1,3,4,7], 'b':
[1,4,0.2,4.5,8.2,1,2,3,4,7], 'c': [1,4.5,5.4,6,2,4,2,3,4,7]})
for index in range(len(df['c'])):
if df['c'][index] >3:
df.at[index+1, 'c'] = None
df.at[index+2, 'c'] = None
df.at[index+3, 'c'] = None
df.at[index+4, 'c'] = None
df.at[index+5, 'c'] = None
print(df['c'])
break
try this:
import pandas as pd
df = pd.DataFrame({'a': [1,5,1,2,2,2,1], 'b': [1,4,2,4,8,1,2], 'c': [1,2,6,6,2,1,2]})
for i in df['c']:
if i>3:
try:
idx = df['c'].tolist().index(i)# drop/delete the next 3 rows regardless of the value
print(idx)
except:
pass
for i in range(idx, idx+3):
df.at[i, 'c'] = None
print(df)
Output:
a b c
0 1 1 1.0
1 5 4 2.0
2 1 2 NaN
3 2 4 NaN
4 2 8 NaN
5 2 1 1.0
6 1 2 2.0
my solution is with a dummy data frame
i got the index of the item if the item is bigger than 3 then iterate trough the range of the items index to the items index plus 3 then do at function to set the value to Nan
In my edit i just added try and except and now it works
For 5 rows:
I think this code is what you want, i also think this is more efficient
import pandas as pd
df = pd.DataFrame({'a': [1,5.0,1,2.3,2.1,2,1,3,4,7], 'b':
[1,4,0.2,4.5,8.2,1,2,3,4,7], 'c': [1,4.5,5.4,6,2,4,2,3,4,7]})
for index in range(len(df['c'])):
if df['c'][index] >3:
for i in range(index+1, index+6):
df.at[i, 'c'] = None
print(df['c'])
break
Output:
0 1.0
1 4.5
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 3.0
8 4.0
9 7.0
Name: c, dtype: float64

pandas DataFrame add a new column and fillna

I am trying to add a column to a pandas dataframe, like so:
df = pd.DataFrame()
df['one'] = pd.Series({'1':4, '2':6})
print (df)
df['two'] = pd.Series({'0':4, '2':6})
print (df)
This yields:
one two
1 4 NaN
2 6 6
However, I would the result to be,
one two
0 NaN 4
1 4 NaN
2 6 6
How do you do that?
One possibility is to use pd.concat:
ser1 = pd.Series({'1':4, '2':6})
ser2 = pd.Series({'0':4, '2':6})
df = pd.concat((ser1, ser2), axis=1)
to get
0 1
0 NaN 4
1 4 NaN
2 6 6
You can use join, telling pandas exactly how you want to do it:
df = pd.DataFrame()
df['one'] = pd.Series({'1':4, '2':6})
df.join(pd.Series({'0':4, '2':6}, name = 'two'), how = 'outer')
This results in
one two
0 NaN 4
1 4 NaN
2 6 6

Categories