Replace specific values in dataframe - python

How to replace particular values in Dataframe. For example in the below dataframe i want to replace the rows starting with [AA,CB,EZ] and the value i want to replace is ''
df = pandas.DataFrame({'A': ['AA','BB','CB','DD','EZ'],'B':[6,7,8,9,10],'C':[11,12,13,14,15]})
$ df
A B C
0 AA 6 11
1 BB 7 12
2 CB 8 13
3 DD 9 14
4 EZ 10 15
$ Expected Ouputdf
A B C
0 AA
1 BB 7 12
2 CB
3 DD 9 14
4 EZ

You can replace values by boolean mask by empty strings, but get mixed types - strings with numeric and some functions should failed:
mask = df['A'].str.startswith(('AA','CB','EZ'))
df.loc[mask, ['B', 'C']] = ''
print (df)
A B C
0 AA
1 BB 7 12
2 CB
3 DD 9 14
4 EZ
Better is replace values to NaNs:
df.loc[mask, ['B', 'C']] = np.nan
print (df)
A B C
0 AA NaN NaN
1 BB 7.0 12.0
2 CB NaN NaN
3 DD 9.0 14.0
4 EZ NaN NaN
Another solution:
df[['B', 'C']] = df[['B', 'C']].mask(mask)

Related

Drop duplicate rows, but only if column equals NaN

I only want to drop rows where two columns (ID, Code) are duplicates, but the third column (Descrip) is equal to 'NaN'. My dataframe, df (Shown below) relfects my intial dataframe and df2 is what I want instead.
df:
ID Descrip Code
1 NaN CC
1 3 SS
2 4 CC
2 7 SS
3 NaN CC
3 1 CC
3 NaN SS
4 20 CC
4 22 SS
5 15 CC
5 10 SS
6 100 CC
6 NaN CC
6 4 SS
6 NaN SS
df2:
ID Descrip Code
1 NaN CC
1 3 SS
2 4 CC
2 7 SS
3 1 CC
3 NaN SS
4 20 CC
4 22 SS
5 15 CC
5 10 SS
6 100 CC
6 4 SS
I know using df.drop(subset['ID', 'Code'], keep='first'), would remove the duplicate rows, but I only want this where 'Decrip' == 'NaN'.
You can use groupby and take the max value (every number is larger than NaN):
df2 = df.groupby(["ID", "Code"])["Descrip"].max().reset_index()
I think you could use:
df = df[~(df.duplicated(['ID','Code'], False) & df['Descrip'].isna())]
Where (and I'll try my best to explain as to my understanding):
df.duplicated(['ID','Code'], False) - Returns a boolean if there is any duplicate in the subset ID and Code, where False ensures all rows are considered. Documentation here.
df['Descrip'].isna() - Checks wheather or not Descrip holds NaN. Documentation here
df[~(....first point above .... & .... second point above ....)] - The tilde stands for not operator to invert the boolean mask and the ampersand chains these two expressions together with bitwise and, together filtering out the rows of interest. Documentation here.
Result:
ID Descrip Code
0 1 NaN CC
1 1 3 SS
2 2 4 CC
3 2 7 SS
5 3 1 CC
6 3 NaN SS
7 4 20 CC
8 4 22 SS
9 5 15 CC
10 5 10 SS
11 6 100 CC
13 6 4 SS

why my output is coming as NaN always , i am expecting output as my series data?

I am trying below code and getting NaN for all the columns/rows in output
import numpy as np
import pandas as pd
data1 = np.array([1,2,4,5,6])
data2 = np.array([11,12,14,15,16])
ser1 = pd.Series(data1)
ser2 = pd.Series(data2)
ser4 = pd.Series(data1)
dataframe = pd.DataFrame([ser1,ser2,ser2],['a','b','c'])
Output is :
0 1 2 3 4
a 1 2 4 5 6
b 11 12 14 15 16
c 11 12 14 15 16
But for below code , i am getting NaN for all the data in output
dataframe = pd.DataFrame([ser1,ser2,ser2,ser4],['a','b','c','d'],['AA','BB','CC','DD','EE'])
AA BB CC DD EE
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
i was expecting the output should be data of series data with column name 'AA','BB','CC','DD','EE'respectively
tried to find any similar questions too on the forum but was unable to find any.
Problem is index alignmenet, it means original columns names are from 0 to N created from index values of Series, so if define another values in list it not match and pandas return NaNs in data.
Possible solution is can create index values of each Series by your new columns names:
data1 = np.array([1,2,4,5,6])
data2 = np.array([11,12,14,15,16])
i = ['AA','BB','CC','DD','EE']
ser1 = pd.Series(data1, index=i)
ser2 = pd.Series(data2, index=i)
ser4 = pd.Series(data1, index=i)
dataframe = pd.DataFrame([ser1,ser2,ser2],['a','b','c'])
print (dataframe)
AA BB CC DD EE
a 1 2 4 5 6
b 11 12 14 15 16
c 11 12 14 15 16
You can also specify index names in Series:
ser1 = pd.Series(data1, index=i, name='a')
ser2 = pd.Series(data2, index=i, name='b')
ser4 = pd.Series(data1, index=i, name='c')
dataframe = pd.DataFrame([ser1,ser2,ser2])
print (dataframe)
AA BB CC DD EE
a 1 2 4 5 6
b 11 12 14 15 16
b 11 12 14 15 16
You can ignore the index of series by stacking as an array using np.vstack , this will let you set your own index and columns:
pd.DataFrame(np.vstack([ser1,ser2,ser2,ser4]),['a','b','c','d'],['AA','BB','CC','DD','EE'])
AA BB CC DD EE
a 1 2 4 5 6
b 11 12 14 15 16
c 11 12 14 15 16
d 1 2 4 5 6

opposite of df.diff() in pandas

I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs

Python pandas; fill in data frame with pivot_table

I have a large python script, which makes two dataframes A and B, and at the end, I want to fill in dataframe A with the values of dataframe B, and keep the columns of dataframe A, but it is not going well.
Dataframe A is like this
A B C D
1 ab
2 bc
3 cd
Dataframe B:
A BB CC
1 C 10
2 C 11
3 D 12
My output must be:
new dataframe
A B C D
1 ab 10
2 bc 11
3 cd 12
But my output is
A B C D
1 ab
2 bc
3 cd
Why is it not filling in the values of dataframe B?
My command is
dfnew = dfB.pivot_table(index='A', columns='BB', values='CC').reindex(index=dfA.index, columns=dfA.columns).fillna(dfA)
I think you need set_index by index column of df for align data, fillna or combine_first and last reset_index:
dfA = pd.DataFrame({'A':[1,2,3], 'B':['ab','bc','cd'], 'C':[np.nan] * 3,'D':[np.nan] * 3})
print (dfA)
A B C D
0 1 ab NaN NaN
1 2 bc NaN NaN
2 3 cd NaN NaN
dfB = pd.DataFrame({'A':[1,2,3], 'BB':['C','C','D'], 'CC':[10,11,12]})
print (dfB)
A BB CC
0 1 C 10
1 2 C 11
2 3 D 12
df = dfB.pivot_table(index='A', columns='BB', values='CC')
print (df)
BB C D
A
1 10.0 NaN
2 11.0 NaN
3 NaN 12.0
dfA = dfA.set_index('A').fillna(df).reset_index()
#dfA = dfA.set_index('A').combine_first(df).reset_index()
print (dfA)
A B C D
0 1 ab 10.0 NaN
1 2 bc 11.0 NaN
2 3 cd NaN 12.0

Getting the content of a pandas row based on some conditions of other row

I have a pandas DataFrame df1 with the following content:
Serial N year current
B 10 14
B 10 16
B 11 10
B 11
B 11 15
C 12 11
C 9
C 12 13
C 12
D 3 4
I would like to count the number of occurrences of of each serial unique serial. If the serial number is less than 2, I would like to replace year and current for that row to nan. I would like to have something like this:
Serial N year current
B 10 14
B 10 16
B 11 10
B 11
B 11 15
C 12 11
C 9
C 12 13
C 12
D nan nan
You can combine value_counts, lt and reindex to get a boolean array of where to change values to nan, and then use loc to make the changes.
serial_filter = df1['Serial N'].value_counts().lt(2).reindex(df1['Serial N'])
df1.loc[serial_filter.values, ['year', 'current']] = np.nan
The resulting output:
Serial N year current
0 B 10.0 14.0
1 B 10.0 16.0
2 B 11.0 10.0
3 B 11.0 NaN
4 B 11.0 15.0
5 C 12.0 11.0
6 C NaN 9.0
7 C 12.0 13.0
8 C 12.0 NaN
9 D NaN NaN
Setup
import pandas as pd
from StringIO import StringIO
text = """Serial_N year current
B 10 14
B 10 16
B 11 10
B 11 nan
B 11 15
C 12 11
C nan 9
C 12 13
C 12 nan
D 3 4"""
df1 = pd.read_csv(StringIO(text), delim_whitespace=True)
df1.columns = ['Serial N', 'year', 'current']
Now I have the same df1 you showed above.
Solution
serial_filter = df1.groupby('Serial N').apply(lambda x: len(x))
serial_filter = serial_filter[serial_filter > 1]
mask = df1.apply(lambda x: x.ix['Serial N'] in serial_filter, axis=1)
df1 = df1[mask]
Demonstration and Explanation
serial_filter = df1.groupby('Serial N').apply(lambda x: len(x))
print serial_filter
Serial N
B 5
C 4
D 1
dtype: int64
produce a count of each unique Serial N
serial_filter = serial_filter[serial_filter > 1]
print serial_filter
Serial N
B 5
C 4
dtype: int64
Redefine it such that it only includes those Serial N's greater than 1
mask = df1.apply(lambda x: x.ix['Serial N'] in serial_filter, axis=1)
print mask
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
dtype: bool
Create a filter mask to use on df1
df1 = df1[mask]
print df1
Serial N year current
0 B 10.0 14.0
1 B 10.0 16.0
2 B 11.0 10.0
3 B 11.0 NaN
4 B 11.0 15.0
5 C 12.0 11.0
6 C NaN 9.0
7 C 12.0 13.0
8 C 12.0 NaN
Update df1

Categories