Is it possible to map only the first occurrence of key in a dataframe?
Ex:
testDict = { A : 1, B: 2}
df
Name Num
A
A
B
B
Expected output
Name Num
A 1
A
B 2
B
Use duplicated to find the first occurrence and then map:
df['Num'] = df.Name[df.Name.duplicated(keep='last')].map(testDict)
print(df)
Output
Name Num
0 A 1.0
1 A NaN
2 B 2.0
3 B NaN
To remove the NaN values, if you wish, do:
df = df.fillna('')
map the drop_duplicates, assuming you have a unique Index for alignment. (Probably best to keep NaN so the column remains numeric)
df['Num'] = df['Name'].drop_duplicates().map(testDict)
Name Num
0 A 1.0
1 A NaN
2 B 2.0
3 B NaN
You can use duplicated and map:
df['Num'] = np.where(~df['Name'].duplicated(), df['Name'].map(testDict), '')
Output:
Name Num
0 A 1
1 A
2 B 2
3 B
Related
Imagine the following dataframe
Base dataframe
df = pd.from_dict({'a': [1,2,1,2,1]
'b': [1,1,3,3,1]
})
And them i pick up the a, column and replace a few values based on b column values
df.loc[df['b']== 3]['a'].replace(2,1)
How could i reappend my a column to my original df, but only changing those specific filtered values?
Wanted result
df = pd.from_dict({'a': [1,2,1,1,1]
'b': [1,1,3,3,1]
})
Do with update
df.update(df.loc[df['b']== 3,['a']].replace(2,1))
df
Out[354]:
a b
0 1.0 1
1 2.0 1
2 1.0 3
3 1.0 3
4 1.0 1
You can try df.mask
df['a'] = df['a'].mask(df['a'].eq(2) & df['b'].eq(3), 1)
print(df)
a b
0 1 1
1 2 1
2 1 3
3 1 3
4 1 1
I have two dataframes,
df1:
hash a b c
ABC 1 2 3
def 5 3 4
Xyz 3 2 -1
df2:
hash v
Xyz 3
def 5
I want to make
df:
hash a b c
ABC 1 2 3 (= as is, because no matching 'ABC' in df2)
def 25 15 20 (= 5*5 3*5 4*5)
Xyz 9 6 -3 (= 3*3 2*3 -1*3)
as like above,
I want to make a dataframe with values of multiplying df1 and df2 according to their index (or first column name) matched.
As df2 only has one column (v), all df1's columns except for the first one (index) should be affected.
Is there any neat Pythonic and Panda's way to achieve it?
df1.set_index(['hash']).mul(df2.set_index(['hash'])) or similar things seem not work..
One approach:
df1 = df1.set_index("hash")
df2 = df2.set_index("hash")["v"]
res = df1.mul(df2, axis=0).combine_first(df1)
print(res)
Output
a b c
hash
ABC 1.0 2.0 3.0
Xyz 9.0 6.0 -3.0
def 25.0 15.0 20.0
One Method:
# We'll make this for convenience
cols = ['a', 'b', 'c']
# Merge the DataFrames, keeping everything from df
df = df1.merge(df2, 'left').fillna(1)
# We'll make the v column integers again since it's been filled.
df.v = df.v.astype(int)
# Broadcast the multiplication across axis 0
df[cols] = df[cols].mul(df.v, axis=0)
# Drop the no-longer needed column:
df = df.drop('v', axis=1)
print(df)
Output:
hash a b c
0 ABC 1 2 3
1 def 25 15 20
2 Xyz 9 6 -3
Alternative Method:
# Set indices
df1 = df1.set_index('hash')
df2 = df2.set_index('hash')
# Apply multiplication and fill values
df = (df1.mul(df2.v, axis=0)
.fillna(df1)
.astype(int)
.reset_index())
# Output:
hash a b c
0 ABC 1 2 3
1 Xyz 9 6 -3
2 def 25 15 20
The function you are looking for is actually multiply.
Here's how I have done it:
>>> df
hash a b
0 ABC 1 2
1 DEF 5 3
2 XYZ 3 -1
>>> df2
hash v
0 XYZ 4
1 ABC 8
df = df.merge(df2, on='hash', how='left').fillna(1)
>>> df
hash a b v
0 ABC 1 2 8.0
1 DEF 5 3 1.0
2 XYZ 3 -1 4.0
df[['a','b']] = df[['a','b']].multiply(df['v'], axis='index')
>>>df
hash a b v
0 ABC 8.0 16.0 8.0
1 DEF 5.0 3.0 1.0
2 XYZ 12.0 -4.0 4.0
You can actually drop v at the end if you don't need it.
Got the Following Dataframe:
A B
Temp1 1
Temp2 2
NaN NaN
NaN 4
Since the A nad B are correlated, I am able to create new column where I have calculated the nan value of A and B and form a tuple:
A B C
Temp1 1 (1,Temp1)
Temp2 2 (2, Temp2)
NaN NaN (3, Temp3)
NaN 4 (4, Temp4)
Now I have to drop the column C and fill the Nan value corrosponding to the Columns.
Use Series.fillna with select values in tuple by indexing with str, last remove C column:
#if values are not in tuples
#df.C = df.C.str.strip('()').str.split(',').apply(tuple)
df.A = df.A.fillna(df.C.str[1])
df.B = df.B.fillna(df.C.str[0])
df = df.drop('C', axis=1)
print (df)
A B
0 Temp1 1
1 Temp2 2
2 Temp3 3
3 Temp4 4
Or create DataFrame from C with DataFrame.pop for use and remove column, set new columns names and pass to DataFrame.fillna:
#if values are not in tuples
#df.C = df.C.str.strip('()').str.split(',').apply(tuple)
df[['A','B']] = df[['A','B']].fillna(pd.DataFrame(df.pop('C').tolist(), columns=['B','A']))
print (df)
A B
0 Temp1 1
1 Temp2 2
2 Temp3 3
3 Temp4 4
why do pandas NaN values sometime typed as numpy.float64, and sometimes float?
This is so confusing when I want to use function and change values in a dataframe depending on other columns
example:
A B C
0 1 NaN d
1 2 a s
2 2 b s
3 3 c NaN
I have a def to change value of column C
def change_val(df):
if df.A==1 and df.B==np.nan:
return df.C
else:
return df.B
Then I apply this function onto column C
df['C']=df.apply(lambda x: change_val(x),axis=1)
Things go wrong on df.B==np.nan, how do I correctly express this please?
Desired result:
A B C
0 1 NaN d
1 2 a a
2 2 b b
3 3 c c
Use numpy.where or loc, for check missing values is used special function Series.isna:
mask = (df.A==1) & (df.B.isna())
#oldier pandas versions
#mask = (df.A==1) & (df.B.isnull())
df['C'] = np.where(mask, df.C, df.B)
Or:
df.loc[~mask, 'C'] = df.B
print (df)
A B C
0 1 NaN d
1 2 a a
2 2 b b
3 3 c c
For more information about working with missing data check docs.
def change_val(df):
if df.A==1 and pd.isnull(df.B):
return df.C
else:
return df.B
NaN is no value will not be equal to any value, not even Nan itself, so use isnull()/isna()
I have a dataframe that looks something like this:
I want to replace all 1's in the range A:D with the name of the column, so that the final result should resemble:
How can I do that?
You can recreate my dataframe with this:
dfz = pd.DataFrame({'A' : [1,0,0,1,0,0],
'B' : [1,0,0,1,0,1],
'C' : [1,0,0,1,3,1],
'D' : [1,0,0,1,0,0],
'E' : [22.0,15.0,None,10.,None,557.0]})
One way could be to use replace and pass in a Series mapping column labels to values (those same labels in this case):
>>> dfz.loc[:, 'A':'D'].replace(1, pd.Series(dfz.columns, dfz.columns))
A B C D
0 A B C D
1 0 0 0 0
2 0 0 0 0
3 A B C D
4 0 0 3 0
5 0 B C 0
To make the change permanent, you'd assign the returned DataFrame back to dfz.loc[:, 'A':'D'].
Solutions aside, it's useful to keep in mind that you may lose a lot of performance benefits when you mix numeric and string types in columns, as pandas is forced to use the generic 'object' dtype to hold the values.
A solution using where:
>>> dfz.where(dfz != 1, dfz.columns.to_series(), axis=1)
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0
Maybe it's not so elegant but...just looping through columns and replace:
for i in dfz[['A','B','C','D']].columns:
dfz[i].replace(1,i,inplace=True)
I do prefer very elegant solution from #ajcr.
In case if you have column names that you cant use that easily for slicing, here is my solution:
dfz.ix[:, dfz.filter(regex=r'(A|B|C|D)').columns.tolist()] = (
dfz[dfz!=1].ix[:,dfz.filter(regex=r'(A|B|C|D)').columns.tolist()]
.apply(lambda x: x.fillna(x.name))
)
Output:
In [207]: dfz
Out[207]:
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0