I am trying to use apply function to create 2 new columns. when dataframe has index, it doesn't wokr, the new columns have values of NaN. If dataframe has no index, then it works. Could you please help? Thanks
def calc_test(row):
a=row['col1']+row['col2']
b=row['col1']/row['col2']
return (a,b)
df_test_dict={'col1':[1,2,3,4,5],'col2':[10,20,30,40,50]}
df_test=pd.DataFrame(df_test_dict)
df_test.index=['a1','b1','c1','d1','e1']
df_test
col1 col2
a1 1 10
b1 2 20
c1 3 30
d1 4 40
e1 5 50
Now I use apply function, the new creately columns have values of NaN. Thanks for your help.
df_test[['a','b']] = pd.DataFrame(df_test.apply(lambda row:calc_test(row),axis=1).tolist())
df_test
col1 col2 a b
a1 1 10 NaN NaN
b1 2 20 NaN NaN
c1 3 30 NaN NaN
d1 4 40 NaN NaN
e1 5 50 NaN Na
When using apply, you may use the result_type ='expand' argument to expand the output of your function as columns of a pandas Dataframe:
df_test[['a','b']]=df_test.apply(lambda row:calc_test(row),axis=1, result_type ='expand')
This returns:
col1 col2 a b
a1 1 10 11.0 0.1
b1 2 20 22.0 0.1
c1 3 30 33.0 0.1
d1 4 40 44.0 0.1
e1 5 50 55.0 0.1
You are wrapping the return of the apply as a DataFrame which has a default indexing of [0, 1, 2, 3, 4] which don't exist in your original DataFrame's index. You can see this by looking at the output of pd.DataFrame(df_test.apply(lambda row:calc_test(row),axis=1).tolist()).
Simply remove the pd.DataFrame() to fix this problem.
df_test[['a', 'b']] = df_test.apply(lambda row:calc_test(row),axis=1).tolist()
Related
I need to index the dataframe from positional index, but I got NA values in previous operation and I wanna preserve it. How could I achieve this?
df1
NaN
1
NaN
NaN
NaN
6
df2
0 10
1 15
2 13
3 15
4 16
5 17
6 17
7 18
8 10
df3
0 15
1 17
The output I want
NaN
15
NaN
NaN
NaN
17
df2.iloc(df1)
IndexError: indices are out-of-bounds
.iloc method in this case drive to a unbound error, I think .iloc is not available here. df3 is another output generated by .loc, but I don't know how to add NaN between them. If you can achieve output by using df1 and df3 is also ok
If df1 and df2 has same index values use for replace non missing values by values from another DataFrame DataFrame.mask with DataFrame.isna:
df1 = df2.mask(df1.isna())
print (df1)
col
0 NaN
1 15.0
2 NaN
3 NaN
4 NaN
5 17.0
Here as shown below is a data frame , where in a column col2 many nan's are there , i want to fill that only nan value the col1 as key from dictionary dict_map and map those value in col2.
Reproducible code:
import pandas as pd
import numpy as np
dict_map = {'a':45,'b':23,'c':97,'z': -1}
df = pd.DataFrame()
df['tag'] = [1,2,3,4,5,6,7,8,9,10,11]
df['col1'] = ['a','b','c','b','a','a','z','c','b','c','b']
df['col2'] = [np.nan,909,34,56,np.nan,45,np.nan,11,61,np.nan,np.nan]
df['_'] = df['col1'].map(dict_map)
Expected Output
One of the Method is :
df['col3'] = np.where(df['col2'].isna(),df['_'],df['col2'])
df
Just wanted to know any other method using function and map function , we can optimize this .
You can map col1 with your dict_map and then use that as input to fillna, as follows
df['col3'] = df['col2'].fillna(df['col1'].map(dict_map))
You can achieve the very same result just using list comprehension, it is a very pythonic solution and I believe it holds better performance.
We are just reading col2 and copying the value to col3 if its not NaN. Then, if it is, we look into Col1, grab the dict key and, instead, use the corresponding value from dict_map.
df['col3'] = [df['col2'][idx] if not np.isnan(df['col2'][idx]) else dict_map[df['col1'][idx]] for idx in df.index.tolist()]
Output:
df
tag col1 col2 col3
0 1 a NaN 45.0
1 2 b 909.0 909.0
2 3 c 34.0 34.0
3 4 b 56.0 56.0
4 5 a NaN 45.0
5 6 a 45.0 45.0
6 7 z NaN -1.0
7 8 c 11.0 11.0
8 9 b 61.0 61.0
9 10 c NaN 97.0
10 11 b NaN 23.0
I have following dataframe:
A1 A2 B1 B2
0 10 20 20 NA
1 20 40 30 No
2 50 No 50 10
3 40 NA 50 20
I want to change value in column A1 to NaN whenever corresponding value in column A2 is No or NA. Same for B1.
Note: NA here is a string objects not NaN.
A1 A2 B1 B2
0 10 20 NaN NA
1 20 40 NaN No
2 NaN No 50 10
3 NaN NA 50 20
Use if NA and No are strings use Series.isin in DataFrame.loc or :
df.loc[df.A2.isin(['NA','No']), 'A1'] = np.nan
Or Series.mask:
df['A1'] = df['A1'].mask(df.A2.isin(['NA','No']))
If NA is missing value test it by Series.isna:
df.loc[df.A2.isna() | df.A2.eq('No'), 'A1'] = np.nan
Or:
df['A1'] = df['A1'].mask(df.A2.isna() | df.A2.eq('No'))
I want to calculate the mean of all values in a row after e.g. 5 entries in this particular row, this leads to different "start" points of mean-calculation. As soon as there are 5 values in a row the mean of the values should be calculated.
Note: There might be some NaNs in the rows which should not count in the 5 entries, I want to use valid values only.
Example if I wanted to calculate after e.g. 5 entries:
Index D1 D2 D3 D4 D5 D6 D7
1 NaN NaN 2 3 4 5 6
2 1 1 2 3 4 5 6
3 2 1 NaN 3 4 5 6
4 NaN NaN NaN 3 4 5 6
My desired output looks like this:
Index D1 D2 D3 D4 D5 D6 D7
1 NaN NaN NaN NaN NaN NaN 4
2 NaN NaN NaN NaN 2.2 2.66 3.14
3 NaN NaN NaN NaN NaN 3 3.5
4 NaN NaN NaN NaN NaN NaN NaN
I was trying to use the .count method, but I got NaNs in all cells using my code below:
B = A.copy()
for i in range(A.shape[0]):
for j in range(A.shape[1]):
if A.iloc[i,0:j].count() > 5:
B.iloc[i,j] = B.iloc[i,0:j].sum()/B.iloc[i,0:j].count()
else:
B.iloc[i,j] = np.nan
Edit:
It looks like I found a solution: Changing inside the forloop:
# Old version
B.iloc[i,j] = B.iloc[i,0:j].sum()/B.iloc[i,0:j].count()
# New version
B.iloc[i,j] = A.iloc[i,0:j].sum()/A.iloc[i,0:j].count()
If someone has a faster/prettier solution let me know anyways, I don't really like this one.
What you want is the expanding mean:
df.loc[:, 'D1':].expanding(5, axis=1).mean()
I'm not sure Index is a column or an index in your dataframe. If it's the index, you can remove the loc[...] call.
I have a pandas dataframe df as shown.
col1 col2
0 NaN a
1 2 b
2 NaN c
3 NaN d
4 5 e
5 6 f
I want to find the first NaN value in col1 and assign a new value to it. I've tried both of the following methods but none of them works.
df.loc[df['col'].isna(), 'col1'][0] = 1
df.loc[df['col'].isna(), 'col1'].iloc[0] = 1
Both of them don't show any error or warning. But when I check the value of the original dataframe, it doesn't change.
What is the correct way to do this?
You can use .fillna() with limit=1 parameter:
df['col1'].fillna(1, limit=1, inplace=True)
print(df)
Prints:
col1 col2
0 1.0 a
1 2.0 b
2 NaN c
3 NaN d
4 5.0 e
5 6.0 f