I have dataframe like as below
Re_MC,Fi_MC,Fin_id,Res_id,
1,2,3,4
,7,6,11
11,,31,32
,,35,38
df1 = pd.read_clipboard(sep=',')
I would like to fillna based on two steps
a) First, compare only Re_MC and Fi_MC. If a value is missing in either of these columns, copy it from the other column.
b) Despite doing step a, if there is still NA for either Re_MC or Fi_MC, copy values from Fin_id for Fi_MC and Res_id for Re_MC.
So, I tried the below two approaches
Approach 1 - This works but not efficient/elegant
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Fi_MC'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Re_MC'])
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Res_id'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Fin_id'])
Approach 2 - This doesn't work and provide incorrect output
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Fi_MC']).fillna(df1['Res_id'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Re_MC']).fillna(df1['Fin_id'])
Is there any other efficient way to fillna in a sequential manner? Meaning, we do step a first and then based on result of step a, we do step b
I expect my output to be like as shown below
updated code
df_new = (df_new
.fillna({'Re MC': df_new['Re Cust'],'Re MC': df_new['Re Cust_System']})
.fillna({'Fi MC' : df_new['Fi.Fi Customer'],'Final MC':df_new['Re.Fi Customer']})
.fillna({'Fi MC' : df_new['Re MC']})
.fillna({'Class Fi MC':df_new['Re MC']})
)
You can use dictionaries in fillna:
(df1
.fillna({'Re_MC': df1['Fi_MC'], 'Fi_MC': df1['Re_MC']})
.fillna({'Re_MC': df1['Res_id'], 'Fi_MC': df1['Fin_id']})
)
output:
Re_MC Fi_MC Fin_id Res_id
0 1.0 2.0 3 4
1 7.0 7.0 6 11
2 11.0 11.0 31 32
3 38.0 35.0 35 38
I have a dataframe that has the following columns:
Acct Num, Correspondence Date, Open Date
For each opened account, I am being asked to look back at all the correspondences that happened within
30 days of opendate of that account, then assigning points as following to the correspondences:
Forty-twenty-forty: Attribute 40% (0.4 points) of the attribution to the first touch,
40% to the last touch, and divide the remaining 20% between all touches in between
So I know apply and group by functions, but this is beyond my paygrade.
I have to group by account, with conditional based on comparison of 2 columns against eachother,
I have to do that to get a total number of correspondences, and I guess they have to be sorted as well, as the following step of assigning points to correspondences depends on the order in which they occurred.
I would like to do this efficiently, as I have a ton of rows, I know apply() can go fast, but I am pretty bad at applying it when the row-level operation I am trying to do gets even a little complex.
I appreciate any help, as I am not good at pandas.
EDIT
as per request
Acct, ContactDate, OpenDate, Points (what I need to calculate)
123, 1/1/2018, 1/1/2021, 0 (because correspondance not within 30 days of open)
123, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
123, 12/11/2020, 1/1/2021, 0.2 (other 'touches' get 0.2/(num of touches-2) 'points')
123, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
456, 1/1/2018, 1/1/2021, 0 (again, because correspondance not within 30 days of open)
456, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
This returns a reduced dataframe in that it excludes timeframes exceeding 30 days and then merges the original df into it get all the data in one df. This assumes your date sorting is correct, otherwise, you may have to do that upfront before applying the function below.
df['Points'] = 0 #add column to dataframe before analysis
#df.columns
#Index(['Acct', 'ContactDate', 'OpenDate', 'Points'], dtype='object')
def points(x):
newx = x.loc[(x['OpenDate'] - x['ContactDate']) <= timedelta(days=30)] # reduce for wide > 30 days
# print(newx.Acct)
if newx.Acct.count() > 2: # check more than two dates exist
newx['Points'].iloc[0] = .4 # first row
newx['Points'].iloc[-1] = .4 # last row
newx['Points'].iloc[1:-1] = .2 / newx['Points'].iloc[1:-1].count() # middle rows / by count of those rows
return newx
elif newx.Acct.count() == 2: # placeholder for later
#edge case logic here for two occurences
return newx
elif newx.Acct.count() == 1: # placeholder for later
#edge case logic here one onccurence
return newx
# groupby Acct then clean up the indices so it can be merged back into original df
dft = df.groupby('Acct', as_index=False).apply(points).reset_index().set_index('level_1').drop('level_0', axis=1)
# merge on index
df_points = df[['Acct', 'ContactDate', 'OpenDate']].merge(dft['Points'], how='left', left_index=True, right_index=True).fillna(0)
Output:
Acct ContactDate OpenDate Points
0 123 2018-01-01 2021-01-01 0.0
1 123 2020-12-10 2021-01-01 0.4
2 123 2020-12-11 2021-01-01 0.2
3 123 2020-12-12 2021-01-01 0.4
4 456 2018-01-01 2021-01-01 0.0
5 456 2020-12-10 2021-01-01 0.4
6 456 2020-12-11 2021-01-01 0.1
7 456 2020-12-11 2021-01-01 0.1
8 456 2020-12-12 2021-01-01 0.4
Hi I am using the date difference as a machine learning feature, analyzing how the weight of a patient changed over time.
I successfully tested a method to do that as shown below, but the question is how to extend this to a dataframe where I have to see date difference for each patient as shown in the figure above. The encircled column is what im aiming to get. So basically the baseline date from which the date difference is calculated changes every time for a new patient name so that we can track the weight progress over time for that patient! Thanks
s='17/6/2016'
s1='22/6/16'
a=pd.to_datetime(s,infer_datetime_format=True)
b=pd.to_datetime(s1,infer_datetime_format=True)
e=b.date()-a.date()
str(e)
str(e)[0:2]
I think it would be something like this, (but im not sure how to do this exactly):
def f(row):
# some logic here
return val
df['Datediff'] = df.apply(f, axis=1)
You can use transform with first
df['Datediff'] = df['Date'] - df1.groupby('Name')['Date'].transform('first')
Another solution can be using cumsum
df['Datediff'] = df.groupby('Name')['Date'].apply(lambda x:x.diff().cumsum().fillna(0))
df["Datediff"] = df.groupby("Name")["Date"].diff().fillna(0)/ np.timedelta64(1, 'D')
df["Datediff"]
0 0.0
1 12.0
2 14.0
3 66.0
4 23.0
5 0.0
6 10.0
7 15.0
8 14.0
9 0.0
10 14.0
Name: Datediff, dtype: float64
I have these 2 dataframes
dffouten:
index opmerking
DateTime
2018-11-01 08:05:41 20 photocells
2018-11-01 11:40:55 42 trap/roodnoodstop
2018-11-02 07:24:02 62 trap/roodnoodstop
and
dffm:
Counter
traploopext 4
What i want to do is divide the amount of times trap/noodstop by the traploopext and countervalue of 4
so what I did is:
dffouten = dffouten.groupby('opmerking').count()
which gives me
index
opmerking
photocells 1
trap/roodnoodstop 2
and then
percentage = (dffm.loc['rectloopext'] / dffouten.loc['trap/roodnoodstop']) * 100
but this doesnt work, the strange thing to me is that if use :
percentage = (dffm.loc['rectloopext'] / 2) * 100
that it gives me the answer.
It seems is necessary specify column in function loc for return scalar, for count values per column is used value_counts so for scalar is used loc and finally get division between 2 scalars:
dffouten = dffouten['opmerking'].value_counts()
print (dffouten)
trap/roodnoodstop 2
photocells 1
Name: opmerking, dtype: int64
#for Series select by index
print (dffouten.loc['trap/roodnoodstop'])
2
#for DataFrame select by index and column
print (dffm.loc['traploopext', 'Counter'])
4
percentage = (dffm.loc['traploopext', 'Counter'] / dffouten.loc['trap/roodnoodstop']) * 100
print (percentage)
200.0
More information is possible find here
I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)
The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.
P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)