applying different functions over a certain column in pandas [duplicate]

applying different functions over a certain column in pandas [duplicate] - python

This question already has answers here:
How do I assign values based on multiple conditions for existing columns?
(7 answers)
Closed 5 months ago.
I would like to use the apply and lambda methods in python in order to change the pricing in a column. The column name is Price. So, if the price is less than 20 I would like to pass and keep it the same. If 30>price>20 I would like to add 1. If the price is 40>price>30 then I would like to add 1.50. And so on. I am trying to figure out a way to apply these functions over a column and then send it back to an excel format in order to updating the pricing. I am confused as to how to do so. I have tried putting this operation in a function using an if clause but it is not spitting out the results that I would need to (k is the name of the dataframe):
def addition():
if k[k['Price']] < 20]:
pass
if k[(k['Price']] > 20) & (k['Price] < 30)]:
return k + 1
if k[(k['Price']] > 30.01) & (k['Price] < 40)]:
return k + 1.50
and so on. However, at the end, when I attempt to send out (what I thought was the newly updated k[k['Price] format in xlsx it doesn't even show up. I have tried to make the xlsx variable global as well but still no luck. I think it is simpler to use the lambda function, but I am having trouble deciding on how to separate and update the prices in that column based off the conditions. Much help would be appreciated.
This is the dataframe that I am trying to perform the different functions on:
0 23.198824
1 21.080706
2 15.810118
3 21.787059
4 18.821882
...
33525 20.347059
33526 25.665882
33527 33.077647
33528 21.803529
33529 23.043529
Name: Price, Length: 33530, dtype: float64

If k is the dataframe,then k+1 won't work, it will cause an error. You can write a function to change the price and apply it to the column -
def update_price(price):
if 20<price<30:
price += 1
elif 30<price<40:
price += 1.5
return price
df['Updated_Price'] = df['Price'].apply(lambda x: update_price(x))
In [39]: df
Out[39]:
Name Price
0 a 15
1 b 23
2 c 37
In [43]: df
Out[43]:
Name Price Updated_Price
0 a 15 15.0
1 b 23 24.0
2 c 37 38.5

You can use apply method and lambda for this purpose alongside with nested if..elses.
import pandas as pd
df = pd.DataFrame({
'Price': [10.0, 23.0, 50.0, 32.0, 12.0, 50.0]
})
df = df['Price'].apply(lambda x: x if x < 20.0 else (x + 1.0 if 30.0 > x > 20.0 else x + 1.5))
print(df)
Output:
0 10.0
1 24.0
2 51.5
3 33.5
4 12.0
5 51.5
Name: Price, dtype: float64

Related

pandas fillna sequentially step by step

I have dataframe like as below
Re_MC,Fi_MC,Fin_id,Res_id,
1,2,3,4
,7,6,11
11,,31,32
,,35,38
df1 = pd.read_clipboard(sep=',')
I would like to fillna based on two steps
a) First, compare only Re_MC and Fi_MC. If a value is missing in either of these columns, copy it from the other column.
b) Despite doing step a, if there is still NA for either Re_MC or Fi_MC, copy values from Fin_id for Fi_MC and Res_id for Re_MC.
So, I tried the below two approaches
Approach 1 - This works but not efficient/elegant
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Fi_MC'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Re_MC'])
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Res_id'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Fin_id'])
Approach 2 - This doesn't work and provide incorrect output
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Fi_MC']).fillna(df1['Res_id'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Re_MC']).fillna(df1['Fin_id'])
Is there any other efficient way to fillna in a sequential manner? Meaning, we do step a first and then based on result of step a, we do step b
I expect my output to be like as shown below
updated code
df_new = (df_new
.fillna({'Re MC': df_new['Re Cust'],'Re MC': df_new['Re Cust_System']})
.fillna({'Fi MC' : df_new['Fi.Fi Customer'],'Final MC':df_new['Re.Fi Customer']})
.fillna({'Fi MC' : df_new['Re MC']})
.fillna({'Class Fi MC':df_new['Re MC']})
)

You can use dictionaries in fillna:
(df1
.fillna({'Re_MC': df1['Fi_MC'], 'Fi_MC': df1['Re_MC']})
.fillna({'Re_MC': df1['Res_id'], 'Fi_MC': df1['Fin_id']})
)
output:
Re_MC Fi_MC Fin_id Res_id
0 1.0 2.0 3 4
1 7.0 7.0 6 11
2 11.0 11.0 31 32
3 38.0 35.0 35 38

Apply multiple condition groupby + sort + sum to pandas dataframe rows

I have a dataframe that has the following columns:
Acct Num, Correspondence Date, Open Date
For each opened account, I am being asked to look back at all the correspondences that happened within
30 days of opendate of that account, then assigning points as following to the correspondences:
Forty-twenty-forty: Attribute 40% (0.4 points) of the attribution to the first touch,
40% to the last touch, and divide the remaining 20% between all touches in between
So I know apply and group by functions, but this is beyond my paygrade.
I have to group by account, with conditional based on comparison of 2 columns against eachother,
I have to do that to get a total number of correspondences, and I guess they have to be sorted as well, as the following step of assigning points to correspondences depends on the order in which they occurred.
I would like to do this efficiently, as I have a ton of rows, I know apply() can go fast, but I am pretty bad at applying it when the row-level operation I am trying to do gets even a little complex.
I appreciate any help, as I am not good at pandas.
EDIT
as per request
Acct, ContactDate, OpenDate, Points (what I need to calculate)
123, 1/1/2018, 1/1/2021, 0 (because correspondance not within 30 days of open)
123, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
123, 12/11/2020, 1/1/2021, 0.2 (other 'touches' get 0.2/(num of touches-2) 'points')
123, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
456, 1/1/2018, 1/1/2021, 0 (again, because correspondance not within 30 days of open)
456, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)

This returns a reduced dataframe in that it excludes timeframes exceeding 30 days and then merges the original df into it get all the data in one df. This assumes your date sorting is correct, otherwise, you may have to do that upfront before applying the function below.
df['Points'] = 0 #add column to dataframe before analysis
#df.columns
#Index(['Acct', 'ContactDate', 'OpenDate', 'Points'], dtype='object')
def points(x):
newx = x.loc[(x['OpenDate'] - x['ContactDate']) <= timedelta(days=30)] # reduce for wide > 30 days
# print(newx.Acct)
if newx.Acct.count() > 2: # check more than two dates exist
newx['Points'].iloc[0] = .4 # first row
newx['Points'].iloc[-1] = .4 # last row
newx['Points'].iloc[1:-1] = .2 / newx['Points'].iloc[1:-1].count() # middle rows / by count of those rows
return newx
elif newx.Acct.count() == 2: # placeholder for later
#edge case logic here for two occurences
return newx
elif newx.Acct.count() == 1: # placeholder for later
#edge case logic here one onccurence
return newx
# groupby Acct then clean up the indices so it can be merged back into original df
dft = df.groupby('Acct', as_index=False).apply(points).reset_index().set_index('level_1').drop('level_0', axis=1)
# merge on index
df_points = df[['Acct', 'ContactDate', 'OpenDate']].merge(dft['Points'], how='left', left_index=True, right_index=True).fillna(0)
Output:
Acct ContactDate OpenDate Points
0 123 2018-01-01 2021-01-01 0.0
1 123 2020-12-10 2021-01-01 0.4
2 123 2020-12-11 2021-01-01 0.2
3 123 2020-12-12 2021-01-01 0.4
4 456 2018-01-01 2021-01-01 0.0
5 456 2020-12-10 2021-01-01 0.4
6 456 2020-12-11 2021-01-01 0.1
7 456 2020-12-11 2021-01-01 0.1
8 456 2020-12-12 2021-01-01 0.4

Calculating date difference for pandas dataframe rows with changing baseline dates

Hi I am using the date difference as a machine learning feature, analyzing how the weight of a patient changed over time.
I successfully tested a method to do that as shown below, but the question is how to extend this to a dataframe where I have to see date difference for each patient as shown in the figure above. The encircled column is what im aiming to get. So basically the baseline date from which the date difference is calculated changes every time for a new patient name so that we can track the weight progress over time for that patient! Thanks
s='17/6/2016'
s1='22/6/16'
a=pd.to_datetime(s,infer_datetime_format=True)
b=pd.to_datetime(s1,infer_datetime_format=True)
e=b.date()-a.date()
str(e)
str(e)[0:2]
I think it would be something like this, (but im not sure how to do this exactly):
def f(row):
# some logic here
return val
df['Datediff'] = df.apply(f, axis=1)

You can use transform with first
df['Datediff'] = df['Date'] - df1.groupby('Name')['Date'].transform('first')
Another solution can be using cumsum
df['Datediff'] = df.groupby('Name')['Date'].apply(lambda x:x.diff().cumsum().fillna(0))

df["Datediff"] = df.groupby("Name")["Date"].diff().fillna(0)/ np.timedelta64(1, 'D')
df["Datediff"]
0 0.0
1 12.0
2 14.0
3 66.0
4 23.0
5 0.0
6 10.0
7 15.0
8 14.0
9 0.0
10 14.0
Name: Datediff, dtype: float64

Using Cell Value from 2 different dataframes to do calculations (Pandas)

I have these 2 dataframes
dffouten:
index opmerking
DateTime
2018-11-01 08:05:41 20 photocells
2018-11-01 11:40:55 42 trap/roodnoodstop
2018-11-02 07:24:02 62 trap/roodnoodstop
and
dffm:
Counter
traploopext 4
What i want to do is divide the amount of times trap/noodstop by the traploopext and countervalue of 4
so what I did is:
dffouten = dffouten.groupby('opmerking').count()
which gives me
index
opmerking
photocells 1
trap/roodnoodstop 2
and then
percentage = (dffm.loc['rectloopext'] / dffouten.loc['trap/roodnoodstop']) * 100
but this doesnt work, the strange thing to me is that if use :
percentage = (dffm.loc['rectloopext'] / 2) * 100
that it gives me the answer.

It seems is necessary specify column in function loc for return scalar, for count values per column is used value_counts so for scalar is used loc and finally get division between 2 scalars:
dffouten = dffouten['opmerking'].value_counts()
print (dffouten)
trap/roodnoodstop 2
photocells 1
Name: opmerking, dtype: int64
#for Series select by index
print (dffouten.loc['trap/roodnoodstop'])
2
#for DataFrame select by index and column
print (dffm.loc['traploopext', 'Counter'])
4
percentage = (dffm.loc['traploopext', 'Counter'] / dffouten.loc['trap/roodnoodstop']) * 100
print (percentage)
200.0
More information is possible find here

Python Pandas Running Totals with Resets

I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)

The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.

P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

applying different functions over a certain column in pandas [duplicate] - python

Related

pandas fillna sequentially step by step

Apply multiple condition groupby + sort + sum to pandas dataframe rows

Calculating date difference for pandas dataframe rows with changing baseline dates

Using Cell Value from 2 different dataframes to do calculations (Pandas)

Python Pandas Running Totals with Resets

Categories

Resources