Python Pandas: use row number in apply - python

I am only beginning with Pandas and I am stuck with the following problem:
I want to use the row number in df.apply() so that it calculates (1+0.05)^(row_number), ex:
(1+0.05)^0 in its first row, (1+0.05)^1 in its second, (1+0.05)^2 in its third etc....
I tried the following but get AttributeError: 'int' object has no attribute 'name'
import pandas as pd
considered_period_years = 60
start_year = 2019
TDE = 0.02
year = list(range(start_year,start_year+considered_period_years))
df = pd.DataFrame(year, columns = ['Year'])
df.insert(0, 'Year Number', range(0,60), allow_duplicates = False)
df.insert(2, 'Investition', 0, allow_duplicates = False)
df['Investition2'] = df['Investition'].apply(lambda x: x*(1+TDE)**x.name)
Any ideas ?
Regards Johann

Welcome to pandas. Familiarize yourself with vectorized functions. The basic idea behind vectorized functions is that you apply the operation to every element in an array without an explicit loop. For example:
x + 1
means "add 1 to element in x".
Similarly:
x * y
means "multiply every element in x by every element in y, pair-wise".
Deep down, vectorized functions are implemented using highly-optimized C loops so they are both fast and convenient.
In your case:
df['Investition2'] = (1+TDE)**df.index

You can create a custom function (foo in our case) to access row.name since you are using df['Investitions2'] it's giving you a Series. And apply series will iterate through its values.
import pandas as pd
considered_period_years = 60
start_year = 2019
TDE = 0.02
year = list(range(start_year,start_year+considered_period_years))
df = pd.DataFrame(year, columns = ['Year'])
df.insert(0, 'Year Number', range(0,60), allow_duplicates = False)
df.insert(2, 'Investition', 0, allow_duplicates = False)
def foo(row):
return row['Investition']*(1+TDE)**row.name
df['Investition2'] = df.apply(lambda x: foo(x), axis=1)
Another alternative is to use itertuples or iterrows.

Related

I want to speed up For loop in python and want to use less memory

Hi below is the function i am using to calculate quantile(25/50/75) & mean for each column.
def newsummary(final_per,grp_lvl,col):
new_col1='_'.join([j]+grp_lvl+['25%'])
new_col2='_'.join([j]+grp_lvl+['50%'])
new_col3='_'.join([j]+grp_lvl+['75%'])
new_col4='_'.join([j]+grp_lvl+['mean'])
final_per1=pd.DataFrame()
final_per1=final_per.groupby(grp_lvl)[j].quantile(0.25).reset_index()
final_per1.rename(columns = {j:new_col1}, inplace = True)
final_per2[new_col1]=final_per1[new_col1].copy()
final_per1=final_per.groupby(grp_lvl)[j].quantile(0.5).reset_index()
final_per1.rename(columns = {j:new_col2}, inplace = True)
final_per2[new_col2]=final_per1[new_col2].copy()
final_per1=final_per.groupby(grp_lvl)[j].quantile(0.75).reset_index()
final_per1.rename(columns = {j:new_col3}, inplace = True)
final_per2[new_col3]=final_per1[new_col3].copy()
final_per1=final_per.groupby(grp_lvl)[j].mean().reset_index()
final_per1.rename(columns = {j:new_col4}, inplace = True)
final_per2[new_col4]=final_per1[new_col4].copy()
return final_per2
Calling the function
grp_lvl=['ZIP_CODE', 'year']
for j in list_col: # approximately 1400 columns to iterate
per=newsummary(final_per,grp_lvl,j)
I want to find quantile(25/50/75) & mean for each column n retain that column in new dataframe. This i have to do for around 1400 columns.
The Dataframe are pandas dataframe.
While executing this loop .copy() command causing performance issues. Are there any alternate ways to reduce performance issues and not face out of memory error.
Your help & suggestion are appreciated.
** Note:- i am using Azure Databricks cluster to execute this. **
Since you are creating 4 new columns for each of the column i.e., quantiles for 0.25, 205 0.75 and mean of the grouped data using pandas dataframe, the code that you are following might be a better choice.
The pyspark grouped data requires an aggregate function and there is no aggregate function to calculate quantile.
There is no need to use copy or return any value from the function. So, modify your code to the below code:
import pandas as pd
final_per2 = pd.DataFrame()
def newsummary(final_per,grp_lvl,col):
new_col1='_'.join([j]+grp_lvl+['25%'])
new_col2='_'.join([j]+grp_lvl+['50%'])
new_col3='_'.join([j]+grp_lvl+['75%'])
new_col4='_'.join([j]+grp_lvl+['mean'])
final_per1=pd.DataFrame()
final_per1=final_per.groupby(grp_lvl)[j].quantile(0.25).reset_index()
final_per1.rename(columns = {j:new_col1}, inplace = True)
final_per2[new_col1]=final_per1[new_col1]
final_per1=final_per.groupby(grp_lvl)[j].quantile(0.5).reset_index()
final_per1.rename(columns = {j:new_col2}, inplace = True)
final_per2[new_col2]=final_per1[new_col2]
final_per1=final_per.groupby(grp_lvl)[j].quantile(0.75).reset_index()
final_per1.rename(columns = {j:new_col3}, inplace = True)
final_per2[new_col3]=final_per1[new_col3]
final_per1=final_per.groupby(grp_lvl)[j].mean().reset_index()
final_per1.rename(columns = {j:new_col4}, inplace = True)
final_per2[new_col4]=final_per1[new_col4]
for j in cols: # approximately 1400 columns to iterate
newsummary(pdf,grp_lvl,j)
final_per2

How to improve performance on average calculations in python dataframe

I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price
The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.
By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price

numpy under a groupby not working

I have the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/test.csv')
df.drop(['SecurityID'],1,inplace=True)
Time = 1
trade_filter_size = 9
groupbytime = (str(Time) + "min")
df['dateTime_s'] = df['dateTime'].astype('datetime64[s]')
df['dateTime'] = pd.to_datetime(df['dateTime'])
df[str(Time)+"min"] = df['dateTime'].dt.floor(str(Time)+"min")
df['tradeBid'] = np.where(((df['tradePrice'] <= df['bid1']) & (df['isTrade']==1)), df['tradeVolume'], 0)
groups = df[df['isTrade'] == 1].groupby(groupbytime)
print("groups",groups.dtypes)
#THIS IS WORKING
df_grouped = (groups.agg({
'tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum())],
}))
# creating a new data frame which is filttered
df2 = pd.DataFrame( df.loc[(df['isTrade'] == 1) & (df['tradeVolume']>=trade_filter_size)])
#recalculating all the bid/ask volume to be bsaed on the filter size
df2['tradeBid'] = np.where(((df2['tradePrice'] <= df2['bid1']) & (df2['isTrade']==1)), df2['tradeVolume'], 0)
df2grouped = (df2.agg({
# here is the problem!!! NOT WORKING
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
}))
The same function is used tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum()). In the first time it's working ok but when doing it on filtered data in a new df it's causing an error:
ValueError: downticks_number is an unknown string function
when I use this code instead to solve the above
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
I get this error:
ValueError: cannot combine transform and aggregation operations
Any idea why I get different results for the same usage of code?
since there were 2 conditions to match for the 2nd groupby, I solved this by moving the filter into the df by creating a new column which is used as a filter (with both 2 filters).
than there was no problem to groupby
the order was the problem

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Split and Join Series in Pandas

I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html
apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )
here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html
Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html

Categories