Pandas apply function to column - python

I am having some issues applying several functions to my dataframe.
I have created a sample code to illustrate what I am trying to do. There might be a better way to do this specific function than the way I am doing it, but I am trying to get a general solution for my problem since I am using several functions, and not just how to do this specific thing the most efficient.
Basically, I have one sample dataframe that looks like this (df1):
Ticker Date High Volume
0 AAPL 20200501 1.5 150
1 AAPL 20200501 1.2 100
2 AAPL 20200501 1.3 150
3 AAPL 20200502 1.4 130
4 AAPL 20200502 1.2 170
5 AAPL 20200502 1.1 160
6 TSLA 20200501 2.5 250
7 TSLA 20200501 2.2 200
8 TSLA 20200501 2.3 250
9 TSLA 20200502 2.4 230
10 TSLA 20200502 2.2 270
11 TSLA 20200502 2.1 260
and one sample dataframe that looks like this (df2):
Ticker Date Price SumVol
0 AAPL 20200508 1.2 0
1 TSLA 20200508 2.2 0
the values in the column 'SumVol' in df2 should be filled with the sum of the values in the 'Volume' column from df1, up untill the first time the value in the 'Price'(df1) column is seen in df2, and the date in df1 matches the date from df2
desired output:
Ticker Date Price SumVol
0 AAPL 20200508 1.2 300
1 TSLA 20200508 2.2 500
for some reason I am unable to get this output, because I am probably doing something wrong in the line of code where I am trying to apply the function to the dataframe. I hope that someone here can help me out.
Full sample code including sample dataframes:
import pandas as pd
df1 = pd.DataFrame({'Ticker': ['AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'TSLA', 'TSLA', 'TSLA', 'TSLA', 'TSLA', 'TSLA'],
'Date': [20200501, 20200501, 20200501, 20200502, 20200502, 20200502, 20200501, 20200501, 20200501, 20200502, 20200502, 20200502],
'High': [1.5, 1.2, 1.3, 1.4, 1.2, 1.1, 2.5, 2.2, 2.3, 2.4, 2.2, 2.1],
'Volume': [150, 100, 150, 130, 170, 160, 250, 200, 250, 230, 270, 260]})
print(df1)
df2 = pd.DataFrame({'Ticker': ['AAPL', 'TSLA'],
'Date': [20200501, 20200502],
'Price': [1.4, 2.2],
'SumVol': [0,0]})
print(df2)
def VolSum(ticker, date, price):
df11 = pd.DataFrame(df1)
df11 = df11[df11['Ticker'] == ticker]
df11 = df11[df11['Date'] == date]
df11 = df11[df11['High'] < price]
df11 = pd.DataFrame(df11)
return df11.Volume.sum
df2['SumVol'].apply(VolSum(df2['Ticker'], df2['Date'], df2['Price']), inplace=True).reset_index(drop=True, inplace=True)
print(df2)

The first reason of your failure is that your function ends with
return df11.Volume.sum (without parentheses),
so you return just sum function, not the result of its execution.
Another reason is that you can apply a function to e.g. each row of a Dataframe,
but you must pass axis=1 parameter. But then:
the function to be applied should have one parameter - the current row,
its result can be substituted under a desired column.
And the third reason of failure is that df2 contains e.g. dates not present
in df1, so you are not likely to find any matching rows.
How to get the expected result - Method 1
First, df2 must contain values that are likely to be matched with df1.
I defined df2 as:
Ticker Date Price SumVol
0 AAPL 20200501 1.4 0
1 TSLA 20200502 2.3 0
Then I changed your function to:
def VolSum(row):
df11 = pd.DataFrame(df1)
df11 = df11[df11['Ticker'] == row.Ticker]
df11 = df11[df11['Date'] == row.Date]
df11 = df11[df11['High'] < row.Price]
return df11.Volume.sum()
And finally I generated the result as:
df2['SumVol'] = df2.apply(VolSum, axis=1)
The result is:
Ticker Date Price SumVol
0 AAPL 20200501 1.4 250
1 TSLA 20200502 2.3 530
How to get the expected result - Method 2
But a more concise and elegant method is to define the summing function as:
def VolSum2(row):
return df1.query('Ticker == #row.Ticker and '
'Date == #row.Date and High < #row.Price').Volume.sum()
And apply it just the same way:
df2['SumVol'] = df2.apply(VolSum2, axis=1)
The result is of course the same.

Related

Subtract columns from two DFs based on matching condition

Suppose I have the following two DFs:
DF A: First column is a date, and then there are columns that start with a year (2021, 2022...)
Date 2021.Water 2021.Gas 2022.Electricity
may-04 500 470 473
may-05 520 490 493
may-06 540 510 513
DF B: First column is a date, and then there are columns that start with a year (2021, 2022...)
Date 2021.Amount 2022.Amount
may-04 100 95
may-05 110 105
may-06 120 115
The expected result is a DF with the columns from DF A, but that have the rows divided by the values for the matching year in DF B. Such as:
Date 2021.Water 2021.Gas 2022.Electricity
may-04 5.0 4.7 5.0
may-05 4.7 4.5 4.7
may-06 4.5 4.3 4.5
I am really struggling with this problem. Let me know if any clarifications are needed and will be glad to help.
Try this:
dfai = dfa.set_index('Date')
dfai.columns = dfai.columns.str.split('.', expand=True)
dfbi = dfb.set_index('Date').rename(columns = lambda x: x.split('.')[0])
df_out = dfai.div(dfbi, level=0).round(1)
df_out.columns = df_out.columns.map('.'.join)
df_out.reset_index()
Output:
Date 2021.Water 2021.Gas 2022.Electricity
0 may-04 5.0 4.7 5.0
1 may-05 4.7 4.5 4.7
2 may-06 4.5 4.2 4.5
Details
First, move 'Date' into the index of both dataframes, then use string split to get years into a level in each dataframe.
Use, pd.DataFrame.div with level=0 to align operations on the top level index of each dataframe.
Flatten multiindex column header back to a single level and reset_index.

Python - Stock trading backtesting simulation without .itterows()

Here is the simplified sample dataset:
Price Signal
0 1.5
1 2.0 Buy
2 2.1
3 2.2
4 1.7 Sell
Here is the code to generate the above sample dataset for ease of reference:
price = [1.5, 2, 2.1, 2.2, 1.7]
signal = ['', 'Buy', '', '', 'Sell']
df = pd.DataFrame(zip(price,signal), columns = ['Price', 'Signal'])
Here is the task:
Assuming initial cash = 100 and stock position = 0, simulate cash and stock position at each step based on the following code using .itterows()
cash = 100
num_of_shares = 0
for index, row in df.iterrows():
if row['Signal'] == 'Sell':
if num_of_shares > 0:
cash = num_of_shares * row['Price']
num_of_shares = 0
elif row['Signal'] == 'Buy':
if cash > 0:
num_of_shares = cash / row['Price']
cash = 0
df.loc[index, 'Position'] = num_of_shares
df.loc[index, 'Cash'] = cash
Here is the result:
Price Signal Position Cash
0 1.5 0.0 100.0
1 2.0 Buy 50.0 0.0
2 2.1 50.0 0.0
3 2.2 50.0 0.0
4 1.7 Sell 0.0 85.0
Here is the Question: Is there any way to achieve the result faster than using .itterows()?
You can use. to_dict() function in Pandas to convert the data frame to a dictionary. Now iterating over a dictionary is comparatively very fast compared to iterrows() function. Iterating over a dictionary format of the dataset takes about 25 records that is 77x times faster than the iterrows() function.
for example:
iterate dataframe with iterrows()
for key,value in df.iterrows()
# key is your index
# and value is your row
iterate dataframe after converting to dictionary
dict1 = df.to_dict(orient='index')
for key in dict1:
value = dict1[key]
#key is your index
#value is your row
if we compare dataframe and dictionary, suppose my dataframe has 3400 rows
if i use pandas iterrows() it will take 2.45 sec to iterate full dataframe
in case of dictionary for the same length it will take 0.13 sec.
so converting pandas dataframe to dictionary its always good practice and best way to optimize the code.

Python issues with pandas.apply, looking for suggestions for alternatives or help to fix issue

I have been trying to fix a little error in my data but I am having troubles doing this.
I have a sample df that looks like this:
Date Ticker Type Quantity
0 20200501 AAPL SS 200
1 20200502 AAPL B 150
2 20200502 APPL B 100
3 20200502 APPL B 50
4 20200502 AAPL S 100
In this scenario I shorted 200 shares of apple on may 1st, then covered them on may second and bought & sold 100 shares more the same day. The transactions shown however show that after I covered 150 shares of my 200 shares, i first bought 100 before covering the other 50. this is not possible, so these rows are flipped. Instead of flipping the right rows however, I would like to change the values in the 'Type' column of my buy to cover orders from 'B' to 'BC'. so the desired output is:
Date Ticker Type Quantity
0 20200501 AAPL SS 200
1 20200502 AAPL BC 150
2 20200502 APPL B 100
3 20200502 APPL BC 50
4 20200502 AAPL S 100
While trying to accomplish this, I was creating a function with pd.apply that calculates the current position size to check whether the quantity column exceeds the size column, and if so it should change the 'Type' column. Apparently though, .apply does not update the dataframe that its working with (atleast to my understanding), so when I tried to calculate position sizes i stumbled upon a problem.
(example of problem):
Date Ticker Type Quantity PositionSize
0 20200501 AAPL SS 200 -200.0
1 20200502 AAPL B 150 NaN
2 20200502 APPL B 100 NaN
Current position Size -200.0
Quantity to update 50
I was hoping someone on here could help me out to fix my issue or give me a better alternative for .apply in this case.
sample of code im using for this example:
import pandas as pd
df1 = pd.DataFrame({'Date': [20200501, 20200502, 20200502, 20200502, 20200502],
'Ticker': ['AAPL', 'AAPL', 'APPL', 'APPL', 'AAPL'],
'Type': ['SS', 'B', 'B', 'B', 'S'],
'Quantity': [200, 150, 200, 50, 100]})
print(df1)
if df1.loc[0, 'Type'] == 'B':
df1.loc[0, 'PositionSize'] = df1.loc[0, 'Quantity']
elif df1.loc[0, 'Type'] == 'BC':
df1.loc[0, 'PositionSize'] = df1.loc[0, 'Quantity']
else:
df1.loc[0, 'PositionSize'] = -df1.loc[0, 'Quantity']
def Check_Type(row):
if row.name is not 0:
f_df1 = pd.DataFrame(df1)
f_df1 = f_df1[f_df1.index < row.name]
s = f_df1['PositionSize'].sum()
if row.Type == 'B':
q = row.Quantity
elif row.Type == 'BC':
q = row.Quantity
else:
q = -row.Quantity
p = s + q
print(f_df1)
print('Current position Size ' + str(s))
print('Quantity to update '+ str(q))
t = row.Type
return t, p
else:
t = row.Type
p = row.PositionSize
return t, p
df1[['Type', 'PositionSize']] = df1.apply(Check_Type, axis=1, result_type='expand')
print(df1)
Instead of using .apply() you could add the value needed in your calculation into a new column.
# Reverse the sign of Quantity for non B rows.
# Add new column with the starting Position.
# Sum them together.
df.loc[ df.Type != 'B', 'Quantity' ] *= -1
df['Pos'] = df.loc[0, 'Quantity']
df.Pos = df.Pos + df.Quantity
>>> df
Date Ticker Type Quantity Pos
0 20200501 AAPL SS -200 -400
1 20200502 AAPL B 150 -50
2 20200502 APPL B 200 0
3 20200502 APPL B 50 -150
4 20200502 AAPL S -100 -300
(First Pos is "wrong" but you can replace it if needed)
# Change the type on `B` rows with a negative Pos
df.loc[ (df.Type == 'B') & (df.Pos < 0), 'Type' ] = 'BC'
>>> df
Date Ticker Type Quantity Pos
0 20200501 AAPL SS -200 -400
1 20200502 AAPL BC 150 -50
2 20200502 APPL B 200 0
3 20200502 APPL BC 50 -150
4 20200502 AAPL S -100 -300
Put the quantities back to positive and delete the Pos column.
df.Quantity = df.Quantity.abs()
del df['Pos']
>>> df
Date Ticker Type Quantity
0 20200501 AAPL SS 200
1 20200502 AAPL BC 150
2 20200502 APPL B 200
3 20200502 APPL BC 50
4 20200502 AAPL S 100

Flattening embedded keys with Pandas

I am trying (unsuccessfully) to create separate columns for embedded dictionary keys. Dict data looks like this:
{'averagePrice': 32.95,
'currentDayProfitLoss': 67.2,
'currentDayProfitLossPercentage': 0.02,
'instrument': {'assetType': 'EQUITY', 'cusip': '902104108', 'symbol': 'IIVI'},
'longQuantity': 120.0,
'marketValue': 4021.2,
'settledLongQuantity': 120.0,
'settledShortQuantity': 0.0,
'shortQuantity': 0.0}]
The 'instrument' key is what I am trying to flatten in to columns (ie assetType, cusip, symbol). Here is the code I last tried and still no indivdual columns
data = accounts_data_single
my_dict = data
headers = list(my_dict['securitiesAccount']['positions'])
dict1 = my_dict['securitiesAccount']['positions']
mypositions = pd.DataFrame(dict1)
pd.concat([mypositions.drop(['instrument'], axis=1), mypositions['instrument'].apply(pd.Series)], axis=1)
mypositions.to_csv('Amer_temp.csv')
Any suggestions are greatly appreciated
I am trying to get the nestled keys/fieldnames all in columns and then all the stock positions in the rows. The above code works great except the nestled 'instrument' keys are all in one column
averagePrice currentDayProfitLoss ... assetType cusip symbol
22.5 500 ... Equity 013245 IIVI
450 250 ... Equity 321354 AAPL
etc
Here's a way to do this. Let's say d is your dict.
Step 1: Convert the dict to dataframe
d1 = pd.DataFrame.from_dict(d, orient='index').T.reset_index(drop=True)
Step 2: Convert the instrument column into dataframe
d2 = d1['instrument'].apply(pd.Series)
Step3: Join outputs of step1 and step2
df = pd.concat([d1.drop('instrument', axis=1), d2], axis=1)
Are you trying to do this:
pd.DataFrame(d).assign(**pd.DataFrame([x['instrument'] for x in d])).drop(columns='instrument')
output:
averagePrice currentDayProfitLoss currentDayProfitLossPercentage longQuantity marketValue settledLongQuantity settledShortQuantity shortQuantity assetType cusip symbol
0 32.95 67.2 0.02 120.0 4021.2 120.0 0.0 0.0 EQUITY 902104108 IIVI
1 31.95 63.2 0.01 100.0 3021.2 100.0 0.0 0.0 EQUITY 802104108 AAPL

Python Pandas - Dynamic matching of different date indices

I have two dataframes with different timeseries data (see example below). Whereas Dataframe1 contains multiple daily observations per month, Dataframe2 only contains one observation per month.
What I want to do now is to align the data in Dataframe2 with the last day every month in Dataframe1. The last day per month in Dataframe1 does not necessarily have to be the last day of that respective calendar month.
I'm grateful for every hint how to tackle this problem in an efficient manner (as dataframes can be quite large)
Dataframe1
----------------------------------
date A B
1980-12-31 152.799 209.132
1981-01-01 152.799 209.132
1981-01-02 152.234 209.517
1981-01-05 152.895 211.790
1981-01-06 155.131 214.023
1981-01-07 152.596 213.044
1981-01-08 151.232 211.810
1981-01-09 150.518 210.887
1981-01-12 149.899 210.340
1981-01-13 147.588 207.621
1981-01-14 148.231 208.076
1981-01-15 148.521 208.676
1981-01-16 148.931 209.278
1981-01-19 149.824 210.372
1981-01-20 149.849 210.454
1981-01-21 150.353 211.644
1981-01-22 149.398 210.042
1981-01-23 148.748 208.654
1981-01-26 148.879 208.355
1981-01-27 148.671 208.431
1981-01-28 147.612 207.525
1981-01-29 147.153 206.595
1981-01-30 146.330 205.558
1981-02-02 145.779 206.635
Dataframe2
---------------------------------
date C D
1981-01-13 53.4 56.5
1981-02-15 52.2 60.0
1981-03-15 51.8 58.0
1981-04-14 51.8 59.5
1981-05-16 50.7 58.0
1981-06-15 50.3 59.5
1981-07-15 50.6 53.5
1981-08-17 50.1 44.5
1981-09-12 50.6 38.5
To provide a readable example, I prepared test data as follows:
df1 - A couple of observations from January and February:
date A B
0 1981-01-02 152.234 209.517
1 1981-01-07 152.596 213.044
2 1981-01-13 147.588 207.621
3 1981-01-20 151.232 211.810
4 1981-01-27 150.518 210.887
5 1981-02-05 149.899 210.340
6 1981-02-14 152.895 211.790
7 1981-02-16 155.131 214.023
8 1981-02-21 180.000 200.239
df2 - Your data, also from January and February:
date C D
0 1981-01-13 53.4 56.5
1 1981-02-15 52.2 60.0
Both dataframes have date column of datetime type.
Start from getting the last observation in each month from df1:
res1 = df1.groupby(df1.date.dt.to_period('M')).tail(1)
The result, for my data, is:
date A B
4 1981-01-27 150.518 210.887
8 1981-02-21 180.000 200.239
Then, to join observations, the join must be performed on the
whole month period, not the exact date. To do this, run:
res = pd.merge(res1.assign(month=res1['date'].dt.to_period('M')),
df2.assign(month=df2['date'].dt.to_period('M')),
how='left', on='month', suffixes=('_1', '_2'), )
The result is:
date_1 A B month date_2 C D
0 1981-01-27 150.518 210.887 1981-01 1981-01-13 53.4 56.5
1 1981-02-21 180.000 200.239 1981-02 1981-02-15 52.2 60.0
If you want the merge to include data only for months where there
is at least one observation in both df1 and df2, drop how parameter.
Its default value is inner, which is the correct mode in this case.
When you have a sample dataframe, you can provide code for doing so. Simply select a column as a list (step 1 and 2) and use that list to build the dataframe with code (step 3 and 4).
import pandas as pd
# Step 1: create your dataframe, and print each column as a list, copy-paste into code example below.
df_1 = pd.read_csv('dataset1.csv')
print(list(df_1['date']))
print(list(df_1['A']))
print(list(df_1['B']))
# Step 2: create your dataframe, and print each column as a list, copy-paste into code example below.
df_2 = pd.read_csv('dataset2.csv')
print(list(df_2['date']))
print(list(df_2['C']))
print(list(df_2['D']))
# Step 3: create sample dataframe ... good if you can provide this in your future questions
df_1 = pd.DataFrame({
'date': ['12/31/1980', '1/1/1981', '1/2/1981', '1/5/1981', '1/6/1981',
'1/7/1981', '1/8/1981', '1/9/1981', '1/12/1981', '1/13/1981',
'1/14/1981', '1/15/1981', '1/16/1981', '1/19/1981', '1/20/1981',
'1/21/1981', '1/22/1981', '1/23/1981', '1/26/1981', '1/27/1981',
'1/28/1981', '1/29/1981', '1/30/1981', '2/2/1981'],
'A': [152.799, 152.799, 152.234, 152.895, 155.131,
152.596, 151.232, 150.518, 149.899, 147.588,
148.231, 148.521, 148.931, 149.824, 149.849,
150.353, 149.398, 148.748, 148.879, 148.671,
147.612, 147.153, 146.33, 145.779],
'B': [209.132, 209.132, 209.517, 211.79, 214.023,
213.044, 211.81, 210.887, 210.34, 207.621,
208.076, 208.676, 209.278, 210.372, 210.454,
211.644, 210.042, 208.654, 208.355, 208.431,
207.525, 206.595, 205.558, 206.635]
})
# Step 4: create sample dataframe ... good if you can provide this in your future questions
df_2 = pd.DataFrame({
'date': ['1/13/1981', '2/15/1981', '3/15/1981', '4/14/1981', '5/16/1981',
'6/15/1981', '7/15/1981', '8/17/1981', '9/12/1981'],
'C': [53.4, 52.2, 51.8, 51.8, 50.7, 50.3, 50.6, 50.1, 50.6],
'D': [56.5, 60.0, 58.0, 59.5, 58.0, 59.5, 53.5, 44.5, 38.5]
})
# Step 5: make sure the date field is actually a date, not a string
df_1['date'] = pd.to_datetime(df_1['date']).dt.date
# Step 6: create new colum with year and month
df_1['date_year_month'] = pd.to_datetime(df_1['date']).dt.to_period('M')
# Step 7: create boolean mask that grabs the max date for each year-month
mask_last_day_month = df_1.groupby('date_year_month')['date'].transform(max) == df_1['date']
# Step 8: create new dataframe with only last day of month
df_1_max = df_1.loc[mask_last_day_month]
print('here is dataframe 1 with only last day in the month')
print(df_1_max)
print()
# Step 9: make sure the date field is actually a date, not a string
df_2['date'] = pd.to_datetime(df_2['date']).dt.date
# Step 10: create new colum with year and month
df_2['date_year_month'] = pd.to_datetime(df_2['date']).dt.to_period('M')
print('here is the original dataframe 2')
print(df_2)
print()

Categories