Fast numpy operation on part of dataframe - python

I have a pandas dataframe with several columns. 2 of them are date and time and others are numerical.
I need to perform fast in-place calculation on the numerical part of the dataframe. Currently I ignore first 2 columns and convert numericals to a numpy and use it further down the code as a numpy.
However I want to keep these processed numericals in the dataframe without touching date and time.
Now:
# tanh norm
def tanh_ret():
data = df.to_numpy()
mu = np.mean(data)
std = np.std(data)
return 0.5 * (np.tanh(0.01 * ((data - mu) / std)) + 1)
del df['Date']
del df['Time']
nums = tanh_ret()
del df
What I want: normalize 3 df columns out of 5 in-place
Mind that the dataset is large so I would prefer as less data copy as possible but also reasonably fast.

Create a random pandas dataframe
I consider 5 columns of random values, you can place what you want. The Time and Date columns are set to a constant value.
import datetime as dt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random((100,5)))
now = dt.datetime.now()
df['Time'] = now.strftime('%H:%M:%S')
df['Date'] = now.strftime('%m/%d/%Y')
Inplace numerical processing
def tanh_ret(data):
mu = data.mean()
std = data.std()
return 0.5 * (np.tanh(0.01 * ((data - mu) / std)) + 1)
num_cols =df.columns[df.dtypes != 'object']
df[num_cols] = df[num_cols].transform(tanh_ret)
Alternatively:
tan_map = {col: tanh_ret for col in num_cols}
df[num_cols] = df.transform(tan_map)
Source

Related

how to calculate the Sharpe ratio in different time intervals?

import pandas as pd
import numpy as np
bt_dict = {
'position_strategy1':df1_1hour,
'position_strategy2':df2_6hour
}
def backtest(bt_dict):
# ohlc_df is one hour timeframe
ohlc_df['date'] = pd.to_datetime(ohlc_df['date'])
ohlc_df.set_index('date',inplace=True)
all_df = pd.DataFrame(index=ohlc_df.index)
all_df['close'] = ohlc_df['close']
for strategy_name,strategy_df in bt_dict.items():
bt_dict[strategy_name] = strategy_df[['date','position']].rename(columns={"position":f"position_{strategy_name}"}).dropna()
bt_dict[strategy_name]['date'] = pd.to_datetime(bt_dict[strategy_name]['date'])
bt_dict[strategy_name].set_index('date', inplace=True)
all_df[f'position_{strategy_name}'] = bt_dict[strategy_name]
all_df = all_df.fillna(method='ffill')
all_df['position'] = all_df['position_strategy1']*0.6 +\
all_df['position_strategy2']*0.4
all_df = all_df.dropna()
all_df['pnl'] = all_df['position'].shift(1) * (all_df['close'] / all_df['close'].shift(1) - 1)
sharpe_ratio = all_df['pnl'].mean() / all_df['pnl'].std() * np.sqrt(365 * 24)
return sharpe_ratio
for example, I have two strategies, including 1-hour and 6-hour data frame, want to combine them and calculate the sharpe ratio
I had tried to calculate multiple timeframes, but the result was wrong..
i hope i get the right way to calculate sharpe ratio in different timeframe

function for counting number of oscillations

i'm trying to build a counter which would detect number of oscillations in a given data
i'm following a method where the slope of each point is calculated and based on negative and positive direction change
is there a preexisting function for this
i'm using the following code and i'm unable to leave out the cells with zero values after taking difference between each cell
import pandas as pd
import xlsxwriter
from asammdf import MDF
import numpy as np
dat = MDF("file_name.dat")
app = dat.get('variabe_name')
df = pd.DataFrame(app)
print(df)
data = df.loc[0, 0:]
#time step = T
T = 0.01
# Number of sample points
N = len(data)
# sample spacing
x = np.linspace(0.0, N*T, N, endpoint=False)
x1 = data.diff()
print(x1)
df1_1 = pd.DataFrame([x1])
df1_1 = df1_1.replace(0, np.nan)
df1_1 = df1_1.dropna(how='all', axis=0)
df1_1 = df1_1.dropna()
df1 = pd.DataFrame.transpose(df1_1)
df1.to_csv("output.csv")'
my data looks like this

Is there any way to speed up dependent iterations in a loop (numpy or pandas) like vectorize?

I have a model with pandas dataframe or numpy arrays as inputs. The model iterates over the rows in a loop with the current row calculation dependent on the prior step. Is there any way to incorporate numpy vectorize in this? or any other way to make the code run faster. I have very big inputs and any improvement in speed saves a lot of time. A sample code is below for reference. NUmpy inputs improve speed over the pandas dataframe inputs. Any suggestion is highly appreciated.
import pandas as pd
import numpy as np
inp1 = pd.DataFrame(np.random.rand(5,5))
inp2 = pd.DataFrame(np.random.rand(5,5))
outp1 = pd.DataFrame(np.zeros(inp1.shape, dtype = float),
index = inp1.index, columns = inp1.columns)
def sample_code_pandas(params):
aa, bb, lmt = params
outp1[inp2 < lmt] = inp1[inp2 < lmt]
out_pr = outp1.iloc[0,:]
for i in range(1, len(outp1)):
rates = ((inp1.iloc[i,:] - aa) / (aa - bb))
outp1.iloc[i,:] = inp1.iloc[i,:] * rates - out_pr
out_pr = outp1.iloc[i,:]
return outp1
%timeit sample_code_pandas((-0.2, -0.5, 0 ))
#******************************************************
inp1 = np.random.rand(5,5)
inp2 = np.random.rand(5,5)
outp1 = np.zeros(inp1.shape, dtype = float)
def sample_code_numpy(params):
aa, bb, lmt = params
outp1[inp2 < lmt] = inp1[inp2 < lmt]
out_pr = outp1[0]
for i in range(1, len(outp1)):
rates = ((inp1[i] - aa) / (aa - bb))
outp1[i] = inp1[i] * rates - out_pr
out_pr = outp1[i]
return outp1
%timeit sample_code_numpy((-0.2, -0.5, 0 ))

Efficient way to convert Latitude/Longitude to XY

I have a working script that converts Latitude and Longitude coordinates to Cartesian coordinates. However, I have to perform this for specific points at each point in time (row by row).
I want to do something similar on a larger df. I'm not sure if a loop that iterates over each row is the most efficient way to do this? Below is the script that converts a single XY point.
import math
import numpy as np
import pandas as pd
point1 = [-37.83028766, 144.9539561]
r = 6371000 #radians of earth meters
phi_0 = point1[1]
cos_phi_0 = math.cos(np.radians(phi_0))
def to_xy(point, r, cos_phi_0):
lam = point[0]
phi = point[1]
return (r * np.radians(lam) * cos_phi_0, r * np.radians(phi))
point1_xy = to_xy(point1, r, cos_phi_0)
This works fine if I want to convert between single points. The issue is if I have a large data frame or list (>100,000 rows) of coordinates. Would a loop that iterates through each row be inefficient. Is there a better way to perform the same function?
Below is an example of a fractionally bigger df.
d = ({
'Time' : [0,1,2,3,4,5,6,7,8],
'Lat' : [37.8300,37.8200,37.8200,37.8100,37.8000,37.8000,37.7900,37.7900,37.7800],
'Long' : [144.8500,144.8400,144.8600,144.8700,144.8800,144.8900,144.8800,144.8700,144.8500],
})
df = pd.DataFrame(data = d)
I will do this if I were you. (Btw: the tuple casting part can be optimized.
import numpy as np
import pandas as pd
point1 = [-37.83028766, 144.9539561]
def to_xy(point):
r = 6371000 #radians of earth meters
lam,phi = point
cos_phi_0 = np.cos(np.radians(phi))
return (r * np.radians(lam) * cos_phi_0,
r * np.radians(phi))
point1_xy = to_xy(point1)
print(point1_xy)
d = ({
'Lat' : [37.8300,37.8200,37.8200,37.8100,37.8000,37.8000,37.7900,37.7900,37.7800],
'Long' : [144.8500,144.8400,144.8600,144.8700,144.8800,144.8900,144.8800,144.8700,144.8500],
})
df = pd.DataFrame(d)
df['to_xy'] = df.apply(lambda x:
tuple(x.values),
axis=1).map(to_xy)
print(df)

could a column length be a decimal

My problem is that could len(df) be a decimal
len(DataFrame)
the code:
import pandas as pd
from sklearn import preprocessing ,model_selection ,svm
from sklearn.linear_model import LinearRegression
df = quandl.get('WIKI/GOOGL',authtoken =
't_nBiw5yVx3CXs3Zsuco')
df = df[['Open','High','Low','Close','Volume']]
HL_pct = (df['High'] - df['Low']) / df['Low'] * 100
PCT_change = (df['Close'] - df['Open']) / df['Open'] * 100
df = ['Volume','HL_pct','PCT_change','Close']
print(len(df))
forcast_col = df['Close']
If your are referring to the python function len, the function will returns the number of items in an object. It cannot be a decimal, since an item can't be of a decimal number (ie. half an int).

Categories