could a column length be a decimal - python

My problem is that could len(df) be a decimal
len(DataFrame)
the code:
import pandas as pd
from sklearn import preprocessing ,model_selection ,svm
from sklearn.linear_model import LinearRegression
df = quandl.get('WIKI/GOOGL',authtoken =
't_nBiw5yVx3CXs3Zsuco')
df = df[['Open','High','Low','Close','Volume']]
HL_pct = (df['High'] - df['Low']) / df['Low'] * 100
PCT_change = (df['Close'] - df['Open']) / df['Open'] * 100
df = ['Volume','HL_pct','PCT_change','Close']
print(len(df))
forcast_col = df['Close']

If your are referring to the python function len, the function will returns the number of items in an object. It cannot be a decimal, since an item can't be of a decimal number (ie. half an int).

Related

how to calculate the Sharpe ratio in different time intervals?

import pandas as pd
import numpy as np
bt_dict = {
'position_strategy1':df1_1hour,
'position_strategy2':df2_6hour
}
def backtest(bt_dict):
# ohlc_df is one hour timeframe
ohlc_df['date'] = pd.to_datetime(ohlc_df['date'])
ohlc_df.set_index('date',inplace=True)
all_df = pd.DataFrame(index=ohlc_df.index)
all_df['close'] = ohlc_df['close']
for strategy_name,strategy_df in bt_dict.items():
bt_dict[strategy_name] = strategy_df[['date','position']].rename(columns={"position":f"position_{strategy_name}"}).dropna()
bt_dict[strategy_name]['date'] = pd.to_datetime(bt_dict[strategy_name]['date'])
bt_dict[strategy_name].set_index('date', inplace=True)
all_df[f'position_{strategy_name}'] = bt_dict[strategy_name]
all_df = all_df.fillna(method='ffill')
all_df['position'] = all_df['position_strategy1']*0.6 +\
all_df['position_strategy2']*0.4
all_df = all_df.dropna()
all_df['pnl'] = all_df['position'].shift(1) * (all_df['close'] / all_df['close'].shift(1) - 1)
sharpe_ratio = all_df['pnl'].mean() / all_df['pnl'].std() * np.sqrt(365 * 24)
return sharpe_ratio
for example, I have two strategies, including 1-hour and 6-hour data frame, want to combine them and calculate the sharpe ratio
I had tried to calculate multiple timeframes, but the result was wrong..
i hope i get the right way to calculate sharpe ratio in different timeframe

Fast numpy operation on part of dataframe

I have a pandas dataframe with several columns. 2 of them are date and time and others are numerical.
I need to perform fast in-place calculation on the numerical part of the dataframe. Currently I ignore first 2 columns and convert numericals to a numpy and use it further down the code as a numpy.
However I want to keep these processed numericals in the dataframe without touching date and time.
Now:
# tanh norm
def tanh_ret():
data = df.to_numpy()
mu = np.mean(data)
std = np.std(data)
return 0.5 * (np.tanh(0.01 * ((data - mu) / std)) + 1)
del df['Date']
del df['Time']
nums = tanh_ret()
del df
What I want: normalize 3 df columns out of 5 in-place
Mind that the dataset is large so I would prefer as less data copy as possible but also reasonably fast.
Create a random pandas dataframe
I consider 5 columns of random values, you can place what you want. The Time and Date columns are set to a constant value.
import datetime as dt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random((100,5)))
now = dt.datetime.now()
df['Time'] = now.strftime('%H:%M:%S')
df['Date'] = now.strftime('%m/%d/%Y')
Inplace numerical processing
def tanh_ret(data):
mu = data.mean()
std = data.std()
return 0.5 * (np.tanh(0.01 * ((data - mu) / std)) + 1)
num_cols =df.columns[df.dtypes != 'object']
df[num_cols] = df[num_cols].transform(tanh_ret)
Alternatively:
tan_map = {col: tanh_ret for col in num_cols}
df[num_cols] = df.transform(tan_map)
Source

function for counting number of oscillations

i'm trying to build a counter which would detect number of oscillations in a given data
i'm following a method where the slope of each point is calculated and based on negative and positive direction change
is there a preexisting function for this
i'm using the following code and i'm unable to leave out the cells with zero values after taking difference between each cell
import pandas as pd
import xlsxwriter
from asammdf import MDF
import numpy as np
dat = MDF("file_name.dat")
app = dat.get('variabe_name')
df = pd.DataFrame(app)
print(df)
data = df.loc[0, 0:]
#time step = T
T = 0.01
# Number of sample points
N = len(data)
# sample spacing
x = np.linspace(0.0, N*T, N, endpoint=False)
x1 = data.diff()
print(x1)
df1_1 = pd.DataFrame([x1])
df1_1 = df1_1.replace(0, np.nan)
df1_1 = df1_1.dropna(how='all', axis=0)
df1_1 = df1_1.dropna()
df1 = pd.DataFrame.transpose(df1_1)
df1.to_csv("output.csv")'
my data looks like this

Pandas/sklearn: Vectorize large number of LinearRegression calculations

I have a Pandas DataFrame where I need to calculate a large numbers of regression coefficients. Each calculation will be only two dimensional. The independent variable will be a ['Base'] which is the same for all cases. The dependent variable series is organized along columns in my DataFrame.
This is easy to accomplish with a for loop but in my real life DataFrame I have thousands of columns on which to run the regression, so it takes forever. Is there a vectorized way to accomplish this?
Below is a MRE:
import pandas as pd
import numpy as np
from sklearn import linear_model
import time
df_data = {
'Base':np.random.randint(1, 100, 1000),
'Adder':np.random.randint(-3, 3, 1000)}
df = pd.DataFrame(data=df_data)
result_df = pd.DataFrame()
df['Thing1'] = df['Base'] * 3 + df['Adder']
df['Thing2'] = df['Base'] * 6 + df['Adder']
df['Thing3'] = df['Base'] * 12 + df['Adder']
df['Thing4'] = df['Base'] * 4 + df['Adder']
df['Thing5'] = df['Base'] * 2.67 + df['Adder']
things = ['Thing1', 'Thing2', 'Thing3', 'Thing4', 'Thing5']
for t in things:
reg = linear_model.LinearRegression()
X, y = df['Base'].values.reshape(-1,1), df[t].values.reshape(-1,1)
reg.fit(X, y)
b = reg.coef_[0][0]
result_df.loc[t, 'Beta'] = b
print(result_df.to_string())
You can use np.polyfit for linear regression:
pd.DataFrame(np.polyfit(df['Base'], df.filter(like='Thing'), deg=1)).T
Output:
0 1
0 3.002379 -0.714256
1 6.002379 -0.714256
2 12.002379 -0.714256
3 4.002379 -0.714256
4 2.672379 -0.714256
#Quang-Hoang 's idea of using df.filter solves the problem. If you really want to use sklearn, this also works:
reg = linear_model.LinearRegression()
X = df['Base'].values.reshape(-1,1)
y = df.filter(items=things).values
reg.fit(X, y)
result_df['Betas'] = reg.coef_
y_predict = reg.predict(X)
result_df['Rsq'] = r2_score(y, y_predict)

Python function for MA and MACD has "ValueError: negative dimensions are not allowed"

I am trying to analyze historical data in csv using pandas.I found from Quantopian that without talib (fail to install), we can use the functions code to analyze. However, when I anlayze using MA an MACD function, I encounter
1. MA not calculate correctly
2. MACD part has "ValueError: negative dimensions are not allowed"
which part should I corrected it?
My code is as following:
import numpy
import pandas as pd
#Moving Average
def MA(df, n):
MA = pd.Series(pd.rolling_mean(df['Close'], n), name = 'MA_' + str(n))
df = df.join(MA)
return df
#MACD, MACD Signal and MACD difference
def MACD(df, n_fast, n_slow):
EMAfast = pd.Series(pd.ewma(df['Close'], span = n_fast, min_periods = n_slow - 1))
EMAslow = pd.Series(pd.ewma(df['Close'], span = n_slow, min_periods = n_slow - 1))
MACD = pd.Series(EMAfast - EMAslow, name = 'MACD_' + str(n_fast) + '_' + str(n_slow))
MACDsign = pd.Series(pd.ewma(MACD, span = 9, min_periods = 8), name = 'MACDsign_' + str(n_fast) + '_' + str(n_slow))
MACDdiff = pd.Series(MACD - MACDsign, name = 'MACDdiff_' + str(n_fast) + '_' + str(n_slow))
df = df.join(MACD)
df = df.join(MACDsign)
df = df.join(MACDdiff)
return df
data = pd.read_csv("NAIM.csv", index_col='Stock', usecols =[0,6])
print data.head(3)
vol = data['Close']
print vol
print MA(data,5)
print MACD(data,12,26)
the csv file is as below:
Stock,Date,Time,Open,High,Low,Close,Volume
NAIM,2015-01-02,00:00:00,2.9,3.0,2.9,3.0,46900
NAIM,2015-01-05,00:00:00,2.95,3.05,2.92,3.05,225900
NAIM,2015-01-06,00:00:00,2.95,2.96,2.9,2.9,682000
NAIM,2015-01-07,00:00:00,2.88,2.95,2.88,2.9,160900
.
.
.
NAIM,2016-01-06,00:00:00,2.48,2.61,2.47,2.6,3260900
NAIM,2016-01-07,00:00:00,2.64,2.74,2.6,2.65,3906100
NAIM,2016-01-08,00:00:00,2.65,2.71,2.62,2.64,1875000
NAIM,2016-01-11,00:00:00,2.65,2.7,2.65,2.68,1089400
NAIM,2016-01-12,00:00:00,2.68,2.71,2.65,2.69,965200
NAIM,2016-01-13,00:00:00,2.69,2.74,2.69,2.73,2091500
NAIM,2016-01-14,00:00:00,2.71,2.71,2.66,2.66,1206000
NAIM,2016-01-15,00:00:00,2.66,2.67,2.62,2.62,738600
My python shell shown the output:
Output from Python Shell after run the script
EMAslow = pd.Series(pd.ewma(df['Close'], span = n_slow, min_periods = n_slow - 1))
EMAfast = pd.Series(pd.ewma(df['Close'], span = n_fast, min_periods = n_slow - 1))
I think you need to change EMAfast to use:
min_periods = n_fast - 1
I think the lack of full periods on your fast EMA is causing a negative Convergence value and is causing your error.

Categories