How can I get last value of STOCHRSI with Ta-Lib? - python

I implemented it but it prints all.
print(ta.STOCHRSI(df["close"], 14, 5, 3, 0)[-1])
2022-04-20 17:00:00 NaN
2022-04-20 18:00:00 NaN
2022-04-20 19:00:00 NaN
2022-04-20 20:00:00 NaN
2022-04-20 21:00:00 NaN
...
2022-04-28 20:00:00 79.700101
2022-04-28 21:00:00 0.000000
2022-04-28 22:00:00 0.000000
2022-04-28 23:00:00 44.877738
2022-04-29 00:00:00 65.792554
Length: 200, dtype: float64
I just want to get recent value of STOCHRSI, just one float value. How can I get it?
or if I want to get the avg of recent 3 values, How can I implement it?

If you really mean the library TA-Lib.enter link description here
As far as I know, the syntax there is different from yours.
Streaming API:"An experimental Streaming API was added that allows users to compute the latest value of an indicator. This can be faster than using the Function API, for example in an application that receives streaming data, and wants to know just the most recent updated indicator value
Streaming API
This works with 'SMA', but fails with 'STOCHRSI' if I make a difference less than 5 in 'assert'.
And to calculate the indicator, you need a quote history. You probably saw that the first values are empty, since there is no data required by the indicator period.
You can try the following: determine how much data is needed for the correct calculation of the indicator. And then feed only this array length.
If resources allow you, then you can calculate all the values and save their variable and take only the last of the variable fastk[-1].
import talib
from talib import stream
sma = talib.SMA(df["close"], timeperiod=14)
latest = stream.SMA(df["close"], timeperiod=14)
assert (sma[-1] - latest) < 0.00001
print(sma[-1], latest)#1.6180066666666686 1.6180066666666668
fastk, fastd = talib.STOCHRSI(df["close"], timeperiod=14, fastk_period=5, fastd_period=3, fastd_matype=0)
f, fd = stream.STOCHRSI(df["close"], timeperiod=14, fastk_period=5, fastd_period=3, fastd_matype=0)
print(fastk[-1], f)
assert (fastk[-1] - f) < 5#64.32089013974793 59.52628987038199
print(fastk[-1], f)
Use the condition of crossing the main signal line from bottom to top.
if fastd[100] < fastk[100] and fastd[101] > fastk[101]:
pass#pass replace your code
I also drew an indicator under the main chart to show what it looks like.
import matplotlib.pyplot as plt
import pandas as pd
import talib
date = df.iloc[:, 0].index.date
x = len(df)
fig, ax = plt.subplots(2)
ax[0].plot(date[x-100:], df.iloc[x-100:, 3])
ax[1].plot(date[x-100:], fastk[x-100:])
ax[1].plot(date[x-100:], fastd[x-100:])
fig.autofmt_xdate()
plt.show()
I made a code to determine the minimum size of the data length for the correct calculation of the indicator.
x = len(df["C"])
fastk, fastd = talib.STOCHRSI(df["C"].values, timeperiod=14, fastk_period=5, fastd_period=3, fastd_matype=0)
fk = np.round(fastk[x - 3:], 5)
fd = np.round(fastd[x - 3:], 5)
print('fk', fk, 'fd', fd)
Output
fk [100. 32.52114 0. ] fd [43.27353 54.11391 44.17371]
Next, we find the desired length of the array.
for depth in range(10, 250, 5):
fastk, fastd = talib.STOCHRSI(df["C"].values[x - depth:], timeperiod=14, fastk_period=5, fastd_period=3,
fastd_matype=0)
if (fk == np.round(fastk[depth - 3:], 5)).all() and (fd == np.round(fastd[depth - 3:], 5)).all():
print('fastk[depth-3:]', fastk[depth - 3:], 'fastd[depth-3:]', fastd[depth - 3:])
print('stop iteration required length', depth)
break
Output
fastk[depth-3:] [100. 32.52113882 0. ] fastd[depth-3:] [43.27353345 54.11391306 44.17371294]
stop iteration required length 190

Related

How to find the average of data samples at random intervals in python?

I have temperature data stored in a csv file when plotted looks like the below image. How do I find the average during each interval when the temperature goes above 12. The result should be the T1, T2 ,T3 which should be the average temperature during the interval when its value is above 12.
Could you please suggest how to achieve this in python?
Highlighted the areas approximately over which I need to calculate the average:
Please find below sample data:
R3,R4
1,11
2,11
3,11
4,11
5,11
6,15.05938512
7,15.12975992
8,15.05850141
18,15.1677708
19,15.00921862
20,15.00686921
21,15.01168888
22,11
23,11
24,11
25,11
26,11
27,15.05938512
28,15.12975992
29,15.05850141
30,15.00328706
31,15.12622611
32,15.01479819
33,15.17778891
34,15.01411488
35,9
36,9
37,9
38,9
39,16.16042435
40,16.00091253
41,16.00419677
42,16.15381827
43,16.0471766
44,16.03725301
45,16.13925003
46,16.00072279
47,11
48,1
In pandas, an idea would be to group the data based on the condition T > 12 and use mean as agg func. Ex:
import pandas as pd
# a dummy df:
df = pd.DataFrame({'T': [11, 13, 13, 10, 14]})
# set the condition
m = df['T'] > 12
# define groups
grouper = (~m).cumsum().where(m)
# ...looks like
# 0 NaN
# 1 1.0
# 2 1.0
# 3 NaN
# 4 2.0
# Name: T, dtype: float64
# now we can easily calculate the mean for each group:
grp_mean = df.groupby(grouper)['T'].mean()
# T
# 1.0 13
# 2.0 14
# Name: T, dtype: int64
Note: if you have noisy data (T jumps up and down), it might be clever to apply a filter first (savgol, median etc. - whatever is appropriate) so you don't end up with groups caused by the noise.
I couldn't find a good pattern for this - here's a clunky bit of code that does what you want, though.
In general, use .shift() to find transition points, and use groupby with transform to get your means.
#if you had a csv with Dates and Temps, do this
#tempsDF = pd.read_csv("temps.csv", columns=["Date","Temp"])
#tempsDF.set_index("Date", inplace=True)
#Using fake data since I don't have your csv
tempsDF = pd.DataFrame({'Temp': [0,13,14,13,8,7,5,0,14,16,16,0,0,0]})
#This is a bit clunky - I bet there's a more elegant way to do it
tempsDF["CumulativeFlag"] = 0
tempsDF.loc[tempsDF["Temp"]>12, "CumulativeFlag"]=1
tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift(), "HighTempGroup"] = list(range(1,len(tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift()])+1))
tempsDF["HighTempGroup"].fillna(method='ffill', inplace=True)
tempsDF.loc[tempsDF["Temp"]<=12, "HighTempGroup"]= None
tempsDF["HighTempMean"] = tempsDF.groupby("HighTempGroup").transform(np.mean)["Temp"]

ValueError: Array conditional must be same shape as self

I am super noob in pandas and I am following a tutorial that is obviously outdated.
I have this simple script that when I run I get tis error :
ValueError: Array conditional must be same shape as self
# loading the class data from the package pandas_datareader
import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
# Adj Close:
# The closing price of the stock that adjusts the price of the stock for corporate actions.
# This price takes into account the stock splits and dividends.
# The adjusted close is the price we will use for this example.
# Indeed, since it takes into account splits and dividends, we will not need to adjust the price manually.
# First day
start_date = '2014-01-01'
# Last day
end_date = '2018-01-01'
# Call the function DataReader from the class data
goog_data = data.DataReader('GOOG', 'yahoo', start_date, end_date)
goog_data_signal = pd.DataFrame(index=goog_data.index)
goog_data_signal['price'] = goog_data['Adj Close']
goog_data_signal['daily_difference'] = goog_data_signal['price'].diff()
goog_data_signal['signal'] = 0.0
# this line produces the error
goog_data_signal['signal'] = pd.DataFrame.where(goog_data_signal['daily_difference'] > 0, 1.0, 0.0)
goog_data_signal['positions'] = goog_data_signal['signal'].diff()
print(goog_data_signal.head())
I am trying to understand the theory, the libraries and the methodology through practicing so bear with me if it is too obvious... :]
The where method is always called from a dataframe however here, you only need to check the condition for a series, so I found 2 ways to solve this problem:
The new where method doesn't support setting a value for the rows where condition is true (1.0 in your case), but still supports setting a value for the false rows (called the other parameter in the doc). So you can set the 1.0's manually later as follows:
goog_data_signal['signal'] = goog_data_signal.where(goog_data_signal['daily_difference'] > 0, other=0.0)
# the true rows will retain their values and you can set them to 1.0 as needed.
Or you can check the condition directly as follows:
goog_data_signal['signal'] = (goog_data_signal['daily_difference'] > 0).astype(int)
The second method produces the output for me:
price daily_difference signal positions
Date
2014-01-02 554.481689 NaN 0 NaN
2014-01-03 550.436829 -4.044861 0 0.0
2014-01-06 556.573853 6.137024 1 1.0
2014-01-07 567.303589 10.729736 1 0.0
2014-01-08 568.484192 1.180603 1 0.0

RuntimeWarning: invalid value encountered in log """Entry point for launching an IPython kernel

I had this expression: RuntimeWarning: invalid value encountered in log """Entry point for launching an IPython kernel.
while trying this:
IN1:
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
IN2:
tickers = ['BP', 'F', 'XOM', 'LNC', 'AAPL']
sec_data = pd.DataFrame()
for t in tickers:
sec_data[t] = wb.DataReader(t, data_source='yahoo', start='2000-1-1')['Adj Close']
IN3:
sec_returns = np.log(sec_data / sec_data.shift(1))
sec_returns
OUT3:
BP F XOM LNC AAPL
Date
2000-01-03 NaN NaN NaN NaN NaN
2000-01-04 -0.005328 -0.033984 -0.019340 -0.029223 -0.088078
2000-01-05 0.033616 0.003697 0.053082 -0.035209 0.014528
2000-01-06 0.002064 0.001230 0.050405 0.018136 -0.090514
2000-01-07 -0.018731 0.071119 -0.002939 0.025022 0.046281
... ... ... ... ... ...
2020-01-21 -0.011675 0.005444 -0.014397 -0.025472 -0.006800
2020-01-22 -0.011549 -0.005444 -0.005788 0.003241 0.003563
2020-01-23 0.008412 -0.002186 -0.006271 -0.006664 0.004804
2020-01-24 -0.001834 -0.015436 -0.006762 -0.030991 -0.002886
2020-01-27 -0.018262 -0.012297 -0.024112 -0.034176 -0.029846
5048 rows × 5 columns
C:\Program Files\Anaconda\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in log """Entry point for launching an IPython kernel.
Is there any chance to avoid this RuntimeWarning?
Maybe it's because of negative values? But I need them.
P.S.- doing that on windows 10, jupyter-notebook.
Log isn't defined for negative values, only for positive ones. You simply can't take the log of a negative value. That's not a python problem, it's a math problem.
Why does it work without any RuntimeWarning in this case?
IN1:
import numpy as np
from pandas_datareader import data as wb
IN2:
MSFT = wb.DataReader('MSFT', data_source='yahoo', start='1995-1-1')
MSFT
IN3:
MSFT['log_return'] = np.log(MSFT['Adj Close'] / MSFT['Adj Close'].shift(1))
MSFT['log_return']
OUT3:
Date
1995-01-03 NaN
1995-01-04 0.007243
1995-01-05 -0.016632
1995-01-06 0.016632
1995-01-09 -0.006205
...
2020-01-22 -0.004816
2020-01-23 0.006137
2020-01-24 -0.010128
2020-01-27 -0.016865
2020-01-28 0.019769
Name: log_return, Length: 6312, dtype: float64
Almost certainly the problem is in the data returned by Yahoo. Having had the exact same issue as you, I've tried the same code using (a) different tickers (which is effectively what you've done by indexing just the MSFT column) and (b) different date ranges and in both cases avoided the problem. I've been unable so far to identify an example of the data problem but when I do I will post.
PS the course does mention early on that the returned data may not always be clean but so far they haven't talked about mitigating techniques!
EDIT: I take that back. Over the date range 2007 to today, the log computation fails with ANY ticker list with more than two elements (as far as I can find). Alternatively a longer ticker list with a shorter date range succeeds. Suggests hitting some kind of limit but surely numpy and pandas are designed to work with bigger arrays than this?
EDIT 2: Having experimented with different ticker counts and date ranges, it seemed that the log() operation would issue the warning if the dataframe contained more than 8000-and-something cells. To eliminate the specifics of the yahoo data source and the pandas_datareader library, I wrote this:
eles = 8192
cols = 2
arr = pd.DataFrame(np.arange(1, eles+1).reshape((int(eles/cols), cols)))
print(arr.head())
logarr = np.log(arr / arr.shift(1))
#logarr = arr / arr.shift(1)
#logarr = np.log(arr)
#logarr = np.log(arr / arr.add(3))
print(logarr.head())
Irrespective of the shape of the array, the warning is issued if the number of elements is greater than 8192. The commented variants do not show this problem: it only affects (as far as I have found) the combination of numpy.log() and pandas.DataFrame.shift().
8192, of course, is a power of 2 (8192 = 2^13), so this suggests (to me) a bug or limit affecting the interaction between numpy and pandas. Or am I missing something?
Of course it IS "just" a warning. The returned DataFrame seems to be complete and usable. You can suppress it with
import warnings
warnings.simplefilter(action='ignore', category=RuntimeWarning)
import pandas as pd
although suppressing runtime warnings across the board would make me feel rather uncomfortable
EDIT 3: After all that, it turns out that the answer is to upgrade numpy and pandas to the latest version (pandas: 1.0.3 and numpy: 1.18.2 at 2020-04-04). Doh. There's an important lesson!

How to apply a function to multiple columns of a Dask Data Frame in parallel?

I have a Dask Dataframe for which I would like to compute skewness for a list of columns and if this skewness exceeds a certain threshold, I correct it using log transformation. I am wondering whether there is a more efficient way of making correct_skewness() function work on multiple columns in parallel by removing the for loop in the correct_skewness() function below:
import dask
import dask.array as da
from scipy import stats
# Create a dataframe
df = dask.datasets.timeseries()
df.head()
id name x y
timestamp
2000-01-01 00:00:00 1032 Oliver 0.018604 0.089191
2000-01-01 00:00:01 1032 Norbert 0.666689 -0.979374
2000-01-01 00:00:02 991 Victor 0.027691 -0.474660
2000-01-01 00:00:03 979 Kevin 0.320067 0.656949
2000-01-01 00:00:04 1087 Zelda -0.462076 0.513409
def correct_skewness(columns=None, max_skewness=2):
if columns is None:
raise ValueError(
f"columns argument is None. Please set columns argument to a list of columns"
)
for col in columns:
skewness = stats.skew(df[col])
max_val = df[col].max().compute()
min_val = df[col].min().compute()
if abs(skewness) > max_skewness and (max_val > 1 or min_val < 0):
delta = 1.0
if min_val < 0:
delta = max(1, -min_val + 1)
df[col] = da.log(delta + df[col])
return df
df = correct_skewness(columns=['x', 'y'])
There are a couple things you can do to improve parallelism in this example:
You can use dask.array.stats.skew rather than statsmodels.skew. You will have to import dask.array.stats explicitly
You can compute the min/max of all columns in one computation
mins = [df[col].min() for col in cols]
maxes = [df[col].min() for col in cols]
skews = [da.stats.skew(df[col]) for col in cols]
mins, maxes, skews = dask.compute(mins, maxes, skews)
Then you could do your if-logic and apply da.log as appropriate. This still requires two passes over your data, but that should be a nice improvement over what you have now.

Timestamp from date, and formatting in panda or csv

I have a function that outputs a dataframe generated from a RINEX (GPS) file. At present, I get the dataframe to be output into separated satellite (1-32) files. I'd like to access in the first column (either when it's still a dataframe or in these new files) in order to format the date to a timestamp in seconds, like below:
Epochs Epochs
2014-04-27 00:00:00 -> 00000
2014-04-27 00:00:30 -> 00030
2014-04-27 00:01:00 -> 00060
This requires stripping the date away, then converting hh:mm:ss to seconds. I've hit a wall trying to figure out how best to access this first column (Epochs) and then make the conversion on the entire column. The code I have been working on is:
def read_data(self, RINEXfile):
obs_data_chunks = []
while True:
obss, _, _, epochs, _ = self.read_data_chunk(RINEXfile)
if obss.shape[0] == 0:
break
obs_data_chunks.append(pd.Panel(
np.rollaxis(obss, 1, 0),
items=['G%02d' % d for d in range(1, 33)],
major_axis=epochs,
minor_axis=self.obs_types
).dropna(axis=0, how='all').dropna(axis=2, how='all'))
obs_data_chunks_dataframe = obs_data_chunks[0]
for sv in range(32):
sat = obs_data_chunks_dataframe[sv, :]
print "sat_columns: {0}".format(sat.columns[0]) #list header of first column: L1
sat.to_csv(('SV_{0}').format(sv+1), index_label="Epochs", sep='\t')
Do I perform this conversion within the dataframe i.e on "sat", or on the files after using the "to_csv"? I'm a bit lost here. Same question for formatting the columns. See the not-so-nicely formatted columns below:
Epochs L1 L2 P1 P2 C1 S1 S2
2014-04-27 00:00:00 669486.833 530073.33 24568752.516 24568762.572 24568751.442 43.0 38.0
2014-04-27 00:00:30 786184.519 621006.551 24590960.634 24590970.218 24590958.374 43.0 38.0
2014-04-27 00:01:00 902916.181 711966.252 24613174.234 24613180.219 24613173.065 42.0 38.0
2014-04-27 00:01:30 1019689.006 802958.016 24635396.428 24635402.41 24635395.627 42.0 37.0
2014-04-27 00:02:00 1136478.43 893962.705 24657620.079 24657627.11 24657621.828 42.0 37.0
UPDATE:
By saying that I've hit a wall trying to figure out how best to access this first column (Epochs), the ""sat" dataframe originally in its header had no "Epochs". It simply had the signals:
L1 L2 P1 P2 C1 S1 S2
The index, (date&time), was missing from the header. In order to overcome this in my csv output files, I "forced" the name with:
sat.to_csv(('SV_{0}').format(sv+1), index_label="Epochs", sep='\t')
I would expect before generating the csv files, I should (but don't know how) be able to access this index (date&time) column and simply convert all dates/times in one swoop, so that the timestamps are outputted.
UPDATE:
The epochs are generated in the dataframe in another function as so:
epochs = np.zeros(CHUNK_SIZE, dtype='datetime64[us]')
UPDATE:
def read_data_chunk(self, RINEXfile, CHUNK_SIZE = 10000):
obss = np.empty((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.float64) * np.NaN
llis = np.zeros((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
signal_strengths = np.zeros((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
epochs = np.zeros(CHUNK_SIZE, dtype='datetime64[us]')
flags = np.zeros(CHUNK_SIZE, dtype=np.uint8)
i = 0
while True:
hdr = self.read_epoch_header(RINEXfile)
#print hdr
if hdr is None:
break
epoch, flags[i], sats = hdr
epochs[i] = np.datetime64(epoch)
sat_map = np.ones(len(sats)) * -1
for n, sat in enumerate(sats):
if sat[0] == 'G':
sat_map[n] = int(sat[1:]) - 1
obss[i], llis[i], signal_strengths[i] = self.read_obs(RINEXfile, len(sats), sat_map)
i += 1
if i >= CHUNK_SIZE:
break
return obss[:i], llis[:i], signal_strengths[:i], epochs[:i], flags[:i]
UPDATE:
My apologies if my description was somewhat vague. Actually I'm modifying code already developed, and I'm not a SW developer so it's a strong learning curve for me too. Let me explain further: the "Epochs" are read from another function:
def read_epoch_header(self, RINEXfile):
epoch_hdr = RINEXfile.readline()
if epoch_hdr == '':
return None
year = int(epoch_hdr[1:3])
if year >= 80:
year += 1900
else:
year += 2000
month = int(epoch_hdr[4:6])
day = int(epoch_hdr[7:9])
hour = int(epoch_hdr[10:12])
minute = int(epoch_hdr[13:15])
second = int(epoch_hdr[15:18])
microsecond = int(epoch_hdr[19:25]) # Discard the least significant digits (use microseconds only).
epoch = datetime.datetime(year, month, day, hour, minute, second, microsecond)
flag = int(epoch_hdr[28])
if flag != 0:
raise ValueError("Don't know how to handle epoch flag %d in epoch header:\n%s", (flag, epoch_hdr))
n_sats = int(epoch_hdr[29:32])
sats = []
for i in range(0, n_sats):
if ((i % 12) == 0) and (i > 0):
epoch_hdr = RINEXfile.readline()
sats.append(epoch_hdr[(32+(i%12)*3):(35+(i%12)*3)])
return epoch, flag, sats
In the above read_data function, these are appended into a dataframe. I basically want to have this dataframe separated by its satellite axis, so that each satellite file has in the first column, the epochs, then the following 7 signals. The last bit of code in the read_data file (below) explains this:
for sv in range(32):
sat = obs_data_chunks_dataframe[sv, :]
print "sat_columns: {0}".format(sat.columns[0]) #list header of first column: L1
sat.to_csv(('SV_{0}').format(sv+1), index_label="Epochs", sep='\t')
The problem here is (1) I want to have the first column as timestamps (so, strip the date, convert so midnight = 00000s and 23:59:59 = 86399s) not as they are now, and (2) ensure the columns are aligned, so I can eventually manipulate these further using a different class to perform other calculations i.e. L1 minus L2 plotted against time, etc.
It will be much quicker to do this when it's a df, if the dtype is datetime64 then just convert to int64 and then divide by nanoseconds:
In [241]:
df['Epochs'].astype(np.int64) // 10**9
Out[241]:
0 1398556800
1 1398556830
2 1398556860
3 1398556890
4 1398556920
Name: Epochs, dtype: int64
If it's a string then convert using to_datetime and then perform the above:
df['Epochs'] = pd.to_datetime(df['Epochs']).astype(np.int64) // 10**9
see related
I resolved part of this myself in the end: in the read_epoch_header function, I simply manipulated a variable that converted just hh:mm:ss to seconds, and used this as the epoch. Doesn't look that elegant but it works. Just need to format the header so that it aligns with the columns (and they are aligned too). Cheers, pymat

Categories