How to prevent zero values from messing up a pandas boxplot? - python

I have a pandas df and after pivoting, it prints as following,
country CHINA USA
0 119.02 0.0
1 121.20 0.0
3 112.49 0.0
4 113.94 0.0
5 114.67 0.0
6 111.77 0.0
7 117.57 0.0
......................
......................
6648 0.00 420.0
6649 0.00 420.0
6650 0.00 420.0
6651 0.00 420.0
6652 0.00 420.0
6653 0.00 420.0
6654 0.00 500.0
6655 0.00 500.0
6656 0.00 390.0
6657 0.00 450.0
6658 0.00 420.0
6659 0.00 420.0
6660 0.00 450.0
The method is here,
def visualize_box_plot(df):
df = df[df.outlier != 1]
df = pd.pivot_table(df,
index=df.index,
columns = df['country'],
values='value',
fill_value = 0)
df.CHINA = df.CHINA.round(2)
df.USA = df.USA.round(2)
# this is the prints
# provided earlier
print df
df_usa = df[(df['USA'] != 0)]
df_china = df[(df['CHINA'] != 0)]
usa = df_usa.as_matrix()[:, -1]
china = df_china.as_matrix()[:,0]
print "USA:", len(usa), " ", "CHINA: ", len(china)
# unequal length
# USA: 1673 CHINA: 4384
x = [china, usa]
plt.boxplot(x)
plt.show()
Zero values come from the NaN during the time of pivoting and I would like omit them while making the box plot. So, I use the code,
df_usa = df[(df['USA'] != 0)]
df_china = df[(df['CHINA'] != 0)]
Those code actually creates seperate df and converted to the NUmpy matrix and lastly, I visualize them all together with matplotlib. Point to be considered, the length of the Numpy matrix is not the same and hence, I can't just call the boxplot function directly with df.
Here is my visualization where 1 and 2 needs to be replaced with CHINA and USA respectively,
The visualization is not good and I get the feelings there might be better way to
get the job done. Any suggestion ? Some sample code will help a lot. You may use the df rounding to 2 digits after the decimal. The main issue is to make the code elegant and improve the visualization better.

I think code can be more simplier - simply replace 0 to NaN and then call DataFrame.boxplot:
print (df.mask(df == 0))
#alternative solution
#print (df.replace(0,np.nan))
CHINA USA
country
0 119.02 NaN
1 121.20 NaN
3 112.49 NaN
4 113.94 NaN
5 114.67 NaN
6 111.77 NaN
7 117.57 NaN
6648 NaN 420.0
6649 NaN 420.0
6650 NaN 420.0
6651 NaN 420.0
6652 NaN 420.0
6653 NaN 420.0
6654 NaN 500.0
6655 NaN 500.0
6656 NaN 390.0
6657 NaN 450.0
6658 NaN 420.0
6659 NaN 420.0
6660 NaN 450.0
df.mask(df == 0).boxplot()
Another possible solution is use DataFrame.plot.box:
df.mask(df == 0).plot.box()
Box Plots in docs

Apart from the numpy nan that jezrael mentioned, there's also nan that you can use from math.
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import math
data = {'c1': [1,2,3], 'c2': [5,3,0]}
for k in data:#search and replace zeroes with math.nan
data[k] = [x if x != 0 else math.nan for x in data[k]]
df = pd.DataFrame(data, columns=list(data.keys()))
df.plot.box(grid='False')
plt.show()

Related

How to extract or split some days from date when date is index or string?

I have 6000 rows and 8 columns, where 'Date' is like index or I can reset index and it would be like first column with string type. I need to Extract the list of 'Lake_Level' values where date of a record is second and seventh day of a month ( and provide top 3 and bottom 3 values of the 'Lake_Level' feature). Please show me how to make it. Thank you in advance.
Date Loc_1 Loc_2 Loc_3 Loc_4 Loc_5 Temp Lake_Level Flow_Rate
03/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
04/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
05/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
06/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
07/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
26/06/2021 0.0 0.0 0.0 0.0 0.0 22.50 250.85 0.60
27/06/2021 0.0 0.0 0.0 0.0 0.0 23.40 250.84 0.60
28/06/2021 0.0 0.0 0.0 0.0 0.0 21.50 250.83 0.60
29/06/2021 0.0 0.0 0.0 0.0 0.0 23.20 250.82 0.60
30/06/2021 0.0 0.0 0.0 0.0 0.0 22.75 250.80 0.60
Why don't you just filter rows with your ideal condition?
You can run queries on your dataset using pandas DataFrame like below:
If datetimes are in column
df[pd.to_datetime(df['Date'], dayfirst=True).dt.day.isin([2,7])]
If datetimes are as indexes
df[pd.to_datetime(df.index, dayfirst=True).day.isin([2,7])]
Here is an example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'Date': [random_date() for _ in range(100)],
...: 'Lake_Level': [random.randint(240, 260) for _ in range(100)]
...: })
In [3]: df[pd.to_datetime(df['Date'], dayfirst=True).dt.day.isin([2,7])]
Out[3]:
Date Lake_Level
2 07/08/2004 245
27 02/12/2017 249
30 02/06/2012 252
51 07/10/2013 257

count values of each month, fill NaN if under certain limit

I am working with a dataframe, where every column represents a company. The index is a datetime index with daily frequency. My problem is the following: For each company, I would like to fill a month with NaN if there are less than 20 values in that month. In the example below, this would mean that Company_1's entry 0.91 on 2012-08-31 would be changed to NaN, while company_2 and 3 would be unchanged.
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
... ... ... ...
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 0.91 0.51 -0.33
Total Values: 1 22 21
I am struggling to find an efficient way to count the number of values for each month of each stock. I could theoretically write a function which creates a new dataframe, which reports the number of values for each month (and for each stock), to then use that dataframe for the original company information, but I am sure that there has to be an easier way. Any help is highly appreciated. Thanks in advance.
groupby the dataframe on monthly freq and transform using count then using Series.lt create a boolean mask and use this mask to fill NaN values in dataframe:
df1 = df.mask(df.groupby(pd.Grouper(freq='M')).transform('count').lt(20))
print(df1)
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
....
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33
IIUC:
df.loc[:, df.apply(lambda d: d.notnull().sum()<20)] = np.NaN
print (df)
Company 1 Company 2 Company 3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33

Pandas combine two columns into one and exclude NaN values

I have a 5k x 2 column dataframe called "both".
I want to create a new 5k x 1 DataFrame or column (doesn't matter) by replacing any NaN value in one column with the value of the adjacent column.
ex:
Gains Loss
0 NaN NaN
1 NaN -0.17
2 NaN -0.13
3 NaN -0.75
4 NaN -0.17
5 NaN -0.99
6 1.06 NaN
7 NaN -1.29
8 NaN -0.42
9 0.14 NaN
so for example, I need to swap the NaNs in the first column in rows 1 through 5 with the values in the same rows, in second column to get a new df of the following form:
Change
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
how do I tell python to do this??
You may fill the NaN values with zeroes and then simply add your columns:
both["Change"] = both["Gains"].fillna(0) + both["Loss"].fillna(0)
Then — if you need it — you may return the resulting zeroes back to NaNs:
both["Change"].replace(0, np.nan, inplace=True)
The result:
Gains Loss Change
0 NaN NaN NaN
1 NaN -0.17 -0.17
2 NaN -0.13 -0.13
3 NaN -0.75 -0.75
4 NaN -0.17 -0.17
5 NaN -0.99 -0.99
6 1.06 NaN 1.06
7 NaN -1.29 -1.29
8 NaN -0.42 -0.42
9 0.14 NaN 0.14
Finally, if you want to get rid of your original columns, you may drop them:
both.drop(columns=["Gains", "Loss"], inplace=True)
There are many ways to achieve this. One is using the loc property:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price1': [np.nan,np.nan,np.nan,np.nan,
np.nan,np.nan,1.06,np.nan,np.nan],
'Price2': [np.nan,-0.17,-0.13,-0.75,-0.17,
-0.99,np.nan,-1.29,-0.42]})
df.loc[df['Price1'].isnull(), 'Price1'] = df['Price2']
df = df.loc[:6,'Price1']
print(df)
Output:
Price1
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
You can see more complex recipes in the Cookbook
IIUC, we can filter for null values and just sum the columns to make your new dataframe.
cols = ['Gains','Loss']
s = df.isnull().cumsum(axis=1).eq(len(df.columns)).any(axis=1)
# add df[cols].isnull() if you only want to measure the price columns for nulls.
df['prices'] = df[cols].loc[~s].sum(axis=1)
df = df.drop(cols,axis=1)
print(df)
prices
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
7 -1.29
8 -0.42

Calculate RSI indicator from pandas DataFrame?

My problem
I tried many libraries on Github but all of them did not produce matching results for TradingView so I followed the formula on this link to calculate RSI indicator. I calculated it with Excel and collated the results with TradingView. I know it's absolutely correct but, but I didn't find a way to calculate it with Pandas.
Formula
100
RSI = 100 - --------
1 + RS
RS = Average Gain / Average Loss
The very first calculations for average gain and average loss are simple
14-period averages:
First Average Gain = Sum of Gains over the past 14 periods / 14.
First Average Loss = Sum of Losses over the past 14 periods / 14
The second, and subsequent, calculations are based on the prior averages
and the current gain loss:
Average Gain = [(previous Average Gain) x 13 + current Gain] / 14.
Average Loss = [(previous Average Loss) x 13 + current Loss] / 14.
Expected Results
close change gain loss avg_gian avg_loss rs \
0 4724.89 NaN NaN NaN NaN NaN NaN
1 4378.51 -346.38 0.00 346.38 NaN NaN NaN
2 6463.00 2084.49 2084.49 0.00 NaN NaN NaN
3 9838.96 3375.96 3375.96 0.00 NaN NaN NaN
4 13716.36 3877.40 3877.40 0.00 NaN NaN NaN
5 10285.10 -3431.26 0.00 3431.26 NaN NaN NaN
6 10326.76 41.66 41.66 0.00 NaN NaN NaN
7 6923.91 -3402.85 0.00 3402.85 NaN NaN NaN
8 9246.01 2322.10 2322.10 0.00 NaN NaN NaN
9 7485.01 -1761.00 0.00 1761.00 NaN NaN NaN
10 6390.07 -1094.94 0.00 1094.94 NaN NaN NaN
11 7730.93 1340.86 1340.86 0.00 NaN NaN NaN
12 7011.21 -719.72 0.00 719.72 NaN NaN NaN
13 6626.57 -384.64 0.00 384.64 NaN NaN NaN
14 6371.93 -254.64 0.00 254.64 931.605000 813.959286 1.144535
15 4041.32 -2330.61 0.00 2330.61 865.061786 922.291480 0.937948
16 3702.90 -338.42 0.00 338.42 803.271658 880.586374 0.912201
17 3434.10 -268.80 0.00 268.80 745.895111 836.887347 0.891273
18 3813.69 379.59 379.59 0.00 719.730460 777.109680 0.926163
19 4103.95 290.26 290.26 0.00 689.053999 721.601845 0.954895
20 5320.81 1216.86 1216.86 0.00 726.754428 670.058856 1.084613
21 8555.00 3234.19 3234.19 0.00 905.856968 622.197509 1.455899
22 10854.10 2299.10 2299.10 0.00 1005.374328 577.754830 1.740140
rsi_14
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 53.369848
15 48.399038
16 47.704239
17 47.125561
18 48.083322
19 48.846358
20 52.029461
21 59.281719
22 63.505515
My Code
Import
import pandas as pd
import numpy as np
Load data
df = pd.read_csv("rsi_14_test_data.csv")
close = df['close']
print(close)
0 4724.89
1 4378.51
2 6463.00
3 9838.96
4 13716.36
5 10285.10
6 10326.76
7 6923.91
8 9246.01
9 7485.01
10 6390.07
11 7730.93
12 7011.21
13 6626.57
14 6371.93
15 4041.32
16 3702.90
17 3434.10
18 3813.69
19 4103.95
20 5320.81
21 8555.00
22 10854.10
Name: close, dtype: float64
Change
Calculate change every row
change = close.diff(1)
print(change)
0 NaN
1 -346.38
2 2084.49
3 3375.96
4 3877.40
5 -3431.26
6 41.66
7 -3402.85
8 2322.10
9 -1761.00
10 -1094.94
11 1340.86
12 -719.72
13 -384.64
14 -254.64
15 -2330.61
16 -338.42
17 -268.80
18 379.59
19 290.26
20 1216.86
21 3234.19
22 2299.10
Name: close, dtype: float64
Gain and loss
get gain and loss from change
is_gain, is_loss = change > 0, change < 0
gain, loss = change, -change
gain[is_loss] = 0
loss[is_gain] = 0
​
gain.name = 'gain'
loss.name = 'loss'
print(loss)
0 NaN
1 346.38
2 0.00
3 0.00
4 0.00
5 3431.26
6 0.00
7 3402.85
8 0.00
9 1761.00
10 1094.94
11 0.00
12 719.72
13 384.64
14 254.64
15 2330.61
16 338.42
17 268.80
18 0.00
19 0.00
20 0.00
21 0.00
22 0.00
Name: loss, dtype: float64
Calculate fist avg gain and loss
Mean of n prior rows
n = 14
avg_gain = change * np.nan
avg_loss = change * np.nan
​
avg_gain[n] = gain[:n+1].mean()
avg_loss[n] = loss[:n+1].mean()
​
avg_gain.name = 'avg_gain'
avg_loss.name = 'avg_loss'
​
avg_df = pd.concat([gain, loss, avg_gain, avg_loss], axis=1)
print(avg_df)
gain loss avg_gain avg_loss
0 NaN NaN NaN NaN
1 0.00 346.38 NaN NaN
2 2084.49 0.00 NaN NaN
3 3375.96 0.00 NaN NaN
4 3877.40 0.00 NaN NaN
5 0.00 3431.26 NaN NaN
6 41.66 0.00 NaN NaN
7 0.00 3402.85 NaN NaN
8 2322.10 0.00 NaN NaN
9 0.00 1761.00 NaN NaN
10 0.00 1094.94 NaN NaN
11 1340.86 0.00 NaN NaN
12 0.00 719.72 NaN NaN
13 0.00 384.64 NaN NaN
14 0.00 254.64 931.605 813.959286
15 0.00 2330.61 NaN NaN
16 0.00 338.42 NaN NaN
17 0.00 268.80 NaN NaN
18 379.59 0.00 NaN NaN
19 290.26 0.00 NaN NaN
20 1216.86 0.00 NaN NaN
21 3234.19 0.00 NaN NaN
22 2299.10 0.00 NaN NaN
The very first calculations for average gain and the average loss is ok but I don't know how to apply pandas.core.window.Rolling.apply for the second, and subsequent because they are in many rows and different columns.
It may be something like this:
avg_gain[n] = (avg_gain[n-1]*13 + gain[n]) / 14
My Wish - My Question
The best way to calculate and work with technical indicators?
Complete the above code in "Pandas Style".
Does the traditional way of coding with loops reduce performance compared to Pandas?
The average gain and loss are calculated by a recursive formula, which can't be vectorized with numpy. We can, however, try and find an analytical (i.e. non-recursive) solution for calculating the individual elements. Such a solution can then be implemented using numpy. See the Old Answer below. I kept it just for illustrative purposes: it works well with the sample data in the OP but may suffer from numerical underflow for bigger datasets (> ~1000 rows, thanks to #WarrenNiles for pointing this problem out in the comment below).
A straightforward solution is to loop over numpy arrays (instead of looping over pandas dataframes). This can easyly be accelerated using numba by commenting out the two numba-related lines below:
#import numba
df['change'] = df['close'].diff()
df['gain'] = df.change.mask(df.change < 0, 0.0)
df['loss'] = -df.change.mask(df.change > 0, -0.0)
##numba.jit
def rma(x, n):
"""Running moving average"""
a = np.full_like(x, np.nan)
a[n] = x[1:n+1].mean()
for i in range(n+1, len(x)):
a[i] = (a[i-1] * (n - 1) + x[i]) / n
return a
df['avg_gain'] = rma(df.gain.to_numpy(), 14)
df['avg_loss'] = rma(df.loss.to_numpy(), 14)
df['rs'] = df.avg_gain / df.avg_loss
df['rsi'] = 100 - (100 / (1 + df.rs))
For the 3173-rows TSLA dataset linked in the comment below, it takes on my machine:
2 s for the pandas loop solution
23 ms for this array loop solution without numba
4 ms for this array loop solution with numba
Old Answer
Denoting the average gain as y and the current gain as x, we get y[i] = a*y[i-1] + b*x[i], where a = 13/14 and b = 1/14 for n = 14. Unwrapping the recursion leads to:
(sorry for the picture, was just to cumbersome to type it)
This can be efficiently calculated in numpy using cumsum (rma = running moving average):
import pandas as pd
import numpy as np
df = pd.DataFrame({'close':[4724.89, 4378.51,6463.00,9838.96,13716.36,10285.10,
10326.76,6923.91,9246.01,7485.01,6390.07,7730.93,
7011.21,6626.57,6371.93,4041.32,3702.90,3434.10,
3813.69,4103.95,5320.81,8555.00,10854.10]})
n = 14
def rma(x, n, y0):
a = (n-1) / n
ak = a**np.arange(len(x)-1, -1, -1)
return np.r_[np.full(n, np.nan), y0, np.cumsum(ak * x) / ak / n + y0 * a**np.arange(1, len(x)+1)]
df['change'] = df['close'].diff()
df['gain'] = df.change.mask(df.change < 0, 0.0)
df['loss'] = -df.change.mask(df.change > 0, -0.0)
df['avg_gain'] = rma(df.gain[n+1:].to_numpy(), n, np.nansum(df.gain.to_numpy()[:n+1])/n)
df['avg_loss'] = rma(df.loss[n+1:].to_numpy(), n, np.nansum(df.loss.to_numpy()[:n+1])/n)
df['rs'] = df.avg_gain / df.avg_loss
df['rsi_14'] = 100 - (100 / (1 + df.rs))
Output of df.round(2):
close change gain loss avg_gain avg_loss rs rsi rsi_14
0 4724.89 NaN NaN NaN NaN NaN NaN NaN NaN
1 4378.51 -346.38 0.00 346.38 NaN NaN NaN NaN NaN
2 6463.00 2084.49 2084.49 0.00 NaN NaN NaN NaN NaN
3 9838.96 3375.96 3375.96 0.00 NaN NaN NaN NaN NaN
4 13716.36 3877.40 3877.40 0.00 NaN NaN NaN NaN NaN
5 10285.10 -3431.26 0.00 3431.26 NaN NaN NaN NaN NaN
6 10326.76 41.66 41.66 0.00 NaN NaN NaN NaN NaN
7 6923.91 -3402.85 0.00 3402.85 NaN NaN NaN NaN NaN
8 9246.01 2322.10 2322.10 0.00 NaN NaN NaN NaN NaN
9 7485.01 -1761.00 0.00 1761.00 NaN NaN NaN NaN NaN
10 6390.07 -1094.94 0.00 1094.94 NaN NaN NaN NaN NaN
11 7730.93 1340.86 1340.86 0.00 NaN NaN NaN NaN NaN
12 7011.21 -719.72 0.00 719.72 NaN NaN NaN NaN NaN
13 6626.57 -384.64 0.00 384.64 NaN NaN NaN NaN NaN
14 6371.93 -254.64 0.00 254.64 931.61 813.96 1.14 53.37 53.37
15 4041.32 -2330.61 0.00 2330.61 865.06 922.29 0.94 48.40 48.40
16 3702.90 -338.42 0.00 338.42 803.27 880.59 0.91 47.70 47.70
17 3434.10 -268.80 0.00 268.80 745.90 836.89 0.89 47.13 47.13
18 3813.69 379.59 379.59 0.00 719.73 777.11 0.93 48.08 48.08
19 4103.95 290.26 290.26 0.00 689.05 721.60 0.95 48.85 48.85
20 5320.81 1216.86 1216.86 0.00 726.75 670.06 1.08 52.03 52.03
21 8555.00 3234.19 3234.19 0.00 905.86 622.20 1.46 59.28 59.28
22 10854.10 2299.10 2299.10 0.00 1005.37 577.75 1.74 63.51 63.51
Concerning your last question about performance: explicite loops in python / pandas are terrible, avoid them whenever you can. If you can't, try cython or numba.
There is an easier way, the package talib.
import talib
close = df['close']
rsi = talib.RSI(close, timeperiod=14)
If you'd like Bollinger Bands to go with your RSI that is easy too.
upperBB, middleBB, lowerBB = talib.BBANDS(close, timeperiod=20, nbdevup=2, nbdevdn=2, matype=0)
You can use Bollinger Bands on RSI instead of the fixed reference levels of 70 and 30.
upperBBrsi, MiddleBBrsi, lowerBBrsi = talib.BBANDS(rsi, timeperiod=50, nbdevup=2, nbdevdn=2, matype=0)
Finally, you can normalize RSI using the %b calcification.
normrsi = (rsi - lowerBBrsi) / (upperBBrsi - lowerBBrsi)
info on talib
https://mrjbq7.github.io/ta-lib/
info on Bollinger Bands
https://www.BollingerBands.com
Here is an option.
I will be touching only on your second bullet
# libraries required
import pandas as pd
import numpy as np
# create dataframe
df = pd.DataFrame({'close':[4724.89, 4378.51,6463.00,9838.96,13716.36,10285.10,
10326.76,6923.91,9246.01,7485.01,6390.07,7730.93,
7011.21,6626.57,6371.93,4041.32,3702.90,3434.10,
3813.69,4103.95,5320.81,8555.00,10854.10]})
df['change'] = df['close'].diff(1) # Calculate change
# calculate gain / loss from every change
df['gain'] = np.select([df['change']>0, df['change'].isna()],
[df['change'], np.nan],
default=0)
df['loss'] = np.select([df['change']<0, df['change'].isna()],
[-df['change'], np.nan],
default=0)
# create avg_gain / avg_loss columns with all nan
df['avg_gain'] = np.nan
df['avg_loss'] = np.nan
n = 14 # what is the window
# keep first occurrence of rolling mean
df['avg_gain'][n] = df['gain'].rolling(window=n).mean().dropna().iloc[0]
df['avg_loss'][n] = df['loss'].rolling(window=n).mean().dropna().iloc[0]
# Alternatively
df['avg_gain'][n] = df.loc[:n, 'gain'].mean()
df['avg_loss'][n] = df.loc[:n, 'loss'].mean()
# This is not a pandas way, looping through the pandas series, but it does what you need
for i in range(n+1, df.shape[0]):
df['avg_gain'].iloc[i] = (df['avg_gain'].iloc[i-1] * (n - 1) + df['gain'].iloc[i]) / n
df['avg_loss'].iloc[i] = (df['avg_loss'].iloc[i-1] * (n - 1) + df['loss'].iloc[i]) / n
# calculate rs and rsi
df['rs'] = df['avg_gain'] / df['avg_loss']
df['rsi'] = 100 - (100 / (1 + df['rs'] ))
If you want to calculate the RSI of a time series using native pandas calls, you can use the following one-line code:
n=14
df['rsi14'] = 100 - (100 / (1 + df['Close'].diff(1).mask(df['Close'].diff(1) < 0, 0).ewm(alpha=1/n, adjust=False).mean() / df['Close'].diff(1).mask(df['Close'].diff(1) > 0, -0.0).abs().ewm(alpha=1/n, adjust=False).mean()))
And it's eave faster than numpy results (ms / loop):
rows np loop native
23 1.0 1.3 0.8
230 1.1 1.4 0.9
2300 1.1 1.3 0.9
23000 3.4 1.8 1.2
This is the rsi code, replace every thing that has "aa":
import pandas as pd
rsi_period = 14
df = pd.Series(coinaalist)
chg = df.diff(1)
gain = chg.mask(chg<0,0)
loss = chg.mask(chg>0,0)
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
rs = abs(avg_gain / avg_loss)
crplaa = 100 - (100/(1+rs))
coinaarsi = crplaa.iloc[-1]
I gave +1 to lepi, however his formula can be made even more pandorable:
n = 14
df['rsi14'] = df['Close'].diff(1).mask(df['Close'].diff(1) < 0, 0).ewm(alpha=1/n, adjust=False).mean().div(df['Close'].diff(1).mask(df['Close'].diff(1) > 0, -0.0).abs().ewm(alpha=1/n, adjust=False).mean()).add(1).rdiv(100).rsub(100)
so div() was used instead of / and add(1).rdiv(100).rsub(100) instead of + - / in other places.

calculate mean only when the number of values in each rows is higher then certain number in python pandas

I have a daily time series dataframe with nine columns. Each columns represent the measurement from different methods. I want to calculate daily mean only when there are more than two measurements otherwise want to assign as NaN. How to do that with pandas dataframe?
suppose my df looks like:
0 1 2 3 4 5 6 7 8
2000-02-25 NaN 0.22 0.54 NaN NaN NaN NaN NaN NaN
2000-02-26 0.57 NaN 0.91 0.21 NaN 0.22 NaN 0.51 NaN
2000-02-27 0.10 0.14 0.09 NaN 0.17 NaN 0.05 NaN NaN
2000-02-28 NaN NaN NaN NaN NaN NaN NaN NaN 0.14
2000-02-29 0.82 NaN 0.75 NaN NaN NaN 0.14 NaN NaN
and I'm expecting mean values like:
0
2000-02-25 NaN
2000-02-26 0.48
2000-02-27 0.11
2000-02-28 NaN
2000-02-29 0.57
Use where for NaNs values by condition created by DataFrame.count for count with exclude NaNs and comparing by Series.gt (>):
s = df.where(df.count(axis=1).gt(2)).mean(axis=1)
#alternative soluton with changed order
#s = df.mean(axis=1).where(df.count(axis=1).gt(2))
print (s)
2000-02-25 NaN
2000-02-26 0.484
2000-02-27 0.110
2000-02-28 NaN
2000-02-29 0.570
dtype: float64

Categories