Calculate RSI indicator from pandas DataFrame? - python

My problem
I tried many libraries on Github but all of them did not produce matching results for TradingView so I followed the formula on this link to calculate RSI indicator. I calculated it with Excel and collated the results with TradingView. I know it's absolutely correct but, but I didn't find a way to calculate it with Pandas.
Formula
100
RSI = 100 - --------
1 + RS
RS = Average Gain / Average Loss
The very first calculations for average gain and average loss are simple
14-period averages:
First Average Gain = Sum of Gains over the past 14 periods / 14.
First Average Loss = Sum of Losses over the past 14 periods / 14
The second, and subsequent, calculations are based on the prior averages
and the current gain loss:
Average Gain = [(previous Average Gain) x 13 + current Gain] / 14.
Average Loss = [(previous Average Loss) x 13 + current Loss] / 14.
Expected Results
close change gain loss avg_gian avg_loss rs \
0 4724.89 NaN NaN NaN NaN NaN NaN
1 4378.51 -346.38 0.00 346.38 NaN NaN NaN
2 6463.00 2084.49 2084.49 0.00 NaN NaN NaN
3 9838.96 3375.96 3375.96 0.00 NaN NaN NaN
4 13716.36 3877.40 3877.40 0.00 NaN NaN NaN
5 10285.10 -3431.26 0.00 3431.26 NaN NaN NaN
6 10326.76 41.66 41.66 0.00 NaN NaN NaN
7 6923.91 -3402.85 0.00 3402.85 NaN NaN NaN
8 9246.01 2322.10 2322.10 0.00 NaN NaN NaN
9 7485.01 -1761.00 0.00 1761.00 NaN NaN NaN
10 6390.07 -1094.94 0.00 1094.94 NaN NaN NaN
11 7730.93 1340.86 1340.86 0.00 NaN NaN NaN
12 7011.21 -719.72 0.00 719.72 NaN NaN NaN
13 6626.57 -384.64 0.00 384.64 NaN NaN NaN
14 6371.93 -254.64 0.00 254.64 931.605000 813.959286 1.144535
15 4041.32 -2330.61 0.00 2330.61 865.061786 922.291480 0.937948
16 3702.90 -338.42 0.00 338.42 803.271658 880.586374 0.912201
17 3434.10 -268.80 0.00 268.80 745.895111 836.887347 0.891273
18 3813.69 379.59 379.59 0.00 719.730460 777.109680 0.926163
19 4103.95 290.26 290.26 0.00 689.053999 721.601845 0.954895
20 5320.81 1216.86 1216.86 0.00 726.754428 670.058856 1.084613
21 8555.00 3234.19 3234.19 0.00 905.856968 622.197509 1.455899
22 10854.10 2299.10 2299.10 0.00 1005.374328 577.754830 1.740140
rsi_14
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 53.369848
15 48.399038
16 47.704239
17 47.125561
18 48.083322
19 48.846358
20 52.029461
21 59.281719
22 63.505515
My Code
Import
import pandas as pd
import numpy as np
Load data
df = pd.read_csv("rsi_14_test_data.csv")
close = df['close']
print(close)
0 4724.89
1 4378.51
2 6463.00
3 9838.96
4 13716.36
5 10285.10
6 10326.76
7 6923.91
8 9246.01
9 7485.01
10 6390.07
11 7730.93
12 7011.21
13 6626.57
14 6371.93
15 4041.32
16 3702.90
17 3434.10
18 3813.69
19 4103.95
20 5320.81
21 8555.00
22 10854.10
Name: close, dtype: float64
Change
Calculate change every row
change = close.diff(1)
print(change)
0 NaN
1 -346.38
2 2084.49
3 3375.96
4 3877.40
5 -3431.26
6 41.66
7 -3402.85
8 2322.10
9 -1761.00
10 -1094.94
11 1340.86
12 -719.72
13 -384.64
14 -254.64
15 -2330.61
16 -338.42
17 -268.80
18 379.59
19 290.26
20 1216.86
21 3234.19
22 2299.10
Name: close, dtype: float64
Gain and loss
get gain and loss from change
is_gain, is_loss = change > 0, change < 0
gain, loss = change, -change
gain[is_loss] = 0
loss[is_gain] = 0
​
gain.name = 'gain'
loss.name = 'loss'
print(loss)
0 NaN
1 346.38
2 0.00
3 0.00
4 0.00
5 3431.26
6 0.00
7 3402.85
8 0.00
9 1761.00
10 1094.94
11 0.00
12 719.72
13 384.64
14 254.64
15 2330.61
16 338.42
17 268.80
18 0.00
19 0.00
20 0.00
21 0.00
22 0.00
Name: loss, dtype: float64
Calculate fist avg gain and loss
Mean of n prior rows
n = 14
avg_gain = change * np.nan
avg_loss = change * np.nan
​
avg_gain[n] = gain[:n+1].mean()
avg_loss[n] = loss[:n+1].mean()
​
avg_gain.name = 'avg_gain'
avg_loss.name = 'avg_loss'
​
avg_df = pd.concat([gain, loss, avg_gain, avg_loss], axis=1)
print(avg_df)
gain loss avg_gain avg_loss
0 NaN NaN NaN NaN
1 0.00 346.38 NaN NaN
2 2084.49 0.00 NaN NaN
3 3375.96 0.00 NaN NaN
4 3877.40 0.00 NaN NaN
5 0.00 3431.26 NaN NaN
6 41.66 0.00 NaN NaN
7 0.00 3402.85 NaN NaN
8 2322.10 0.00 NaN NaN
9 0.00 1761.00 NaN NaN
10 0.00 1094.94 NaN NaN
11 1340.86 0.00 NaN NaN
12 0.00 719.72 NaN NaN
13 0.00 384.64 NaN NaN
14 0.00 254.64 931.605 813.959286
15 0.00 2330.61 NaN NaN
16 0.00 338.42 NaN NaN
17 0.00 268.80 NaN NaN
18 379.59 0.00 NaN NaN
19 290.26 0.00 NaN NaN
20 1216.86 0.00 NaN NaN
21 3234.19 0.00 NaN NaN
22 2299.10 0.00 NaN NaN
The very first calculations for average gain and the average loss is ok but I don't know how to apply pandas.core.window.Rolling.apply for the second, and subsequent because they are in many rows and different columns.
It may be something like this:
avg_gain[n] = (avg_gain[n-1]*13 + gain[n]) / 14
My Wish - My Question
The best way to calculate and work with technical indicators?
Complete the above code in "Pandas Style".
Does the traditional way of coding with loops reduce performance compared to Pandas?

The average gain and loss are calculated by a recursive formula, which can't be vectorized with numpy. We can, however, try and find an analytical (i.e. non-recursive) solution for calculating the individual elements. Such a solution can then be implemented using numpy. See the Old Answer below. I kept it just for illustrative purposes: it works well with the sample data in the OP but may suffer from numerical underflow for bigger datasets (> ~1000 rows, thanks to #WarrenNiles for pointing this problem out in the comment below).
A straightforward solution is to loop over numpy arrays (instead of looping over pandas dataframes). This can easyly be accelerated using numba by commenting out the two numba-related lines below:
#import numba
df['change'] = df['close'].diff()
df['gain'] = df.change.mask(df.change < 0, 0.0)
df['loss'] = -df.change.mask(df.change > 0, -0.0)
##numba.jit
def rma(x, n):
"""Running moving average"""
a = np.full_like(x, np.nan)
a[n] = x[1:n+1].mean()
for i in range(n+1, len(x)):
a[i] = (a[i-1] * (n - 1) + x[i]) / n
return a
df['avg_gain'] = rma(df.gain.to_numpy(), 14)
df['avg_loss'] = rma(df.loss.to_numpy(), 14)
df['rs'] = df.avg_gain / df.avg_loss
df['rsi'] = 100 - (100 / (1 + df.rs))
For the 3173-rows TSLA dataset linked in the comment below, it takes on my machine:
2 s for the pandas loop solution
23 ms for this array loop solution without numba
4 ms for this array loop solution with numba
Old Answer
Denoting the average gain as y and the current gain as x, we get y[i] = a*y[i-1] + b*x[i], where a = 13/14 and b = 1/14 for n = 14. Unwrapping the recursion leads to:
(sorry for the picture, was just to cumbersome to type it)
This can be efficiently calculated in numpy using cumsum (rma = running moving average):
import pandas as pd
import numpy as np
df = pd.DataFrame({'close':[4724.89, 4378.51,6463.00,9838.96,13716.36,10285.10,
10326.76,6923.91,9246.01,7485.01,6390.07,7730.93,
7011.21,6626.57,6371.93,4041.32,3702.90,3434.10,
3813.69,4103.95,5320.81,8555.00,10854.10]})
n = 14
def rma(x, n, y0):
a = (n-1) / n
ak = a**np.arange(len(x)-1, -1, -1)
return np.r_[np.full(n, np.nan), y0, np.cumsum(ak * x) / ak / n + y0 * a**np.arange(1, len(x)+1)]
df['change'] = df['close'].diff()
df['gain'] = df.change.mask(df.change < 0, 0.0)
df['loss'] = -df.change.mask(df.change > 0, -0.0)
df['avg_gain'] = rma(df.gain[n+1:].to_numpy(), n, np.nansum(df.gain.to_numpy()[:n+1])/n)
df['avg_loss'] = rma(df.loss[n+1:].to_numpy(), n, np.nansum(df.loss.to_numpy()[:n+1])/n)
df['rs'] = df.avg_gain / df.avg_loss
df['rsi_14'] = 100 - (100 / (1 + df.rs))
Output of df.round(2):
close change gain loss avg_gain avg_loss rs rsi rsi_14
0 4724.89 NaN NaN NaN NaN NaN NaN NaN NaN
1 4378.51 -346.38 0.00 346.38 NaN NaN NaN NaN NaN
2 6463.00 2084.49 2084.49 0.00 NaN NaN NaN NaN NaN
3 9838.96 3375.96 3375.96 0.00 NaN NaN NaN NaN NaN
4 13716.36 3877.40 3877.40 0.00 NaN NaN NaN NaN NaN
5 10285.10 -3431.26 0.00 3431.26 NaN NaN NaN NaN NaN
6 10326.76 41.66 41.66 0.00 NaN NaN NaN NaN NaN
7 6923.91 -3402.85 0.00 3402.85 NaN NaN NaN NaN NaN
8 9246.01 2322.10 2322.10 0.00 NaN NaN NaN NaN NaN
9 7485.01 -1761.00 0.00 1761.00 NaN NaN NaN NaN NaN
10 6390.07 -1094.94 0.00 1094.94 NaN NaN NaN NaN NaN
11 7730.93 1340.86 1340.86 0.00 NaN NaN NaN NaN NaN
12 7011.21 -719.72 0.00 719.72 NaN NaN NaN NaN NaN
13 6626.57 -384.64 0.00 384.64 NaN NaN NaN NaN NaN
14 6371.93 -254.64 0.00 254.64 931.61 813.96 1.14 53.37 53.37
15 4041.32 -2330.61 0.00 2330.61 865.06 922.29 0.94 48.40 48.40
16 3702.90 -338.42 0.00 338.42 803.27 880.59 0.91 47.70 47.70
17 3434.10 -268.80 0.00 268.80 745.90 836.89 0.89 47.13 47.13
18 3813.69 379.59 379.59 0.00 719.73 777.11 0.93 48.08 48.08
19 4103.95 290.26 290.26 0.00 689.05 721.60 0.95 48.85 48.85
20 5320.81 1216.86 1216.86 0.00 726.75 670.06 1.08 52.03 52.03
21 8555.00 3234.19 3234.19 0.00 905.86 622.20 1.46 59.28 59.28
22 10854.10 2299.10 2299.10 0.00 1005.37 577.75 1.74 63.51 63.51
Concerning your last question about performance: explicite loops in python / pandas are terrible, avoid them whenever you can. If you can't, try cython or numba.

There is an easier way, the package talib.
import talib
close = df['close']
rsi = talib.RSI(close, timeperiod=14)
If you'd like Bollinger Bands to go with your RSI that is easy too.
upperBB, middleBB, lowerBB = talib.BBANDS(close, timeperiod=20, nbdevup=2, nbdevdn=2, matype=0)
You can use Bollinger Bands on RSI instead of the fixed reference levels of 70 and 30.
upperBBrsi, MiddleBBrsi, lowerBBrsi = talib.BBANDS(rsi, timeperiod=50, nbdevup=2, nbdevdn=2, matype=0)
Finally, you can normalize RSI using the %b calcification.
normrsi = (rsi - lowerBBrsi) / (upperBBrsi - lowerBBrsi)
info on talib
https://mrjbq7.github.io/ta-lib/
info on Bollinger Bands
https://www.BollingerBands.com

Here is an option.
I will be touching only on your second bullet
# libraries required
import pandas as pd
import numpy as np
# create dataframe
df = pd.DataFrame({'close':[4724.89, 4378.51,6463.00,9838.96,13716.36,10285.10,
10326.76,6923.91,9246.01,7485.01,6390.07,7730.93,
7011.21,6626.57,6371.93,4041.32,3702.90,3434.10,
3813.69,4103.95,5320.81,8555.00,10854.10]})
df['change'] = df['close'].diff(1) # Calculate change
# calculate gain / loss from every change
df['gain'] = np.select([df['change']>0, df['change'].isna()],
[df['change'], np.nan],
default=0)
df['loss'] = np.select([df['change']<0, df['change'].isna()],
[-df['change'], np.nan],
default=0)
# create avg_gain / avg_loss columns with all nan
df['avg_gain'] = np.nan
df['avg_loss'] = np.nan
n = 14 # what is the window
# keep first occurrence of rolling mean
df['avg_gain'][n] = df['gain'].rolling(window=n).mean().dropna().iloc[0]
df['avg_loss'][n] = df['loss'].rolling(window=n).mean().dropna().iloc[0]
# Alternatively
df['avg_gain'][n] = df.loc[:n, 'gain'].mean()
df['avg_loss'][n] = df.loc[:n, 'loss'].mean()
# This is not a pandas way, looping through the pandas series, but it does what you need
for i in range(n+1, df.shape[0]):
df['avg_gain'].iloc[i] = (df['avg_gain'].iloc[i-1] * (n - 1) + df['gain'].iloc[i]) / n
df['avg_loss'].iloc[i] = (df['avg_loss'].iloc[i-1] * (n - 1) + df['loss'].iloc[i]) / n
# calculate rs and rsi
df['rs'] = df['avg_gain'] / df['avg_loss']
df['rsi'] = 100 - (100 / (1 + df['rs'] ))

If you want to calculate the RSI of a time series using native pandas calls, you can use the following one-line code:
n=14
df['rsi14'] = 100 - (100 / (1 + df['Close'].diff(1).mask(df['Close'].diff(1) < 0, 0).ewm(alpha=1/n, adjust=False).mean() / df['Close'].diff(1).mask(df['Close'].diff(1) > 0, -0.0).abs().ewm(alpha=1/n, adjust=False).mean()))
And it's eave faster than numpy results (ms / loop):
rows np loop native
23 1.0 1.3 0.8
230 1.1 1.4 0.9
2300 1.1 1.3 0.9
23000 3.4 1.8 1.2

This is the rsi code, replace every thing that has "aa":
import pandas as pd
rsi_period = 14
df = pd.Series(coinaalist)
chg = df.diff(1)
gain = chg.mask(chg<0,0)
loss = chg.mask(chg>0,0)
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
rs = abs(avg_gain / avg_loss)
crplaa = 100 - (100/(1+rs))
coinaarsi = crplaa.iloc[-1]

I gave +1 to lepi, however his formula can be made even more pandorable:
n = 14
df['rsi14'] = df['Close'].diff(1).mask(df['Close'].diff(1) < 0, 0).ewm(alpha=1/n, adjust=False).mean().div(df['Close'].diff(1).mask(df['Close'].diff(1) > 0, -0.0).abs().ewm(alpha=1/n, adjust=False).mean()).add(1).rdiv(100).rsub(100)
so div() was used instead of / and add(1).rdiv(100).rsub(100) instead of + - / in other places.

Related

Python Pandas-retrieving values in one column while they are less than the value of a second column

Suppose I have a df that looks like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 31.0 0.90
2 30 0.03 41.0 0.70
3 40 0.72 51.0 0.08
4 50 0.09 81.0 0.78
5 60 0.09 NaN NaN
6 70 0.01 NaN NaN
7 80 0.09 NaN NaN
8 90 0.08 NaN NaN
9 100 0.02 NaN NaN
In the posR column, we see that it jumps from 11 to 31, and there is not a value in the "20's". I want to insert a value to fill that space, which would essentially just be the posF value, and NA, so my resulting df would look like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 20 NaN
2 30 0.03 31.0 0.90
3 40 0.72 41.0 0.70
4 50 0.09 50 NaN
5 60 0.09 60 NaN
6 70 0.01 70 NaN
7 80 0.09 80 NaN
8 90 0.08 81.0 0.78
9 100 0.02 100 NaN
So I want to fill the NaN values in the position with the values from posF that are in between the values in posR.
What I have tried to do is just make a dummy list and add values to the list based on if they were less than a (I see the flaw here but I don't know how to fix it).
insert_rows = []
for x in df['posF']:
for a,b in zip(df['posR'], df['rfreq']):
if x<a:
insert_rows.append([x, 'NA'])
print(len(insert_rows))#21, should be 5
I realize that it is appending x several times until it reaches the condition of being >a.
After this I will just create a new df and add these values to the original 2 columns so they are the same length.
If you can think of a better title, feel free to edit.
My first thought was to retrieve the new indices for the entries in posR by interpolating with posF and then put the values to their new positions - but as you want to have 81 one row later than here, I'm afraid this is not exactly what you're searching for and I still don't really get the logic behind your task.
However, perhaps this is a starting point, let's see...
This approach would work like the following:
Retrieve the new index positions of the values in posR according to their order in posF:
import numpy as np
idx = np.interp(df.posR, df.posF, df.index).round()
Get rid of nan entries and cast to int:
idx = idx[np.isfinite(idx)].astype(int)
Create a new column by copying posF in the first step, and set newrfreq to nan respectively:
df['newposR'] = df.posF
df['newrfreq'] = np.nan
Then overwrite with the values from posR and rfreq, but now at the updated positions:
df.loc[idx, 'newposR'] = df.posR[:len(idx)].values
df.loc[idx, 'newrfreq'] = df.rfreq[:len(idx)].values
Result:
posF ffreq posR rfreq newposR newrfreq
0 10 0.50 11.0 0.08 11.0 0.08
1 20 0.20 31.0 0.90 20.0 NaN
2 30 0.03 41.0 0.70 31.0 0.90
3 40 0.72 51.0 0.08 41.0 0.70
4 50 0.09 81.0 0.78 51.0 0.08
5 60 0.09 NaN NaN 60.0 NaN
6 70 0.01 NaN NaN 70.0 NaN
7 80 0.09 NaN NaN 81.0 0.78
8 90 0.08 NaN NaN 90.0 NaN
9 100 0.02 NaN NaN 100.0 NaN

splitting a dataframe into chunks and naming each new chunk into a dataframe

is there a good code to split dataframes into chunks and automatically name each chunk into its own dataframe?
for example, dfmaster has 1000 records. split by 200 and create df1, df2,….df5
any guidance would be much appreciated.
I've looked on other boards and there is no guidance for a function that can automatically create new dataframes.
Use numpy for splitting:
See example below:
In [2095]: df
Out[2095]:
0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.00 0.0 0.00 0.0 0.94 0.00 0.00 0.63 0.00
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.00 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN
In [2096]: np.split(df, 2)
Out[2096]:
[ 0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.0 0.0 0.0 0.0 0.94 0.0 0.0 0.63 0.0
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN,
0 1 2 3 4 5 6 7 8 9 10
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.0 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN]
df gets split into 2 dataframes having 2 rows each.
You can do np.split(df, 500)
I find these ideas helpful:
solution via list:
https://stackoverflow.com/a/49563326/10396469
solution using numpy.split:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html
just use df = df.values first to convert from dataframe to numpy.array.

How to prevent zero values from messing up a pandas boxplot?

I have a pandas df and after pivoting, it prints as following,
country CHINA USA
0 119.02 0.0
1 121.20 0.0
3 112.49 0.0
4 113.94 0.0
5 114.67 0.0
6 111.77 0.0
7 117.57 0.0
......................
......................
6648 0.00 420.0
6649 0.00 420.0
6650 0.00 420.0
6651 0.00 420.0
6652 0.00 420.0
6653 0.00 420.0
6654 0.00 500.0
6655 0.00 500.0
6656 0.00 390.0
6657 0.00 450.0
6658 0.00 420.0
6659 0.00 420.0
6660 0.00 450.0
The method is here,
def visualize_box_plot(df):
df = df[df.outlier != 1]
df = pd.pivot_table(df,
index=df.index,
columns = df['country'],
values='value',
fill_value = 0)
df.CHINA = df.CHINA.round(2)
df.USA = df.USA.round(2)
# this is the prints
# provided earlier
print df
df_usa = df[(df['USA'] != 0)]
df_china = df[(df['CHINA'] != 0)]
usa = df_usa.as_matrix()[:, -1]
china = df_china.as_matrix()[:,0]
print "USA:", len(usa), " ", "CHINA: ", len(china)
# unequal length
# USA: 1673 CHINA: 4384
x = [china, usa]
plt.boxplot(x)
plt.show()
Zero values come from the NaN during the time of pivoting and I would like omit them while making the box plot. So, I use the code,
df_usa = df[(df['USA'] != 0)]
df_china = df[(df['CHINA'] != 0)]
Those code actually creates seperate df and converted to the NUmpy matrix and lastly, I visualize them all together with matplotlib. Point to be considered, the length of the Numpy matrix is not the same and hence, I can't just call the boxplot function directly with df.
Here is my visualization where 1 and 2 needs to be replaced with CHINA and USA respectively,
The visualization is not good and I get the feelings there might be better way to
get the job done. Any suggestion ? Some sample code will help a lot. You may use the df rounding to 2 digits after the decimal. The main issue is to make the code elegant and improve the visualization better.
I think code can be more simplier - simply replace 0 to NaN and then call DataFrame.boxplot:
print (df.mask(df == 0))
#alternative solution
#print (df.replace(0,np.nan))
CHINA USA
country
0 119.02 NaN
1 121.20 NaN
3 112.49 NaN
4 113.94 NaN
5 114.67 NaN
6 111.77 NaN
7 117.57 NaN
6648 NaN 420.0
6649 NaN 420.0
6650 NaN 420.0
6651 NaN 420.0
6652 NaN 420.0
6653 NaN 420.0
6654 NaN 500.0
6655 NaN 500.0
6656 NaN 390.0
6657 NaN 450.0
6658 NaN 420.0
6659 NaN 420.0
6660 NaN 450.0
df.mask(df == 0).boxplot()
Another possible solution is use DataFrame.plot.box:
df.mask(df == 0).plot.box()
Box Plots in docs
Apart from the numpy nan that jezrael mentioned, there's also nan that you can use from math.
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import math
data = {'c1': [1,2,3], 'c2': [5,3,0]}
for k in data:#search and replace zeroes with math.nan
data[k] = [x if x != 0 else math.nan for x in data[k]]
df = pd.DataFrame(data, columns=list(data.keys()))
df.plot.box(grid='False')
plt.show()

How to insert value in column if condition is true using Pandas (Python)

I have the following dataset and I am trying to create a condition, where if the value in the Percentage cell is positive, I want the match cell to show the subsequent Percentage value eg (i+1). However, I wanted to ask how would I be able to perform this operation without using a loop. For example, in row 0 and Match, it would display the value -0.34.
User Percent Match
0 A 0.87 NaN
1 A -0.34 NaN
2 A 0.71 NaN
3 A -0.58 NaN
4 B -1.67 NaN
5 B -0.44 NaN
6 B -0.72 NaN
7 C 0.19 NaN
8 C 0.39 NaN
9 C -0.28 NaN
10 C 0.53 NaN
Additionally, how would I be able to have a summation of the subsequent two value proceeding a positive number in the Percent cell. I have the following code, but I am making an error in indexing the row location.
df1.ix[df1.Percent >=0, ['Match']] = df1.iloc[:1]['Match']; df1
For the first part you can use loc with a boolean condition and shift:
In [5]:
df.loc[df['Percent']>0,'Match'] = df['Percent'].shift(-1)
df
Out[5]:
User Percent Match
0 A 0.87 -0.34
1 A -0.34 NaN
2 A 0.71 -0.58
3 A -0.58 NaN
4 B -1.67 NaN
5 B -0.44 NaN
6 B -0.72 NaN
7 C 0.19 0.39
8 C 0.39 -0.28
9 C -0.28 NaN
10 C 0.53 NaN
For the summation you can do the following:
In [15]:
def func(x):
return df['Percent'].iloc[x.name-2:x.name].sum()
df['sum'] = df[df['Percent']>0][['Percent']].apply(lambda x: func(x), axis=1)
df
Out[15]:
User Percent Match sum
0 A 0.87 -0.34 0.00
1 A -0.34 NaN NaN
2 A 0.71 -0.58 0.53
3 A -0.58 NaN NaN
4 B -1.67 NaN NaN
5 B -0.44 NaN NaN
6 B -0.72 NaN NaN
7 C 0.19 0.39 -1.16
8 C 0.39 -0.28 -0.53
9 C -0.28 NaN NaN
10 C 0.53 NaN 0.11
This uses a slight trick to mask the df and return the col of interest but force to a df (by using double square brackets [[]]) so we can call apply and use axis=1 to iterate row-wise. This allows us to access the row index via the .name attribute. We can then use this to slice the df and return the sum.

pandas why does int64 - float64 column subtraction yield NaN's

I am confused by the results of pandas subtraction of two columns. When I subtract two float64 and int64 columns it yields several NaN entries. Why is this happening? What could be the cause of this strange behavior?
Final Updae: As N.Wouda pointed out, my problem was that the index columns did not match.
Y_predd.reset_index(drop=True,inplace=True)
Y_train_2.reset_index(drop=True,inplace=True)
solved my problem
Update 2: It seems like my index columns don't match, which makes sense because they are both sampled from the same data frome. How can I "start fresh" with new index coluns?
Update: Y_predd- Y_train_2.astype('float64') also yields NaN values. I am confused why this did not raise an error. They are the same size. Why could this be yielding NaN?
In [48]: Y_predd.size
Out[48]: 182527
In [49]: Y_train_2.astype('float64').size
Out[49]: 182527
Original documentation of error:
In [38]: Y_train_2
Out[38]:
66419 0
2319 0
114195 0
217532 0
131687 0
144024 0
94055 0
143479 0
143124 0
49910 0
109278 0
215905 1
127311 0
150365 0
117866 0
28702 0
168111 0
64625 0
207180 0
14555 0
179268 0
22021 1
120169 0
218769 0
259754 0
188296 1
63503 1
175104 0
218261 0
35453 0
..
112048 0
97294 0
68569 0
60333 0
184119 1
57632 0
153729 1
155353 0
114979 1
180634 0
42842 0
99979 0
243728 0
203679 0
244381 0
55646 0
35557 0
148977 0
164008 0
53227 1
219863 0
4625 0
155759 0
232463 0
167807 0
123638 0
230463 1
198219 0
128459 1
53911 0
Name: objective_for_classifier, dtype: int64
In [39]: Y_predd
Out[39]:
0 0.00
1 0.48
2 0.04
3 0.00
4 0.48
5 0.58
6 0.00
7 0.00
8 0.02
9 0.06
10 0.22
11 0.32
12 0.12
13 0.26
14 0.18
15 0.18
16 0.28
17 0.30
18 0.52
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 0.64
26 0.30
27 0.76
28 0.10
29 0.42
...
182497 0.60
182498 0.00
182499 0.06
182500 0.12
182501 0.00
182502 0.40
182503 0.70
182504 0.42
182505 0.54
182506 0.24
182507 0.56
182508 0.34
182509 0.10
182510 0.18
182511 0.06
182512 0.12
182513 0.00
182514 0.22
182515 0.08
182516 0.22
182517 0.00
182518 0.42
182519 0.02
182520 0.50
182521 0.00
182522 0.08
182523 0.16
182524 0.00
182525 0.32
182526 0.06
Name: prediction_method_used, dtype: float64
In [40]: Y_predd - Y_tr
Y_train_1 Y_train_2
In [40]: Y_predd - Y_train_2
Out[41]:
0 NaN
1 NaN
2 0.04
3 NaN
4 0.48
5 NaN
6 0.00
7 0.00
8 NaN
9 NaN
10 NaN
11 0.32
12 -0.88
13 -0.74
14 0.18
15 NaN
16 NaN
17 NaN
18 NaN
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 NaN
26 0.30
27 NaN
28 0.10
29 0.42
...
260705 NaN
260706 NaN
260709 NaN
260710 NaN
260711 NaN
260713 NaN
260715 NaN
260716 NaN
260718 NaN
260721 NaN
260722 NaN
260723 NaN
260724 NaN
260725 NaN
260726 NaN
260727 NaN
260731 NaN
260735 NaN
260737 NaN
260738 NaN
260739 NaN
260740 NaN
260742 NaN
260743 NaN
260745 NaN
260748 NaN
260749 NaN
260750 NaN
260751 NaN
260752 NaN
dtype: float64
Posting here so we can close the question, from the comments:
Are you sure each dataframe has the same index range?
You can reset the indices on both frames by df.reset_index(drop=True) and then subtract the frames as you were already doing. This process should result in the desired output.

Categories