Get sum of row differences with Pandas .diff()

Get sum of row differences with Pandas .diff() - python

I am doing a time-series analysis where I need to calculate the change in several attributes over time. Pandas makes a simple version of this easy; the .diff(periods=n) function will calculate the difference between a row and the preceding n rows, however, that is not quite what I need...
df= pd.DataFrame({'day_num': [134, 135, 136, 137],
'swe': [38.8, 38.9, 37.6, 36.8],
'prcp': [0., 0.1, 0., 0.15],
'flow': [2930, 3350, 3900, 4090]})
diff_3 = df.diff(periods=3)
returns:
day_num swe prcp flow
0 134 38.8 0.00 2930
1 135 38.9 0.00 3350
2 136 37.6 0.00 3900
3 137 36.8 0.15 4090
And:
day_num swe prcp flow
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 3.0 -2.0 0.15 1160.0
In the swe column(snow water equivalent in inches, literally the liquid water equivalent of a particular snowpack) , the 3 day difference for index 3 is -2.0, which equals 36.8 - 38.8. However, notice that there was an increase and decrease within the 3 day period. This means that a total of 2.1 inches (my desired output) of water melted and contributed to streamflow (flow column) over that 3 day window. Min - max would not work either, because if I were to increase the n value to 7 or 14, there could easily be 'meltings' that are not accounted for in the output. What is needed is the sum of the single day differences over a span of n days. I eventually want to merge the diff_n columns back into the original dataset.
Any ideas would be appreciated. Also, this is my first post so let me know how I can improve my format / content...

sum of the single day differences over a span of n days
First, diff consecutive rows then do a rolling sum. Since the series after diff already has difference for 2 consecutive rows, for rolling sum we only provide period-1 (in our case 3-1 = 2).
periods = 3
df['swe'] = df['swe'].diff().rolling(periods-1).sum()
Output:
day_num swe prcp flow
0 134 NaN 0.00 2930
1 135 NaN 0.10 3350
2 136 -1.2 0.00 3900
3 137 -2.1 0.15 4090
This is also equivalent to doing following as the contribution of intermediate days will eventually cancel out:
df['swe'] = df['swe'].diff(periods-1)
Output:
day_num swe prcp flow swe_using_diff swe_using_rolling_sum
0 134 38.8 0.00 2930 NaN NaN
1 135 38.9 0.10 3350 NaN NaN
2 136 37.6 0.00 3900 -1.2 -1.2
3 137 36.8 0.15 4090 -2.1 -2.1

Related

What's the most idiomatic way to set all values to NaN except the end of the month?

I'd like to learn the most idiomatic way to set all values of a data frame to NaN except the values corresponding to the last business day of the month. I've worked out the following solution but it feels clunky.
If you're wondering what my original use-case is ... I get mixed daily and monthly data into one big data frame. I extract the monthly data which is basically repeating the same value within each month and I'd like to replace the dull repeated values with an interpolated estimation e.g. using loess. To that end I need to fill in missing values for getting all the in-between x-axis NA values.
# get the values corresponding to the last business day of each month
df_eofm = df.resample('BM').last()
# fill the original data frame with NaN's
df[:] = np.nan
# now try to set the last business days to the values we saved
df.update(df_eofm)
print(df)
print(df.dropna())
This produces the expected result:
Col1 Col2 Col3
Date
1963-12-31 57.5 -28 0.89
1964-01-01 NaN NaN NaN
1964-01-02 NaN NaN NaN
1964-01-03 NaN NaN NaN
1964-01-04 NaN NaN NaN
... ... ... ...
2020-03-11 NaN NaN NaN
2020-03-12 NaN NaN NaN
2020-03-13 NaN NaN NaN
2020-03-14 NaN NaN NaN
2020-03-15 NaN NaN NaN
[20530 rows x 3 columns]
Col1 Col2 Col3
Date
1963-12-31 57.5 -28 0.89
1964-01-31 54 106 0.65
1964-02-28 57.1 126 0.68
1964-03-31 57.9 266 0.73
1964-04-30 60.2 144 0.72
... ... ... ...
2019-10-31 47.8 136 0.11
2019-11-29 48.3 128 0.22
2019-12-31 48.1 266 0.37
2020-01-31 47.2 145 -0.08
2020-02-28 50.9 225 -0.45
[675 rows x 3 columns]

You could use is_month_end and index the dataframe with the resultant boolean series:
df[~df.index.is_month_end] = np.nan
For the last business day, using this answer we could do something like:
def is_business_day(date):
return bool(len(pd.bdate_range(date, date)))
last_bus = (df.index.to_frame()[0]
.map(is_business_day)
.groupby(df.index.month)
.transform(lambda x: x.last_valid_index()))
df[df.index==last_bus] = np.nan

Pandas combine two columns into one and exclude NaN values

I have a 5k x 2 column dataframe called "both".
I want to create a new 5k x 1 DataFrame or column (doesn't matter) by replacing any NaN value in one column with the value of the adjacent column.
ex:
Gains Loss
0 NaN NaN
1 NaN -0.17
2 NaN -0.13
3 NaN -0.75
4 NaN -0.17
5 NaN -0.99
6 1.06 NaN
7 NaN -1.29
8 NaN -0.42
9 0.14 NaN
so for example, I need to swap the NaNs in the first column in rows 1 through 5 with the values in the same rows, in second column to get a new df of the following form:
Change
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
how do I tell python to do this??

You may fill the NaN values with zeroes and then simply add your columns:
both["Change"] = both["Gains"].fillna(0) + both["Loss"].fillna(0)
Then — if you need it — you may return the resulting zeroes back to NaNs:
both["Change"].replace(0, np.nan, inplace=True)
The result:
Gains Loss Change
0 NaN NaN NaN
1 NaN -0.17 -0.17
2 NaN -0.13 -0.13
3 NaN -0.75 -0.75
4 NaN -0.17 -0.17
5 NaN -0.99 -0.99
6 1.06 NaN 1.06
7 NaN -1.29 -1.29
8 NaN -0.42 -0.42
9 0.14 NaN 0.14
Finally, if you want to get rid of your original columns, you may drop them:
both.drop(columns=["Gains", "Loss"], inplace=True)

There are many ways to achieve this. One is using the loc property:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price1': [np.nan,np.nan,np.nan,np.nan,
np.nan,np.nan,1.06,np.nan,np.nan],
'Price2': [np.nan,-0.17,-0.13,-0.75,-0.17,
-0.99,np.nan,-1.29,-0.42]})
df.loc[df['Price1'].isnull(), 'Price1'] = df['Price2']
df = df.loc[:6,'Price1']
print(df)
Output:
Price1
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
You can see more complex recipes in the Cookbook

IIUC, we can filter for null values and just sum the columns to make your new dataframe.
cols = ['Gains','Loss']
s = df.isnull().cumsum(axis=1).eq(len(df.columns)).any(axis=1)
# add df[cols].isnull() if you only want to measure the price columns for nulls.
df['prices'] = df[cols].loc[~s].sum(axis=1)
df = df.drop(cols,axis=1)
print(df)
prices
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
7 -1.29
8 -0.42

Calculate RSI indicator from pandas DataFrame?

My problem
I tried many libraries on Github but all of them did not produce matching results for TradingView so I followed the formula on this link to calculate RSI indicator. I calculated it with Excel and collated the results with TradingView. I know it's absolutely correct but, but I didn't find a way to calculate it with Pandas.
Formula
100
RSI = 100 - --------
1 + RS
RS = Average Gain / Average Loss
The very first calculations for average gain and average loss are simple
14-period averages:
First Average Gain = Sum of Gains over the past 14 periods / 14.
First Average Loss = Sum of Losses over the past 14 periods / 14
The second, and subsequent, calculations are based on the prior averages
and the current gain loss:
Average Gain = [(previous Average Gain) x 13 + current Gain] / 14.
Average Loss = [(previous Average Loss) x 13 + current Loss] / 14.
Expected Results
close change gain loss avg_gian avg_loss rs \
0 4724.89 NaN NaN NaN NaN NaN NaN
1 4378.51 -346.38 0.00 346.38 NaN NaN NaN
2 6463.00 2084.49 2084.49 0.00 NaN NaN NaN
3 9838.96 3375.96 3375.96 0.00 NaN NaN NaN
4 13716.36 3877.40 3877.40 0.00 NaN NaN NaN
5 10285.10 -3431.26 0.00 3431.26 NaN NaN NaN
6 10326.76 41.66 41.66 0.00 NaN NaN NaN
7 6923.91 -3402.85 0.00 3402.85 NaN NaN NaN
8 9246.01 2322.10 2322.10 0.00 NaN NaN NaN
9 7485.01 -1761.00 0.00 1761.00 NaN NaN NaN
10 6390.07 -1094.94 0.00 1094.94 NaN NaN NaN
11 7730.93 1340.86 1340.86 0.00 NaN NaN NaN
12 7011.21 -719.72 0.00 719.72 NaN NaN NaN
13 6626.57 -384.64 0.00 384.64 NaN NaN NaN
14 6371.93 -254.64 0.00 254.64 931.605000 813.959286 1.144535
15 4041.32 -2330.61 0.00 2330.61 865.061786 922.291480 0.937948
16 3702.90 -338.42 0.00 338.42 803.271658 880.586374 0.912201
17 3434.10 -268.80 0.00 268.80 745.895111 836.887347 0.891273
18 3813.69 379.59 379.59 0.00 719.730460 777.109680 0.926163
19 4103.95 290.26 290.26 0.00 689.053999 721.601845 0.954895
20 5320.81 1216.86 1216.86 0.00 726.754428 670.058856 1.084613
21 8555.00 3234.19 3234.19 0.00 905.856968 622.197509 1.455899
22 10854.10 2299.10 2299.10 0.00 1005.374328 577.754830 1.740140
rsi_14
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 53.369848
15 48.399038
16 47.704239
17 47.125561
18 48.083322
19 48.846358
20 52.029461
21 59.281719
22 63.505515
My Code
Import
import pandas as pd
import numpy as np
Load data
df = pd.read_csv("rsi_14_test_data.csv")
close = df['close']
print(close)
0 4724.89
1 4378.51
2 6463.00
3 9838.96
4 13716.36
5 10285.10
6 10326.76
7 6923.91
8 9246.01
9 7485.01
10 6390.07
11 7730.93
12 7011.21
13 6626.57
14 6371.93
15 4041.32
16 3702.90
17 3434.10
18 3813.69
19 4103.95
20 5320.81
21 8555.00
22 10854.10
Name: close, dtype: float64
Change
Calculate change every row
change = close.diff(1)
print(change)
0 NaN
1 -346.38
2 2084.49
3 3375.96
4 3877.40
5 -3431.26
6 41.66
7 -3402.85
8 2322.10
9 -1761.00
10 -1094.94
11 1340.86
12 -719.72
13 -384.64
14 -254.64
15 -2330.61
16 -338.42
17 -268.80
18 379.59
19 290.26
20 1216.86
21 3234.19
22 2299.10
Name: close, dtype: float64
Gain and loss
get gain and loss from change
is_gain, is_loss = change > 0, change < 0
gain, loss = change, -change
gain[is_loss] = 0
loss[is_gain] = 0

gain.name = 'gain'
loss.name = 'loss'
print(loss)
0 NaN
1 346.38
2 0.00
3 0.00
4 0.00
5 3431.26
6 0.00
7 3402.85
8 0.00
9 1761.00
10 1094.94
11 0.00
12 719.72
13 384.64
14 254.64
15 2330.61
16 338.42
17 268.80
18 0.00
19 0.00
20 0.00
21 0.00
22 0.00
Name: loss, dtype: float64
Calculate fist avg gain and loss
Mean of n prior rows
n = 14
avg_gain = change * np.nan
avg_loss = change * np.nan

avg_gain[n] = gain[:n+1].mean()
avg_loss[n] = loss[:n+1].mean()

avg_gain.name = 'avg_gain'
avg_loss.name = 'avg_loss'

avg_df = pd.concat([gain, loss, avg_gain, avg_loss], axis=1)
print(avg_df)
gain loss avg_gain avg_loss
0 NaN NaN NaN NaN
1 0.00 346.38 NaN NaN
2 2084.49 0.00 NaN NaN
3 3375.96 0.00 NaN NaN
4 3877.40 0.00 NaN NaN
5 0.00 3431.26 NaN NaN
6 41.66 0.00 NaN NaN
7 0.00 3402.85 NaN NaN
8 2322.10 0.00 NaN NaN
9 0.00 1761.00 NaN NaN
10 0.00 1094.94 NaN NaN
11 1340.86 0.00 NaN NaN
12 0.00 719.72 NaN NaN
13 0.00 384.64 NaN NaN
14 0.00 254.64 931.605 813.959286
15 0.00 2330.61 NaN NaN
16 0.00 338.42 NaN NaN
17 0.00 268.80 NaN NaN
18 379.59 0.00 NaN NaN
19 290.26 0.00 NaN NaN
20 1216.86 0.00 NaN NaN
21 3234.19 0.00 NaN NaN
22 2299.10 0.00 NaN NaN
The very first calculations for average gain and the average loss is ok but I don't know how to apply pandas.core.window.Rolling.apply for the second, and subsequent because they are in many rows and different columns.
It may be something like this:
avg_gain[n] = (avg_gain[n-1]*13 + gain[n]) / 14
My Wish - My Question
The best way to calculate and work with technical indicators?
Complete the above code in "Pandas Style".
Does the traditional way of coding with loops reduce performance compared to Pandas?

The average gain and loss are calculated by a recursive formula, which can't be vectorized with numpy. We can, however, try and find an analytical (i.e. non-recursive) solution for calculating the individual elements. Such a solution can then be implemented using numpy. See the Old Answer below. I kept it just for illustrative purposes: it works well with the sample data in the OP but may suffer from numerical underflow for bigger datasets (> ~1000 rows, thanks to #WarrenNiles for pointing this problem out in the comment below).
A straightforward solution is to loop over numpy arrays (instead of looping over pandas dataframes). This can easyly be accelerated using numba by commenting out the two numba-related lines below:
#import numba
df['change'] = df['close'].diff()
df['gain'] = df.change.mask(df.change < 0, 0.0)
df['loss'] = -df.change.mask(df.change > 0, -0.0)
##numba.jit
def rma(x, n):
"""Running moving average"""
a = np.full_like(x, np.nan)
a[n] = x[1:n+1].mean()
for i in range(n+1, len(x)):
a[i] = (a[i-1] * (n - 1) + x[i]) / n
return a
df['avg_gain'] = rma(df.gain.to_numpy(), 14)
df['avg_loss'] = rma(df.loss.to_numpy(), 14)
df['rs'] = df.avg_gain / df.avg_loss
df['rsi'] = 100 - (100 / (1 + df.rs))
For the 3173-rows TSLA dataset linked in the comment below, it takes on my machine:
2 s for the pandas loop solution
23 ms for this array loop solution without numba
4 ms for this array loop solution with numba
Old Answer
Denoting the average gain as y and the current gain as x, we get y[i] = a*y[i-1] + b*x[i], where a = 13/14 and b = 1/14 for n = 14. Unwrapping the recursion leads to:
(sorry for the picture, was just to cumbersome to type it)
This can be efficiently calculated in numpy using cumsum (rma = running moving average):
import pandas as pd
import numpy as np
df = pd.DataFrame({'close':[4724.89, 4378.51,6463.00,9838.96,13716.36,10285.10,
10326.76,6923.91,9246.01,7485.01,6390.07,7730.93,
7011.21,6626.57,6371.93,4041.32,3702.90,3434.10,
3813.69,4103.95,5320.81,8555.00,10854.10]})
n = 14
def rma(x, n, y0):
a = (n-1) / n
ak = a**np.arange(len(x)-1, -1, -1)
return np.r_[np.full(n, np.nan), y0, np.cumsum(ak * x) / ak / n + y0 * a**np.arange(1, len(x)+1)]
df['change'] = df['close'].diff()
df['gain'] = df.change.mask(df.change < 0, 0.0)
df['loss'] = -df.change.mask(df.change > 0, -0.0)
df['avg_gain'] = rma(df.gain[n+1:].to_numpy(), n, np.nansum(df.gain.to_numpy()[:n+1])/n)
df['avg_loss'] = rma(df.loss[n+1:].to_numpy(), n, np.nansum(df.loss.to_numpy()[:n+1])/n)
df['rs'] = df.avg_gain / df.avg_loss
df['rsi_14'] = 100 - (100 / (1 + df.rs))
Output of df.round(2):
close change gain loss avg_gain avg_loss rs rsi rsi_14
0 4724.89 NaN NaN NaN NaN NaN NaN NaN NaN
1 4378.51 -346.38 0.00 346.38 NaN NaN NaN NaN NaN
2 6463.00 2084.49 2084.49 0.00 NaN NaN NaN NaN NaN
3 9838.96 3375.96 3375.96 0.00 NaN NaN NaN NaN NaN
4 13716.36 3877.40 3877.40 0.00 NaN NaN NaN NaN NaN
5 10285.10 -3431.26 0.00 3431.26 NaN NaN NaN NaN NaN
6 10326.76 41.66 41.66 0.00 NaN NaN NaN NaN NaN
7 6923.91 -3402.85 0.00 3402.85 NaN NaN NaN NaN NaN
8 9246.01 2322.10 2322.10 0.00 NaN NaN NaN NaN NaN
9 7485.01 -1761.00 0.00 1761.00 NaN NaN NaN NaN NaN
10 6390.07 -1094.94 0.00 1094.94 NaN NaN NaN NaN NaN
11 7730.93 1340.86 1340.86 0.00 NaN NaN NaN NaN NaN
12 7011.21 -719.72 0.00 719.72 NaN NaN NaN NaN NaN
13 6626.57 -384.64 0.00 384.64 NaN NaN NaN NaN NaN
14 6371.93 -254.64 0.00 254.64 931.61 813.96 1.14 53.37 53.37
15 4041.32 -2330.61 0.00 2330.61 865.06 922.29 0.94 48.40 48.40
16 3702.90 -338.42 0.00 338.42 803.27 880.59 0.91 47.70 47.70
17 3434.10 -268.80 0.00 268.80 745.90 836.89 0.89 47.13 47.13
18 3813.69 379.59 379.59 0.00 719.73 777.11 0.93 48.08 48.08
19 4103.95 290.26 290.26 0.00 689.05 721.60 0.95 48.85 48.85
20 5320.81 1216.86 1216.86 0.00 726.75 670.06 1.08 52.03 52.03
21 8555.00 3234.19 3234.19 0.00 905.86 622.20 1.46 59.28 59.28
22 10854.10 2299.10 2299.10 0.00 1005.37 577.75 1.74 63.51 63.51
Concerning your last question about performance: explicite loops in python / pandas are terrible, avoid them whenever you can. If you can't, try cython or numba.

There is an easier way, the package talib.
import talib
close = df['close']
rsi = talib.RSI(close, timeperiod=14)
If you'd like Bollinger Bands to go with your RSI that is easy too.
upperBB, middleBB, lowerBB = talib.BBANDS(close, timeperiod=20, nbdevup=2, nbdevdn=2, matype=0)
You can use Bollinger Bands on RSI instead of the fixed reference levels of 70 and 30.
upperBBrsi, MiddleBBrsi, lowerBBrsi = talib.BBANDS(rsi, timeperiod=50, nbdevup=2, nbdevdn=2, matype=0)
Finally, you can normalize RSI using the %b calcification.
normrsi = (rsi - lowerBBrsi) / (upperBBrsi - lowerBBrsi)
info on talib
https://mrjbq7.github.io/ta-lib/
info on Bollinger Bands
https://www.BollingerBands.com

Here is an option.
I will be touching only on your second bullet
# libraries required
import pandas as pd
import numpy as np
# create dataframe
df = pd.DataFrame({'close':[4724.89, 4378.51,6463.00,9838.96,13716.36,10285.10,
10326.76,6923.91,9246.01,7485.01,6390.07,7730.93,
7011.21,6626.57,6371.93,4041.32,3702.90,3434.10,
3813.69,4103.95,5320.81,8555.00,10854.10]})
df['change'] = df['close'].diff(1) # Calculate change
# calculate gain / loss from every change
df['gain'] = np.select([df['change']>0, df['change'].isna()],
[df['change'], np.nan],
default=0)
df['loss'] = np.select([df['change']<0, df['change'].isna()],
[-df['change'], np.nan],
default=0)
# create avg_gain / avg_loss columns with all nan
df['avg_gain'] = np.nan
df['avg_loss'] = np.nan
n = 14 # what is the window
# keep first occurrence of rolling mean
df['avg_gain'][n] = df['gain'].rolling(window=n).mean().dropna().iloc[0]
df['avg_loss'][n] = df['loss'].rolling(window=n).mean().dropna().iloc[0]
# Alternatively
df['avg_gain'][n] = df.loc[:n, 'gain'].mean()
df['avg_loss'][n] = df.loc[:n, 'loss'].mean()
# This is not a pandas way, looping through the pandas series, but it does what you need
for i in range(n+1, df.shape[0]):
df['avg_gain'].iloc[i] = (df['avg_gain'].iloc[i-1] * (n - 1) + df['gain'].iloc[i]) / n
df['avg_loss'].iloc[i] = (df['avg_loss'].iloc[i-1] * (n - 1) + df['loss'].iloc[i]) / n
# calculate rs and rsi
df['rs'] = df['avg_gain'] / df['avg_loss']
df['rsi'] = 100 - (100 / (1 + df['rs'] ))

If you want to calculate the RSI of a time series using native pandas calls, you can use the following one-line code:
n=14
df['rsi14'] = 100 - (100 / (1 + df['Close'].diff(1).mask(df['Close'].diff(1) < 0, 0).ewm(alpha=1/n, adjust=False).mean() / df['Close'].diff(1).mask(df['Close'].diff(1) > 0, -0.0).abs().ewm(alpha=1/n, adjust=False).mean()))
And it's eave faster than numpy results (ms / loop):
rows np loop native
23 1.0 1.3 0.8
230 1.1 1.4 0.9
2300 1.1 1.3 0.9
23000 3.4 1.8 1.2

This is the rsi code, replace every thing that has "aa":
import pandas as pd
rsi_period = 14
df = pd.Series(coinaalist)
chg = df.diff(1)
gain = chg.mask(chg<0,0)
loss = chg.mask(chg>0,0)
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
rs = abs(avg_gain / avg_loss)
crplaa = 100 - (100/(1+rs))
coinaarsi = crplaa.iloc[-1]

I gave +1 to lepi, however his formula can be made even more pandorable:
n = 14
df['rsi14'] = df['Close'].diff(1).mask(df['Close'].diff(1) < 0, 0).ewm(alpha=1/n, adjust=False).mean().div(df['Close'].diff(1).mask(df['Close'].diff(1) > 0, -0.0).abs().ewm(alpha=1/n, adjust=False).mean()).add(1).rdiv(100).rsub(100)
so div() was used instead of / and add(1).rdiv(100).rsub(100) instead of + - / in other places.

Python Pandas-retrieving values in one column while they are less than the value of a second column

Suppose I have a df that looks like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 31.0 0.90
2 30 0.03 41.0 0.70
3 40 0.72 51.0 0.08
4 50 0.09 81.0 0.78
5 60 0.09 NaN NaN
6 70 0.01 NaN NaN
7 80 0.09 NaN NaN
8 90 0.08 NaN NaN
9 100 0.02 NaN NaN
In the posR column, we see that it jumps from 11 to 31, and there is not a value in the "20's". I want to insert a value to fill that space, which would essentially just be the posF value, and NA, so my resulting df would look like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 20 NaN
2 30 0.03 31.0 0.90
3 40 0.72 41.0 0.70
4 50 0.09 50 NaN
5 60 0.09 60 NaN
6 70 0.01 70 NaN
7 80 0.09 80 NaN
8 90 0.08 81.0 0.78
9 100 0.02 100 NaN
So I want to fill the NaN values in the position with the values from posF that are in between the values in posR.
What I have tried to do is just make a dummy list and add values to the list based on if they were less than a (I see the flaw here but I don't know how to fix it).
insert_rows = []
for x in df['posF']:
for a,b in zip(df['posR'], df['rfreq']):
if x<a:
insert_rows.append([x, 'NA'])
print(len(insert_rows))#21, should be 5
I realize that it is appending x several times until it reaches the condition of being >a.
After this I will just create a new df and add these values to the original 2 columns so they are the same length.
If you can think of a better title, feel free to edit.

My first thought was to retrieve the new indices for the entries in posR by interpolating with posF and then put the values to their new positions - but as you want to have 81 one row later than here, I'm afraid this is not exactly what you're searching for and I still don't really get the logic behind your task.
However, perhaps this is a starting point, let's see...
This approach would work like the following:
Retrieve the new index positions of the values in posR according to their order in posF:
import numpy as np
idx = np.interp(df.posR, df.posF, df.index).round()
Get rid of nan entries and cast to int:
idx = idx[np.isfinite(idx)].astype(int)
Create a new column by copying posF in the first step, and set newrfreq to nan respectively:
df['newposR'] = df.posF
df['newrfreq'] = np.nan
Then overwrite with the values from posR and rfreq, but now at the updated positions:
df.loc[idx, 'newposR'] = df.posR[:len(idx)].values
df.loc[idx, 'newrfreq'] = df.rfreq[:len(idx)].values
Result:
posF ffreq posR rfreq newposR newrfreq
0 10 0.50 11.0 0.08 11.0 0.08
1 20 0.20 31.0 0.90 20.0 NaN
2 30 0.03 41.0 0.70 31.0 0.90
3 40 0.72 51.0 0.08 41.0 0.70
4 50 0.09 81.0 0.78 51.0 0.08
5 60 0.09 NaN NaN 60.0 NaN
6 70 0.01 NaN NaN 70.0 NaN
7 80 0.09 NaN NaN 81.0 0.78
8 90 0.08 NaN NaN 90.0 NaN
9 100 0.02 NaN NaN 100.0 NaN

Pandas DataFrame - desired index has duplicate values

This is my first time trying Pandas. I think I have a reasonable use case, but I am stumbling. I want to load a tab delimited file into a Pandas Dataframe, then group it by Symbol and plot it with the x.axis indexed by the TimeStamp column. Here is a subset of the data:
Symbol,Price,M1,M2,Volume,TimeStamp
TBET,2.19,3,8.05,1124179,9:59:14 AM
FUEL,3.949,9,1.15,109674,9:59:11 AM
SUNH,4.37,6,0.09,24394,9:59:09 AM
FUEL,3.9099,8,1.11,105265,9:59:09 AM
TBET,2.18,2,8.03,1121629,9:59:05 AM
ORBC,3.4,2,0.22,10509,9:59:02 AM
FUEL,3.8599,7,1.07,102116,9:58:47 AM
FUEL,3.8544,6,1.05,100116,9:58:40 AM
GBR,3.83,4,0.46,64251,9:58:24 AM
GBR,3.8,3,0.45,63211,9:58:20 AM
XRA,3.6167,3,0.12,42310,9:58:08 AM
GBR,3.75,2,0.34,47521,9:57:52 AM
MPET,1.42,3,0.26,44600,9:57:52 AM
Note two things about the TimeStamp column;
it has duplicate values and
the intervals are irregular.
I thought I could do something like this...
from pandas import *
import pylab as plt
df = read_csv('data.txt',index_col=5)
df.sort(ascending=False)
df.plot()
plt.show()
But the read_csv method raises an exception "Tried columns 1-X as index but found duplicates". Is there an option that will allow me to specify an index column with duplicate values?
I would also be interested in aligning my irregular timestamp intervals to one second resolution, I would still wish to plot multiple events for a given second, but maybe I could introduce a unique index, then align my prices to it?

I created several issues just now to address some features / conveniences that I think would be nice to have: GH-856, GH-857, GH-858
We're currently working on a revamp of the time series capabilities and doing alignment to secondly resolution is possible now (though not with duplicates, so would need to write some functions for that). I also want to support duplicate timestamps in a better way. However, this is really panel (3D) data, so one way that you might alter things is the following:
In [29]: df.pivot('Symbol', 'TimeStamp').stack()
Out[29]:
M1 M2 Price Volume
Symbol TimeStamp
FUEL 9:58:40 AM 6 1.05 3.8544 100116
9:58:47 AM 7 1.07 3.8599 102116
9:59:09 AM 8 1.11 3.9099 105265
9:59:11 AM 9 1.15 3.9490 109674
GBR 9:57:52 AM 2 0.34 3.7500 47521
9:58:20 AM 3 0.45 3.8000 63211
9:58:24 AM 4 0.46 3.8300 64251
MPET 9:57:52 AM 3 0.26 1.4200 44600
ORBC 9:59:02 AM 2 0.22 3.4000 10509
SUNH 9:59:09 AM 6 0.09 4.3700 24394
TBET 9:59:05 AM 2 8.03 2.1800 1121629
9:59:14 AM 3 8.05 2.1900 1124179
XRA 9:58:08 AM 3 0.12 3.6167 42310
note that this created a MultiIndex. Another way I could have gotten this:
In [32]: df.set_index(['Symbol', 'TimeStamp'])
Out[32]:
Price M1 M2 Volume
Symbol TimeStamp
TBET 9:59:14 AM 2.1900 3 8.05 1124179
FUEL 9:59:11 AM 3.9490 9 1.15 109674
SUNH 9:59:09 AM 4.3700 6 0.09 24394
FUEL 9:59:09 AM 3.9099 8 1.11 105265
TBET 9:59:05 AM 2.1800 2 8.03 1121629
ORBC 9:59:02 AM 3.4000 2 0.22 10509
FUEL 9:58:47 AM 3.8599 7 1.07 102116
9:58:40 AM 3.8544 6 1.05 100116
GBR 9:58:24 AM 3.8300 4 0.46 64251
9:58:20 AM 3.8000 3 0.45 63211
XRA 9:58:08 AM 3.6167 3 0.12 42310
GBR 9:57:52 AM 3.7500 2 0.34 47521
MPET 9:57:52 AM 1.4200 3 0.26 44600
In [33]: df.set_index(['Symbol', 'TimeStamp']).sortlevel(0)
Out[33]:
Price M1 M2 Volume
Symbol TimeStamp
FUEL 9:58:40 AM 3.8544 6 1.05 100116
9:58:47 AM 3.8599 7 1.07 102116
9:59:09 AM 3.9099 8 1.11 105265
9:59:11 AM 3.9490 9 1.15 109674
GBR 9:57:52 AM 3.7500 2 0.34 47521
9:58:20 AM 3.8000 3 0.45 63211
9:58:24 AM 3.8300 4 0.46 64251
MPET 9:57:52 AM 1.4200 3 0.26 44600
ORBC 9:59:02 AM 3.4000 2 0.22 10509
SUNH 9:59:09 AM 4.3700 6 0.09 24394
TBET 9:59:05 AM 2.1800 2 8.03 1121629
9:59:14 AM 2.1900 3 8.05 1124179
XRA 9:58:08 AM 3.6167 3 0.12 42310
you can get this data in a true panel format like so:
In [35]: df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()
Out[35]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 11 (major) x 7 (minor)
Items: Price to Volume
Major axis: 9:57:52 AM to 9:59:14 AM
Minor axis: FUEL to XRA
In [36]: panel = df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()
In [37]: panel['Price']
Out[37]:
Symbol FUEL GBR MPET ORBC SUNH TBET XRA
TimeStamp
9:57:52 AM NaN 3.75 1.42 NaN NaN NaN NaN
9:58:08 AM NaN NaN NaN NaN NaN NaN 3.6167
9:58:20 AM NaN 3.80 NaN NaN NaN NaN NaN
9:58:24 AM NaN 3.83 NaN NaN NaN NaN NaN
9:58:40 AM 3.8544 NaN NaN NaN NaN NaN NaN
9:58:47 AM 3.8599 NaN NaN NaN NaN NaN NaN
9:59:02 AM NaN NaN NaN 3.4 NaN NaN NaN
9:59:05 AM NaN NaN NaN NaN NaN 2.18 NaN
9:59:09 AM 3.9099 NaN NaN NaN 4.37 NaN NaN
9:59:11 AM 3.9490 NaN NaN NaN NaN NaN NaN
9:59:14 AM NaN NaN NaN NaN NaN 2.19 NaN
you can then generate some plots from that data.
note here that the timestamps are still as strings-- I guess they could be converted to Python datetime.time objects and things might be a bit easier to work with. I don't have many plans to provide a lot of support for raw times vs. timestamps (date + time) but if enough people need it I suppose I can be convinced :)
If you have multiple observations on a second for a single symbol then some of the above methods will not work. But I want to build in better support for that in upcoming releases of pandas, so knowing your use cases will be helpful to me-- consider joining the mailing list (pystatsmodels)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.