Compute pct_change upto NaN - python

I have a df with index extending past the last data point:
df
2022-01-31 96.210 21649.6
2022-02-28 96.390 21708.4
2022-03-31 97.410 21739.7
2022-04-30 98.630 21644.3
2022-05-31 103.744 21649.2
2022-06-30 102.498 21607.4
2022-07-31 105.138 21636.1
2022-08-31 105.450 21631.8
2022-09-30 109.691 21503.1
2022-10-31 111.745 21414.8
2022-11-30 111.481 21351.6
2022-12-31 104.728 NaN
2023-01-31 103.522 NaN
2023-02-28 NaN
2023-03-31 NaN
2023-04-30 NaN
2023-05-31 NaN
2023-06-30 NaN
2023-07-31 NaN
2023-08-31 NaN
and when I compute pct_change, pandas treats NaNs as values and extends the pct_change calculation past the last actual data point. So instead stopping on 2023-01-31 for the first column, pandas continues computing values for 2023-02-28 and so on:
df.pct_change(12)
2022-01-31 0.069713 0.117543
2022-02-28 0.059464 0.106713
2022-03-31 0.069969 0.094989
2022-04-30 0.061336 0.076258
2022-05-31 0.140671 0.060263
2022-06-30 0.141022 0.056075
2022-07-31 0.135400 0.049517
2022-08-31 0.145573 0.038228
2022-09-30 0.186490 0.025632
2022-10-31 0.188271 0.012812
2022-11-30 0.187484 0.000112
2022-12-31 0.090576 -0.006440
2023-01-31 0.076000 -0.013765
2023-02-28 0.073991 -0.016436
2023-03-31 0.062745 -0.017852
2023-04-30 0.049600 -0.013523
2023-05-31 -0.002140 -0.013746
2023-06-30 0.009990 -0.011839
2023-07-31 -0.015370 -0.013149
2023-08-31 -0.018284 -0.012953
How can I tell pandas to compute pct_change only upto NaN values? So the output is:
2022-01-31 0.069713 0.117543
2022-02-28 0.059464 0.106713
2022-03-31 0.069969 0.094989
2022-04-30 0.061336 0.076258
2022-05-31 0.140671 0.060263
2022-06-30 0.141022 0.056075
2022-07-31 0.135400 0.049517
2022-08-31 0.145573 0.038228
2022-09-30 0.186490 0.025632
2022-10-31 0.188271 0.012812
2022-11-30 0.187484 0.000112
2022-12-31 0.090576 NaN
2023-01-31 0.076000 NaN
2023-02-28 NaN NaN
2023-03-31 NaN NaN
2023-04-30 NaN NaN
2023-05-31 NaN NaN
2023-06-30 NaN NaN
2023-07-31 NaN NaN
2023-08-31 NaN NaN
The following options won't work for me:
dropna
specifying a range a-la df=df[:'2023-01-31'].pct_change(12)
I need to keep the index past the last data point and I am doing a lot of different pct_changes so specifying a range everytime is too time consuming, I am looking for a nicer solution.
Also, specifying a range won't work because when when the df gets new data points, I will have to manually adjust all of the ranges, which is no bueno.

If you look at the documentation for df.pct_change, you'll find that it has a parameter fill_method that uses pad as the default for handling NaN values before computing the changes. pad (or ffill) means that the function is propagating the last valid observation forward. E.g. in your first column, with periods=12, when you reach the index value 2023-02-28, instead of using the NaN value, the function picks the last valid value (103.522 at 2023-01-31) to do the calculation:
103.522/96.390-1
# 0.0739910779126467 (96.390 being the value at `2022-02-28`, one year earlier)
To avoid this, you want to set fill_method to bfill (cf. the documentation for df.fillna for these methods). Result examplified below with periods=3 (I trust you are only showing the last rows of a longer df, because otherwise all the values before 2023-01-31 should be NaN values):
df.pct_change(periods=3, fill_method='bfill')
col1 col2
2022-01-31 NaN NaN
2022-02-28 NaN NaN
2022-03-31 NaN NaN
2022-04-30 0.025153 -0.000245
2022-05-31 0.076294 -0.002727
2022-06-30 0.052233 -0.006086
2022-07-31 0.065984 -0.000379
2022-08-31 0.016444 -0.000804
2022-09-30 0.070177 -0.004827
2022-10-31 0.062841 -0.010228
2022-11-30 0.057193 -0.012953
2022-12-31 -0.045245 NaN
2023-01-31 -0.073587 NaN
2023-02-28 NaN NaN
2023-03-31 NaN NaN
2023-04-30 NaN NaN
2023-05-31 NaN NaN
2023-06-30 NaN NaN
2023-07-31 NaN NaN
2023-08-31 NaN NaN

You can put ptc_change(limit=1), which will stop comparison further after 1 null comparison.

Related

yfinance shows 2 rows for the same day with Nan values

I'm using yfinance library with 2 tickers(^BVSP and BRL=X) but when i display the dateframe it show 2 rows per day where each row shows the information of only one ticket. The information about the other ticket is Nan. I want to put all the information in one row
How can i solve this?
I tried this
dados_bolsa =\["^BVSP","BRL=X"\]
today = datetime.datetime.now()
one_year = today - datetime.timedelta(days=365)
print(one_year)
dados_mercado = yf.download(dados_bolsa , one_year,today)
display(dados_mercado)
i get
2022-02-06 13:27:29.158181
[*********************100%***********************] 2 of 2 completed
Adj Close Close High Low Open Volume
BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP
Date
2022-02-07 00:00:00+00:00 5.3269 NaN 5.3269 NaN 5.3430 NaN 5.276800 NaN 5.326200 NaN 0.0 NaN
2022-02-07 03:00:00+00:00 NaN 111996.00000 NaN 111996.00000 NaN 112517.000000 NaN 111490.00000 NaN 112247.000000 NaN 10672800.0
2022-02-08 00:00:00+00:00 5.2626 NaN 5.2626 NaN 5.2849 NaN 5.251000 NaN 5.262800 NaN 0.0 NaN
2022-02-08 03:00:00+00:00 NaN 112234.00000 NaN 112234.00000 NaN 112251.000000 NaN 110943.00000 NaN 111995.000000 NaN 10157500.0
2022-02-09 00:00:00+00:00 5.2584 NaN 5.2584 NaN 5.2880 NaN 5.232774 NaN 5.256489 NaN 0.0 NaN
Look that we have 2 rows for the same day with Nan. I want just one row but with all information.

Pandas trying to make values within a column into new columns after groupby on column

My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN

Python Pandas v0.18+: is there a way to resample a dataframe without filling NAs?

I wonder if there is a way to up resample a DataFrame without having to decide how NAs should be filled immediately.
I tried the following but got the Future Warning:
FutureWarning: .resample() is now a deferred operation use .resample(...).mean() instead of .resample(...)
Code:
import pandas as pd
dates = pd.date_range('2015-01-01', '2016-01-01', freq='BM')
dummy = [i for i in range(len(dates))]
df = pd.DataFrame({'A': dummy})
df.index = dates
df.resample('B')
Is there a better way to do this, that doesn't show warnings?
Thanks.
Use Resampler.asfreq:
print (df.resample('B').asfreq())
A
2015-01-30 0.0
2015-02-02 NaN
2015-02-03 NaN
2015-02-04 NaN
2015-02-05 NaN
2015-02-06 NaN
2015-02-09 NaN
2015-02-10 NaN
2015-02-11 NaN
2015-02-12 NaN
2015-02-13 NaN
2015-02-16 NaN
2015-02-17 NaN
2015-02-18 NaN
2015-02-19 NaN
2015-02-20 NaN
2015-02-23 NaN
2015-02-24 NaN
2015-02-25 NaN
2015-02-26 NaN
2015-02-27 1.0
2015-03-02 NaN
2015-03-03 NaN
2015-03-04 NaN
...
...

Python Pandas resample, odd behaviour

I have 2 datasets (cex2.txt and cex3) wich I would like to resample in pandas. With one dataset I get the expected output, with the other not.
The datasets are tick data and are exactly equally formatted. Actually, the 2 datasets are only from two different days.
import pandas as pd
import datetime as dt
pd.set_option ('display.mpl_style', 'default')
time_converter = lambda x: dt.datetime.fromtimestamp(float(x))
data_frame = pd.read_csv('cex2.txt', sep=';', converters={'time': time_converter})
data_frame.drop('Unnamed: 7', axis=1, inplace=True)
data_frame.drop('low', axis=1, inplace=True)
data_frame.drop('high', axis=1, inplace=True)
data_frame.drop('last', axis=1, inplace=True)
data_frame = data_frame.reindex_axis(['time', 'ask', 'bid', 'vol'], axis=1)
data_frame.set_index(pd.DatetimeIndex(data_frame['time']), inplace=True)
ask = data_frame['ask'].resample('15Min', how='ohlc')
bid = data_frame['bid'].resample('15Min', how='ohlc')
vol = data_frame['vol'].resample('15Min', how='sum')
print ask
from the cex2.txt dataset I get this wrong output:
open high low close
1970-01-01 01:00:00 NaN NaN NaN NaN
1970-01-01 01:15:00 NaN NaN NaN NaN
1970-01-01 01:30:00 NaN NaN NaN NaN
1970-01-01 01:45:00 NaN NaN NaN NaN
1970-01-01 02:00:00 NaN NaN NaN NaN
1970-01-01 02:15:00 NaN NaN NaN NaN
1970-01-01 02:30:00 NaN NaN NaN NaN
1970-01-01 02:45:00 NaN NaN NaN NaN
1970-01-01 03:00:00 NaN NaN NaN NaN
1970-01-01 03:15:00 NaN NaN NaN NaN
from the cex3.txt dataset I get correct values:
open high low close
2014-08-10 13:30:00 0.003483 0.003500 0.003483 0.003485
2014-08-10 13:45:00 0.003485 0.003570 0.003467 0.003471
2014-08-10 14:00:00 0.003471 0.003500 0.003470 0.003494
2014-08-10 14:15:00 0.003494 0.003500 0.003493 0.003498
2014-08-10 14:30:00 0.003498 0.003549 0.003498 0.003500
2014-08-10 14:45:00 0.003500 0.003533 0.003487 0.003533
2014-08-10 15:00:00 0.003533 0.003600 0.003520 0.003587
I'm really at my wits' end. Does anyone have an idea why this happens?
Edit:
Here are the data sources:
https://dl.dropboxusercontent.com/u/14055520/cex2.txt
https://dl.dropboxusercontent.com/u/14055520/cex3.txt
Thanks!

Extract Pandas index value as single date time stamp variable, Not as index

So I have a dataFrame:
Units fcast currerr curpercent fcastcum unitscum cumerrpercent
2013-09-01 3561 NaN NaN NaN NaN NaN NaN
2013-10-01 3480 NaN NaN NaN NaN NaN NaN
2013-11-01 3071 NaN NaN NaN NaN NaN NaN
2013-12-01 3234 NaN NaN NaN NaN NaN NaN
2014-01-01 2610 2706 -96 -3.678161 2706 2610 -3.678161
2014-02-01 NaN 3117 NaN NaN 5823 NaN NaN
2014-03-01 NaN 3943 NaN NaN 9766 NaN NaN
And I want to load a value, the index of the current month which is found by getting the last item that has "units" filled in, into a variable, "curr_month" that will have a number of uses (including text display and using as a slicing operator)
This is way ugly but almost works:
curr_month=mergederrs['Units'].dropna()
curr_month=curr_month[-1:].index
curr_month
But curr_month is
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01]
Length: 1, Freq: None, Timezone: None
Which is Unhashable, so this fails
mergederrs[curr_month:]
The docs are great for creating the DF but a bit sparse of getting individual items out!
I'd probably write
>>> df.Units.last_valid_index()
Timestamp('2014-01-01 00:00:00')
but a slight tweak on your approach should work too:
>>> df.Units.dropna().index[-1]
Timestamp('2014-01-01 00:00:00')
It's the difference between somelist[-1:] and somelist[-1].
[Note that I'm assuming that all of the nan values come at the end. If there are valids and then NaNs and then valids, and you want the last valid in the first group, that would be slightly different.]

Categories