Get lagged data in pandas - python

I want to get the lagged data from a dataset. The dataset is monthly and looks like this:
Final Profits
JCCreateDate
2016-04-30 31163371.59
2016-05-31 27512300.34
...
2019-02-28 16800693.82
2019-03-31 5384227.13
Now out of the above dataset, I've selected a window of data (last 12 months of data) from which I want to subtract 3,6,9 and 12 months.
I've created the window dataset like this:
df_all = pd.read_csv('dataset.csv')
df = pd.read_csv('window_dataset.csv')
data_start, data_end = pd.to_datetime(df.first_valid_index()), pd.to_datetime(df.last_valid_index())
dr = pd.date_range(data_start, data_end, freq='M')
Now for the daterange dr I wanted to subtract the months, lets suppose I subtract 3 months from dr and try to retrieve the data from df_all
df_all.loc[dr - pd.DateOffset(months=3)]
which gives me following output
Final Profits
2018-01-30 NaN
2018-02-28 9240766.46
2018-03-30 NaN
2018-04-30 13250515.05
2018-05-31 12539224.15
2018-06-30 17778326.04
2018-07-31 19345671.02
2018-08-30 NaN
2018-09-30 14815607.14
2018-10-31 28979099.74
2018-11-28 NaN
2018-12-31 12395273.24
As one can see I've got some NaN because the months like Jan, Mar has got 31 days and the subtraction is searching for the wrong day of the month. How to deal with it ?

I'm not 100% what you are looking for but I suspect use shift.
# set up dataframe
index = pd.date_range(start='2016-04-30', end='2019-03-31', freq='M' )
df = pd.DataFrame(np.random.randint(5000000, 50000000, 36), index=index, columns=['Final Profits'])
# create three columns shifting and subtracing from 'Final_Profits'
df['3mos'] = df['Final Profits'] - df['Final Profits'].shift(3)
df['6mos'] = df['Final Profits'] - df['Final Profits'].shift(6)
df['9mos'] = df['Final Profits'] - df['Final Profits'].shift(9)
print(df.head(12))
Final Profits 3mos 6mos 9mos
2016-04-30 45197972 NaN NaN NaN
2016-05-31 5029292 NaN NaN NaN
2016-06-30 20310120 NaN NaN NaN
2016-07-31 10514197 -34683775.0 NaN NaN
2016-08-31 31219405 26190113.0 NaN NaN
2016-09-30 21504727 1194607.0 NaN NaN
2016-10-31 19234437 8720240.0 -25963535.0 NaN
2016-11-30 18881711 -12337694.0 13852419.0 NaN
2016-12-31 27237712 5732985.0 6927592.0 NaN
2017-01-31 21692788 2458351.0 11178591.0 -23505184.0
2017-02-28 7869701 -11012010.0 -23349704.0 2840409.0
2017-03-31 20943248 -6294464.0 -561479.0 633128.0

Related

Concatenate pandas DataFrames on columns, similar to outer merge

I have 3 dataframes with dates on the first column of each. I would like to concat these dataframes but concating related with the row value of each. If the values match, add on the same row, otherwise, I would expect to have a NaN.
import numpy as np
import pandas as pd
# Create the pandas DataFrame
df1 = pd.DataFrame(['2018-12-31','2019-09-30','2022-01-31'], columns = ['Date1'])
df2 = pd.DataFrame(['2019-09-30','2022-02-28'], columns = ['Date2'])
df3 = pd.DataFrame(['2019-09-30','2021-06-30','2021-11-30','2022-03-31'], columns = ['Date3'])
display(df1)
display(df2)
display(df3)
data = {'Date1': ['2018-12-31','2019-09-30',np.nan,np.nan,'2022-01-31',np.nan,np.nan],
'Date2': [np.nan,'2019-09-30',np.nan,np.nan,np.nan,'2022-02-28',np.nan],
'Date3': [np.nan,'2019-09-30','2021-06-30','2021-11-30',np.nan,np.nan,'2022-01-31']}
desired_df = pd.DataFrame(data)
desired_df
This is what I am trying to achieve.
Date1
Date2
Date3
0
2018-12-31
NaN
NaN
1
2019-09-30
2019-09-30
2019-09-30
2
NaN
NaN
2021-06-30
3
NaN
NaN
2021-11-30
4
2022-01-31
NaN
NaN
5
NaN
2022-02-28
NaN
6
NaN
NaN
2022-01-31
My original idea was to used something like:
pd.concat([df1,df2,df3], axis=1, join="outer")
However, above will produce something like:
Date1
Date2
Date3
2018-12-31
2019-09-30
2019-09-30
2019-09-30
2022-02-28
2021-06-30
2022-01-31
NaN
2021-11-30
NaN
NaN
2022-03-31
We could set_index with the Dates (by setting the drop parameter to False, we don't lose the column), then concat horizontally:
out = (pd.concat([df.set_index(f'Date{i+1}', drop=False)
for i, df in enumerate([df1, df2, df3])], axis=1)
.sort_index().reset_index(drop=True))
Output:
Date1 Date2 Date3
0 2018-12-31 NaN NaN
1 2019-09-30 2019-09-30 2019-09-30
2 NaN NaN 2021-06-30
3 NaN NaN 2021-11-30
4 2022-01-31 NaN NaN
5 NaN 2022-02-28 NaN
6 NaN NaN 2022-03-31

Filling NaN rows in big pandas datetime indexed dataframe using other not NaN rows values

I have a big weather csv dataframe containing several hundred thousand of rows as well many columns. The rows are time-series sampled every 10 minutes over many years. The index data column that represents datetime consists of year, month, day, hour, minute and second. Unfortunately, there were several thousand missing rows containing only NaNs. The goal is to fill these ones using the values of other rows collected at the same time but of other years if they are not NaNs.
I wrote a python for loop code but it seems like a very time consuming solution. I need your help for a more efficient and faster solution.
The raw dataframe is as follows:
print(df)
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:10:00 996.52 -8.02 265.40 -8.90 93.30
2004-01-01 00:20:00 996.57 -8.41 265.01 -9.28 93.40
2004-01-01 00:40:00 996.51 -8.31 265.12 -9.07 94.20
2004-01-01 00:50:00 996.51 -8.27 265.15 -9.04 94.10
2004-01-01 01:00:00 996.53 -8.51 264.91 -9.31 93.90
... ... ... ... ... ...
2020-12-31 23:20:00 1000.07 -4.05 269.10 -8.13 73.10
2020-12-31 23:30:00 999.93 -3.35 269.81 -8.06 69.71
2020-12-31 23:40:00 999.82 -3.16 270.01 -8.21 67.91
2020-12-31 23:50:00 999.81 -4.23 268.94 -8.53 71.80
2021-01-01 00:00:00 999.82 -4.82 268.36 -8.42 75.70
[820551 rows x 5 columns]
For any reason, there were missing rows in the df dataframe. To identify them, it is possible to apply the below function:
findnanrows(df.groupby(pd.Grouper(freq='10T')).mean())
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 NaN NaN NaN NaN NaN
2009-10-08 09:50:00 NaN NaN NaN NaN NaN
2009-10-08 10:00:00 NaN NaN NaN NaN NaN
2013-05-16 09:00:00 NaN NaN NaN NaN NaN
2014-07-30 08:10:00 NaN NaN NaN NaN NaN
... ... ... ... ... ...
2016-10-28 12:00:00 NaN NaN NaN NaN NaN
2016-10-28 12:10:00 NaN NaN NaN NaN NaN
2016-10-28 12:20:00 NaN NaN NaN NaN NaN
2016-10-28 12:30:00 NaN NaN NaN NaN NaN
2016-10-28 12:40:00 NaN NaN NaN NaN NaN
[5440 rows x 5 columns]
The aim is to fill all these NaN rows. As an example, the first NaN row which corresponds to the datetime 2004-01-01 00:30:00 should be filled with the not NaN values of another row collected on the same datetime xxxx-01-01 00:30:00 of another year like 2005-01-01 00:30:00 or 2006-01-01 00:30:00 and so on, even 2003-01-01 00:30:00 or 2002-01-01 00:30:00 if they existing. It is possible to apply an average over all these other years.
Seeing the values of the row with the datetime index 2005-01-01 00:30:00:
print(df.loc["2005-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2005-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
After filling the row corresponding to the index datetime 2004-01-01 00:30:00 using the values of the row having the index datetime 2005-01-01 00:30:00, the df dataframe will have the following row:
print(df.loc["2004-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
The two functions that I created are the following. The first is to identify the NaN rows. The second is to fill them.
def findnanrows(df):
is_NaN = df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df[row_has_NaN]
return rows_with_NaN
def filldata(weatherdata):
fillweatherdata = weatherdata.copy()
allyears = fillweatherdata.index.year.unique().tolist()
dfnan = findnanrows(fillweatherdata.groupby(pd.Grouper(freq='10T')).mean())
for i in range(dfnan.shape[0]):
dnan = dfnan.index[i]
if dnan.year == min(allyears):
y = 0
dnew = dnan.replace(year=dnan.year+y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year+y)
y += 1
else:
y = 0
dnew = dnan.replace(year=dnan.year-y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year-y)
y += 1
new_row = pd.DataFrame(np.array([fillweatherdata.loc[dnew, :]]).tolist(), columns=fillweatherdata.columns.tolist(), index=[dnan])
fillweatherdata = pd.concat([fillweatherdata, pd.DataFrame(new_row)], ignore_index=False)
#fillweatherdata = fillweatherdata.drop_duplicates()
fillweatherdata = fillweatherdata.sort_index()
return fillweatherdata

Inserting missing quarterly earnings dates within index

I have this df:
revenue pct_yoy pct_qoq
2020-06-30 99.721 0.479013 0.092833
2020-03-31 91.250 0.478283 0.087216
2019-12-31 83.930 0.676253 0.135094
2019-09-30 73.941 NaN 0.096657
2019-06-30 67.424 NaN 0.092293
2019-03-31 61.727 NaN 0.232814
2018-09-30 50.070 NaN NaN
However, if you look at last index value with 2018, I seem to be missing 2018-12-31 when looking at the index as a sequential quarterly time-series. The index jumps straight to 2018-9-30.
How to ensure that any missing quarterly dates are inserted with nan values for their respective columns?
I'm not quite sure how to approach this problem.
You'll need to generate a list of your own quarterly dates that includes the missing dates. Then you can use .reindex to re-align your dataframe to this new list of dates.
# Get the oldest and newest dates which will be the bounds
# for our new Index
first_date = df.index.min()
last_date = df.index.max()
# Generate dates for every 3 months (3M) from first_date up to last_date
quarterly = pd.date_range(first_date, last_date, freq="3M")
# realign our dataframe using our new quarterly date index
# this will fill NaN for dates that did not exist in the
# original index
out = df.reindex(quarterly)
# if you want to order this from most recent date to least recent date
# do: out.sort_index(ascending=False)
print(out)
revenue pct_yoy pct_qoq
2018-09-30 50.070 NaN NaN
2018-12-31 NaN NaN NaN
2019-03-31 61.727 NaN 0.232814
2019-06-30 67.424 NaN 0.092293
2019-09-30 73.941 NaN 0.096657
2019-12-31 83.930 0.676253 0.135094
2020-03-31 91.250 0.478283 0.087216
2020-06-30 99.721 0.479013 0.092833
If your data contains only quarter-enddates as in the sample, you may use resample and asfreq to fill missing quarter-ends
df_final = df.resample('Q').asfreq()[::-1]
Out[122]:
revenue pct_yoy pct_qoq
2020-06-30 99.721 0.479013 0.092833
2020-03-31 91.250 0.478283 0.087216
2019-12-31 83.930 0.676253 0.135094
2019-09-30 73.941 NaN 0.096657
2019-06-30 67.424 NaN 0.092293
2019-03-31 61.727 NaN 0.232814
2018-12-31 NaN NaN NaN
2018-09-30 50.070 NaN NaN

pandas: given a start and end date, add a column for each day in between, then add values?

This is my data:
df = pd.DataFrame([
{start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1}
{start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2}
{start_date: '2019/12/01', end_date: '', spend: 10000, campaign_id: 3}
{start_date: '2019/12/01', end_date: '2019/12/01', spend: 50, campaign_id: 4}
]);
I need to add a column to each row for each day since 2019/12/01, and calculate the spend on that campaign that day, which I'll get by dividing the spend on the campaign by the total number of days it was active.
So here I'd add a column for each day between 1 December and today (10 December). For row 1, the content of the five columns for 1 Dec to 5 Dec would be 2000, then for the six ocolumns from 5 Dec to 10 Dec it would be zero.
I know pandas is well-designed for this kind of problem, but I have no idea where to start!
Doesn't seem like a straight forward task to me. But first convert your date columns if you haven't already:
df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])
Then create a helper function for resampling:
def resampler(data, daterange):
temp = (data.set_index('start_date').groupby('campaign_id')
.apply(daterange)
.drop("campaign_id",axis=1)
.reset_index().rename(columns={"level_1":"start_date"}))
return temp
Now its a 3 step process. First resample your data according to end_date of each group:
df1 = resampler(df, lambda d: d.reindex(pd.date_range(min(d.index),max(d["end_date"]),freq="D")) if d["end_date"].notnull().all() else d)
df1["spend"] = df1.groupby("campaign_id")["spend"].transform(lambda x: x.mean()/len(x))
With the average values calculated, resample again to current date:
dates = pd.date_range(min(df["start_date"]),pd.Timestamp.today(),freq="D")
df1 = resampler(df1,lambda d: d.reindex(dates))
Finally transpose your dataframe:
df1 = pd.concat([df1.drop("end_date",axis=1).set_index(["campaign_id","start_date"]).unstack(),
df1.groupby("campaign_id")["end_date"].min()], axis=1)
df1.columns = [*dates,"end_date"]
print (df1)
#
2019-12-01 00:00:00 2019-12-02 00:00:00 2019-12-03 00:00:00 2019-12-04 00:00:00 2019-12-05 00:00:00 2019-12-06 00:00:00 2019-12-07 00:00:00 2019-12-08 00:00:00 2019-12-09 00:00:00 2019-12-10 00:00:00 end_date
campaign_id
1 2000.0 2000.0 2000.0 2000.0 2000.0 NaN NaN NaN NaN NaN 2019-12-05
2 NaN NaN NaN NaN 10000.0 10000.0 10000.0 10000.0 10000.0 NaN 2019-12-09
3 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2019-12-01

Python Pandas resample, odd behaviour

I have 2 datasets (cex2.txt and cex3) wich I would like to resample in pandas. With one dataset I get the expected output, with the other not.
The datasets are tick data and are exactly equally formatted. Actually, the 2 datasets are only from two different days.
import pandas as pd
import datetime as dt
pd.set_option ('display.mpl_style', 'default')
time_converter = lambda x: dt.datetime.fromtimestamp(float(x))
data_frame = pd.read_csv('cex2.txt', sep=';', converters={'time': time_converter})
data_frame.drop('Unnamed: 7', axis=1, inplace=True)
data_frame.drop('low', axis=1, inplace=True)
data_frame.drop('high', axis=1, inplace=True)
data_frame.drop('last', axis=1, inplace=True)
data_frame = data_frame.reindex_axis(['time', 'ask', 'bid', 'vol'], axis=1)
data_frame.set_index(pd.DatetimeIndex(data_frame['time']), inplace=True)
ask = data_frame['ask'].resample('15Min', how='ohlc')
bid = data_frame['bid'].resample('15Min', how='ohlc')
vol = data_frame['vol'].resample('15Min', how='sum')
print ask
from the cex2.txt dataset I get this wrong output:
open high low close
1970-01-01 01:00:00 NaN NaN NaN NaN
1970-01-01 01:15:00 NaN NaN NaN NaN
1970-01-01 01:30:00 NaN NaN NaN NaN
1970-01-01 01:45:00 NaN NaN NaN NaN
1970-01-01 02:00:00 NaN NaN NaN NaN
1970-01-01 02:15:00 NaN NaN NaN NaN
1970-01-01 02:30:00 NaN NaN NaN NaN
1970-01-01 02:45:00 NaN NaN NaN NaN
1970-01-01 03:00:00 NaN NaN NaN NaN
1970-01-01 03:15:00 NaN NaN NaN NaN
from the cex3.txt dataset I get correct values:
open high low close
2014-08-10 13:30:00 0.003483 0.003500 0.003483 0.003485
2014-08-10 13:45:00 0.003485 0.003570 0.003467 0.003471
2014-08-10 14:00:00 0.003471 0.003500 0.003470 0.003494
2014-08-10 14:15:00 0.003494 0.003500 0.003493 0.003498
2014-08-10 14:30:00 0.003498 0.003549 0.003498 0.003500
2014-08-10 14:45:00 0.003500 0.003533 0.003487 0.003533
2014-08-10 15:00:00 0.003533 0.003600 0.003520 0.003587
I'm really at my wits' end. Does anyone have an idea why this happens?
Edit:
Here are the data sources:
https://dl.dropboxusercontent.com/u/14055520/cex2.txt
https://dl.dropboxusercontent.com/u/14055520/cex3.txt
Thanks!

Categories