Groupby time bins in multilevel index - python

I have a sparsely filled data frame that looks like this:
entity_id 59e75f2b9e182f68cf25721d 59e75f2bc0bd722a5f395ee9 59e75f2c05e40310ebe1f433 ...
organisation_id group_id datetime ...
59e7515edb84e482acce8339 59e75177575fc94638c1f8e7 2018-04-01 02:01:00 NaN NaN NaN ...
2018-04-01 02:02:00 NaN 2.15 NaN ...
2018-04-01 02:03:00 NaN NaN 3.689 ...
2018-04-01 02:04:00 NaN NaN NaN ...
2018-04-01 02:05:00 NaN NaN NaN ...
... ... ... ... ...
5cb590649f18c69541d34f7a 2019-04-01 01:55:00 NaN NaN NaN ...
2019-04-01 01:56:00 NaN NaN NaN ...
2019-04-01 01:57:00 NaN NaN NaN ...
2019-04-01 01:58:00 NaN NaN NaN ...
2019-04-01 01:59:00 NaN NaN NaN ...
I would like to group this frame by group_id and 10-minute bins applied to the datetime index (for each group i want values that occurred inside the same 10 minute window to be grouped so i can take the mean over columns, disregarding the minute portion of the datetime index essentially).
I have tried using pd.Grouper(freq='10T') but that doesn't work in conjunction with multilevel indices it would seem.
group_mean = frame.groupby(
pd.Grouper(freq='10T'), level='datetime').mean(axis=1)
This gives me the error message
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
For reference, my wanted output should look something like this:
group_mean
organisation_id group_id datetime
59e7515edb84e482acce8339 59e75177575fc94638c1f8e7 2018-04-01 02:10:00 mean(axis=1)
2018-04-01 02:20:00 mean(axis=1)
...
5cb590649f18c69541d34f7a 2019-04-01 01:50:00 mean(axis=1)
2019-04-01 02:00:00 mean(axis=1)
...
where mean(axis=1) is the mean of all columns that are not NaN for that specific group and time bin.

Solution need DatetimeIndex, so first convert another levels to columns and add it to groupby in list:
Notice: Mean is per groups, not per columns.
group_mean = (frame.reset_index(['organisation_id','group_id'])
.groupby(['organisation_id',
'group_id',
pd.Grouper(freq='10T',level='datetime')])
.mean())
If need mean per columns:
df = frame.mean(axis=1)

Related

Filling NaN rows in big pandas datetime indexed dataframe using other not NaN rows values

I have a big weather csv dataframe containing several hundred thousand of rows as well many columns. The rows are time-series sampled every 10 minutes over many years. The index data column that represents datetime consists of year, month, day, hour, minute and second. Unfortunately, there were several thousand missing rows containing only NaNs. The goal is to fill these ones using the values of other rows collected at the same time but of other years if they are not NaNs.
I wrote a python for loop code but it seems like a very time consuming solution. I need your help for a more efficient and faster solution.
The raw dataframe is as follows:
print(df)
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:10:00 996.52 -8.02 265.40 -8.90 93.30
2004-01-01 00:20:00 996.57 -8.41 265.01 -9.28 93.40
2004-01-01 00:40:00 996.51 -8.31 265.12 -9.07 94.20
2004-01-01 00:50:00 996.51 -8.27 265.15 -9.04 94.10
2004-01-01 01:00:00 996.53 -8.51 264.91 -9.31 93.90
... ... ... ... ... ...
2020-12-31 23:20:00 1000.07 -4.05 269.10 -8.13 73.10
2020-12-31 23:30:00 999.93 -3.35 269.81 -8.06 69.71
2020-12-31 23:40:00 999.82 -3.16 270.01 -8.21 67.91
2020-12-31 23:50:00 999.81 -4.23 268.94 -8.53 71.80
2021-01-01 00:00:00 999.82 -4.82 268.36 -8.42 75.70
[820551 rows x 5 columns]
For any reason, there were missing rows in the df dataframe. To identify them, it is possible to apply the below function:
findnanrows(df.groupby(pd.Grouper(freq='10T')).mean())
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 NaN NaN NaN NaN NaN
2009-10-08 09:50:00 NaN NaN NaN NaN NaN
2009-10-08 10:00:00 NaN NaN NaN NaN NaN
2013-05-16 09:00:00 NaN NaN NaN NaN NaN
2014-07-30 08:10:00 NaN NaN NaN NaN NaN
... ... ... ... ... ...
2016-10-28 12:00:00 NaN NaN NaN NaN NaN
2016-10-28 12:10:00 NaN NaN NaN NaN NaN
2016-10-28 12:20:00 NaN NaN NaN NaN NaN
2016-10-28 12:30:00 NaN NaN NaN NaN NaN
2016-10-28 12:40:00 NaN NaN NaN NaN NaN
[5440 rows x 5 columns]
The aim is to fill all these NaN rows. As an example, the first NaN row which corresponds to the datetime 2004-01-01 00:30:00 should be filled with the not NaN values of another row collected on the same datetime xxxx-01-01 00:30:00 of another year like 2005-01-01 00:30:00 or 2006-01-01 00:30:00 and so on, even 2003-01-01 00:30:00 or 2002-01-01 00:30:00 if they existing. It is possible to apply an average over all these other years.
Seeing the values of the row with the datetime index 2005-01-01 00:30:00:
print(df.loc["2005-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2005-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
After filling the row corresponding to the index datetime 2004-01-01 00:30:00 using the values of the row having the index datetime 2005-01-01 00:30:00, the df dataframe will have the following row:
print(df.loc["2004-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
The two functions that I created are the following. The first is to identify the NaN rows. The second is to fill them.
def findnanrows(df):
is_NaN = df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df[row_has_NaN]
return rows_with_NaN
def filldata(weatherdata):
fillweatherdata = weatherdata.copy()
allyears = fillweatherdata.index.year.unique().tolist()
dfnan = findnanrows(fillweatherdata.groupby(pd.Grouper(freq='10T')).mean())
for i in range(dfnan.shape[0]):
dnan = dfnan.index[i]
if dnan.year == min(allyears):
y = 0
dnew = dnan.replace(year=dnan.year+y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year+y)
y += 1
else:
y = 0
dnew = dnan.replace(year=dnan.year-y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year-y)
y += 1
new_row = pd.DataFrame(np.array([fillweatherdata.loc[dnew, :]]).tolist(), columns=fillweatherdata.columns.tolist(), index=[dnan])
fillweatherdata = pd.concat([fillweatherdata, pd.DataFrame(new_row)], ignore_index=False)
#fillweatherdata = fillweatherdata.drop_duplicates()
fillweatherdata = fillweatherdata.sort_index()
return fillweatherdata

Pandas trying to make values within a column into new columns after groupby on column

My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN

Get lagged data in pandas

I want to get the lagged data from a dataset. The dataset is monthly and looks like this:
Final Profits
JCCreateDate
2016-04-30 31163371.59
2016-05-31 27512300.34
...
2019-02-28 16800693.82
2019-03-31 5384227.13
Now out of the above dataset, I've selected a window of data (last 12 months of data) from which I want to subtract 3,6,9 and 12 months.
I've created the window dataset like this:
df_all = pd.read_csv('dataset.csv')
df = pd.read_csv('window_dataset.csv')
data_start, data_end = pd.to_datetime(df.first_valid_index()), pd.to_datetime(df.last_valid_index())
dr = pd.date_range(data_start, data_end, freq='M')
Now for the daterange dr I wanted to subtract the months, lets suppose I subtract 3 months from dr and try to retrieve the data from df_all
df_all.loc[dr - pd.DateOffset(months=3)]
which gives me following output
Final Profits
2018-01-30 NaN
2018-02-28 9240766.46
2018-03-30 NaN
2018-04-30 13250515.05
2018-05-31 12539224.15
2018-06-30 17778326.04
2018-07-31 19345671.02
2018-08-30 NaN
2018-09-30 14815607.14
2018-10-31 28979099.74
2018-11-28 NaN
2018-12-31 12395273.24
As one can see I've got some NaN because the months like Jan, Mar has got 31 days and the subtraction is searching for the wrong day of the month. How to deal with it ?
I'm not 100% what you are looking for but I suspect use shift.
# set up dataframe
index = pd.date_range(start='2016-04-30', end='2019-03-31', freq='M' )
df = pd.DataFrame(np.random.randint(5000000, 50000000, 36), index=index, columns=['Final Profits'])
# create three columns shifting and subtracing from 'Final_Profits'
df['3mos'] = df['Final Profits'] - df['Final Profits'].shift(3)
df['6mos'] = df['Final Profits'] - df['Final Profits'].shift(6)
df['9mos'] = df['Final Profits'] - df['Final Profits'].shift(9)
print(df.head(12))
Final Profits 3mos 6mos 9mos
2016-04-30 45197972 NaN NaN NaN
2016-05-31 5029292 NaN NaN NaN
2016-06-30 20310120 NaN NaN NaN
2016-07-31 10514197 -34683775.0 NaN NaN
2016-08-31 31219405 26190113.0 NaN NaN
2016-09-30 21504727 1194607.0 NaN NaN
2016-10-31 19234437 8720240.0 -25963535.0 NaN
2016-11-30 18881711 -12337694.0 13852419.0 NaN
2016-12-31 27237712 5732985.0 6927592.0 NaN
2017-01-31 21692788 2458351.0 11178591.0 -23505184.0
2017-02-28 7869701 -11012010.0 -23349704.0 2840409.0
2017-03-31 20943248 -6294464.0 -561479.0 633128.0

Pandas Dataframe multiindex

I'm new to Python, Pandas, Dash, etc. I'm trying to structure a dataframe so I can create some dash components for graphing that will allow the user to see and filter data.
At the top are aggregation characteristics, the first 3 are required and the remaining are sparse based on whether or not the data was aggregated for that characteristic. After the first ellipses, there are some summary characteristics for the day, and after the second ellipses is the time series data for the aggregation. There are about 3800 pre-calculated aggregate groupings in this example.
Should I try to make the aggregate characteristics into a MultiIndex?
The runid is the identifier of the analysis run that created the output(same number for all 3818 columns), while the UID field should be unique for each column of a single run, but multiple runs will have the same UID with different RUNIDs. The UID is the unique combination of CHAR1 thru CHAR20 for that RUNID and AGGLEVEL. The AGGLEVEL is the analysis grouping which may have one or more columns of output. CHAR3_CHAR6_UNADJ is the unique combinations of CHAR3 and CHAR6, so those two rows are populated while the remaining CHAR rows are null (well NaN) My current example is just one run, but there tens of thousands of runs, although I usually focus on one at a time and probably won't deal with more than 10-20 at a a time for a subset of the data of each. Char1 thru Char20 are only populated if that column has data aggregated by that characteristic.
My dataframe example:
print(dft)
0 ... 3818
UID 32 ... 19980
RUNID 1234 ... 1234
AGGLEVEL CHAR12_ADJ ... CHAR3_CHAR6_UNADJ
CHAR1 NaN ... NaN
CHAR2 NaN ... NaN
CHAR3 NaN ... 1234
CHAR4 NaN ... NaN
CHAR5 NaN ... NaN
CHAR6 NaN ... ABCD
CHAR7 NaN ... NaN
CHAR8 NaN ... NaN
CHAR9 NaN ... NaN
CHAR10 NaN ... NaN
CHAR11 NaN ... NaN
CHAR12 IJKL ... NaN
CHAR13 NaN ... NaN
CHAR14 NaN ... NaN
CHAR15 NaN ... NaN
CHAR16 NaN ... NaN
CHAR17 NaN ... NaN
CHAR18 NaN ... NaN
CHAR19 NaN ... NaN
CHAR20 NaN ... NaN
...
STARTTIME 2018-08-22 00:00:00 ... 2018-08-22 00:00:00
MAXIMUM 2.676 ... 0.654993
MINIMUM 0.8868 ... 0.258181
...
00:00 1.2288 ... 0.335217
01:00 1.2828 ... 0.337848
02:00 1.2876 ... 0.324639
03:00 1.194 ... 0.314569
04:00 1.2876 ... 0.258181
05:00 1.1256 ... 0.284699
06:00 1.4016 ... 0.364655
07:00 1.122 ... 0.388968
08:00 1.0188 ... 0.452711
09:00 1.008 ... 0.507032
10:00 1.0272 ... 0.546807
11:00 0.972 ... 0.605359
12:00 1.062 ... 0.641152
13:00 0.8868 ... 0.625082
14:00 1.1076 ... 0.623865
15:00 0.9528 ... 0.654993
16:00 1.014 ... 0.645511
17:00 2.676 ... 0.62638
18:00 0.9888 ... 0.551629
19:00 1.038 ... 0.518322
20:00 1.2528 ... 0.50793
21:00 1.08 ... 0.456993
22:00 1.1724 ... 0.387063
23:00 1.1736 ... 0.345045
[62 rows x 3819 columns]
You should try to transpose it with dft.T. You will have as index the number of your sample from 0 to 3818 and it'll be easier to select your columns then with dft['STARTTIME'] for instance.
For the NaN, you should do dft = dft.replace('NaN',np.nan) so that Pandas will understand that it's really a NaN and not a string (don't forget to write import numpy as np before). You'll be able then to use pd.isna(dft) to check if you have NaN in your Dataframe or dft.dropna() to keep full completed lines.

Python Pandas resample, odd behaviour

I have 2 datasets (cex2.txt and cex3) wich I would like to resample in pandas. With one dataset I get the expected output, with the other not.
The datasets are tick data and are exactly equally formatted. Actually, the 2 datasets are only from two different days.
import pandas as pd
import datetime as dt
pd.set_option ('display.mpl_style', 'default')
time_converter = lambda x: dt.datetime.fromtimestamp(float(x))
data_frame = pd.read_csv('cex2.txt', sep=';', converters={'time': time_converter})
data_frame.drop('Unnamed: 7', axis=1, inplace=True)
data_frame.drop('low', axis=1, inplace=True)
data_frame.drop('high', axis=1, inplace=True)
data_frame.drop('last', axis=1, inplace=True)
data_frame = data_frame.reindex_axis(['time', 'ask', 'bid', 'vol'], axis=1)
data_frame.set_index(pd.DatetimeIndex(data_frame['time']), inplace=True)
ask = data_frame['ask'].resample('15Min', how='ohlc')
bid = data_frame['bid'].resample('15Min', how='ohlc')
vol = data_frame['vol'].resample('15Min', how='sum')
print ask
from the cex2.txt dataset I get this wrong output:
open high low close
1970-01-01 01:00:00 NaN NaN NaN NaN
1970-01-01 01:15:00 NaN NaN NaN NaN
1970-01-01 01:30:00 NaN NaN NaN NaN
1970-01-01 01:45:00 NaN NaN NaN NaN
1970-01-01 02:00:00 NaN NaN NaN NaN
1970-01-01 02:15:00 NaN NaN NaN NaN
1970-01-01 02:30:00 NaN NaN NaN NaN
1970-01-01 02:45:00 NaN NaN NaN NaN
1970-01-01 03:00:00 NaN NaN NaN NaN
1970-01-01 03:15:00 NaN NaN NaN NaN
from the cex3.txt dataset I get correct values:
open high low close
2014-08-10 13:30:00 0.003483 0.003500 0.003483 0.003485
2014-08-10 13:45:00 0.003485 0.003570 0.003467 0.003471
2014-08-10 14:00:00 0.003471 0.003500 0.003470 0.003494
2014-08-10 14:15:00 0.003494 0.003500 0.003493 0.003498
2014-08-10 14:30:00 0.003498 0.003549 0.003498 0.003500
2014-08-10 14:45:00 0.003500 0.003533 0.003487 0.003533
2014-08-10 15:00:00 0.003533 0.003600 0.003520 0.003587
I'm really at my wits' end. Does anyone have an idea why this happens?
Edit:
Here are the data sources:
https://dl.dropboxusercontent.com/u/14055520/cex2.txt
https://dl.dropboxusercontent.com/u/14055520/cex3.txt
Thanks!

Categories