I'm fairly new to python, especially the data libraries, so please excuse any idiocy.
I'm trying to practise with a made up data set of monthly observations over 12 months, data looks like this...
print(data)
2017-04-17 156
2017-05-09 216
2017-06-11 300
2017-07-29 184
2017-08-31 162
2017-09-24 91
2017-10-15 225
2017-11-03 245
2017-12-26 492
2018-01-26 485
2018-02-18 401
2018-03-09 215
2018-04-30 258
These monthly observations are irregular (there is exactly one in each month but nowhere near the same time).
Now, I want to use liner interpolation to get the values at the start of each month -
I've tried a bunch of methods... and was able to do it 'manually', but I'm trying to get to grips with pandas and numpy, and I know it can be done with these, here's what I had so far: I make a Series holding data, and then I do:
resampled1 = data.resample('MS')
interp1 = resampled1.interpolate()
print(interp1)
This prints:
2017-04-01 NaN
2017-05-01 NaN
2017-06-01 NaN
2017-07-01 NaN
2017-08-01 NaN
2017-09-01 NaN
2017-10-01 NaN
2017-11-01 NaN
2017-12-01 NaN
2018-01-01 NaN
2018-02-01 NaN
2018-03-01 NaN
2018-04-01 NaN
Now, I know that the first one 2017-4-17 should be NaN as linear interpolation (which I believe is the default), interpolates between the two points before and after... which is not possible since I don't have a datapoint before April 1st. As for the others... I'm not certain what I'm doing wrong... probably just because I'm struggling to wrap my head around exactly what resample is doing?
You probably want to resample('D') to interpolate, e.g.:
In []:
data.resample('D').interpolate().asfreq('MS')
Out[]:
2017-05-01 194.181818
2017-06-01 274.545455
2017-07-01 251.666667
2017-08-01 182.000000
2017-09-01 159.041667
2017-10-01 135.666667
2017-11-01 242.894737
2017-12-01 375.490566
2018-01-01 490.645161
2018-02-01 463.086957
2018-03-01 293.315789
2018-04-01 234.019231
Try to use RedBlackPy.
from datetime import datetime
import redblackpy as rb
index = [datetime(2017,4,17), datetime(2017,5,9), datetime(2017,6, 11)]
values = [156, 216, 300]
series = rb.Series(index=index, values=values, interpolate='linear')
# Now you can access by any key with no insertion, using interpolation.
print(series[datetime(2017, 5, 1)]) # prints 194.18182373046875
Related
I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)
Given this sample data from an excel file to csv (using Pandas) I have tried every form of pd.datetime to convert these seemingly uniform string dates to datetime format. Use the flags errors='coerce' I lose a bunch of dates. No good. Use errors='ignore' I get some columns with dtype datetime and others remain object. No good. The goal is to grab the years for all these dates and then bin them in five year bins from 1980-2000. At this point I am thinking pandas datetime parser is like the Kardashian of parsers, famous for nothing.
Date_1 Date_2 Date_3 Date_4 Date_5 Date_6
1000 9/1/2019 NaN NaN NaN NaN
1001 NaN NaN NaN NaN NaN
1002 NaN 1/1/2000 NaN NaN NaN
1003 NaN NaN NaN NaN NaN
1004 NaN 4/1/2016 NaN NaN NaN
1005 NaN NaN NaN NaN 1/1/2013
What have I tried. pd.todatetime with various flags and without flags.
This is the most common error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\arrays\datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
2053 try:
-> 2054 values, tz_parsed = conversion.datetime_to_datetime64(data)
2055 # If tzaware, these values represent unix timestamps, so we
pandas\_libs\tslibs\conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
Tried even converting all date strings to just strings and using regex to grab just the year. I only need the year for each of these dates to then use pd.cut or groupby and get the following result in bins.
1980 - 1985 347
1986 - 1990 450
1995 - 2000 47
and so on.
However, having done what I thought was a good set of operations, I keep ending up with dramatically less in date figures than are in the actual data set, like 50% of the dates just disappear from the dataset no matter what datetime conversion is attempted. So much frustration that I have actually linked half the dataset in csv format here so you can actually check out what I am dealing with in reality
I've got a solution which is a bit convoluted but should work.
Read the values, convert to string, remove nan's, gather the dates into a list per row, filter the lists for where there are actual values, explode the lists and you have a series with the dates. You can do it in one go but I'll break down the steps.
import pandas as pd
df = pd.read_csv("tst_file.csv")
df = df.drop("Unnamed: 0", axis=1)
df_str = df.astype(str).replace("nan", "")
df_lists = df_str.apply(list, axis=1)
df_filtered = df_lists.apply(lambda x: list(filter(lambda y : len(y) >= 1, x)))
clean = df_filtered.explode()
dates = pd.to_datetime(clean, format="%m/%d/%Y")
ts = pd.DataFrame(dates, columns=["time"])
ts["year"] = ts["time"].dt.year
Now you can bin the years as you wish
It looks like this :
time year
0 2019-09-01 2019.0
1 2011-01-01 2011.0
2 2000-01-01 2000.0
3 NaT NaN
4 2016-04-01 2016.0
.. ... ...
995 2020-01-01 2020.0
996 2015-02-01 2015.0
997 2016-04-01 2016.0
998 2001-12-01 2001.0
999 2016-01-01 2016.0
[1162 rows x 2 columns]
You can keep the original index and create a new one, a good way is also to set the dates as index and use loc to grab the relevant slices.
ts = ts.reset_index()
ts = ts.set_index("time").sort_index()
For example, there are 285 rows in this interval
ts.loc["2010":"2015"]
index year
time
2010-01-01 418 2010.0
2010-01-01 318 2010.0
2010-01-01 416 2010.0
2010-01-01 676 2010.0
2010-01-01 729 2010.0
... ... ...
2015-12-01 827 2015.0
2015-12-01 409 2015.0
2015-12-01 194 2015.0
2015-12-02 804 2015.0
2015-12-15 513 2015.0
[285 rows x 2 columns]
I'm new to pandas and I'm having some problems when I try to obtain daily average from data file.
So, my data is structured as follows:
DATA ESTACION
DATETIME
2020-01-15 00:00:00 175 47
2020-01-15 01:00:00 152 47
2020-01-15 02:00:00 180 47
2020-01-15 03:00:00 132 47
2020-01-15 04:00:00 115 47
... ... ...
2020-03-13 19:00:00 38 16
2020-03-13 20:00:00 53 16
2020-03-13 21:00:00 73 16
2020-03-13 22:00:00 28 16
2020-03-13 23:00:00 22 16
These are air pollution results gathered by 24 stations. Each station receives hourly information as you can see.
I'm trying to get daily average data by station. So this is what I do:
I group all info by station
grouped = data.groupby(['ESTACION'])
Then I get daily average resampling the grouped data
resampled = grouped.resample('D').mean()
And this is what I've obtained:
DATA ESTACION
ESTACION DATETIME
4 2020-01-02 18.250000 4.0
2020-01-03 NaN NaN
2020-01-04 NaN NaN
2020-01-05 NaN NaN
2020-01-06 NaN NaN
... ... ...
60 2020-11-29 NaN NaN
2020-11-30 NaN NaN
2020-12-01 NaN NaN
2020-12-02 118.666667 60.0
2020-12-03 80.833333 60.0
I don't really know whats going on cause I've only got data for 2020-01-15 - 2020-03-13 and it shows me info from other timestamps and NaN results.
If you need anything else to reproduce this case let me know.
Thanks and best regards
Output is expected, because resample always create consecutive DatetimeIndex.
So is possible remove missing rows by DataFrame.dropna:
resampled = grouped.resample('D').mean().dropna()
Another solution is use Series.dt.date:
data.groupby(['ESTACION', data['DATETIME'].dt.date]).mean()
I want to get the lagged data from a dataset. The dataset is monthly and looks like this:
Final Profits
JCCreateDate
2016-04-30 31163371.59
2016-05-31 27512300.34
...
2019-02-28 16800693.82
2019-03-31 5384227.13
Now out of the above dataset, I've selected a window of data (last 12 months of data) from which I want to subtract 3,6,9 and 12 months.
I've created the window dataset like this:
df_all = pd.read_csv('dataset.csv')
df = pd.read_csv('window_dataset.csv')
data_start, data_end = pd.to_datetime(df.first_valid_index()), pd.to_datetime(df.last_valid_index())
dr = pd.date_range(data_start, data_end, freq='M')
Now for the daterange dr I wanted to subtract the months, lets suppose I subtract 3 months from dr and try to retrieve the data from df_all
df_all.loc[dr - pd.DateOffset(months=3)]
which gives me following output
Final Profits
2018-01-30 NaN
2018-02-28 9240766.46
2018-03-30 NaN
2018-04-30 13250515.05
2018-05-31 12539224.15
2018-06-30 17778326.04
2018-07-31 19345671.02
2018-08-30 NaN
2018-09-30 14815607.14
2018-10-31 28979099.74
2018-11-28 NaN
2018-12-31 12395273.24
As one can see I've got some NaN because the months like Jan, Mar has got 31 days and the subtraction is searching for the wrong day of the month. How to deal with it ?
I'm not 100% what you are looking for but I suspect use shift.
# set up dataframe
index = pd.date_range(start='2016-04-30', end='2019-03-31', freq='M' )
df = pd.DataFrame(np.random.randint(5000000, 50000000, 36), index=index, columns=['Final Profits'])
# create three columns shifting and subtracing from 'Final_Profits'
df['3mos'] = df['Final Profits'] - df['Final Profits'].shift(3)
df['6mos'] = df['Final Profits'] - df['Final Profits'].shift(6)
df['9mos'] = df['Final Profits'] - df['Final Profits'].shift(9)
print(df.head(12))
Final Profits 3mos 6mos 9mos
2016-04-30 45197972 NaN NaN NaN
2016-05-31 5029292 NaN NaN NaN
2016-06-30 20310120 NaN NaN NaN
2016-07-31 10514197 -34683775.0 NaN NaN
2016-08-31 31219405 26190113.0 NaN NaN
2016-09-30 21504727 1194607.0 NaN NaN
2016-10-31 19234437 8720240.0 -25963535.0 NaN
2016-11-30 18881711 -12337694.0 13852419.0 NaN
2016-12-31 27237712 5732985.0 6927592.0 NaN
2017-01-31 21692788 2458351.0 11178591.0 -23505184.0
2017-02-28 7869701 -11012010.0 -23349704.0 2840409.0
2017-03-31 20943248 -6294464.0 -561479.0 633128.0
This is my data:
time id w
0 2018-03-01 00:00:00 39.0 1176.000000
1 2018-03-01 00:15:00 39.0 NaN
2 2018-03-01 00:30:00 39.0 NaN
3 2018-03-01 00:45:00 39.0 NaN
4 2018-03-01 01:00:00 39.0 NaN
5 2018-03-01 01:15:00 39.0 NaN
6 2018-03-01 01:30:00 39.0 NaN
7 2018-03-01 01:45:00 39.0 1033.461538
8 2018-03-01 02:00:00 39.0 1081.066667
9 2018-03-01 02:15:00 39.0 1067.909091
10 2018-03-01 02:30:00 39.0 NaN
11 2018-03-01 02:45:00 39.0 1051.866667
12 2018-03-01 03:00:00 39.0 1127.000000
13 2018-03-01 03:15:00 39.0 1047.466667
14 2018-03-01 03:30:00 39.0 1037.533333
I want to get index: 10
Because I need to know which time not continuous and I need to add the value.
I want to know if there is a NAN in front of and behind each 'time'. If not I need to know it index. I need to add value for it.
My data is very large. I need a faster way.
I really need your help.Many thanks.
Not sure if I understood you correctly. If you want the index of the column time where the change is more than 15 minutes, you will have more index than 4, and you can do so:
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
df['Delta']=(df['time'].subtract(df['time'].shift(1)))
df['Delta'] = df['Delta'].astype(str)
print df.index[df['Delta'] != '0 days 00:15:00.000000000'].tolist()
And the output is:
[4561, 4723, 5154, 5220, 5293, 5437, 5484]
Edit
Again, if I understood you right, just use this:
df.index[(pd.isnull(df['w'])) & (pd.notnull(df['w'].shift(1))) & (pd.notnull(df['w'].shift(-1)))].tolist()
Output:
[10]
This should work pretty fast:
import numpy as np
index = np.array([4561,4723,4724,4725,4726,5154,5220,5221,5222,5223,5224,5293,5437,5484,5485,5486,5487])
continuous = np.diff(index) == 1
not_continuous = np.where(~continuous[1:] & ~continuous[:-1])[0] + 1 # check on both 'sides', +1 because you 'loose' one index in the diff operation
index[not_continuous]
array([5154, 5293, 5437])
It doesn't handle the first value well but this is quite ambiguous since you don't have a preceding value to check against. Up to you to add this extra check if it matters to you... Same for last value, potentially.