I have several monthly, datetime-indexed cumulative Pandas series which I would like to de-cumulate so I can just get the values for the specific months themselves.
So, for each year, Jan is Jan, Feb is Jan + Feb, Mar is Jan + Feb + Mar and so on, until the next year that starts at Jan again.
To be awkward some of these series start with Feb instead.
Here's an example series:
2016-02-29 112.3
2016-03-31 243.0
2016-04-30 360.1
2016-05-31 479.5
2016-06-30 643.0
2016-07-31 757.6
2016-08-31 874.5
2016-09-30 1051.8
2016-10-31 1203.4
2016-11-30 1358.3
2016-12-31 1573.5
2017-01-31 75.0
2017-02-28 140.5
2017-03-31 290.4
2017-04-30 416.6
2017-05-31 548.2
2017-06-30 746.6
2017-07-31 863.5
2017-08-31 985.4
2017-09-30 1160.1
2017-10-31 1302.5
2017-11-30 1465.7
2017-12-31 1694.1
2018-01-31 74.0
2018-02-28 146.3
2018-03-31 300.9
2018-04-30 421.9
2018-05-31 564.1
2018-06-30 771.4
I thought one way to do this would be to use df.diff() to get most of the differences for everything but Jan, replace the incorrect Jan values with NaN then do a df.update(original df) to fill in the NaNs with the correct values.
I'm having trouble trying to replace the Jan data with NaNs. Would anyone be able to help with this or suggest another solution at all please?
I would solve this with groupby + diff + fillna:
df.asfreq('M').groupby(pd.Grouper(freq='Y')).diff().fillna(df)
Value
2016-02-29 112.3
2016-03-31 130.7
2016-04-30 117.1
2016-05-31 119.4
2016-06-30 163.5
2016-07-31 114.6
2016-08-31 116.9
2016-09-30 177.3
2016-10-31 151.6
2016-11-30 154.9
2016-12-31 215.2
2017-01-31 75.0
2017-02-28 65.5
2017-03-31 149.9
2017-04-30 126.2
2017-05-31 131.6
2017-06-30 198.4
2017-07-31 116.9
2017-08-31 121.9
2017-09-30 174.7
2017-10-31 142.4
2017-11-30 163.2
2017-12-31 228.4
2018-01-31 74.0
2018-02-28 72.3
2018-03-31 154.6
2018-04-30 121.0
2018-05-31 142.2
2018-06-30 207.3
Assuming the index is the date column, and the "Value" is a float.
Related
I have a DataFrame of the following form:
You see that it has a multi index. For each muni index I want to do a resampling of the form .resample('A').mean() of the popDate index. Hence, I want python to fill in the missing years. NaN values shall be replaced by a linear interpolation. How do I do that?
Update: Some mock input DataFrame:
interData=pd.DataFrame({'muni':['Q1','Q1','Q1','Q2','Q2','Q2'],'popDate':['2015','2021','2022','2015','2017','2022'],'population':[5,11,22,15,17,22]})
interData['popDate']=pd.to_datetime(interData['popDate'])
interData=interData.set_index(['muni','popDate'])
It looks like you want a groupby.resample:
interData.groupby(level='muni').resample('A', level='popDate').mean()
Output:
population
muni popDate
Q1 2015-12-31 5.0
2016-12-31 NaN
2017-12-31 NaN
2018-12-31 NaN
2019-12-31 NaN
2020-12-31 NaN
2021-12-31 11.0
2022-12-31 22.0
Q2 2015-12-31 15.0
2016-12-31 NaN
2017-12-31 17.0
2018-12-31 NaN
2019-12-31 NaN
2020-12-31 NaN
2021-12-31 NaN
2022-12-31 22.0
If you also need interpolation, combine with interpolate:
out = (interData.groupby(level='muni')
.apply(lambda g: g.resample('A', level='popDate').mean()
.interpolate(method='time'))
)
Output:
population
muni popDate
Q1 2015-12-31 5.000000
2016-12-31 6.001825
2017-12-31 7.000912
2018-12-31 8.000000
2019-12-31 8.999088
2020-12-31 10.000912
2021-12-31 11.000000
2022-12-31 22.000000
Q2 2015-12-31 15.000000 # 366 days between 2015-12-31 and 2016-12-31
2016-12-31 16.001368 # 365 days between 2016-12-31 and 2017-12-31
2017-12-31 17.000000
2018-12-31 17.999452
2019-12-31 18.998905
2020-12-31 20.001095
2021-12-31 21.000548
2022-12-31 22.000000
So im taking a pandas and Numpy course and ran into a problem, the course instructor performed the solution and it worked, i followed every step and it didnt work for me
Pardon the length, i included the actual datasets for clarity
i assigned the following items in a list to the variable " dates" as instructed, see below
dates = [
"2016-12-22",
"2017-05-03",
"2017-01-06",
"2017-03-05",
"2017-02-12",
"2017-03-21",
"2017-04-14",
"2017-04-15",
]
then i have a series im working against named oil_series with the following data
Date is the Index Name
date
2016-12-20
2016-12-21
2016-12-22
2016-12-23
2016-12-27
2016-12-28
2016-12-29
2016-12-30
2017-01-03
2017-01-04
2017-01-05
2017-01-06
2017-01-09
2017-01-10
2017-01-11
2017-01-12
2017-01-13
2017-01-17
2017-01-18
2017-01-19
2017-01-20
2017-01-23
2017-01-24
2017-01-25
2017-01-26
2017-01-27
2017-01-30
2017-01-31
2017-02-01
2017-02-02
2017-02-03
2017-02-06
2017-02-07
2017-02-08
2017-02-09
2017-02-10
2017-02-13
2017-02-14
2017-02-15
2017-02-16
2017-02-17
2017-02-21
2017-02-22
2017-02-23
2017-02-24
2017-02-27
2017-02-28
2017-03-01
2017-03-02
2017-03-03
2017-03-06
2017-03-07
2017-03-08
2017-03-09
2017-03-10
2017-03-13
2017-03-14
2017-03-15
2017-03-16
2017-03-17
2017-03-20
2017-03-21
2017-03-22
2017-03-23
2017-03-24
2017-03-27
2017-03-28
2017-03-29
2017-03-30
2017-03-31
2017-04-03
2017-04-04
2017-04-05
2017-04-06
2017-04-07
2017-04-10
2017-04-11
2017-04-12
2017-04-13
2017-04-17
2017-04-18
2017-04-19
2017-04-20
2017-04-21
2017-04-24
2017-04-25
2017-04-26
2017-04-27
2017-04-28
2017-05-01
2017-05-02
2017-05-03
2017-05-04
2017-05-05
2017-05-08
2017-05-09
2017-05-10
2017-05-11
2017-05-12
2017-05-15
Values
52.22
51.44
51.98
52.01
52.82
54.01
53.8
53.75
52.36
53.26
53.77
53.98
51.95
50.82
52.19
53.01
52.36
52.45
51.12
51.39
52.33
52.77
52.38
52.14
53.24
53.18
52.63
52.75
53.9
53.55
53.81
53.01
52.19
52.37
52.99
53.84
52.96
53.21
53.11
53.41
53.41
54.02
53.61
54.48
53.99
54.04
54
53.82
52.63
53.33
53.19
52.68
49.83
48.75
48.05
47.95
47.24
48.34
48.3
48.34
47.79
47.02
47.29
47
47.3
47.02
48.36
49.47
50.3
50.54
50.25
50.99
51.14
51.69
52.25
53.06
53.38
53.12
53.19
52.62
52.46
50.49
50.26
49.64
48.9
49.22
49.22
48.96
49.31
48.83
47.65
47.79
45.55
46.23
46.46
45.84
47.28
47.81
47.83
48.86
So when i write the following code to filter the "oil_prices" whose index are dates against the "dates" list i created, see code below
mask = (oil_series.index.isin(dates)) & (oil_series <= 50)
oil_series.loc[mask]
the following error occurs
Error from running the code
Please help me understand the problem
According your comment, you have a MultiIndex of only one level. Convert it to Index and your code should work:
oil_series.index = oil_series.index.get_level_values('date')
# Your code here.
mask = (oil_series.index.isin(dates)) & (oil_series <= 50)
So I have a series of dates and I want to split it into chunks based on continuity.Series looks like the following:
2019-01-01 36.581647
2019-01-02 35.988585
2019-01-03 35.781111
2019-01-04 35.126273
2019-01-05 34.401451
2019-01-06 34.351714
2019-01-07 34.175517
2019-01-08 33.622116
2019-01-09 32.861861
2019-01-10 32.915251
2019-01-11 32.866832
2019-01-12 32.214259
2019-01-13 31.707626
2019-01-14 32.556175
2019-01-15 32.674965
2019-01-16 32.391766
2019-01-17 32.463836
2019-01-18 32.151290
2019-01-19 31.952946
2019-01-20 31.739855
2019-01-21 31.355354
2019-01-22 31.271243
2019-01-23 31.273255
2019-01-24 31.442803
2019-01-25 32.034161
2019-01-26 31.455956
2019-01-27 31.408881
2019-01-28 31.066477
2019-01-29 30.489070
2019-01-30 30.356210
2019-01-31 30.470496
2019-02-01 29.949312
2019-02-02 29.916971
2019-02-03 29.865447
2019-02-04 29.512595
2019-02-05 29.297967
2019-02-06 28.743329
2019-02-07 28.509800
2019-02-08 27.681294
2019-02-10 26.441899
2019-02-11 26.787360
2019-02-12 27.368621
2019-02-13 27.085167
2019-02-14 26.856398
2019-02-15 26.793370
2019-02-16 26.334788
2019-02-17 25.906381
2019-02-18 25.367705
2019-02-19 24.939880
2019-02-20 25.021575
2019-02-21 25.006527
2019-02-22 24.984512
2019-02-23 24.372664
2019-02-24 24.183728
2019-10-10 23.970567
2019-10-11 24.755944
2019-10-12 25.155136
2019-10-13 25.273033
2019-10-14 25.490775
2019-10-15 25.864637
2019-10-16 26.168158
2019-10-17 26.600422
2019-10-18 26.959990
2019-10-19 26.965104
2019-10-20 27.128877
2019-10-21 26.908657
2019-10-22 26.979930
2019-10-23 26.816817
2019-10-24 27.058753
2019-10-25 27.453882
2019-10-26 27.358057
2019-10-27 27.374445
2019-10-28 27.418648
2019-10-29 27.458521
2019-10-30 27.859687
2019-10-31 28.093942
2019-11-01 28.494706
2019-11-02 28.517255
2019-11-03 28.492476
2019-11-04 28.723757
2019-11-05 28.835151
2019-11-06 29.367227
2019-11-07 29.920598
2019-11-08 29.746370
2019-11-09 29.498023
2019-11-10 29.745044
2019-11-11 30.935084
2019-11-12 31.710737
2019-11-13 32.890792
2019-11-14 33.011911
2019-11-15 33.121803
2019-11-16 32.805403
2019-11-17 32.887447
2019-11-18 33.350492
2019-11-19 33.525344
2019-11-20 33.791458
2019-11-21 33.674697
2019-11-22 33.642584
2019-11-23 33.704386
2019-11-24 33.472346
2019-11-25 33.317035
2019-11-26 32.934307
2019-11-27 33.573193
2019-11-28 32.840514
2019-11-29 33.085686
2019-11-30 33.138131
2019-12-01 33.344264
2019-12-02 33.524948
2019-12-03 33.694687
2019-12-04 33.836534
2019-12-05 34.343416
2019-12-06 34.321793
2019-12-07 34.156796
2019-12-08 34.399591
2019-12-09 34.931185
2019-12-10 35.294034
2019-12-11 35.021331
2019-12-12 34.292483
2019-12-13 34.330898
2019-12-14 34.354278
2019-12-15 34.436500
2019-12-16 34.869841
2019-12-17 34.932567
2019-12-18 34.855816
2019-12-19 35.226241
2019-12-20 35.184222
2019-12-21 35.456716
2019-12-22 35.730350
2019-12-23 35.739911
2019-12-24 35.800030
2019-12-25 35.896615
2019-12-26 35.871280
2019-12-27 35.509646
2019-12-28 35.235416
2019-12-29 34.848605
2019-12-30 34.926700
2019-12-31 34.787211
And I want to split it like:
chunk,start,end,value
0,2019-01-01,2019-02-24,35.235416
1,2019-10-10,2019-12-31,34.787211
Values are random and can be of any aggregated function. About that I dont care. But still cannot find a way to do it. The important thing is the chunks I get
I assume that your DataFrame:
has columns named Date and Amount,
Date column is of datetime type (not string).
To generate your result, define the following function, to be applied
to each group of rows:
def grpRes(grp):
return pd.Series([grp.Date.min(), grp.Date.max(), grp.Amount.mean()],
index=['start', 'end', 'value'])
Then apply it to each group and rename the index:
res = df.groupby(df.Date.diff().dt.days.fillna(1, downcast='infer')
.gt(1).cumsum()).apply(grpRes)
res.index.name = 'chunk'
I noticed that your data sample has no row for 2019-02-09, but you
dot't treat such single missing day as a violation of the
"continuity rule".
If you realy want such behaviour, change gt(1) to e.g. gt(2).
One way is boolean indexing, which assumes your data is already sorted. I also assumed your columns were named ['Date', 'Val]
#reset index so you have a dataframe
data = s.reset_index()
# boolean indexing where the date below is greater than 1 day
end = data[((data['Date'] - data['Date'].shift(-1)).dt.days.abs() != 1)].reset_index(drop=True).rename(columns={'Date':'End', 'Val': 'End_val'})
# boolean indexing where the date above is greater than one day
start = data[(data['Date'] - data['Date'].shift()).dt.days != 1].reset_index(drop=True).rename(columns={'Date':'Start', 'Val':'Start_val'})
# concat your data
pd.concat([start,end], axis=1)
Start Start_val End End_val
0 2019-01-01 36.581647 2019-02-08 27.681294
1 2019-02-10 26.441899 2019-02-24 24.183728
2 2019-10-10 23.970567 2019-12-31 34.787211
ZILLOW/C25499_MLPFAH - Value
Date
2013-04-30 178.571429
2013-05-31 178.571429
2013-06-30 185.380865
2013-07-31 176.747442
2013-08-31 166.666667
2013-09-30 167.599502
2013-10-31 169.025157
2013-11-30 160.929092
2013-12-31 165.282392
2014-01-31 167.153775
2014-02-28 166.666667
2014-03-31 172.686604
2014-04-30 172.207447
2014-05-31 161.466408
2014-06-30 156.976744
2014-07-31 142.410714
2014-08-31 144.152523
2014-09-30 145.656780
2014-10-31 150.291745
2014-11-30 152.343542
2014-12-31 152.343542
2015-01-31 150.387968
2015-02-28 154.441006
2015-03-31 157.130952
2015-04-30 154.761905
2015-05-31 149.999583
2015-06-30 148.054146
2015-07-31 152.357673
2015-08-31 148.054146
2015-09-30 154.715762
2015-10-31 165.719697
2015-11-30 165.719697
2015-12-31 158.990168
2016-01-31 158.990168
2016-02-29 146.204168
2016-03-31 148.255814
2016-04-30 145.340150
2016-05-31 144.152523
2016-06-30 144.152523
2016-07-31 153.556496
2016-08-31 157.471093
2016-09-30 166.272727
2016-10-31 171.289349
2016-11-30 166.272727
2016-12-31 164.085821
2017-01-31 155.586081
2017-02-28 149.224486
2017-03-31 149.107143
2017-04-30 151.785714
2017-05-31 149.107143
2017-06-30 151.903057
2017-07-31 151.903057
2017-08-31 152.020400
2017-09-30 151.477833
2017-10-31 145.813048
2017-11-30 150.843468
2017-12-31 146.829969
2018-01-31 147.846890
2018-02-28 150.843468
2018-03-31 146.920361
data = '''2013-04-30 178.571429
2013-05-31 178.571429
2013-06-30 185.380865
2013-07-31 176.747442
2013-08-31 166.666667
2013-09-30 167.599502
2013-10-31 169.025157
2013-11-30 160.929092
2013-12-31 165.282392
2014-01-31 167.153775
2014-02-28 166.666667
2014-03-31 172.686604
2014-04-30 172.207447
2014-05-31 161.466408
2014-06-30 156.976744
2014-07-31 142.410714
2014-08-31 144.152523
2014-09-30 145.656780
2014-10-31 150.291745
2014-11-30 152.343542
2014-12-31 152.343542
2015-01-31 150.387968
2015-02-28 154.441006
2015-03-31 157.130952
2015-04-30 154.761905
2015-05-31 149.999583
2015-06-30 148.054146
2015-07-31 152.357673
2015-08-31 148.054146
2015-09-30 154.715762
2015-10-31 165.719697
2015-11-30 165.719697
2015-12-31 158.990168
2016-01-31 158.990168
2016-02-29 146.204168
2016-03-31 148.255814
2016-04-30 145.340150
2016-05-31 144.152523
2016-06-30 144.152523
2016-07-31 153.556496
2016-08-31 157.471093
2016-09-30 166.272727
2016-10-31 171.289349
2016-11-30 166.272727
2016-12-31 164.085821
2017-01-31 155.586081
2017-02-28 149.224486
2017-03-31 149.107143
2017-04-30 151.785714
2017-05-31 149.107143
2017-06-30 151.903057
2017-07-31 151.903057
2017-08-31 152.020400
2017-09-30 151.477833
2017-10-31 145.813048
2017-11-30 150.843468
2017-12-31 146.829969
2018-01-31 147.846890
2018-02-28 150.843468
2018-03-31 146.920361'''
print(len([p for p in [float(r.split()[1]) for r in data.split('\n')] if p >= 160 and p <= 170]))
This outputs 13.
Fairly new to python and pandas here.
I make a query that's giving me back a timeseries. I'm never sure how many data points I receive from the query (run for a single day), but what I do know is that I need to resample them to contain 24 points (one for each hour in the day).
Printing m3hstream gives
[(1479218009000L, 109), (1479287368000L, 84)]
Then I try to make a dataframe df with
df = pd.DataFrame(data = list(m3hstream), columns=['Timestamp', 'Value'])
and this gives me an output of
Timestamp Value
0 1479218009000 109
1 1479287368000 84
Following I do this
daily_summary = pd.DataFrame()
daily_summary['value'] = df['Value'].resample('H').mean()
daily_summary = daily_summary.truncate(before=start, after=end)
print "Now daily summary"
print daily_summary
But this is giving me a TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could anyone please let me know how to resample it so I have 1 point for each hour in the 24 hour period that I'm querying for?
Thanks.
First thing you need to do is convert that 'Timestamp' to an actual pd.Timestamp. It looks like those are milliseconds
Then resample with the on parameter set to 'Timestamp'
df = df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index()
Timestamp Value
0 2016-11-15 13:00:00 109.0
1 2016-11-15 14:00:00 NaN
2 2016-11-15 15:00:00 NaN
3 2016-11-15 16:00:00 NaN
4 2016-11-15 17:00:00 NaN
5 2016-11-15 18:00:00 NaN
6 2016-11-15 19:00:00 NaN
7 2016-11-15 20:00:00 NaN
8 2016-11-15 21:00:00 NaN
9 2016-11-15 22:00:00 NaN
10 2016-11-15 23:00:00 NaN
11 2016-11-16 00:00:00 NaN
12 2016-11-16 01:00:00 NaN
13 2016-11-16 02:00:00 NaN
14 2016-11-16 03:00:00 NaN
15 2016-11-16 04:00:00 NaN
16 2016-11-16 05:00:00 NaN
17 2016-11-16 06:00:00 NaN
18 2016-11-16 07:00:00 NaN
19 2016-11-16 08:00:00 NaN
20 2016-11-16 09:00:00 84.0
If you want to fill those NaN values, use ffill, bfill, or interpolate
df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index().interpolate()
Timestamp Value
0 2016-11-15 13:00:00 109.00
1 2016-11-15 14:00:00 107.75
2 2016-11-15 15:00:00 106.50
3 2016-11-15 16:00:00 105.25
4 2016-11-15 17:00:00 104.00
5 2016-11-15 18:00:00 102.75
6 2016-11-15 19:00:00 101.50
7 2016-11-15 20:00:00 100.25
8 2016-11-15 21:00:00 99.00
9 2016-11-15 22:00:00 97.75
10 2016-11-15 23:00:00 96.50
11 2016-11-16 00:00:00 95.25
12 2016-11-16 01:00:00 94.00
13 2016-11-16 02:00:00 92.75
14 2016-11-16 03:00:00 91.50
15 2016-11-16 04:00:00 90.25
16 2016-11-16 05:00:00 89.00
17 2016-11-16 06:00:00 87.75
18 2016-11-16 07:00:00 86.50
19 2016-11-16 08:00:00 85.25
20 2016-11-16 09:00:00 84.00
Let's try:
daily_summary = daily_summary.set_index('Timestamp')
daily_summary.index = pd.to_datetime(daily_summary.index, unit='ms')
For once an hour:
daily_summary.resample('H').mean()
or for once a day:
daily_summary.resample('D').mean()