I am doing 1d interpolation using scipy on time-series. My x-axis data is in datetime format and y axis is in float like:
3/15/2012 16:00:00 32.94
3/16/2012 16:00:00 32.95
3/19/2012 16:00:00 32.61
Now during slope calculation slope = (y_hi-y_lo) / (x_hi-x_lo) i am getting the error TypeError: unsupported operand type(s) for /: 'float' and 'datetime.timedelta' which is an obvious error. Can someone point me toward the right direction, How to handle it ?
Your issue is that you are trying to divide a float by a datetime.timedelta object which is, as you said, obviously throwing a TypeError.
You can convert datetime.timedelta objects to a float representing the total number of seconds within that timedelta using the datetime.timedelta.total_seconds() instance method.
In that case you would modify your code to something like:
slope_numerator = y_hi - y_lo
slope_denominator = (x_hi - x_lo).total_seconds()
slope = slope_numerator / slope_denominator
Note that this will give you a slope in terms of seconds. You could modify the denominator to give it in terms of hours, days, etc to suit your purposes.
If you are working with timeseries data, the Pandas package is an excellent option. Here's an example of upsampling daily data to hourly data via interpolation:
import numpy as np
from pandas import *
rng = date_range('1/1/2011', periods=12, freq='D')
ts = Series(np.arange(len(rng)), index=rng)
resampled = ts.resample('H')
interp = resampled.interpolate()
In [5]: ts
Out[5]:
2011-01-01 0
2011-01-02 1
2011-01-03 2
2011-01-04 3
2011-01-05 4
2011-01-06 5
2011-01-07 6
2011-01-08 7
2011-01-09 8
2011-01-10 9
2011-01-11 10
2011-01-12 11
In [12]: interp.head()
Out[12]:
2011-01-01 00:00:00 0.000000
2011-01-01 01:00:00 0.041667
2011-01-01 02:00:00 0.083333
2011-01-01 03:00:00 0.125000
2011-01-01 04:00:00 0.166667
Freq: H, dtype: float64
In [13]: interp.tail()
Out[13]:
2011-01-11 20:00:00 10.833333
2011-01-11 21:00:00 10.875000
2011-01-11 22:00:00 10.916667
2011-01-11 23:00:00 10.958333
2011-01-12 00:00:00 11.000000
Freq: H, dtype: float64
Related
I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you
You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.
I have following data frame:
dteday
0 2011-01-01
1 2011-01-02
2 2011-01-03
3 2011-01-04
4 2011-01-05
5 2011-01-06
6 2011-01-07
7 2011-01-08
8 2011-01-09
9 2011-01-10
10 2011-01-11
11 2011-01-12
12 2011-01-13
13 2011-01-14
14 2011-01-15
15 2011-01-16
16 2011-01-17
And want to transform this column to column of Unix timestamps of this date.
I tried this, but ran into next error:
df['tmstamp'] = df.dteday.astype(np.int64)
Error:ValueError: invalid literal for int() with base 10: '2011-01-01'
I can't find the same questions anywhere.
What's the problem? Thanks.
Looks like your current code is trying to directly convert the string 2011-01-01 to an integer i.e. np.int64. The parsing/conversion fails thus you're seeing an error.
You can use pd.to_datetime() method to convert the string values in the column to datetime object first. (Docs). Then you can convert the type to np.int64.
Given the following dataframe:
dates
0 2011-01-01
1 2011-01-02
2 2011-01-03
3 2011-01-04
4 2011-01-05
Try this:
df['timestamp'] = pd.to_datetime(df['dates']).astype(np.int64)
Outputs:
dates timestamp
0 2011-01-01 1293840000000000000
1 2011-01-02 1293926400000000000
2 2011-01-03 1294012800000000000
3 2011-01-04 1294099200000000000
4 2011-01-05 1294185600000000000
What I have:
A pandas dataframe with a column containing dates
Python 3.6
What I want:
Compute a new column, where the new value for every row depends only on a part of the date in the existing column for the same row (for example, an operation that depends only on the hour of the date)
Do so in an efficient manner (thinking, vectorized), as opposed to row-by-row computations.
Example dataframe (small dataframe is convenient for printing, but I also have an actual use-case with a larger dataframe which I can't share, but can use to for timing different solutions):
import numpy as np
import pandas as pd
from datetime import datetime
from datetime import timedelta
df = pd.DataFrame({'Date': np.arange(datetime(2000,1,1),
datetime(2000,1,2),
timedelta(hours=3)).astype(datetime)})
print(df)
Which gives:
Date
0 2000-01-01 00:00:00
1 2000-01-01 03:00:00
2 2000-01-01 06:00:00
3 2000-01-01 09:00:00
4 2000-01-01 12:00:00
5 2000-01-01 15:00:00
6 2000-01-01 18:00:00
7 2000-01-01 21:00:00
Existing solution (too slow):
df['SinHour'] = df.apply(
lambda row: np.sin((row.Date.hour + float(row.Date.minute) / 60.0) * np.pi / 12.0),
axis=1)
print(df)
Which gives:
Date SinHour
0 2000-01-01 00:00:00 0.000000e+00
1 2000-01-01 03:00:00 7.071068e-01
2 2000-01-01 06:00:00 1.000000e+00
3 2000-01-01 09:00:00 7.071068e-01
4 2000-01-01 12:00:00 1.224647e-16
5 2000-01-01 15:00:00 -7.071068e-01
6 2000-01-01 18:00:00 -1.000000e+00
7 2000-01-01 21:00:00 -7.071068e-01
I say this solution is too slow, because it computes every value in the column row-by-row. Of course, if this really is the only possibility, I'll have to settle for this. However, in the case of simpler functions, I've gotten huge speedups by using vectorized numpy functions, which I'm hoping will be possible in some way here too.
Direction for desired solution (does not work):
I was hoping to be able to do something like this:
df = df.assign(
SinHour=lambda data: np.sin((data.Date.hour + float(data.Date.minute) / 60.0)
* np.pi / 12.0))
This is the direction I was hoping to go in, because it's no longer a row-by-row apply. However, it obviously doesn't work, because it can't access the hour and minute properties of the entire Date column at once in a "vectorized" manner.
You was really close, only need .dt for process datetimes Series and for cast astype:
df = df.assign(SinHour=np.sin((df.Date.dt.hour +
(df.Date.dt.minute).astype(float) / 60.0) * np.pi / 12.0)
)
print(df)
Date SinHour
0 2000-01-01 00:00:00 0.000000e+00
1 2000-01-01 03:00:00 7.071068e-01
2 2000-01-01 06:00:00 1.000000e+00
3 2000-01-01 09:00:00 7.071068e-01
4 2000-01-01 12:00:00 1.224647e-16
5 2000-01-01 15:00:00 -7.071068e-01
6 2000-01-01 18:00:00 -1.000000e+00
7 2000-01-01 21:00:00 -7.071068e-01
I'd like to find faster code to achieve the same goal: for each row, compute the median of all data in the past 30 days. But there are less than 5 data points, then return np.nan.
import pandas as pd
import numpy as np
import datetime
def findPastVar(df, var='var' ,window=30, method='median'):
# window= # of past days
def findPastVar_apply(row):
pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
if len(pastVar) < 5:
return(np.nan)
if method == 'median':
return(np.median(pastVar.values))
df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
return(df)
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
In [47]: df.head()
Out[47]:
timestamp var
0 2011-01-01 00:00:00 -0.670695
1 2011-01-02 00:00:00 0.315148
2 2011-01-03 00:00:00 -0.717432
3 2011-01-04 00:00:00 2.904063
4 2011-01-05 00:00:00 -1.092813
Desired output:
In [55]: df.head(10)
Out[55]:
timestamp var past30d_var_median
0 2011-01-01 00:00:00 -0.670695 NaN
1 2011-01-02 00:00:00 0.315148 NaN
2 2011-01-03 00:00:00 -0.717432 NaN
3 2011-01-04 00:00:00 2.904063 NaN
4 2011-01-05 00:00:00 -1.092813 NaN
5 2011-01-06 00:00:00 -2.676784 -0.670695
6 2011-01-07 00:00:00 -0.353425 -0.694063
7 2011-01-08 00:00:00 -0.223442 -0.670695
8 2011-01-09 00:00:00 0.162126 -0.512060
9 2011-01-10 00:00:00 0.633801 -0.353425
However, my current code running speed:
In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop
I need to run a large dataframe from time to time, so I want to optimize this code.
Any suggestion or comment are welcome.
New in pandas 0.19 is time aware rolling. It can deal with missing data.
Code:
print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())
Test Code:
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]
# drop some data
df = df.drop(range(15, 50))
df['median'] = df.rolling(
'30d', on='timestamp', min_periods=5)['var'].median()
Results:
timestamp var median
0 2011-01-01 00:00:00 -0.639901 NaN
1 2011-01-02 00:00:00 -1.212541 NaN
2 2011-01-03 00:00:00 1.015730 NaN
3 2011-01-04 00:00:00 -0.203701 NaN
4 2011-01-05 00:00:00 0.319618 -0.203701
5 2011-01-06 00:00:00 1.272088 0.057958
6 2011-01-07 00:00:00 0.688965 0.319618
7 2011-01-08 00:00:00 -1.028438 0.057958
8 2011-01-09 00:00:00 1.418207 0.319618
9 2011-01-10 00:00:00 0.303839 0.311728
10 2011-01-11 00:00:00 -1.939277 0.303839
11 2011-01-12 00:00:00 1.052173 0.311728
12 2011-01-13 00:00:00 0.710270 0.319618
13 2011-01-14 00:00:00 1.080713 0.504291
14 2011-01-15 00:00:00 1.192859 0.688965
50 2011-02-21 00:00:00 -1.126879 NaN
51 2011-02-21 00:00:00 0.213635 NaN
52 2011-02-22 00:00:00 -1.357243 NaN
53 2011-02-23 00:00:00 -1.993216 NaN
54 2011-02-24 00:00:00 1.082374 -1.126879
55 2011-02-25 00:00:00 0.124840 -0.501019
56 2011-02-26 00:00:00 -0.136822 -0.136822
57 2011-02-27 00:00:00 -0.744386 -0.440604
58 2011-02-28 00:00:00 -1.960251 -0.744386
59 2011-03-01 00:00:00 0.041767 -0.440604
you can try rolling_median
O(N log(window)) implementation using skip list
pd.rolling_median(df,window= 30,min_periods=5)
I have a dataframe with only numeric values and I want to calculate the mean for every column and create a new dataframe.
The original dataframe is indexed by a datetimefield. The new dataframe should be indexed by the same field as original dataframe with a value equal to last row index of original dataframe.
Code so far
mean_series=df.mean()
df_mean= pd.DataFrame(stddev_series)
df_mean.rename(columns=lambda x: 'std_dev_'+ x, inplace=True)
but this gives an error
df_mean.rename(columns=lambda x: 'std_mean_'+ x, inplace=True)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S21') dtype('S21') dtype('S21')
Your question implies that you want a new DataFrame with a single row.
In [10]: df.head(10)
Out[10]:
0 1 2 3
2011-01-01 00:00:00 0.182481 0.523784 0.718124 0.063792
2011-01-01 01:00:00 0.321362 0.404686 0.481889 0.524521
2011-01-01 02:00:00 0.514426 0.735809 0.433758 0.392824
2011-01-01 03:00:00 0.616802 0.149099 0.217199 0.155990
2011-01-01 04:00:00 0.525465 0.439633 0.641974 0.270364
2011-01-01 05:00:00 0.749662 0.151958 0.200913 0.219916
2011-01-01 06:00:00 0.665164 0.396595 0.980862 0.560119
2011-01-01 07:00:00 0.797803 0.377273 0.273724 0.220965
2011-01-01 08:00:00 0.651989 0.553929 0.769008 0.545288
2011-01-01 09:00:00 0.692169 0.261194 0.400704 0.118335
In [11]: df.tail()
Out[11]:
0 1 2 3
2011-01-03 19:00:00 0.247211 0.539330 0.734206 0.781125
2011-01-03 20:00:00 0.278550 0.534943 0.804949 0.137291
2011-01-03 21:00:00 0.602246 0.108791 0.987120 0.455887
2011-01-03 22:00:00 0.003097 0.436435 0.987877 0.046066
2011-01-03 23:00:00 0.604916 0.670532 0.513927 0.610775
In [12]: df.mean()
Out[12]:
0 0.495307
1 0.477509
2 0.562590
3 0.447997
dtype: float64
In [13]: new_df = pd.DataFrame(df.mean().to_dict(),index=[df.index.values[-1]])
In [14]: new_df
Out[14]:
0 1 2 3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997
In [15]: new_df.rename(columns=lambda c: "mean_"+str(c))
Out[15]:
mean_0 mean_1 mean_2 mean_3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997