calculate time difference pandas dataframe - python

I have a pandas dataframe where index is as follows :
Index([16/May/2013:23:56:43, 16/May/2013:23:56:42, 16/May/2013:23:56:43, ..., 17/May/2013:23:54:45, 17/May/2013:23:54:45, 17/May/2013:23:54:45], dtype=object)
I have calculated time difference in consequent occurrences in the following method.
df2['tvalue'] = df2.index
df2['tvalue'] = np.datetime64(df2['tvalue'])
df2['delta'] = (df2['tvalue']-df2['tvalue'].shift()).fillna(0)
So I got following output
Time tvalue delta
16/May/2013:23:56:43 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:42 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:43 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:43 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:48 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:48 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:48 2013-05-01 13:23:56 00:00:00
16/May/2013:23:57:44 2013-05-01 13:23:57 00:00:01
16/May/2013:23:57:44 2013-05-01 13:23:57 00:00:00
16/May/2013:23:57:44 2013-05-01 13:23:57 00:00:00
But it has calculated time difference taking the year as hours and the date is also different?What can be the problem here?

Parsing your date was non-trivial, I think strptime could prob do it, but didn't work for me. Your example above your times are just strings, not datetimes.
In [140]: from dateutil import parser
In [130]: def parse(x):
.....: date, hh, mm, ss = x.split(':')
.....: dd, mo, yyyy = date.split('/')
.....: return parser.parse("%s %s %s %s:%s:%s" % (yyyy,mo,dd,hh,mm,ss))
.....:
In [131]: map(parse,idx)
Out[131]:
[datetime.datetime(2013, 5, 16, 23, 56, 43),
datetime.datetime(2013, 5, 16, 23, 56, 42),
datetime.datetime(2013, 5, 16, 23, 56, 43),
datetime.datetime(2013, 5, 17, 23, 54, 45),
datetime.datetime(2013, 5, 17, 23, 54, 45),
datetime.datetime(2013, 5, 17, 23, 54, 45)]
In [132]: pd.to_datetime(map(parse,idx))
Out[132]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-16 23:56:43, ..., 2013-05-17 23:54:45]
Length: 6, Freq: None, Timezone: None
In [133]: df = DataFrame(dict(time = pd.to_datetime(map(parse,idx))))
In [134]: df
Out[134]:
time
0 2013-05-16 23:56:43
1 2013-05-16 23:56:42
2 2013-05-16 23:56:43
3 2013-05-17 23:54:45
4 2013-05-17 23:54:45
5 2013-05-17 23:54:45
In [138]: df['delta'] = (df['time']-df['time'].shift()).fillna(0)
In [139]: df
Out[139]:
time delta
0 2013-05-16 23:56:43 00:00:00
1 2013-05-16 23:56:42 -00:00:01
2 2013-05-16 23:56:43 00:00:01
3 2013-05-17 23:54:45 23:58:02
4 2013-05-17 23:54:45 00:00:00
5 2013-05-17 23:54:45 00:00:00

Related

Convert a datetime index to sequential numbers for x value of machine leanring

This seems like a basic question. I want to use the datetime index in a pandas dataframe as the x values of a machine leanring algorithm for a univarte time series comparisons.
I tried to isolate the index and then convert it to a number but i get an error.
df=data["Close"]
idx=df.index
df.index.get_loc(idx)
Date
2014-03-31 0.9260
2014-04-01 0.9269
2014-04-02 0.9239
2014-04-03 0.9247
2014-04-04 0.9233
this is what i get when i add your code
2019-04-24 00:00:00 0.7097
2019-04-25 00:00:00 0.7015
2019-04-26 00:00:00 0.7018
2019-04-29 00:00:00 0.7044
x (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
Name: Close, Length: 1325, dtype: object
I ne
ed a column of 1 to the number of values in my dataframe
First select column Close by double [] for one column DataFrame, so possible add new column:
df = data[["Close"]]
df["x"] = np.arange(1, len(df) + 1)
print (df)
Close x
Date
2014-03-31 0.9260 1
2014-04-01 0.9269 2
2014-04-02 0.9239 3
2014-04-03 0.9247 4
2014-04-04 0.9233 5
You can add a column with value range(1, len(data) + 1) as so:
df = pd.DataFrame({"y": [5, 4, 3, 2, 1]}, index=pd.date_range(start="2019-08-01", periods=5))
In [3]: df
Out[3]:
y
2019-08-01 5
2019-08-02 4
2019-08-03 3
2019-08-04 2
2019-08-05 1
df["x"] = range(1, len(df) + 1)
In [7]: df
Out[7]:
y x
2019-08-01 5 1
2019-08-02 4 2
2019-08-03 3 3
2019-08-04 2 4
2019-08-05 1 5

set_codes in multiIndexed pandas series

I want to multiIndex an array of data.
Initially, I was indexing my data with datetime, but for some later applications, I had to add another numeric index (that goes from 0 the len(array)-1).
I have written those little lines:
O = [0.701733664614, 0.699495411782, 0.572129320819, 0.613315597684, 0.58079660603, 0.596638918579, 0.48453382119]
Ab = [datetime.datetime(2018, 12, 11, 14, 0), datetime.datetime(2018, 12, 21, 10, 0), datetime.datetime(2018, 12, 21, 14, 0), datetime.datetime(2019, 1, 1, 10, 0), datetime.datetime(2019, 1, 1, 14, 0), datetime.datetime(2019, 1, 11, 10, 0), datetime.datetime(2019, 1, 11, 14, 0)]
tst = pd.Series(O,index=Ab)
ld = len(tst)
index = pd.MultiIndex.from_product([(x for x in range(0,ld)),Ab], names=['id','dtime'])
print (index)
data = pd.Series(O,index=index)
But when printting index, I get some bizzare ''codes'':
The levels & names are perfect, but the codes go from 0 to 763...764 times (instead of one)!
I tried to add the set_codes command:
index.set_codes([x for x in range(0,ld)], level=0)
print (index)
I vain, I have the following error :
ValueError: Unequal code lengths: [764, 583696]
the initial pandas series:
print (tst)
2005-01-01 14:00:00 0.544177
2005-01-01 14:00:00 0.544177
2005-01-21 14:00:00 0.602239
...
2019-05-21 10:00:00 0.446813
2019-05-21 14:00:00 0.466573
Length: 764, dtype: float64
the new expected one
id dtime
0 2005-01-01 14:00:00 0.544177
1 2005-01-01 14:00:00 0.544177
2 2005-01-21 14:00:00 0.602239
...
762 2019-05-21 10:00:00 0.446813
763 2019-05-21 14:00:00 0.466573
Thanks in advance
You can create new index by MultiIndex.from_arrays and reassign to Series:
s.index = pd.MultiIndex.from_arrays([np.arange(len(s)), s.index], names=['id','dtime'])

Filter Dataframe with a list of time ranges

below is a simplified version of my setup:
import pandas as pd
import datetime as dt
df_data = pd.DataFrame({'DateTime' : [dt.datetime(2017, 9, 1, 0, 0, 0),dt.datetime(2017, 9, 1, 1, 0, 0),dt.datetime(2017, 9, 1, 2, 0, 0),dt.datetime(2017, 9, 1, 3, 0, 0)], 'Data' : [1,2,3,5]})
df_timeRanges = pd.DataFrame({'startTime':[dt.datetime(2017, 8, 30, 0, 0, 0), dt.datetime(2017, 9, 1, 1, 30, 0)], 'endTime':[dt.datetime(2017, 9, 1, 0, 30, 0), dt.datetime(2017, 9, 1, 2, 30, 0)]})
print df_data
print df_timeRanges
This gives:
Data DateTime
0 1 2017-09-01 00:00:00
1 2 2017-09-01 01:00:00
2 3 2017-09-01 02:00:00
3 5 2017-09-01 03:00:00
endTime startTime
0 2017-09-01 00:30:00 2017-08-30 00:00:00
1 2017-09-01 02:30:00 2017-09-01 01:30:00
I would like to filter df_data with df_timeRanges, with the remaining rows in a single dataframe, kind of like:
df_data_filt = df_data[(df_data['DateTime'] >= df_timeRanges['startTime']) & (df_data['DateTime'] <= df_timeRanges['endTime'])]
I did not expect the above line to work, and it returned this error:
ValueError: Can only compare identically-labeled Series objects
Would anyone be able to provide some tips on this? The df_data and df_timeRanges in my real task are much bigger.
Thanks in advance
IIUIC, Use
In [794]: mask = np.logical_or.reduce([
(df_data.DateTime >= x.startTime) & (df_data.DateTime <= x.endTime)
for i, x in df_timeRanges.iterrows()])
In [795]: df_data[mask]
Out[795]:
Data DateTime
0 1 2017-09-01 00:00:00
2 3 2017-09-01 02:00:00
Or, also
In [807]: func = lambda x: (df_data.DateTime >= x.startTime) & (df_data.DateTime <= x.endTime)
In [808]: df_data[df_timeRanges.apply(func, axis=1).any()]
Out[808]:
Data DateTime
0 1 2017-09-01 00:00:00
2 3 2017-09-01 02:00:00

Get MM-DD-YYYY from pandas Timestamp

dates seem to be a tricky thing in python, and I am having a lot of trouble simply stripping the date out of the pandas TimeStamp. I would like to get from 2013-09-29 02:34:44 to simply 09-29-2013
I have a dataframe with a column Created_date:
Name: Created_Date, Length: 1162549, dtype: datetime64[ns]`
I have tried applying the .date() method on this Series, eg: df.Created_Date.date(), but I get the error AttributeError: 'Series' object has no attribute 'date'
Can someone help me out?
map over the elements:
In [239]: from operator import methodcaller
In [240]: s = Series(date_range(Timestamp('now'), periods=2))
In [241]: s
Out[241]:
0 2013-10-01 00:24:16
1 2013-10-02 00:24:16
dtype: datetime64[ns]
In [238]: s.map(lambda x: x.strftime('%d-%m-%Y'))
Out[238]:
0 01-10-2013
1 02-10-2013
dtype: object
In [242]: s.map(methodcaller('strftime', '%d-%m-%Y'))
Out[242]:
0 01-10-2013
1 02-10-2013
dtype: object
You can get the raw datetime.date objects by calling the date() method of the Timestamp elements that make up the Series:
In [249]: s.map(methodcaller('date'))
Out[249]:
0 2013-10-01
1 2013-10-02
dtype: object
In [250]: s.map(methodcaller('date')).values
Out[250]:
array([datetime.date(2013, 10, 1), datetime.date(2013, 10, 2)], dtype=object)
Yet another way you can do this is by calling the unbound Timestamp.date method:
In [273]: s.map(Timestamp.date)
Out[273]:
0 2013-10-01
1 2013-10-02
dtype: object
This method is the fastest, and IMHO the most readable. Timestamp is accessible in the top-level pandas module, like so: pandas.Timestamp. I've imported it directly for expository purposes.
The date attribute of DatetimeIndex objects does something similar, but returns a numpy object array instead:
In [243]: index = DatetimeIndex(s)
In [244]: index
Out[244]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-10-01 00:24:16, 2013-10-02 00:24:16]
Length: 2, Freq: None, Timezone: None
In [246]: index.date
Out[246]:
array([datetime.date(2013, 10, 1), datetime.date(2013, 10, 2)], dtype=object)
For larger datetime64[ns] Series objects, calling Timestamp.date is faster than operator.methodcaller which is slightly faster than a lambda:
In [263]: f = methodcaller('date')
In [264]: flam = lambda x: x.date()
In [265]: fmeth = Timestamp.date
In [266]: s2 = Series(date_range('20010101', periods=1000000, freq='T'))
In [267]: s2
Out[267]:
0 2001-01-01 00:00:00
1 2001-01-01 00:01:00
2 2001-01-01 00:02:00
3 2001-01-01 00:03:00
4 2001-01-01 00:04:00
5 2001-01-01 00:05:00
6 2001-01-01 00:06:00
7 2001-01-01 00:07:00
8 2001-01-01 00:08:00
9 2001-01-01 00:09:00
10 2001-01-01 00:10:00
11 2001-01-01 00:11:00
12 2001-01-01 00:12:00
13 2001-01-01 00:13:00
14 2001-01-01 00:14:00
...
999985 2002-11-26 10:25:00
999986 2002-11-26 10:26:00
999987 2002-11-26 10:27:00
999988 2002-11-26 10:28:00
999989 2002-11-26 10:29:00
999990 2002-11-26 10:30:00
999991 2002-11-26 10:31:00
999992 2002-11-26 10:32:00
999993 2002-11-26 10:33:00
999994 2002-11-26 10:34:00
999995 2002-11-26 10:35:00
999996 2002-11-26 10:36:00
999997 2002-11-26 10:37:00
999998 2002-11-26 10:38:00
999999 2002-11-26 10:39:00
Length: 1000000, dtype: datetime64[ns]
In [269]: timeit s2.map(f)
1 loops, best of 3: 1.04 s per loop
In [270]: timeit s2.map(flam)
1 loops, best of 3: 1.1 s per loop
In [271]: timeit s2.map(fmeth)
1 loops, best of 3: 968 ms per loop
Keep in mind that one of the goals of pandas is to provide a layer on top of numpy so that (most of the time) you don't have to deal with the low level details of the ndarray. So getting the raw datetime.date objects in an array is of limited use since they don't correspond to any numpy.dtype that is supported by pandas (pandas only supports datetime64[ns] [that's nanoseconds] dtypes). That said, sometimes you need to do this.
Maybe this only came in recently, but there are built-in methods for this. Try:
In [27]: s = pd.Series(pd.date_range(pd.Timestamp('now'), periods=2))
In [28]: s
Out[28]:
0 2016-02-11 19:11:43.386016
1 2016-02-12 19:11:43.386016
dtype: datetime64[ns]
In [29]: s.dt.to_pydatetime()
Out[29]:
array([datetime.datetime(2016, 2, 11, 19, 11, 43, 386016),
datetime.datetime(2016, 2, 12, 19, 11, 43, 386016)], dtype=object)
You can try using .dt.date on datetime64[ns] of the dataframe.
For e.g. df['Created_date'] = df['Created_date'].dt.date
Input dataframe named as test_df:
print(test_df)
Result:
Created_date
0 2015-03-04 15:39:16
1 2015-03-22 17:36:49
2 2015-03-25 22:08:45
3 2015-03-16 13:45:20
4 2015-03-19 18:53:50
Checking dtypes:
print(test_df.dtypes)
Result:
Created_date datetime64[ns]
dtype: object
Extracting date and updating Created_date column:
test_df['Created_date'] = test_df['Created_date'].dt.date
print(test_df)
Result:
Created_date
0 2015-03-04
1 2015-03-22
2 2015-03-25
3 2015-03-16
4 2015-03-19
well I would do this way.
pdTime =pd.date_range(timeStamp, periods=len(years), freq="D")
pdTime[i].strftime('%m-%d-%Y')

Python iterate through month, use month in between Query

I have the following model:
Deal(models.Model):
start_date = models.DateTimeField()
end_date = models.DateTimeField()
I want to iterate through a given year
year = '2010'
For each month in year I want to execute a query to see if the month is between start_date and end_date.
How can I iterate through a given year? Use the month to do a query?
SELECT * FROM deals WHERE month BETWEEN start_date AND end_date
The outcome will tell me if I had a deal in January 2010 and/or in February 2010, etc.
How can I iterate through a given year?
You could use python-dateutil's rrule. Install with command pip install python-dateutil.
Example usage:
In [1]: from datetime import datetime
In [2]: from dateutil import rrule
In [3]: list(rrule.rrule(rrule.MONTHLY, dtstart=datetime(2010,01,01,00,01), count=12))
Out[3]:
[datetime.datetime(2010, 1, 1, 0, 1),
datetime.datetime(2010, 2, 1, 0, 1),
datetime.datetime(2010, 3, 1, 0, 1),
datetime.datetime(2010, 4, 1, 0, 1),
datetime.datetime(2010, 5, 1, 0, 1),
datetime.datetime(2010, 6, 1, 0, 1),
datetime.datetime(2010, 7, 1, 0, 1),
datetime.datetime(2010, 8, 1, 0, 1),
datetime.datetime(2010, 9, 1, 0, 1),
datetime.datetime(2010, 10, 1, 0, 1),
datetime.datetime(2010, 11, 1, 0, 1),
datetime.datetime(2010, 12, 1, 0, 1)]
Use the month to do a query?
You could iterate over months like this:
In [1]: from dateutil import rrule
In [2]: from datetime import datetime
In [3]: months = list(rrule.rrule(rrule.MONTHLY, dtstart=datetime(2010,01,01,00,01), count=13))
In [4]: i = 0
In [5]: while i < len(months) - 1:
...: print "start_date", months[i], "end_date", months[i+1]
...: i += 1
...:
start_date 2010-01-01 00:01:00 end_date 2010-02-01 00:01:00
start_date 2010-02-01 00:01:00 end_date 2010-03-01 00:01:00
start_date 2010-03-01 00:01:00 end_date 2010-04-01 00:01:00
start_date 2010-04-01 00:01:00 end_date 2010-05-01 00:01:00
start_date 2010-05-01 00:01:00 end_date 2010-06-01 00:01:00
start_date 2010-06-01 00:01:00 end_date 2010-07-01 00:01:00
start_date 2010-07-01 00:01:00 end_date 2010-08-01 00:01:00
start_date 2010-08-01 00:01:00 end_date 2010-09-01 00:01:00
start_date 2010-09-01 00:01:00 end_date 2010-10-01 00:01:00
start_date 2010-10-01 00:01:00 end_date 2010-11-01 00:01:00
start_date 2010-11-01 00:01:00 end_date 2010-12-01 00:01:00
start_date 2010-12-01 00:01:00 end_date 2011-01-01 00:01:00
Replace the "print" statement with a query. Feel free to adapt it to your needs.
There is probably a better way but that could do the job.

Categories