How to convert TimeSeries object in pandas into integer? - python

I've been working with Pandas to calculate the age of a sportsman on a particular fixture, although it's returned as a TimeSeries type.
I'd now like to be able to plot age (in days) against the fixture dates, but can't work out how to turn the TimeSeries object to an integer. What can I try next?
This is the shape of the data.
squad_date['mean_age']
2008-08-16 11753 days, 0:00:00
2008-08-23 11760 days, 0:00:00
2008-08-30 11767 days, 0:00:00
2008-09-14 11782 days, 0:00:00
2008-09-20 11788 days, 0:00:00
This is what I would like:
2008-08-16 11753
2008-08-23 11760
2008-08-30 11767
2008-09-14 11782
2008-09-20 11788

For people who find this post by google, if you have numpy >= 0.7 and pandas 0.11, these solutions will not work. What does work:
squad_date['mean_age'].apply(lambda x: x / np.timedelta64(1,'D'))
The official Pandas documentation can be confusing here. They suggest to do "x.item()", where x is already a timedelta object.
x.item() would retrieve the difference in as an int value from the timedelta object. If that would be 'ns', you would get an int with the number of nanoseconds for example. So that would give a integer divide by a timedelta error; dividing the timedeltas directly by each other does work (and converts it to Days as by the 'D' in the second part).
I hope this will help someone in the future!

you need to be on master for this (0.11-dev)
In [40]: x = pd.date_range('20130101',periods=5)
In [41]: td = pd.Series(x,index=x)-pd.Timestamp('20130101')
In [43]: td
Out[43]:
2013-01-01 00:00:00
2013-01-02 1 days, 00:00:00
2013-01-03 2 days, 00:00:00
2013-01-04 3 days, 00:00:00
2013-01-05 4 days, 00:00:00
Freq: D, Dtype: timedelta64[ns]
In [44]: td.apply(lambda x: x.item().days)
Out[44]:
2013-01-01 0
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
Freq: D, Dtype: int64

The way I did it:
def conv_delta_to_int (dt):
return int(str(dt).split(" ")[0].replace (",", ""))
squad_date['mean_age'] = map(conv_delta_to_int, squad_date['mean_age'])

Related

remove date in pandas [duplicate]

I have a dataframe df and its first column is timedelta64
df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686 entries, 0 to 685
Data columns (total 6 columns):
0 686 non-null timedelta64[ns]
1 686 non-null object
2 686 non-null object
3 686 non-null object
4 686 non-null object
5 686 non-null object
If I print(df[0][2]), for example, it will give me 0 days 05:01:11. However, I don't want the 0 days filed. I only want 05:01:11 to be printed. Could someone teaches me how to do this? Thanks so much!
It is possible by:
df['duration1'] = df['duration'].astype(str).str[-18:-10]
But solution is not general, if input is 3 days 05:01:11 it remove 3 days too.
So solution working only for timedeltas less as one day correctly.
More general solution is create custom format:
N = 10
np.random.seed(11230)
rng = pd.date_range('2017-04-03 15:30:00', periods=N, freq='13.5H')
df = pd.DataFrame({'duration': np.abs(np.random.choice(rng, size=N) -
np.random.choice(rng, size=N)) })
df['duration1'] = df['duration'].astype(str).str[-18:-10]
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df['duration2'] = df['duration'].apply(f)
print (df)
duration duration1 duration2
0 2 days 06:00:00 06:00:00 54:00:00
1 2 days 19:30:00 19:30:00 67:30:00
2 1 days 03:00:00 03:00:00 27:00:00
3 0 days 00:00:00 00:00:00 0:00:00
4 4 days 12:00:00 12:00:00 108:00:00
5 1 days 03:00:00 03:00:00 27:00:00
6 0 days 13:30:00 13:30:00 13:30:00
7 1 days 16:30:00 16:30:00 40:30:00
8 0 days 00:00:00 00:00:00 0:00:00
9 1 days 16:30:00 16:30:00 40:30:00
Here's a short and robust version using apply():
df['timediff_string'] = df['timediff'].apply(
lambda x: f'{x.components.hours:02d}:{x.components.minutes:02d}:{x.components.seconds:02d}'
if not pd.isnull(x) else ''
)
This leverages the components attribute of pandas Timedelta objects and also handles empty values (NaT).
If the timediff column does not contain pandas Timedelta objects, you can convert it:
df['timediff'] = pd.to_timedelta(df['timediff'])
datetime.timedelta already formats the way you'd like. The crux of this issue is that Pandas internally converts to numpy.timedelta.
import pandas as pd
from datetime import timedelta
time_1 = timedelta(days=3, seconds=3400)
time_2 = timedelta(days=0, seconds=3400)
print(time_1)
print(time_2)
times = pd.Series([time_1, time_2])
# Times are converted to Numpy timedeltas.
print(times)
# Convert to string after converting to datetime.timedelta.
times = times.apply(
lambda numpy_td: str(timedelta(seconds=numpy_td.total_seconds())))
print(times)
So, convert to a datetime.timedelta and then str (to prevent conversion back to numpy.timedelta) before printing.
3 days, 0:56:40
0:56:400
0 3 days 00:56:40
1 0 days 00:56:40
dtype: timedelta64[ns]
0 3 days, 0:56:40
1 0:56:40
dtype: object
I came here looking for answers to the same question, so I felt I should add further clarification. : )
You can convert it into a Python timedelta, then to str and finally back to a Series:
pd.Series(df["duration"].dt.to_pytimedelta().astype(str), name="start_time")
Given OP is ok with an object column (a little verbose):
def splitter(td):
td = str(td).split(' ')[-1:][0]
return td
df['split'] = df['timediff'].apply(splitter)
Basically we're taking the timedelta column, transforming the contents to a string, then splitting the string (creates a list) and taking the last item of that list, which would be the hh:mm:ss component.
Note that specifying ' ' for what to split by is redundant here.
Alternative one liner:
df['split2'] = df['timediff'].astype('str').str.split().str[-1]
which is very similar, but not very pretty IMHO. Also, the output includes milliseconds, which is not the case in the first solution. I'm not sure what the reason for that is (please comment if you do). If your data is big it might be worthwhile to time these different approaches.
If wou want to remove all nonzero components (not only days), you can do it like this:
def pd_td_fmt(td):
import pandas as pd
abbr = {'days': 'd', 'hours': 'h', 'minutes': 'min', 'seconds': 's', 'milliseconds': 'ms', 'microseconds': 'us',
'nanoseconds': 'ns'}
fmt = lambda td:"".join(f"{v}{abbr[k]}" for k, v in td.components._asdict().items() if v != 0)
if isinstance(td, pd.Timedelta):
return fmt(td)
elif isinstance(td,pd.TimedeltaIndex):
return td.map(fmt)
else:
raise ValueError
If you can be sure that your timedelta is less than a day, this might work. To do this in as few lines as possible, I convert the timedelta to a datetime by adding the unix epoch 0 and then using the now-datetime dt function to format the date format.
df['duration1'] = (df['duration'] + pd.to_datetime(0)).dt.strftime('%M:%S')

Cumulative elapsed minutes from Pandas datetime Series

I have a column of datetime stamps. I need a column of total minutes elapsed from the first to the last value.
I have:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
>>> df
timestamp
0 2001-01-01 06:00:00
1 2001-01-01 06:01:00
2 2001-01-01 06:15:00
I need to add a column that gives the running total:
timestamp minutes
1-1-2001 6:00 0
1-1-2001 6:01 1
1-1-2001 6:15 15
1-1-2001 7:00 60
1-1-2001 7:35 95
Having a hard time manipulating the datetime Series to allow me to total up the timestamp.
I've looked at a lot of posts and can't find anything that does what I'm trying to do. Would appreciate any ideas!
You can chain a few methods together:
>>> df['minutes'] = df['timestamp'].diff().fillna(0).dt.total_seconds()\
... .cumsum().div(60).astype(int)
>>> df
timestamp minutes
0 2001-01-01 06:00:00 0
1 2001-01-01 06:01:00 1
2 2001-01-01 06:15:00 15
Creation:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
Walkthrough
The easiest way to break this down is to separate each intermediate method call.
df['timestamp'].diff() gives you a Series of Pandas-equivalent of Python's datetime.timedelta, the differences in times from each value to the next.
>>> df['timestamp'].diff()
0 NaT
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
This contains an N/A value (NaT/not a time) because there's nothing to subtract from the first value. You can simply fill it with the zero-value for timedeltas:
>>> df['timestamp'].diff().fillna(0)
0 00:00:00
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
Now you need to get an actual integer (minutes) from these objects. In .dt.total_seconds(), .dt is an "accessor" that is a way of accessing a bunch of methods that let you work with datetime-like data:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds()
0 0.0
1 60.0
2 840.0
Name: timestamp, dtype: float64
The result is the incremental second-change as a float. You need this on a cumulative basis, in minutes, and as an integer. That's what the final 3 operations do:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds().cumsum().div(60).astype(int)
0 0
1 1
2 15
Name: timestamp, dtype: int64
Note that astype(int) will do rounding if you have seconds that aren't fully divisible by 60.

Fast conversion of time string (Hour:Min:Sec.Millsecs) to float

I use pandas to import a csv file (about a million rows, 5 columns) that contains one column of timestamps (increasing row-by-row) in the format Hour:Min:Sec.Millsecs, e.g.
11:52:55.162
and some other columns with floats. I need to transform the timestamp column into floats (say in seconds). So far I'm using
pandas.read_csv
to get a dataframe df and then transform it into a numpy array
df=np.array(df)
All the above works great and is quite fast. However, then I use datetime.strptime (the 0th columns are the timestamps)
df[:,0]=[(datetime.strptime(str(d),'%H:%M:%S.%f')).total_seconds() for d in df[:,0]]
to transform the timestamps into seconds and unfortunately this turns out to be veryyyy slow . It's not the iteration over all the rows that so slow but
datetime.strptime
is the bottleneck. Is there a better way to do it?
Here, using timedeltas
Create a sample series
In [21]: s = pd.to_timedelta(np.arange(100000),unit='s')
In [22]: s
Out[22]:
0 00:00:00
1 00:00:01
2 00:00:02
3 00:00:03
4 00:00:04
5 00:00:05
6 00:00:06
7 00:00:07
8 00:00:08
9 00:00:09
10 00:00:10
11 00:00:11
12 00:00:12
13 00:00:13
14 00:00:14
...
99985 1 days, 03:46:25
99986 1 days, 03:46:26
99987 1 days, 03:46:27
99988 1 days, 03:46:28
99989 1 days, 03:46:29
99990 1 days, 03:46:30
99991 1 days, 03:46:31
99992 1 days, 03:46:32
99993 1 days, 03:46:33
99994 1 days, 03:46:34
99995 1 days, 03:46:35
99996 1 days, 03:46:36
99997 1 days, 03:46:37
99998 1 days, 03:46:38
99999 1 days, 03:46:39
Length: 100000, dtype: timedelta64[ns]
Convert to string for testing purposes
In [23]: t = s.apply(pd.tslib.repr_timedelta64)
These are strings
In [24]: t.iloc[-1]
Out[24]: '1 days, 03:46:39'
Dividing by a timedelta64 converts this to seconds
In [25]: pd.to_timedelta(t.iloc[-1])/np.timedelta64(1,'s')
Out[25]: 99999.0
This is currently matching using a reg-ex, so not very fast from a string directly.
In [27]: %timeit pd.to_timedelta(t)/np.timedelta64(1,'s')
1 loops, best of 3: 1.84 s per loop
This is a date-timestamp based soln
Since date times are already stored as int64's this is very easy an fast
Create a sample series
In [7]: s = Series(date_range('20130101',periods=1000,freq='ms'))
In [8]: s
Out[8]:
0 2013-01-01 00:00:00
1 2013-01-01 00:00:00.001000
2 2013-01-01 00:00:00.002000
3 2013-01-01 00:00:00.003000
4 2013-01-01 00:00:00.004000
5 2013-01-01 00:00:00.005000
6 2013-01-01 00:00:00.006000
7 2013-01-01 00:00:00.007000
8 2013-01-01 00:00:00.008000
9 2013-01-01 00:00:00.009000
10 2013-01-01 00:00:00.010000
11 2013-01-01 00:00:00.011000
12 2013-01-01 00:00:00.012000
13 2013-01-01 00:00:00.013000
14 2013-01-01 00:00:00.014000
...
985 2013-01-01 00:00:00.985000
986 2013-01-01 00:00:00.986000
987 2013-01-01 00:00:00.987000
988 2013-01-01 00:00:00.988000
989 2013-01-01 00:00:00.989000
990 2013-01-01 00:00:00.990000
991 2013-01-01 00:00:00.991000
992 2013-01-01 00:00:00.992000
993 2013-01-01 00:00:00.993000
994 2013-01-01 00:00:00.994000
995 2013-01-01 00:00:00.995000
996 2013-01-01 00:00:00.996000
997 2013-01-01 00:00:00.997000
998 2013-01-01 00:00:00.998000
999 2013-01-01 00:00:00.999000
Length: 1000, dtype: datetime64[ns]
Convert to ns since epoch / divide to get ms since epoch (if you want seconds,
divide by 10**9)
In [9]: pd.DatetimeIndex(s).asi8/10**6
Out[9]:
array([1356998400000, 1356998400001, 1356998400002, 1356998400003,
1356998400004, 1356998400005, 1356998400006, 1356998400007,
1356998400008, 1356998400009, 1356998400010, 1356998400011,
...
1356998400992, 1356998400993, 1356998400994, 1356998400995,
1356998400996, 1356998400997, 1356998400998, 1356998400999])
Pretty fast
In [12]: s = Series(date_range('20130101',periods=1000000,freq='ms'))
In [13]: %timeit pd.DatetimeIndex(s).asi8/10**6
100 loops, best of 3: 11 ms per loop
I'm guessing that the datetime object has a lot of overhead - it may be easier to do it by hand:
def to_seconds(s):
hr, min, sec = [float(x) for x in s.split(':')]
return hr*3600 + min*60 + sec
Using sum(), and enumerate() -
>>> ts = '11:52:55.162'
>>> ts1 = map(float, ts.split(':'))
>>> ts1
[11.0, 52.0, 55.162]
>>> ts2 = [60**(2-i)*n for i, n in enumerate(ts1)]
>>> ts2
[39600.0, 3120.0, 55.162]
>>> ts3 = sum(ts2)
>>> ts3
42775.162
>>> seconds = sum(60**(2-i)*n for i, n in enumerate(map(float, ts.split(':'))))
>>> seconds
42775.162
>>>

Handling monthly-binned data in pandas

I have a dataset I'm analyzing in pandas where all data is binned monthly. The data originates from a MySQL database where all dates are in the format 'YYYY-MM-01', such that, for example, all rows for October 2013 would have "2013-10-01" in the month column.
I'm currently reading the data into pandas (via a .tsv dump of the MySQL table) with
data = pd.read_table(filename,header=None,names=('uid','iid','artist','tag','date'),index_col=indexes, parse_dates='date')
This is all fine, except for the fact that any subsequent analyses I run in which I do monthly resampling always represents dates using the end-of-month convention (i.e. data from October becomes '2013-10-31' instead of '2013-10-01'), but this can lead to inconsistencies where the original data has months labeled as 'YYYY-MM-01', while any resampled data will have the months labeled as 'YYYY-MM-31' (or '-30' or '-28', as appropriate).
My question is this: What is the easiest and/or fastest way I can convert all the dates in my dataframe to the end-of-month format from the outset? Keep in mind that the date is one of several indexes in a multi-index, not a column. I think my best bet is to use a modified date_parser in my in my pd.read_table call that always converts month to the end-of-month convention, but I'm not sure how to approach it.
Read your dates in exactly like you are doing.
Create some test data. I am setting the dates to the start of month, but it doesn't matter.
In [39]: df = DataFrame(np.random.randn(10,2),columns=list('AB'),
index=date_range('20130101',periods=10,freq='MS'))
In [40]: df
Out[40]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
Force convert them to the end-of-month in time space regardless of the day
In [41]: df.index = df.index.to_period().to_timestamp('M')
In [42]: df
Out[42]:
A B
2013-01-31 -0.553482 0.049128
2013-02-28 0.337975 -0.035897
2013-03-31 -0.394849 -1.755323
2013-04-30 -0.555638 1.903388
2013-05-31 -0.087752 1.551916
2013-06-30 1.000943 -0.361248
2013-07-31 -1.855171 -2.215276
2013-08-31 -0.582643 1.661696
2013-09-30 0.501061 -1.455171
2013-10-31 1.343630 -2.008060
Back to the start
In [43]: df.index = df.index.to_period().to_timestamp('MS')
In [44]: df
Out[44]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
You can also work with (and resample) as periods
In [45]: df.index = df.index.to_period()
In [46]: df
Out[46]:
A B
2013-01 -0.553482 0.049128
2013-02 0.337975 -0.035897
2013-03 -0.394849 -1.755323
2013-04 -0.555638 1.903388
2013-05 -0.087752 1.551916
2013-06 1.000943 -0.361248
2013-07 -1.855171 -2.215276
2013-08 -0.582643 1.661696
2013-09 0.501061 -1.455171
2013-10 1.343630 -2.008060
use replace() to change the day value. and you can get the last day of month using
from datetime import date
import calendar
d = date(2000,1,1)
d = d.replace(day=calendar.monthrange(d.year, d.month)[1])
UPDATE
I add some example for pandas.
sample file date.csv
2013-01-01, 1
2013-02-01, 2
ipython shell log.
In [27]: import pandas as pd
In [28]: from datetime import datetime, date
In [29]: import calendar
In [30]: def parse(dt):
dt = datetime.strptime(dt, '%Y-%m-%d')
dt = dt.replace(day=calendar.monthrange(dt.year, dt.month)[1])
return dt.date()
....:
In [31]: parse('2013-01-01')
Out[31]: datetime.date(2013, 1, 31)
In [32]: r = pd.read_csv('date.csv', header=None, names=('date', 'value'), parse_dates=['date'], date_parser=parse)
In [33]: r
Out[33]:
date value
0 2013-01-31 1
1 2013-02-28 2

What representation should I use in Pandas for data valid throughout an interval?

I have a series of hourly prices. Each price is valid throughout the whole 1-hour period. What is the best way to represent these prices in Pandas that would enable me to index them in arbitrary higher frequencies (such as minutes or seconds) and do arithmetic with them?
Data specifics
Sample prices might be:
>>> prices = Series(randn(5), pd.date_range('2013-01-01 12:00', periods = 5, freq='H'))
>>> prices
2013-01-01 12:00:00 -1.001692
2013-01-01 13:00:00 -1.408082
2013-01-01 14:00:00 -0.329637
2013-01-01 15:00:00 1.005882
2013-01-01 16:00:00 1.202557
Freq: H
Now, what representation to use if I want the value at 13:37:42(I expect it to be the same as at 13:00)?
>>> prices['2013-01-01 13:37:42']
...
KeyError: <Timestamp: 2013-01-01 13:37:42>
Resampling
I know I could resample the prices and fill in the details (ffill, right?), but that doesn't seem like such a nice solution, because I have to assume the frequency I'm going to be indexing it at and it reduces readability with too many unnecessary data points.
Time spans
At first glance a PeriodIndex seems to work
>>> price_periods = prices.to_period()
>>> price_periods['2013-01-01 13:37:42']
-1.408082
But a time-spanned series doesn't offer some of the other functionality I expect from a Series. Say that I have another series amounts that says how many items I bought in a certain moment. If I wanted to calculate the prices I would want to multiply the two series'
>>> amounts = Series([1,2,2], pd.DatetimeIndex(['2013-01-01 13:37', '2013-01-01 13:57', '2013-01-01 14:05']))
>>> amounts*price_periods
but that yields an exception and sometimes even freezes my IPy Notebook. Indexing doesn't help either.
>>> ts_periods[amounts.index]
Are PeriodIndex structures still a work in progress or these features aren't going to be added? Is there maybe some other structure I should have used (or should use for now, before PeriodIndex matures)? I'm using Pandas version 0.9.0.dev-1e68fd9.
Check asof
prices.asof('2013-01-01 13:37:42')
returns the value for the previous available datetime:
prices['2013-01-01 13:00:00']
To make calculations, you can use:
prices.asof(amounts.index) * amounts
which returns a Series with amount's Index and the respective values:
>>> prices
2013-01-01 12:00:00 0.943607
2013-01-01 13:00:00 -1.019452
2013-01-01 14:00:00 -0.279136
2013-01-01 15:00:00 1.013548
2013-01-01 16:00:00 0.929920
>>> prices.asof(amounts.index)
2013-01-01 13:37:00 -1.019452
2013-01-01 13:57:00 -1.019452
2013-01-01 14:05:00 -0.279136
>>> prices.asof(amounts.index) * amounts
2013-01-01 13:37:00 -1.019452
2013-01-01 13:57:00 -2.038904
2013-01-01 14:05:00 -0.558272

Categories