I use pandas to import a csv file (about a million rows, 5 columns) that contains one column of timestamps (increasing row-by-row) in the format Hour:Min:Sec.Millsecs, e.g.
11:52:55.162
and some other columns with floats. I need to transform the timestamp column into floats (say in seconds). So far I'm using
pandas.read_csv
to get a dataframe df and then transform it into a numpy array
df=np.array(df)
All the above works great and is quite fast. However, then I use datetime.strptime (the 0th columns are the timestamps)
df[:,0]=[(datetime.strptime(str(d),'%H:%M:%S.%f')).total_seconds() for d in df[:,0]]
to transform the timestamps into seconds and unfortunately this turns out to be veryyyy slow . It's not the iteration over all the rows that so slow but
datetime.strptime
is the bottleneck. Is there a better way to do it?
Here, using timedeltas
Create a sample series
In [21]: s = pd.to_timedelta(np.arange(100000),unit='s')
In [22]: s
Out[22]:
0 00:00:00
1 00:00:01
2 00:00:02
3 00:00:03
4 00:00:04
5 00:00:05
6 00:00:06
7 00:00:07
8 00:00:08
9 00:00:09
10 00:00:10
11 00:00:11
12 00:00:12
13 00:00:13
14 00:00:14
...
99985 1 days, 03:46:25
99986 1 days, 03:46:26
99987 1 days, 03:46:27
99988 1 days, 03:46:28
99989 1 days, 03:46:29
99990 1 days, 03:46:30
99991 1 days, 03:46:31
99992 1 days, 03:46:32
99993 1 days, 03:46:33
99994 1 days, 03:46:34
99995 1 days, 03:46:35
99996 1 days, 03:46:36
99997 1 days, 03:46:37
99998 1 days, 03:46:38
99999 1 days, 03:46:39
Length: 100000, dtype: timedelta64[ns]
Convert to string for testing purposes
In [23]: t = s.apply(pd.tslib.repr_timedelta64)
These are strings
In [24]: t.iloc[-1]
Out[24]: '1 days, 03:46:39'
Dividing by a timedelta64 converts this to seconds
In [25]: pd.to_timedelta(t.iloc[-1])/np.timedelta64(1,'s')
Out[25]: 99999.0
This is currently matching using a reg-ex, so not very fast from a string directly.
In [27]: %timeit pd.to_timedelta(t)/np.timedelta64(1,'s')
1 loops, best of 3: 1.84 s per loop
This is a date-timestamp based soln
Since date times are already stored as int64's this is very easy an fast
Create a sample series
In [7]: s = Series(date_range('20130101',periods=1000,freq='ms'))
In [8]: s
Out[8]:
0 2013-01-01 00:00:00
1 2013-01-01 00:00:00.001000
2 2013-01-01 00:00:00.002000
3 2013-01-01 00:00:00.003000
4 2013-01-01 00:00:00.004000
5 2013-01-01 00:00:00.005000
6 2013-01-01 00:00:00.006000
7 2013-01-01 00:00:00.007000
8 2013-01-01 00:00:00.008000
9 2013-01-01 00:00:00.009000
10 2013-01-01 00:00:00.010000
11 2013-01-01 00:00:00.011000
12 2013-01-01 00:00:00.012000
13 2013-01-01 00:00:00.013000
14 2013-01-01 00:00:00.014000
...
985 2013-01-01 00:00:00.985000
986 2013-01-01 00:00:00.986000
987 2013-01-01 00:00:00.987000
988 2013-01-01 00:00:00.988000
989 2013-01-01 00:00:00.989000
990 2013-01-01 00:00:00.990000
991 2013-01-01 00:00:00.991000
992 2013-01-01 00:00:00.992000
993 2013-01-01 00:00:00.993000
994 2013-01-01 00:00:00.994000
995 2013-01-01 00:00:00.995000
996 2013-01-01 00:00:00.996000
997 2013-01-01 00:00:00.997000
998 2013-01-01 00:00:00.998000
999 2013-01-01 00:00:00.999000
Length: 1000, dtype: datetime64[ns]
Convert to ns since epoch / divide to get ms since epoch (if you want seconds,
divide by 10**9)
In [9]: pd.DatetimeIndex(s).asi8/10**6
Out[9]:
array([1356998400000, 1356998400001, 1356998400002, 1356998400003,
1356998400004, 1356998400005, 1356998400006, 1356998400007,
1356998400008, 1356998400009, 1356998400010, 1356998400011,
...
1356998400992, 1356998400993, 1356998400994, 1356998400995,
1356998400996, 1356998400997, 1356998400998, 1356998400999])
Pretty fast
In [12]: s = Series(date_range('20130101',periods=1000000,freq='ms'))
In [13]: %timeit pd.DatetimeIndex(s).asi8/10**6
100 loops, best of 3: 11 ms per loop
I'm guessing that the datetime object has a lot of overhead - it may be easier to do it by hand:
def to_seconds(s):
hr, min, sec = [float(x) for x in s.split(':')]
return hr*3600 + min*60 + sec
Using sum(), and enumerate() -
>>> ts = '11:52:55.162'
>>> ts1 = map(float, ts.split(':'))
>>> ts1
[11.0, 52.0, 55.162]
>>> ts2 = [60**(2-i)*n for i, n in enumerate(ts1)]
>>> ts2
[39600.0, 3120.0, 55.162]
>>> ts3 = sum(ts2)
>>> ts3
42775.162
>>> seconds = sum(60**(2-i)*n for i, n in enumerate(map(float, ts.split(':'))))
>>> seconds
42775.162
>>>
Related
I've got a pandas dataframe organized by date I'm trying to split up by year (in a column called 'year'). I want to return one dataframe per year, with a name something like "df19XX".
I was hoping to write a "For" loop that can handle this... something like...
for d in [1980, 1981, 1982]:
df(d) = df[df['year']==d]
... which would return three data frames called df1980, df1981 and df1982.
thanks!
Something like this ? Also using #Andy's df
variables = locals()
for i in [2012, 2013]:
variables["df{0}".format(i)]=df.loc[df.date.dt.year==i]
df2012
Out[118]:
A date
0 0.881468 2012-12-28
1 0.237672 2012-12-29
2 0.992287 2012-12-30
3 0.194288 2012-12-31
df2013
Out[119]:
A date
4 0.151854 2013-01-01
5 0.855312 2013-01-02
6 0.534075 2013-01-03
You can iterate through the groupby:
In [11]: df = pd.DataFrame({"date": pd.date_range("2012-12-28", "2013-01-03"), "A": np.random.rand(7)})
In [12]: df
Out[12]:
A date
0 0.434715 2012-12-28
1 0.208877 2012-12-29
2 0.912897 2012-12-30
3 0.226368 2012-12-31
4 0.100489 2013-01-01
5 0.474088 2013-01-02
6 0.348368 2013-01-03
In [13]: g = df.groupby(df.date.dt.year)
In [14]: for k, v in g:
...: print(k)
...: print(v)
...: print()
...:
2012
A date
0 0.434715 2012-12-28
1 0.208877 2012-12-29
2 0.912897 2012-12-30
3 0.226368 2012-12-31
2013
A date
4 0.100489 2013-01-01
5 0.474088 2013-01-02
6 0.348368 2013-01-03
I would strongly argue that is preferable to simply have a dict having variables and messing around with the locals() dictionary (I claim using locals() so is not "pythonic"):
In [14]: {k: grp for k, grp in g}
Out[14]:
{2012: A date
0 0.434715 2012-12-28
1 0.208877 2012-12-29
2 0.912897 2012-12-30
3 0.226368 2012-12-31, 2013: A date
4 0.100489 2013-01-01
5 0.474088 2013-01-02
6 0.348368 2013-01-03}
Though you might consider calculating this on the fly (rather than storing in a dict or indeed a variable). You can use get_group:
In [15]: g.get_group(2012)
Out[15]:
A date
0 0.865239 2012-12-28
1 0.019071 2012-12-29
2 0.362088 2012-12-30
3 0.031861 2012-12-31
I am writing a process which takes a semi-large file as input (~4 million rows, 5 columns)
and performs a few operations on it.
Columns:
- CARD_NO
- ID
- CREATED_DATE
- STATUS
- FLAG2
I need to create a file which contains 1 copy of each CARD_NO where STATUS = '1' and CREATED_DATE is the maximum of all CREATED_DATEs for that CARD_NO.
I succeeded but my solution is very slow (3h and counting as of right now.)
Here is my code:
file = 'input.csv'
input = pd.read_csv(file)
input = input.drop_duplicates()
card_groups = input.groupby('CARD_NO', as_index=False, sort=False).filter(lambda x: x['STATUS'] == 1)
def important(x):
latest_date = x['CREATED_DATE'].values[x['CREATED_DATE'].values.argmax()]
return x[x.CREATED_DATE == latest_date]
#where the major slowdown occurs
group_2 = card_groups.groupby('CARD_NO', as_index=False, sort=False).apply(important)
path = 'result.csv'
group_2.to_csv(path, sep=',', index=False)
# ~4 minutes for the 154k rows file
# 3+ hours for ~4m rows
I was wondering if you had any advice on how to improve the running time of this little process.
Thank you and have a good day.
Setup (FYI make sure that your use parse_dates=True when reading your csv)
In [6]: n_groups = 10000
In [7]: N = 4000000
In [8]: dates = date_range('20130101',periods=100)
In [9]: df = DataFrame(dict(id = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))
In [10]: pd.set_option('max_rows',10)
In [13]: df = DataFrame(dict(card_no = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))
In [14]: df
Out[14]:
card_no date status
0 5790 2013-02-11 6
1 6572 2013-03-17 6
2 7764 2013-02-06 3
3 4905 2013-04-01 3
4 3871 2013-04-08 1
... ... ... ...
3999995 1891 2013-02-16 5
3999996 9048 2013-01-11 9
3999997 1443 2013-02-23 1
3999998 2845 2013-01-28 0
3999999 5645 2013-02-05 8
[4000000 rows x 3 columns]
In [15]: df.dtypes
Out[15]:
card_no int64
date datetime64[ns]
status int64
dtype: object
Only status == 1, groupby card_no, then return the max date for that group
In [18]: df[df.status==1].groupby('card_no')['date'].max()
Out[18]:
card_no
0 2013-04-06
1 2013-03-30
2 2013-04-09
...
9997 2013-04-07
9998 2013-04-07
9999 2013-04-09
Name: date, Length: 10000, dtype: datetime64[ns]
In [19]: %timeit df[df.status==1].groupby('card_no')['date'].max()
1 loops, best of 3: 934 ms per loop
If you need a transform of this (e.g. the same values for each group. Note that with < 0.14.1 (releasing this week) you will need to use this soln here, otherwise this will be pretty slow)
In [20]: df[df.status==1].groupby('card_no')['date'].transform('max')
Out[20]:
4 2013-04-10
13 2013-04-10
25 2013-04-10
...
3999973 2013-04-10
3999979 2013-04-10
3999997 2013-04-09
Name: date, Length: 399724, dtype: datetime64[ns]
In [21]: %timeit df[df.status==1].groupby('card_no')['date'].transform('max')
1 loops, best of 3: 1.8 s per loop
I suspect you prob want to merge the final transform back into the original frame
In [24]: df.join(res.to_frame('max_date'))
Out[24]:
card_no date status max_date
0 5790 2013-02-11 6 NaT
1 6572 2013-03-17 6 NaT
2 7764 2013-02-06 3 NaT
3 4905 2013-04-01 3 NaT
4 3871 2013-04-08 1 2013-04-10
... ... ... ... ...
3999995 1891 2013-02-16 5 NaT
3999996 9048 2013-01-11 9 NaT
3999997 1443 2013-02-23 1 2013-04-09
3999998 2845 2013-01-28 0 NaT
3999999 5645 2013-02-05 8 NaT
[4000000 rows x 4 columns]
In [25]: %timeit df.join(res.to_frame('max_date'))
10 loops, best of 3: 58.8 ms per loop
The csv writing will actually take a fair amount of time relative to this. I used HDF5 for things like this, MUCH faster.
dates seem to be a tricky thing in python, and I am having a lot of trouble simply stripping the date out of the pandas TimeStamp. I would like to get from 2013-09-29 02:34:44 to simply 09-29-2013
I have a dataframe with a column Created_date:
Name: Created_Date, Length: 1162549, dtype: datetime64[ns]`
I have tried applying the .date() method on this Series, eg: df.Created_Date.date(), but I get the error AttributeError: 'Series' object has no attribute 'date'
Can someone help me out?
map over the elements:
In [239]: from operator import methodcaller
In [240]: s = Series(date_range(Timestamp('now'), periods=2))
In [241]: s
Out[241]:
0 2013-10-01 00:24:16
1 2013-10-02 00:24:16
dtype: datetime64[ns]
In [238]: s.map(lambda x: x.strftime('%d-%m-%Y'))
Out[238]:
0 01-10-2013
1 02-10-2013
dtype: object
In [242]: s.map(methodcaller('strftime', '%d-%m-%Y'))
Out[242]:
0 01-10-2013
1 02-10-2013
dtype: object
You can get the raw datetime.date objects by calling the date() method of the Timestamp elements that make up the Series:
In [249]: s.map(methodcaller('date'))
Out[249]:
0 2013-10-01
1 2013-10-02
dtype: object
In [250]: s.map(methodcaller('date')).values
Out[250]:
array([datetime.date(2013, 10, 1), datetime.date(2013, 10, 2)], dtype=object)
Yet another way you can do this is by calling the unbound Timestamp.date method:
In [273]: s.map(Timestamp.date)
Out[273]:
0 2013-10-01
1 2013-10-02
dtype: object
This method is the fastest, and IMHO the most readable. Timestamp is accessible in the top-level pandas module, like so: pandas.Timestamp. I've imported it directly for expository purposes.
The date attribute of DatetimeIndex objects does something similar, but returns a numpy object array instead:
In [243]: index = DatetimeIndex(s)
In [244]: index
Out[244]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-10-01 00:24:16, 2013-10-02 00:24:16]
Length: 2, Freq: None, Timezone: None
In [246]: index.date
Out[246]:
array([datetime.date(2013, 10, 1), datetime.date(2013, 10, 2)], dtype=object)
For larger datetime64[ns] Series objects, calling Timestamp.date is faster than operator.methodcaller which is slightly faster than a lambda:
In [263]: f = methodcaller('date')
In [264]: flam = lambda x: x.date()
In [265]: fmeth = Timestamp.date
In [266]: s2 = Series(date_range('20010101', periods=1000000, freq='T'))
In [267]: s2
Out[267]:
0 2001-01-01 00:00:00
1 2001-01-01 00:01:00
2 2001-01-01 00:02:00
3 2001-01-01 00:03:00
4 2001-01-01 00:04:00
5 2001-01-01 00:05:00
6 2001-01-01 00:06:00
7 2001-01-01 00:07:00
8 2001-01-01 00:08:00
9 2001-01-01 00:09:00
10 2001-01-01 00:10:00
11 2001-01-01 00:11:00
12 2001-01-01 00:12:00
13 2001-01-01 00:13:00
14 2001-01-01 00:14:00
...
999985 2002-11-26 10:25:00
999986 2002-11-26 10:26:00
999987 2002-11-26 10:27:00
999988 2002-11-26 10:28:00
999989 2002-11-26 10:29:00
999990 2002-11-26 10:30:00
999991 2002-11-26 10:31:00
999992 2002-11-26 10:32:00
999993 2002-11-26 10:33:00
999994 2002-11-26 10:34:00
999995 2002-11-26 10:35:00
999996 2002-11-26 10:36:00
999997 2002-11-26 10:37:00
999998 2002-11-26 10:38:00
999999 2002-11-26 10:39:00
Length: 1000000, dtype: datetime64[ns]
In [269]: timeit s2.map(f)
1 loops, best of 3: 1.04 s per loop
In [270]: timeit s2.map(flam)
1 loops, best of 3: 1.1 s per loop
In [271]: timeit s2.map(fmeth)
1 loops, best of 3: 968 ms per loop
Keep in mind that one of the goals of pandas is to provide a layer on top of numpy so that (most of the time) you don't have to deal with the low level details of the ndarray. So getting the raw datetime.date objects in an array is of limited use since they don't correspond to any numpy.dtype that is supported by pandas (pandas only supports datetime64[ns] [that's nanoseconds] dtypes). That said, sometimes you need to do this.
Maybe this only came in recently, but there are built-in methods for this. Try:
In [27]: s = pd.Series(pd.date_range(pd.Timestamp('now'), periods=2))
In [28]: s
Out[28]:
0 2016-02-11 19:11:43.386016
1 2016-02-12 19:11:43.386016
dtype: datetime64[ns]
In [29]: s.dt.to_pydatetime()
Out[29]:
array([datetime.datetime(2016, 2, 11, 19, 11, 43, 386016),
datetime.datetime(2016, 2, 12, 19, 11, 43, 386016)], dtype=object)
You can try using .dt.date on datetime64[ns] of the dataframe.
For e.g. df['Created_date'] = df['Created_date'].dt.date
Input dataframe named as test_df:
print(test_df)
Result:
Created_date
0 2015-03-04 15:39:16
1 2015-03-22 17:36:49
2 2015-03-25 22:08:45
3 2015-03-16 13:45:20
4 2015-03-19 18:53:50
Checking dtypes:
print(test_df.dtypes)
Result:
Created_date datetime64[ns]
dtype: object
Extracting date and updating Created_date column:
test_df['Created_date'] = test_df['Created_date'].dt.date
print(test_df)
Result:
Created_date
0 2015-03-04
1 2015-03-22
2 2015-03-25
3 2015-03-16
4 2015-03-19
well I would do this way.
pdTime =pd.date_range(timeStamp, periods=len(years), freq="D")
pdTime[i].strftime('%m-%d-%Y')
I am using Pandas Timegrouper to group datapoints in a pandas dataframe in python:
grouped = data.groupby(pd.TimeGrouper('30S'))
I would like to know if there's a way to achieve window overlap, like suggested in this question: Window overlap in Pandas while keeping the pandas dataframe as data structure.
Update: tested timing of the three solutions proposed below and the rolling mean seems faster:
%timeit df.groupby(pd.TimeGrouper('30s',closed='right')).mean()
%timeit df.resample('30s',how='mean',closed='right')
%timeit pd.rolling_mean(df,window=30).iloc[29::30]
yields:
1000 loops, best of 3: 336 µs per loop
1000 loops, best of 3: 349 µs per loop
1000 loops, best of 3: 199 µs per loop
Create some data exactly 3 x 30s long
In [51]: df = DataFrame(randn(90,2),columns=list('AB'),index=date_range('20130101 9:01:01',freq='s',periods=90))
Using a TimeGrouper in this way is equivalent of resample (and that's what resample actually does)
Note that I used closed to make sure that exactly 30 observations are included
In [57]: df.groupby(pd.TimeGrouper('30s',closed='right')).mean()
Out[57]:
A B
2013-01-01 09:01:00 -0.214968 -0.162200
2013-01-01 09:01:30 -0.090708 -0.021484
2013-01-01 09:02:00 -0.160335 -0.135074
In [52]: df.resample('30s',how='mean',closed='right')
Out[52]:
A B
2013-01-01 09:01:00 -0.214968 -0.162200
2013-01-01 09:01:30 -0.090708 -0.021484
2013-01-01 09:02:00 -0.160335 -0.135074
This is also equivalent if you then pick out the 30s intervals
In [55]: pd.rolling_mean(df,window=30).iloc[28:40]
Out[55]:
A B
2013-01-01 09:01:29 NaN NaN
2013-01-01 09:01:30 -0.214968 -0.162200
2013-01-01 09:01:31 -0.150401 -0.180492
2013-01-01 09:01:32 -0.160755 -0.142534
2013-01-01 09:01:33 -0.114918 -0.181424
2013-01-01 09:01:34 -0.098945 -0.221110
2013-01-01 09:01:35 -0.052450 -0.169884
2013-01-01 09:01:36 -0.011172 -0.185132
2013-01-01 09:01:37 0.100843 -0.178179
2013-01-01 09:01:38 0.062554 -0.097637
2013-01-01 09:01:39 0.048834 -0.065808
2013-01-01 09:01:40 0.003585 -0.059181
So depending on what you want to achieve, its easy to do an overlap, by using rolling_mean
and then pick out whatever 'frequency' you want. Eg here is a 5s resample with a 30s interval.
In [61]: pd.rolling_mean(df,window=30)[9::5]
Out[61]:
A B
2013-01-01 09:01:10 NaN NaN
2013-01-01 09:01:15 NaN NaN
2013-01-01 09:01:20 NaN NaN
2013-01-01 09:01:25 NaN NaN
2013-01-01 09:01:30 -0.214968 -0.162200
2013-01-01 09:01:35 -0.052450 -0.169884
2013-01-01 09:01:40 0.003585 -0.059181
2013-01-01 09:01:45 -0.055886 -0.111228
2013-01-01 09:01:50 -0.110191 -0.045032
2013-01-01 09:01:55 0.093662 -0.036177
2013-01-01 09:02:00 -0.090708 -0.021484
2013-01-01 09:02:05 -0.286759 0.020365
2013-01-01 09:02:10 -0.273221 -0.073886
2013-01-01 09:02:15 -0.222720 -0.038865
2013-01-01 09:02:20 -0.175630 0.001389
2013-01-01 09:02:25 -0.301671 -0.025603
2013-01-01 09:02:30 -0.160335 -0.135074
I've been working with Pandas to calculate the age of a sportsman on a particular fixture, although it's returned as a TimeSeries type.
I'd now like to be able to plot age (in days) against the fixture dates, but can't work out how to turn the TimeSeries object to an integer. What can I try next?
This is the shape of the data.
squad_date['mean_age']
2008-08-16 11753 days, 0:00:00
2008-08-23 11760 days, 0:00:00
2008-08-30 11767 days, 0:00:00
2008-09-14 11782 days, 0:00:00
2008-09-20 11788 days, 0:00:00
This is what I would like:
2008-08-16 11753
2008-08-23 11760
2008-08-30 11767
2008-09-14 11782
2008-09-20 11788
For people who find this post by google, if you have numpy >= 0.7 and pandas 0.11, these solutions will not work. What does work:
squad_date['mean_age'].apply(lambda x: x / np.timedelta64(1,'D'))
The official Pandas documentation can be confusing here. They suggest to do "x.item()", where x is already a timedelta object.
x.item() would retrieve the difference in as an int value from the timedelta object. If that would be 'ns', you would get an int with the number of nanoseconds for example. So that would give a integer divide by a timedelta error; dividing the timedeltas directly by each other does work (and converts it to Days as by the 'D' in the second part).
I hope this will help someone in the future!
you need to be on master for this (0.11-dev)
In [40]: x = pd.date_range('20130101',periods=5)
In [41]: td = pd.Series(x,index=x)-pd.Timestamp('20130101')
In [43]: td
Out[43]:
2013-01-01 00:00:00
2013-01-02 1 days, 00:00:00
2013-01-03 2 days, 00:00:00
2013-01-04 3 days, 00:00:00
2013-01-05 4 days, 00:00:00
Freq: D, Dtype: timedelta64[ns]
In [44]: td.apply(lambda x: x.item().days)
Out[44]:
2013-01-01 0
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
Freq: D, Dtype: int64
The way I did it:
def conv_delta_to_int (dt):
return int(str(dt).split(" ")[0].replace (",", ""))
squad_date['mean_age'] = map(conv_delta_to_int, squad_date['mean_age'])