Convert Pandas object to multiple columns - python

I have imported the following data within a CSV file:
01/01/2014 00:00:00, 50.031
01/01/2014 00:00:01, 50.026
01/01/2014 00:00:02, 50.019
01/01/2014 00:00:03, 50.008
etc
I successfully have converted the "object" in the first column to a datetime using:
df= pd.read_csv("myfile.csv",names=['DateTime','Freq'])
df['DateTime'] = pd.to_datetime(df['DateTime'], coerce=True)
The problem is, it's a very big CSV file (35 million rows) and it's dog slow. Is there a more efficient ways of converting the first column to datetime?
I would also like to split the date and the time into separate columns.

Yes, you can do it in the read_csv() function itself, you can use the argument parse_dates , and send in the list of columns to parse as date to it. Example -
df= pd.read_csv("myfile.csv",names=['DateTime','Freq'],parse_dates=['DateTime'])
Demo -
In [41]: import io
In [42]: s = """Date, SomeNum
....: 01/01/2014 00:00:00, 50.031
....: 01/01/2014 00:00:01, 50.026
....: 01/01/2014 00:00:02, 50.019
....: 01/01/2014 00:00:03, 50.008"""
In [43]: df = pd.read_csv(io.StringIO(s),parse_dates=['Date'])
In [44]: df
Out[44]:
Date SomeNum
0 2014-01-01 00:00:00 50.031
1 2014-01-01 00:00:01 50.026
2 2014-01-01 00:00:02 50.019
3 2014-01-01 00:00:03 50.008
In [45]: df['Date']
Out[45]:
0 2014-01-01 00:00:00
1 2014-01-01 00:00:01
2 2014-01-01 00:00:02
3 2014-01-01 00:00:03
Name: Date, dtype: datetime64[ns]
Timing results of different methods for a csv with 1 million records -
In [92]: def func1():
....: df = pd.read_csv('a.csv',names=['DateTime','Freq'])
....: df['DateTime'] = pd.to_datetime(df['DateTime'], coerce=True,format='%d/%m/%Y %H:%M:%S')
....: return df
....:
In [96]: def func2():
....: return pd.read_csv('a.csv',names=['DateTime','Freq'],parse_dates=['DateTime'])
....:
In [97]: %timeit func1()
1 loops, best of 3: 6.5 s per loop
In [98]: %timeit func2()
1 loops, best of 3: 652 ms per loop

Related

Extract day and month from a datetime object

I have a column with dates in string format '2017-01-01'. Is there a way to extract day and month from it using pandas?
I have converted the column to datetime dtype but haven't figured out the later part:
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df.dtypes:
Date datetime64[ns]
print(df)
Date
0 2017-05-11
1 2017-05-12
2 2017-05-13
With dt.day and dt.month --- Series.dt
df = pd.DataFrame({'date':pd.date_range(start='2017-01-01',periods=5)})
df.date.dt.month
Out[164]:
0 1
1 1
2 1
3 1
4 1
Name: date, dtype: int64
df.date.dt.day
Out[165]:
0 1
1 2
2 3
3 4
4 5
Name: date, dtype: int64
Also can do with dt.strftime
df.date.dt.strftime('%m')
Out[166]:
0 01
1 01
2 01
3 01
4 01
Name: date, dtype: object
A simple form:
df['MM-DD'] = df['date'].dt.strftime('%m-%d')
Use dt to get the datetime attributes of the column.
In [60]: df = pd.DataFrame({'date': [datetime.datetime(2018,1,1),datetime.datetime(2018,1,2),datetime.datetime(2018,1,3),]})
In [61]: df
Out[61]:
date
0 2018-01-01
1 2018-01-02
2 2018-01-03
In [63]: df['day'] = df.date.dt.day
In [64]: df['month'] = df.date.dt.month
In [65]: df
Out[65]:
date day month
0 2018-01-01 1 1
1 2018-01-02 2 1
2 2018-01-03 3 1
Timing the methods provided:
Using apply:
In [217]: %timeit(df['date'].apply(lambda d: d.day))
The slowest run took 33.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 210 µs per loop
Using dt.date:
In [218]: %timeit(df.date.dt.day)
10000 loops, best of 3: 127 µs per loop
Using dt.strftime:
In [219]: %timeit(df.date.dt.strftime('%d'))
The slowest run took 40.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 284 µs per loop
We can see that dt.day is the fastest
This should do it:
df['day'] = df['Date'].apply(lambda r:r.day)
df['month'] = df['Date'].apply(lambda r:r.month)

Modify hour in datetimeindex in pandas dataframe

I have a dataframe that looks like this:
master.head(5)
Out[73]:
hour price
day
2014-01-01 0 1066.24
2014-01-01 1 1032.11
2014-01-01 2 1028.53
2014-01-01 3 963.57
2014-01-01 4 890.65
In [74]: master.index.dtype
Out[74]: dtype('<M8[ns]')
What I need to do is update the hour in the index with the hour in the column but the following approaches don't work:
In [82]: master.index.hour = master.index.hour(master['hour'])
TypeError: 'numpy.ndarray' object is not callable
In [83]: master.index.hour = [master.index.hour(master.iloc[i,0]) for i in len(master.index.hour)]
TypeError: 'int' object is not iterable
How to proceed?
IIUC I think you want to construct a TimedeltaIndex:
In [89]:
df.index += pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[89]:
hour price
2014-01-01 00:00:00 0 1066.24
2014-01-01 01:00:00 1 1032.11
2014-01-01 02:00:00 2 1028.53
2014-01-01 03:00:00 3 963.57
2014-01-01 04:00:00 4 890.65
Just to compare against using apply:
In [87]:
%timeit df.index + pd.TimedeltaIndex(df['hour'], unit='h')
%timeit df.index + df['hour'].apply(lambda x: pd.Timedelta(x, 'h'))
1000 loops, best of 3: 291 µs per loop
1000 loops, best of 3: 1.18 ms per loop
You can see that using a TimedeltaIndex is significantly faster
master.index =
pd.to_datetime(master.index.map(lambda x : x.strftime('%Y-%m-%d')) + '-' + master.hour.map(str) , format='%Y-%m-%d-%H.0')

Merge multiple dataframes with non-unique indices

I have a bunch of pandas time series. Here is an example for illustration (real data has ~ 1 million entries in each series):
>>> for s in series:
print s.head()
print
2014-01-01 01:00:00 -0.546404
2014-01-01 01:00:00 -0.791217
2014-01-01 01:00:01 0.117944
2014-01-01 01:00:01 -1.033161
2014-01-01 01:00:02 0.013415
2014-01-01 01:00:02 0.368853
2014-01-01 01:00:02 0.380515
2014-01-01 01:00:02 0.976505
2014-01-01 01:00:02 0.881654
dtype: float64
2014-01-01 01:00:00 -0.111314
2014-01-01 01:00:01 0.792093
2014-01-01 01:00:01 -1.367650
2014-01-01 01:00:02 -0.469194
2014-01-01 01:00:02 0.569606
2014-01-01 01:00:02 -1.777805
dtype: float64
2014-01-01 01:00:00 -0.108123
2014-01-01 01:00:00 -1.518526
2014-01-01 01:00:00 -1.395465
2014-01-01 01:00:01 0.045677
2014-01-01 01:00:01 1.614789
2014-01-01 01:00:01 1.141460
2014-01-01 01:00:02 1.365290
dtype: float64
The times in each series are not unique. For example, the last series has 3 values at 2014-01-01 01:00:00. The second series has only one value at that time. Also, not all the times need to be present in all the series.
My goal is to create a merged DataFrame with times that are a union of all the times in the individual time series. Each timestamp should be repeated as many times as needed. So, if a timestamp occurs (2, 0, 3, 4) times in the series above, the timestamp should be repeated 4 times (the maximum of the frequencies) in the resulting DataFrame. The values of each column should be "filled forward".
As an example, the result of merging the above should be:
c0 c1 c2
2014-01-01 01:00:00 -0.546404 -0.111314 -0.108123
2014-01-01 01:00:00 -0.791217 -0.111314 -1.518526
2014-01-01 01:00:00 -0.791217 -0.111314 -1.395465
2014-01-01 01:00:01 0.117944 0.792093 0.045677
2014-01-01 01:00:01 -1.033161 -1.367650 1.614789
2014-01-01 01:00:01 -1.033161 -1.367650 1.141460
2014-01-01 01:00:02 0.013415 -0.469194 1.365290
2014-01-01 01:00:02 0.368853 0.569606 1.365290
2014-01-01 01:00:02 0.380515 -1.777805 1.365290
2014-01-01 01:00:02 0.976505 -1.777805 1.365290
2014-01-01 01:00:02 0.881654 -1.777805 1.365290
To give an idea of size and "uniqueness" in my real data:
>>> [len(s.index.unique()) for s in series]
[48617, 48635, 48720, 48620]
>>> len(times)
51043
>>> [len(s) for s in series]
[1143409, 1143758, 1233646, 1242864]
Here is what I have tried:
I can create a union of all the unique times:
uniques = [s.index.unique() for s in series]
times = uniques[0].union_many(uniques[1:])
I can now index each series using times:
series[0].loc[times]
But that seems to repeat the values for each item in times, which is not what I want.
I can't reindex() the series using times because the index for each series is not unique.
I can do it by a slow Python loop or do it in Cython, but is there a "pandas-only" way to do what I want to do?
I created my example series using the following code:
def make_series(n=3, rep=(0,5)):
times = pandas.date_range('2014/01/01 01:00:00', periods=n, freq='S')
reps = [random.randint(*rep) for _ in xrange(n)]
dates = []
values = numpy.random.randn(numpy.sum(reps))
for date, rep in zip(times, reps):
dates.extend([date]*rep)
return pandas.Series(data=values, index=dates)
series = [make_series() for _ in xrange(3)]
This is very nearly a concat:
In [11]: s0 = pd.Series([1, 2, 3], name='s0')
In [12]: s1 = pd.Series([1, 4, 5], name='s1')
In [13]: pd.concat([s0, s1], axis=1)
Out[13]:
s0 s1
0 1 1
1 2 4
2 3 5
However, concat cannot deal with duplicate indices (it's ambigious how they should merge, and in your case you don't want to merge them in the "ordinary" way - as combinations)...
I think you are going to use a groupby:
In [21]: s0 = pd.Series([1, 2, 3], [0, 0, 1], name='s0')
In [22]: s1 = pd.Series([1, 4, 5], [0, 1, 1], name='s1')
Note: I've appended a faster method which works for int-like dtypes (like datetime64).
We want to add a MultiIndex level of the cumcounts for each item, that way we trick the Index into becoming unique:
In [23]: s0.groupby(level=0).cumcount()
Out[23]:
0 0
0 1
1 0
dtype: int64
Note: I can't seem to append a column to the index without being a DataFrame..
In [24]: df0 = pd.DataFrame(s0).set_index(s0.groupby(level=0).cumcount(), append=True)
In [25]: df1 = pd.DataFrame(s1).set_index(s1.groupby(level=0).cumcount(), append=True)
In [26]: df0
Out[26]:
s0
0 0 1
1 2
1 0 3
Now we can go ahead and concat these:
In [27]: res = pd.concat([df0, df1], axis=1)
In [28]: res
Out[28]:
s0 s1
0 0 1 1
1 2 NaN
1 0 3 4
1 NaN 5
If you want to drop the cumcount level:
In [29]: res.index = res.index.droplevel(1)
In [30]: res
Out[30]:
s0 s1
0 1 1
0 2 NaN
1 3 4
1 NaN 5
Now you can ffill to get the desired result... (if you were concerned about forward filling of different datetimes you could groupby the index and ffill).
If the upperbound on repetitions in each group was reasonable (I'm picking 1000, but much higher is still "reasonable"!, you could use a Float64Index as follows (and certainly it seems more elegant):
s0.index = s0.index + (s0.groupby(level=0)._cumcount_array() / 1000.)
s1.index = s1.index + (s1.groupby(level=0)._cumcount_array() / 1000.)
res = pd.concat([s0, s1], axis=1)
res.index = res.index.values.astype('int64')
Note: I'm cheekily using a private method here which returns the cumcount as a numpy array...
Note2: This is pandas 0.14, in 0.13 you have to pass a numpy array to _cumcount_array e.g. np.arange(len(s0))), pre-0.13 you're out of luck - there's no cumcount.
How about this - convert to dataframes with labeled columns first, then concat().
s1 = pd.Series(index=['4/4/14','4/4/14','4/5/14'],
data=[12.2,0.0,12.2])
s2 = pd.Series(index=['4/5/14','4/8/14'],
data=[14.2,3.0])
d1 = pd.DataFrame(a,columns=['a'])
d2 = pd.DataFrame(b,columns=['b'])
final_df = pd.merge(d1, d2, left_index=True, right_index=True, how='outer')
This gives me
a b
4/4/14 12.2 NaN
4/4/14 0.0 NaN
4/5/14 12.2 14.2
4/8/14 NaN 3.0

Fast conversion of time string (Hour:Min:Sec.Millsecs) to float

I use pandas to import a csv file (about a million rows, 5 columns) that contains one column of timestamps (increasing row-by-row) in the format Hour:Min:Sec.Millsecs, e.g.
11:52:55.162
and some other columns with floats. I need to transform the timestamp column into floats (say in seconds). So far I'm using
pandas.read_csv
to get a dataframe df and then transform it into a numpy array
df=np.array(df)
All the above works great and is quite fast. However, then I use datetime.strptime (the 0th columns are the timestamps)
df[:,0]=[(datetime.strptime(str(d),'%H:%M:%S.%f')).total_seconds() for d in df[:,0]]
to transform the timestamps into seconds and unfortunately this turns out to be veryyyy slow . It's not the iteration over all the rows that so slow but
datetime.strptime
is the bottleneck. Is there a better way to do it?
Here, using timedeltas
Create a sample series
In [21]: s = pd.to_timedelta(np.arange(100000),unit='s')
In [22]: s
Out[22]:
0 00:00:00
1 00:00:01
2 00:00:02
3 00:00:03
4 00:00:04
5 00:00:05
6 00:00:06
7 00:00:07
8 00:00:08
9 00:00:09
10 00:00:10
11 00:00:11
12 00:00:12
13 00:00:13
14 00:00:14
...
99985 1 days, 03:46:25
99986 1 days, 03:46:26
99987 1 days, 03:46:27
99988 1 days, 03:46:28
99989 1 days, 03:46:29
99990 1 days, 03:46:30
99991 1 days, 03:46:31
99992 1 days, 03:46:32
99993 1 days, 03:46:33
99994 1 days, 03:46:34
99995 1 days, 03:46:35
99996 1 days, 03:46:36
99997 1 days, 03:46:37
99998 1 days, 03:46:38
99999 1 days, 03:46:39
Length: 100000, dtype: timedelta64[ns]
Convert to string for testing purposes
In [23]: t = s.apply(pd.tslib.repr_timedelta64)
These are strings
In [24]: t.iloc[-1]
Out[24]: '1 days, 03:46:39'
Dividing by a timedelta64 converts this to seconds
In [25]: pd.to_timedelta(t.iloc[-1])/np.timedelta64(1,'s')
Out[25]: 99999.0
This is currently matching using a reg-ex, so not very fast from a string directly.
In [27]: %timeit pd.to_timedelta(t)/np.timedelta64(1,'s')
1 loops, best of 3: 1.84 s per loop
This is a date-timestamp based soln
Since date times are already stored as int64's this is very easy an fast
Create a sample series
In [7]: s = Series(date_range('20130101',periods=1000,freq='ms'))
In [8]: s
Out[8]:
0 2013-01-01 00:00:00
1 2013-01-01 00:00:00.001000
2 2013-01-01 00:00:00.002000
3 2013-01-01 00:00:00.003000
4 2013-01-01 00:00:00.004000
5 2013-01-01 00:00:00.005000
6 2013-01-01 00:00:00.006000
7 2013-01-01 00:00:00.007000
8 2013-01-01 00:00:00.008000
9 2013-01-01 00:00:00.009000
10 2013-01-01 00:00:00.010000
11 2013-01-01 00:00:00.011000
12 2013-01-01 00:00:00.012000
13 2013-01-01 00:00:00.013000
14 2013-01-01 00:00:00.014000
...
985 2013-01-01 00:00:00.985000
986 2013-01-01 00:00:00.986000
987 2013-01-01 00:00:00.987000
988 2013-01-01 00:00:00.988000
989 2013-01-01 00:00:00.989000
990 2013-01-01 00:00:00.990000
991 2013-01-01 00:00:00.991000
992 2013-01-01 00:00:00.992000
993 2013-01-01 00:00:00.993000
994 2013-01-01 00:00:00.994000
995 2013-01-01 00:00:00.995000
996 2013-01-01 00:00:00.996000
997 2013-01-01 00:00:00.997000
998 2013-01-01 00:00:00.998000
999 2013-01-01 00:00:00.999000
Length: 1000, dtype: datetime64[ns]
Convert to ns since epoch / divide to get ms since epoch (if you want seconds,
divide by 10**9)
In [9]: pd.DatetimeIndex(s).asi8/10**6
Out[9]:
array([1356998400000, 1356998400001, 1356998400002, 1356998400003,
1356998400004, 1356998400005, 1356998400006, 1356998400007,
1356998400008, 1356998400009, 1356998400010, 1356998400011,
...
1356998400992, 1356998400993, 1356998400994, 1356998400995,
1356998400996, 1356998400997, 1356998400998, 1356998400999])
Pretty fast
In [12]: s = Series(date_range('20130101',periods=1000000,freq='ms'))
In [13]: %timeit pd.DatetimeIndex(s).asi8/10**6
100 loops, best of 3: 11 ms per loop
I'm guessing that the datetime object has a lot of overhead - it may be easier to do it by hand:
def to_seconds(s):
hr, min, sec = [float(x) for x in s.split(':')]
return hr*3600 + min*60 + sec
Using sum(), and enumerate() -
>>> ts = '11:52:55.162'
>>> ts1 = map(float, ts.split(':'))
>>> ts1
[11.0, 52.0, 55.162]
>>> ts2 = [60**(2-i)*n for i, n in enumerate(ts1)]
>>> ts2
[39600.0, 3120.0, 55.162]
>>> ts3 = sum(ts2)
>>> ts3
42775.162
>>> seconds = sum(60**(2-i)*n for i, n in enumerate(map(float, ts.split(':'))))
>>> seconds
42775.162
>>>

Handling monthly-binned data in pandas

I have a dataset I'm analyzing in pandas where all data is binned monthly. The data originates from a MySQL database where all dates are in the format 'YYYY-MM-01', such that, for example, all rows for October 2013 would have "2013-10-01" in the month column.
I'm currently reading the data into pandas (via a .tsv dump of the MySQL table) with
data = pd.read_table(filename,header=None,names=('uid','iid','artist','tag','date'),index_col=indexes, parse_dates='date')
This is all fine, except for the fact that any subsequent analyses I run in which I do monthly resampling always represents dates using the end-of-month convention (i.e. data from October becomes '2013-10-31' instead of '2013-10-01'), but this can lead to inconsistencies where the original data has months labeled as 'YYYY-MM-01', while any resampled data will have the months labeled as 'YYYY-MM-31' (or '-30' or '-28', as appropriate).
My question is this: What is the easiest and/or fastest way I can convert all the dates in my dataframe to the end-of-month format from the outset? Keep in mind that the date is one of several indexes in a multi-index, not a column. I think my best bet is to use a modified date_parser in my in my pd.read_table call that always converts month to the end-of-month convention, but I'm not sure how to approach it.
Read your dates in exactly like you are doing.
Create some test data. I am setting the dates to the start of month, but it doesn't matter.
In [39]: df = DataFrame(np.random.randn(10,2),columns=list('AB'),
index=date_range('20130101',periods=10,freq='MS'))
In [40]: df
Out[40]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
Force convert them to the end-of-month in time space regardless of the day
In [41]: df.index = df.index.to_period().to_timestamp('M')
In [42]: df
Out[42]:
A B
2013-01-31 -0.553482 0.049128
2013-02-28 0.337975 -0.035897
2013-03-31 -0.394849 -1.755323
2013-04-30 -0.555638 1.903388
2013-05-31 -0.087752 1.551916
2013-06-30 1.000943 -0.361248
2013-07-31 -1.855171 -2.215276
2013-08-31 -0.582643 1.661696
2013-09-30 0.501061 -1.455171
2013-10-31 1.343630 -2.008060
Back to the start
In [43]: df.index = df.index.to_period().to_timestamp('MS')
In [44]: df
Out[44]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
You can also work with (and resample) as periods
In [45]: df.index = df.index.to_period()
In [46]: df
Out[46]:
A B
2013-01 -0.553482 0.049128
2013-02 0.337975 -0.035897
2013-03 -0.394849 -1.755323
2013-04 -0.555638 1.903388
2013-05 -0.087752 1.551916
2013-06 1.000943 -0.361248
2013-07 -1.855171 -2.215276
2013-08 -0.582643 1.661696
2013-09 0.501061 -1.455171
2013-10 1.343630 -2.008060
use replace() to change the day value. and you can get the last day of month using
from datetime import date
import calendar
d = date(2000,1,1)
d = d.replace(day=calendar.monthrange(d.year, d.month)[1])
UPDATE
I add some example for pandas.
sample file date.csv
2013-01-01, 1
2013-02-01, 2
ipython shell log.
In [27]: import pandas as pd
In [28]: from datetime import datetime, date
In [29]: import calendar
In [30]: def parse(dt):
dt = datetime.strptime(dt, '%Y-%m-%d')
dt = dt.replace(day=calendar.monthrange(dt.year, dt.month)[1])
return dt.date()
....:
In [31]: parse('2013-01-01')
Out[31]: datetime.date(2013, 1, 31)
In [32]: r = pd.read_csv('date.csv', header=None, names=('date', 'value'), parse_dates=['date'], date_parser=parse)
In [33]: r
Out[33]:
date value
0 2013-01-31 1
1 2013-02-28 2

Categories