I have a pandas DataFrame that I have written to an HDF5 file. The data is indexed by Timestamps and looks like this:
In [5]: df
Out[5]:
Codes Price Size
Time
2015-04-27 01:31:08-04:00 T 111.75 23
2015-04-27 01:31:39-04:00 T 111.80 23
2015-04-27 01:31:39-04:00 T 113.00 35
2015-04-27 01:34:14-04:00 T 113.00 85
2015-04-27 01:55:15-04:00 T 113.50 203
... ... ... ...
2015-05-26 11:35:00-04:00 CA 110.55 196
2015-05-26 11:35:00-04:00 CA 110.55 98
2015-05-26 11:35:00-04:00 CA 110.55 738
2015-05-26 11:35:00-04:00 CA 110.55 19
2015-05-26 11:37:01-04:00 110.55 12
What I would like is to create a function that I can pass a pandas DatetimeIndex and it will return a DataFrame with the rows at or right before each Timestamp in the DatetimeIndex.
The problem I'm running into is that concatenated read_hdf queries won't work if I am looking for more than 30 rows -- see [pandas read_hdf with 'where' condition limitation?
What I am doing now is this, but there has to be a better solution:
from pandas import read_hdf, DatetimeIndex
from datetime import timedelta
import pytz
def getRows(file, dataset, index):
if len(index) == 1:
start = index.date[0]
end = (index.date + timedelta(days=1))[0]
else:
start = index.date.min()
end = (index.date.max() + timedelta(days=1))
where = '(index >= "' + str(start) + '") & (index < "' str(end) + '")'
df = read_hdf(file, dataset, where=where)
df = df.groupby(level=0).last().reindex(index, method='pad')
return df
This is an example of using a where mask
In [22]: pd.set_option('max_rows',10)
In [23]: df = DataFrame({'A' : np.random.randn(100), 'B' : pd.date_range('20130101',periods=100)}).set_index('B')
In [24]: df
Out[24]:
A
B
2013-01-01 0.493144
2013-01-02 0.421045
2013-01-03 -0.717824
2013-01-04 0.159865
2013-01-05 -0.485890
... ...
2013-04-06 -0.805954
2013-04-07 -1.014333
2013-04-08 0.846877
2013-04-09 -1.646908
2013-04-10 -0.160927
[100 rows x 1 columns]
Store the tests frame
In [25]: store = pd.HDFStore('test.h5',mode='w')
In [26]: store.append('df',df)
Create a random selection of dates.
In [27]: dates = df.index.take(np.random.randint(0,100,10))
In [28]: dates
Out[28]: DatetimeIndex(['2013-03-29', '2013-02-16', '2013-01-15', '2013-02-06', '2013-01-12', '2013-02-24', '2013-02-18', '2013-01-06', '2013-03-17', '2013-03-21'], dtype='datetime64[ns]', name=u'B', freq=None, tz=None)
Select the index column (in its entirety)
In [29]: c = store.select_column('df','index')
In [30]: c
Out[30]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
...
95 2013-04-06
96 2013-04-07
97 2013-04-08
98 2013-04-09
99 2013-04-10
Name: B, dtype: datetime64[ns]
Select the indexers that you want. This could actually be somewhat complicated, e.g. you might want a .reindex(method='nearest')
In [34]: c[c.isin(dates)]
Out[34]:
5 2013-01-06
11 2013-01-12
14 2013-01-15
36 2013-02-06
46 2013-02-16
48 2013-02-18
54 2013-02-24
75 2013-03-17
79 2013-03-21
87 2013-03-29
Name: B, dtype: datetime64[ns]
Select the rows that you want
In [32]: store.select('df',where=c[c.isin(dates)].index)
Out[32]:
A
B
2013-01-06 0.680930
2013-01-12 0.165923
2013-01-15 -0.517692
2013-02-06 -0.351020
2013-02-16 1.348973
2013-02-18 0.448890
2013-02-24 -1.078522
2013-03-17 -0.358597
2013-03-21 -0.482301
2013-03-29 0.343381
In [33]: store.close()
Related
I have a column with hh:mm:ss and a separate column with the decimal seconds.
I have quite a horrible text files to process and the decimal value of my time is separated into another column. Now I'd like to concatenate them back in.
For example:
df = {'Time':['01:00:00','01:00:00 AM','01:00:01 AM','01:00:01 AM'],
'DecimalSecond':['14','178','158','75']}
I tried the following but it didn't work. It gives me "01:00:00 AM.14" LOL
df = df['Time2'] = df['Time'].map(str) + '.' + df['DecimalSecond'].map(str)
The goal is to come up with one column named "Time2" which has the first row 01:00:00.14 AM, second row 01.00.00.178 AM, etc)
Thank you for the help.
You can convert ouput to datetimes and then call Series.dt.time:
#Time column is splitted by space and extracted values before first space
s = df['Time'].astype(str).str.split().str[0] + '.' + df['DecimalSecond'].astype(str)
df['Time2'] = pd.to_datetime(s).dt.time
print (df)
Time DecimalSecond Time2
0 01:00:00 14 01:00:00.140000
1 01:00:00 AM 178 01:00:00.178000
2 01:00:01 AM 158 01:00:01.158000
3 01:00:01 AM 75 01:00:01.750000
Please see the python code below
In [1]:
import pandas as pd
In [2]:
df = pd.DataFrame({'Time':['01:00:00','01:00:00','01:00:01','01:00:01'],
'DecimalSecond':['14','178','158','75']})
In [3]:
df['Time2'] = df[['Time','DecimalSecond']].apply(lambda x: ' '.join(x), axis = 1)
print(df)
Time DecimalSecond Time2
0 01:00:00 14 01:00:00 14
1 01:00:00 178 01:00:00 178
2 01:00:01 158 01:00:01 158
3 01:00:01 75 01:00:01 75
In [4]:
df.iloc[:,2]
Out[4]:
0 01:00:00 14
1 01:00:00 178
2 01:00:01 158
3 01:00:01 75
Name: Time2, dtype: object
I have a dataframe with multiindex which I want to convert to date() index.
Here is an example emulation of the type of dataframes I have:
i = pd.date_range('01-01-2016', '01-01-2020')
x = pd.DataFrame(index = i, data=np.random.randint(0, 10, len(i)))
x = x.groupby(by = [x.index.year, x.index.month]).sum()
print(x)
I tried to convert it to date index by this:
def to_date(ind):
return pd.to_datetime(str(ind[0]) + '/' + str(ind[1]), format="%Y/%m").date()
# flattening the multiindex to tuples to later reset the index
x.set_axis(x.index.to_flat_index(), axis=0, inplace = True)
x = x.rename(index = to_date)
x.set_axis(pd.DatetimeIndex(x.index), axis=0, inplace=True)
But it is very slow. I think the problem is in the pd.to_datetime(str(ind[0]) + '/' + str(ind[1]), format="%Y/%m").date() line. Would greatly appreciate any ideas to make this faster.
You can just use:
x.index=pd.to_datetime([f"{a}-{b}" for a,b in x.index],format='%Y-%m')
print(x)
0
2016-01-01 162
2016-02-01 119
2016-03-01 148
2016-04-01 125
2016-05-01 132
2016-06-01 144
2016-07-01 157
2016-08-01 141
2016-09-01 138
2016-10-01 168
2016-11-01 140
2016-12-01 137
2017-01-01 113
2017-02-01 113
2017-03-01 155
..........
..........
......
I am writing a process which takes a semi-large file as input (~4 million rows, 5 columns)
and performs a few operations on it.
Columns:
- CARD_NO
- ID
- CREATED_DATE
- STATUS
- FLAG2
I need to create a file which contains 1 copy of each CARD_NO where STATUS = '1' and CREATED_DATE is the maximum of all CREATED_DATEs for that CARD_NO.
I succeeded but my solution is very slow (3h and counting as of right now.)
Here is my code:
file = 'input.csv'
input = pd.read_csv(file)
input = input.drop_duplicates()
card_groups = input.groupby('CARD_NO', as_index=False, sort=False).filter(lambda x: x['STATUS'] == 1)
def important(x):
latest_date = x['CREATED_DATE'].values[x['CREATED_DATE'].values.argmax()]
return x[x.CREATED_DATE == latest_date]
#where the major slowdown occurs
group_2 = card_groups.groupby('CARD_NO', as_index=False, sort=False).apply(important)
path = 'result.csv'
group_2.to_csv(path, sep=',', index=False)
# ~4 minutes for the 154k rows file
# 3+ hours for ~4m rows
I was wondering if you had any advice on how to improve the running time of this little process.
Thank you and have a good day.
Setup (FYI make sure that your use parse_dates=True when reading your csv)
In [6]: n_groups = 10000
In [7]: N = 4000000
In [8]: dates = date_range('20130101',periods=100)
In [9]: df = DataFrame(dict(id = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))
In [10]: pd.set_option('max_rows',10)
In [13]: df = DataFrame(dict(card_no = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))
In [14]: df
Out[14]:
card_no date status
0 5790 2013-02-11 6
1 6572 2013-03-17 6
2 7764 2013-02-06 3
3 4905 2013-04-01 3
4 3871 2013-04-08 1
... ... ... ...
3999995 1891 2013-02-16 5
3999996 9048 2013-01-11 9
3999997 1443 2013-02-23 1
3999998 2845 2013-01-28 0
3999999 5645 2013-02-05 8
[4000000 rows x 3 columns]
In [15]: df.dtypes
Out[15]:
card_no int64
date datetime64[ns]
status int64
dtype: object
Only status == 1, groupby card_no, then return the max date for that group
In [18]: df[df.status==1].groupby('card_no')['date'].max()
Out[18]:
card_no
0 2013-04-06
1 2013-03-30
2 2013-04-09
...
9997 2013-04-07
9998 2013-04-07
9999 2013-04-09
Name: date, Length: 10000, dtype: datetime64[ns]
In [19]: %timeit df[df.status==1].groupby('card_no')['date'].max()
1 loops, best of 3: 934 ms per loop
If you need a transform of this (e.g. the same values for each group. Note that with < 0.14.1 (releasing this week) you will need to use this soln here, otherwise this will be pretty slow)
In [20]: df[df.status==1].groupby('card_no')['date'].transform('max')
Out[20]:
4 2013-04-10
13 2013-04-10
25 2013-04-10
...
3999973 2013-04-10
3999979 2013-04-10
3999997 2013-04-09
Name: date, Length: 399724, dtype: datetime64[ns]
In [21]: %timeit df[df.status==1].groupby('card_no')['date'].transform('max')
1 loops, best of 3: 1.8 s per loop
I suspect you prob want to merge the final transform back into the original frame
In [24]: df.join(res.to_frame('max_date'))
Out[24]:
card_no date status max_date
0 5790 2013-02-11 6 NaT
1 6572 2013-03-17 6 NaT
2 7764 2013-02-06 3 NaT
3 4905 2013-04-01 3 NaT
4 3871 2013-04-08 1 2013-04-10
... ... ... ... ...
3999995 1891 2013-02-16 5 NaT
3999996 9048 2013-01-11 9 NaT
3999997 1443 2013-02-23 1 2013-04-09
3999998 2845 2013-01-28 0 NaT
3999999 5645 2013-02-05 8 NaT
[4000000 rows x 4 columns]
In [25]: %timeit df.join(res.to_frame('max_date'))
10 loops, best of 3: 58.8 ms per loop
The csv writing will actually take a fair amount of time relative to this. I used HDF5 for things like this, MUCH faster.
I use pandas to import a csv file (about a million rows, 5 columns) that contains one column of timestamps (increasing row-by-row) in the format Hour:Min:Sec.Millsecs, e.g.
11:52:55.162
and some other columns with floats. I need to transform the timestamp column into floats (say in seconds). So far I'm using
pandas.read_csv
to get a dataframe df and then transform it into a numpy array
df=np.array(df)
All the above works great and is quite fast. However, then I use datetime.strptime (the 0th columns are the timestamps)
df[:,0]=[(datetime.strptime(str(d),'%H:%M:%S.%f')).total_seconds() for d in df[:,0]]
to transform the timestamps into seconds and unfortunately this turns out to be veryyyy slow . It's not the iteration over all the rows that so slow but
datetime.strptime
is the bottleneck. Is there a better way to do it?
Here, using timedeltas
Create a sample series
In [21]: s = pd.to_timedelta(np.arange(100000),unit='s')
In [22]: s
Out[22]:
0 00:00:00
1 00:00:01
2 00:00:02
3 00:00:03
4 00:00:04
5 00:00:05
6 00:00:06
7 00:00:07
8 00:00:08
9 00:00:09
10 00:00:10
11 00:00:11
12 00:00:12
13 00:00:13
14 00:00:14
...
99985 1 days, 03:46:25
99986 1 days, 03:46:26
99987 1 days, 03:46:27
99988 1 days, 03:46:28
99989 1 days, 03:46:29
99990 1 days, 03:46:30
99991 1 days, 03:46:31
99992 1 days, 03:46:32
99993 1 days, 03:46:33
99994 1 days, 03:46:34
99995 1 days, 03:46:35
99996 1 days, 03:46:36
99997 1 days, 03:46:37
99998 1 days, 03:46:38
99999 1 days, 03:46:39
Length: 100000, dtype: timedelta64[ns]
Convert to string for testing purposes
In [23]: t = s.apply(pd.tslib.repr_timedelta64)
These are strings
In [24]: t.iloc[-1]
Out[24]: '1 days, 03:46:39'
Dividing by a timedelta64 converts this to seconds
In [25]: pd.to_timedelta(t.iloc[-1])/np.timedelta64(1,'s')
Out[25]: 99999.0
This is currently matching using a reg-ex, so not very fast from a string directly.
In [27]: %timeit pd.to_timedelta(t)/np.timedelta64(1,'s')
1 loops, best of 3: 1.84 s per loop
This is a date-timestamp based soln
Since date times are already stored as int64's this is very easy an fast
Create a sample series
In [7]: s = Series(date_range('20130101',periods=1000,freq='ms'))
In [8]: s
Out[8]:
0 2013-01-01 00:00:00
1 2013-01-01 00:00:00.001000
2 2013-01-01 00:00:00.002000
3 2013-01-01 00:00:00.003000
4 2013-01-01 00:00:00.004000
5 2013-01-01 00:00:00.005000
6 2013-01-01 00:00:00.006000
7 2013-01-01 00:00:00.007000
8 2013-01-01 00:00:00.008000
9 2013-01-01 00:00:00.009000
10 2013-01-01 00:00:00.010000
11 2013-01-01 00:00:00.011000
12 2013-01-01 00:00:00.012000
13 2013-01-01 00:00:00.013000
14 2013-01-01 00:00:00.014000
...
985 2013-01-01 00:00:00.985000
986 2013-01-01 00:00:00.986000
987 2013-01-01 00:00:00.987000
988 2013-01-01 00:00:00.988000
989 2013-01-01 00:00:00.989000
990 2013-01-01 00:00:00.990000
991 2013-01-01 00:00:00.991000
992 2013-01-01 00:00:00.992000
993 2013-01-01 00:00:00.993000
994 2013-01-01 00:00:00.994000
995 2013-01-01 00:00:00.995000
996 2013-01-01 00:00:00.996000
997 2013-01-01 00:00:00.997000
998 2013-01-01 00:00:00.998000
999 2013-01-01 00:00:00.999000
Length: 1000, dtype: datetime64[ns]
Convert to ns since epoch / divide to get ms since epoch (if you want seconds,
divide by 10**9)
In [9]: pd.DatetimeIndex(s).asi8/10**6
Out[9]:
array([1356998400000, 1356998400001, 1356998400002, 1356998400003,
1356998400004, 1356998400005, 1356998400006, 1356998400007,
1356998400008, 1356998400009, 1356998400010, 1356998400011,
...
1356998400992, 1356998400993, 1356998400994, 1356998400995,
1356998400996, 1356998400997, 1356998400998, 1356998400999])
Pretty fast
In [12]: s = Series(date_range('20130101',periods=1000000,freq='ms'))
In [13]: %timeit pd.DatetimeIndex(s).asi8/10**6
100 loops, best of 3: 11 ms per loop
I'm guessing that the datetime object has a lot of overhead - it may be easier to do it by hand:
def to_seconds(s):
hr, min, sec = [float(x) for x in s.split(':')]
return hr*3600 + min*60 + sec
Using sum(), and enumerate() -
>>> ts = '11:52:55.162'
>>> ts1 = map(float, ts.split(':'))
>>> ts1
[11.0, 52.0, 55.162]
>>> ts2 = [60**(2-i)*n for i, n in enumerate(ts1)]
>>> ts2
[39600.0, 3120.0, 55.162]
>>> ts3 = sum(ts2)
>>> ts3
42775.162
>>> seconds = sum(60**(2-i)*n for i, n in enumerate(map(float, ts.split(':'))))
>>> seconds
42775.162
>>>
I'm constructing a dictionary using a dictionary comprehension which has read_csv embedded within it. This constructs the dictionary fine, but when I then push it into a DataFrame all of my data goes to null and the dates get very wacky as well. Here's sample code and output:
In [129]: a= {x.split(".")[0] : read_csv(x, parse_dates=True, index_col=[0])["Settle"] for x in t[:2]}
In [130]: a
Out[130]:
{'SPH2010': Date
2010-03-19 1172.95
2010-03-18 1166.10
2010-03-17 1165.70
2010-03-16 1159.50
2010-03-15 1150.30
2010-03-12 1151.30
2010-03-11 1150.60
2010-03-10 1145.70
2010-03-09 1140.50
2010-03-08 1137.10
2010-03-05 1136.50
2010-03-04 1122.30
2010-03-03 1118.60
2010-03-02 1117.40
2010-03-01 1114.60
...
2008-04-10 1370.4
2008-04-09 1367.7
2008-04-08 1378.7
2008-04-07 1378.4
2008-04-04 1377.8
2008-04-03 1379.9
2008-04-02 1377.7
2008-04-01 1376.6
2008-03-31 1329.1
2008-03-28 1324.0
2008-03-27 1334.7
2008-03-26 1340.7
2008-03-25 1357.0
2008-03-24 1357.3
2008-03-20 1329.8
Name: Settle, Length: 495,
'SPM2011': Date
2011-06-17 1279.4
2011-06-16 1269.0
2011-06-15 1265.4
2011-06-14 1289.9
2011-06-13 1271.6
2011-06-10 1269.2
2011-06-09 1287.4
2011-06-08 1277.0
2011-06-07 1284.8
2011-06-06 1285.0
2011-06-03 1296.3
2011-06-02 1312.4
2011-06-01 1312.1
2011-05-31 1343.9
2011-05-27 1329.9
...
2009-07-10 856.6
2009-07-09 861.2
2009-07-08 856.0
2009-07-07 861.7
2009-07-06 877.9
2009-07-02 875.8
2009-07-01 902.6
2009-06-30 900.3
2009-06-29 908.0
2009-06-26 901.1
2009-06-25 903.8
2009-06-24 885.2
2009-06-23 877.6
2009-06-22 876.0
2009-06-19 903.4
Name: Settle, Length: 497}
In [131]: DataFrame(a)
Out[131]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 806 entries, 2189-09-10 03:33:28.879144 to 1924-01-20 06:06:06.621835
Data columns:
SPH2010 0 non-null values
SPM2011 0 non-null values
dtypes: float64(2)
Thanks!
EDIT:
I've also tried doing this with concat and I get the same results.
You should be able to use concat and unstack. Here's an example:
df1 = pd.Series([1, 2], name='a')
df2 = pd.Series([3, 4], index=[1, 2], name='b')
d = {'A': s1, 'B': s2} # a dict of Series
In [4]: pd.concat(d)
Out[4]:
A 0 1
1 2
B 1 3
2 4
In [5]: pd.concat(d).unstack().T
Out[5]:
A B
0 1 NaN
1 2 3
2 NaN 4