Python Pandas remove date from timestamp - python

I have a large data set like this
user category
time
2014-01-01 00:00:00 21155349 2
2014-01-01 00:00:00 56347479 6
2014-01-01 00:00:00 68429517 13
2014-01-01 00:00:00 39055685 4
2014-01-01 00:00:00 521325 13
I want to make it as
user category
time
00:00:00 21155349 2
00:00:00 56347479 6
00:00:00 68429517 13
00:00:00 39055685 4
00:00:00 521325 13
How you do this using pandas

If you want to mutate a series (column) in pandas, the pattern is to apply a function to it (that updates on element in the series at a time), and to then assign that series back into into the dataframe
import pandas
import StringIO
# load data
data = '''date,user,category
2014-01-01 00:00:00, 21155349, 2
2014-01-01 00:00:00, 56347479, 6
2014-01-01 00:00:00, 68429517, 13
2014-01-01 00:00:00, 39055685, 4
2014-01-01 00:00:00, 521325, 13'''
df = pandas.read_csv(StringIO.StringIO(data))
df['date'] = pandas.to_datetime(df['date'])
# make the required change
without_date = df['date'].apply( lambda d : d.time() )
df['date'] = without_date
# display results
print df
If the problem is because the date is the index, you've got a few more hoops to jump through:
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.apply(lambda d : d.time() ))
As suggested by #DSM, If you have pandas later than 0.15.2, you can use use the .dt accessor on the series to do fast updates.
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.dt.time)

Related

Pandas change time values based on condition

I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data

Split 1 column into 2 columns pandas dataframe AttributeError .str [duplicate]

I have a pandas dataframe with over 1000 timestamps (below) that I would like to loop through:
2016-02-22 14:59:44.561776
I'm having a hard time splitting this time stamp into 2 columns- 'date' and 'time'. The date format can stay the same, but the time needs to be converted to CST (including milliseconds).
Thanks for the help
Had same problem and this worked for me.
Suppose the date column in your dataset is called "date"
import pandas as pd
df = pd.read_csv(file_path)
df['Dates'] = pd.to_datetime(df['date']).dt.date
df['Time'] = pd.to_datetime(df['date']).dt.time
This will give you two columns "Dates" and "Time" with splited dates.
I'm not sure why you would want to do this in the first place, but if you really must...
df = pd.DataFrame({'my_timestamp': pd.date_range('2016-1-1 15:00', periods=5)})
>>> df
my_timestamp
0 2016-01-01 15:00:00
1 2016-01-02 15:00:00
2 2016-01-03 15:00:00
3 2016-01-04 15:00:00
4 2016-01-05 15:00:00
df['new_date'] = [d.date() for d in df['my_timestamp']]
df['new_time'] = [d.time() for d in df['my_timestamp']]
>>> df
my_timestamp new_date new_time
0 2016-01-01 15:00:00 2016-01-01 15:00:00
1 2016-01-02 15:00:00 2016-01-02 15:00:00
2 2016-01-03 15:00:00 2016-01-03 15:00:00
3 2016-01-04 15:00:00 2016-01-04 15:00:00
4 2016-01-05 15:00:00 2016-01-05 15:00:00
The conversion to CST is more tricky. I assume that the current timestamps are 'unaware', i.e. they do not have a timezone attached? If not, how would you expect to convert them?
For more details:
https://docs.python.org/2/library/datetime.html
How to make an unaware datetime timezone aware in python
EDIT
An alternative method that only loops once across the timestamps instead of twice:
new_dates, new_times = zip(*[(d.date(), d.time()) for d in df['my_timestamp']])
df = df.assign(new_date=new_dates, new_time=new_times)
The easiest way is to use the pandas.Series dt accessor, which works on columns with a datetime dtype (see pd.to_datetime). For this case, pd.date_range creates an example column with a datetime dtype, therefore use .dt.date and .dt.time:
df = pd.DataFrame({'full_date': pd.date_range('2016-1-1 10:00:00.123', periods=10, freq='5H')})
df['date'] = df['full_date'].dt.date
df['time'] = df['full_date'].dt.time
In [166]: df
Out[166]:
full_date date time
0 2016-01-01 10:00:00.123 2016-01-01 10:00:00.123000
1 2016-01-01 15:00:00.123 2016-01-01 15:00:00.123000
2 2016-01-01 20:00:00.123 2016-01-01 20:00:00.123000
3 2016-01-02 01:00:00.123 2016-01-02 01:00:00.123000
4 2016-01-02 06:00:00.123 2016-01-02 06:00:00.123000
5 2016-01-02 11:00:00.123 2016-01-02 11:00:00.123000
6 2016-01-02 16:00:00.123 2016-01-02 16:00:00.123000
7 2016-01-02 21:00:00.123 2016-01-02 21:00:00.123000
8 2016-01-03 02:00:00.123 2016-01-03 02:00:00.123000
9 2016-01-03 07:00:00.123 2016-01-03 07:00:00.123000
If your timestamps are already in pandas format (not string), then:
df["date"] = df["timestamp"].date
dt["time"] = dt["timestamp"].time
If your timestamp is a string, you can parse it using the datetime module:
from datetime import datetime
data1["timestamp"] = df["timestamp"].apply(lambda x: \
datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f"))
Source:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
If your timestamp is a string, you can convert it to a datetime object:
from datetime import datetime
timestamp = '2016-02-22 14:59:44.561776'
dt = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S.%f')
From then on you can bring it to whatever format you like.
Try
s = '2016-02-22 14:59:44.561776'
date,time = s.split()
then convert time as needed.
If you want to further split the time,
hour, minute, second = time.split(':')
try this:
def time_date(datetime_obj):
date_time = datetime_obj.split(' ')
time = date_time[1].split('.')
return date_time[0], time[0]
In addition to #Alexander if you want a single liner
df['new_date'],df['new_time'] = zip(*[(d.date(), d.time()) for d in df['my_timestamp']])
If your timestamp is a string, you can convert it to pandas timestamp before splitting it.
#convert to pandas timestamp
data["old_date"] = pd.to_datetime(data.old_date)
#split columns
data["new_date"] = data["old_date"].dt.date
data["new_time"] = data["old_date"].dt.time

pandas: selecting rows in a specific time window

I have a dataset of samples covering multiple days, all with a timestamp.
I want to select rows within a specific time window. E.g. all rows that were generated between 1pm and 3 pm every day.
This is a sample of my data in a pandas dataframe:
22 22 2018-04-12T20:14:23Z 2018-04-12T21:14:23Z 0 6370.1
23 23 2018-04-12T21:14:23Z 2018-04-12T21:14:23Z 0 6368.8
24 24 2018-04-12T22:14:22Z 2018-04-13T01:14:23Z 0 6367.4
25 25 2018-04-12T23:14:22Z 2018-04-13T01:14:23Z 0 6365.8
26 26 2018-04-13T00:14:22Z 2018-04-13T01:14:23Z 0 6364.4
27 27 2018-04-13T01:14:22Z 2018-04-13T01:14:23Z 0 6362.7
28 28 2018-04-13T02:14:22Z 2018-04-13T05:14:22Z 0 6361.0
29 29 2018-04-13T03:14:22Z 2018-04-13T05:14:22Z 0 6359.3
.. ... ... ... ... ...
562 562 2018-05-05T08:13:21Z 2018-05-05T09:13:21Z 0 6300.9
563 563 2018-05-05T09:13:21Z 2018-05-05T09:13:21Z 0 6300.7
564 564 2018-05-05T10:13:14Z 2018-05-05T13:13:14Z 0 6300.2
565 565 2018-05-05T11:13:14Z 2018-05-05T13:13:14Z 0 6299.9
566 566 2018-05-05T12:13:14Z 2018-05-05T13:13:14Z 0 6299.6
How do I achieve that? I need to ignore the date and just evaluate the time component. I could traverse the dataframe in a loop and evaluate the date time in that way, but there must be a more simple way to do that..
I converted the messageDate which was read a a string to a dateTime by
df["messageDate"]=pd.to_datetime(df["messageDate"])
But after that I got stuck on how to filter on time only.
Any input appreciated.
datetime columns have DatetimeProperties object, from which you can extract datetime.time and filter on it:
import datetime
df = pd.DataFrame(
[
'2018-04-12T12:00:00Z', '2018-04-12T14:00:00Z','2018-04-12T20:00:00Z',
'2018-04-13T12:00:00Z', '2018-04-13T14:00:00Z', '2018-04-13T20:00:00Z'
],
columns=['messageDate']
)
df
messageDate
# 0 2018-04-12 12:00:00
# 1 2018-04-12 14:00:00
# 2 2018-04-12 20:00:00
# 3 2018-04-13 12:00:00
# 4 2018-04-13 14:00:00
# 5 2018-04-13 20:00:00
df["messageDate"] = pd.to_datetime(df["messageDate"])
time_mask = (df['messageDate'].dt.hour >= 13) & \
(df['messageDate'].dt.hour <= 15)
df[time_mask]
# messageDate
# 1 2018-04-12 14:00:00
# 4 2018-04-13 14:00:00
I hope the code is self explanatory. You can always ask questions.
import pandas as pd
# Prepping data for example
dates = pd.date_range('1/1/2018', periods=7, freq='H')
data = {'A' : range(7)}
df = pd.DataFrame(index = dates, data = data)
print df
# A
# 2018-01-01 00:00:00 0
# 2018-01-01 01:00:00 1
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
# 2018-01-01 05:00:00 5
# 2018-01-01 06:00:00 6
# Creating a mask to filter the value we with to have or not.
# Here, we use df.index because the index is our datetime.
# If the datetime is a column, you can always say df['column_name']
mask = (df.index > '2018-1-1 01:00:00') & (df.index < '2018-1-1 05:00:00')
print mask
# [False False True True True False False]
df_with_good_dates = df.loc[mask]
print df_with_good_dates
# A
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
df=df[(df["messageDate"].apply(lambda x : x.hour)>13) & (df["messageDate"].apply(lambda x : x.hour)<15)]
You can use x.minute, x.second similarly.
try this after ensuring messageDate is indeed datetime format as you have done
df.set_index('messageDate',inplace=True)
choseInd = [ind for ind in df.index if (ind.hour>=13)&(ind.hour<=15)]
df_select = df.loc[choseInd]
you can do the same, even without making the datetime column as an index, as the answer with apply: lambda shows
it just makes your dataframe 'better looking' if the datetime is your index rather than numerical one.

Merge multiple dataframes with non-unique indices

I have a bunch of pandas time series. Here is an example for illustration (real data has ~ 1 million entries in each series):
>>> for s in series:
print s.head()
print
2014-01-01 01:00:00 -0.546404
2014-01-01 01:00:00 -0.791217
2014-01-01 01:00:01 0.117944
2014-01-01 01:00:01 -1.033161
2014-01-01 01:00:02 0.013415
2014-01-01 01:00:02 0.368853
2014-01-01 01:00:02 0.380515
2014-01-01 01:00:02 0.976505
2014-01-01 01:00:02 0.881654
dtype: float64
2014-01-01 01:00:00 -0.111314
2014-01-01 01:00:01 0.792093
2014-01-01 01:00:01 -1.367650
2014-01-01 01:00:02 -0.469194
2014-01-01 01:00:02 0.569606
2014-01-01 01:00:02 -1.777805
dtype: float64
2014-01-01 01:00:00 -0.108123
2014-01-01 01:00:00 -1.518526
2014-01-01 01:00:00 -1.395465
2014-01-01 01:00:01 0.045677
2014-01-01 01:00:01 1.614789
2014-01-01 01:00:01 1.141460
2014-01-01 01:00:02 1.365290
dtype: float64
The times in each series are not unique. For example, the last series has 3 values at 2014-01-01 01:00:00. The second series has only one value at that time. Also, not all the times need to be present in all the series.
My goal is to create a merged DataFrame with times that are a union of all the times in the individual time series. Each timestamp should be repeated as many times as needed. So, if a timestamp occurs (2, 0, 3, 4) times in the series above, the timestamp should be repeated 4 times (the maximum of the frequencies) in the resulting DataFrame. The values of each column should be "filled forward".
As an example, the result of merging the above should be:
c0 c1 c2
2014-01-01 01:00:00 -0.546404 -0.111314 -0.108123
2014-01-01 01:00:00 -0.791217 -0.111314 -1.518526
2014-01-01 01:00:00 -0.791217 -0.111314 -1.395465
2014-01-01 01:00:01 0.117944 0.792093 0.045677
2014-01-01 01:00:01 -1.033161 -1.367650 1.614789
2014-01-01 01:00:01 -1.033161 -1.367650 1.141460
2014-01-01 01:00:02 0.013415 -0.469194 1.365290
2014-01-01 01:00:02 0.368853 0.569606 1.365290
2014-01-01 01:00:02 0.380515 -1.777805 1.365290
2014-01-01 01:00:02 0.976505 -1.777805 1.365290
2014-01-01 01:00:02 0.881654 -1.777805 1.365290
To give an idea of size and "uniqueness" in my real data:
>>> [len(s.index.unique()) for s in series]
[48617, 48635, 48720, 48620]
>>> len(times)
51043
>>> [len(s) for s in series]
[1143409, 1143758, 1233646, 1242864]
Here is what I have tried:
I can create a union of all the unique times:
uniques = [s.index.unique() for s in series]
times = uniques[0].union_many(uniques[1:])
I can now index each series using times:
series[0].loc[times]
But that seems to repeat the values for each item in times, which is not what I want.
I can't reindex() the series using times because the index for each series is not unique.
I can do it by a slow Python loop or do it in Cython, but is there a "pandas-only" way to do what I want to do?
I created my example series using the following code:
def make_series(n=3, rep=(0,5)):
times = pandas.date_range('2014/01/01 01:00:00', periods=n, freq='S')
reps = [random.randint(*rep) for _ in xrange(n)]
dates = []
values = numpy.random.randn(numpy.sum(reps))
for date, rep in zip(times, reps):
dates.extend([date]*rep)
return pandas.Series(data=values, index=dates)
series = [make_series() for _ in xrange(3)]
This is very nearly a concat:
In [11]: s0 = pd.Series([1, 2, 3], name='s0')
In [12]: s1 = pd.Series([1, 4, 5], name='s1')
In [13]: pd.concat([s0, s1], axis=1)
Out[13]:
s0 s1
0 1 1
1 2 4
2 3 5
However, concat cannot deal with duplicate indices (it's ambigious how they should merge, and in your case you don't want to merge them in the "ordinary" way - as combinations)...
I think you are going to use a groupby:
In [21]: s0 = pd.Series([1, 2, 3], [0, 0, 1], name='s0')
In [22]: s1 = pd.Series([1, 4, 5], [0, 1, 1], name='s1')
Note: I've appended a faster method which works for int-like dtypes (like datetime64).
We want to add a MultiIndex level of the cumcounts for each item, that way we trick the Index into becoming unique:
In [23]: s0.groupby(level=0).cumcount()
Out[23]:
0 0
0 1
1 0
dtype: int64
Note: I can't seem to append a column to the index without being a DataFrame..
In [24]: df0 = pd.DataFrame(s0).set_index(s0.groupby(level=0).cumcount(), append=True)
In [25]: df1 = pd.DataFrame(s1).set_index(s1.groupby(level=0).cumcount(), append=True)
In [26]: df0
Out[26]:
s0
0 0 1
1 2
1 0 3
Now we can go ahead and concat these:
In [27]: res = pd.concat([df0, df1], axis=1)
In [28]: res
Out[28]:
s0 s1
0 0 1 1
1 2 NaN
1 0 3 4
1 NaN 5
If you want to drop the cumcount level:
In [29]: res.index = res.index.droplevel(1)
In [30]: res
Out[30]:
s0 s1
0 1 1
0 2 NaN
1 3 4
1 NaN 5
Now you can ffill to get the desired result... (if you were concerned about forward filling of different datetimes you could groupby the index and ffill).
If the upperbound on repetitions in each group was reasonable (I'm picking 1000, but much higher is still "reasonable"!, you could use a Float64Index as follows (and certainly it seems more elegant):
s0.index = s0.index + (s0.groupby(level=0)._cumcount_array() / 1000.)
s1.index = s1.index + (s1.groupby(level=0)._cumcount_array() / 1000.)
res = pd.concat([s0, s1], axis=1)
res.index = res.index.values.astype('int64')
Note: I'm cheekily using a private method here which returns the cumcount as a numpy array...
Note2: This is pandas 0.14, in 0.13 you have to pass a numpy array to _cumcount_array e.g. np.arange(len(s0))), pre-0.13 you're out of luck - there's no cumcount.
How about this - convert to dataframes with labeled columns first, then concat().
s1 = pd.Series(index=['4/4/14','4/4/14','4/5/14'],
data=[12.2,0.0,12.2])
s2 = pd.Series(index=['4/5/14','4/8/14'],
data=[14.2,3.0])
d1 = pd.DataFrame(a,columns=['a'])
d2 = pd.DataFrame(b,columns=['b'])
final_df = pd.merge(d1, d2, left_index=True, right_index=True, how='outer')
This gives me
a b
4/4/14 12.2 NaN
4/4/14 0.0 NaN
4/5/14 12.2 14.2
4/8/14 NaN 3.0

Pandas and csv import into dataframe. How to best to combine date anbd date fields into one

I have a csv file that I am trying to import into pandas.
There are two columns of intrest. date and hour and are the first two cols.
E.g.
date,hour,...
10-1-2013,0,
10-1-2013,0,
10-1-2013,0,
10-1-2013,1,
10-1-2013,1,
How do I import using pandas so that that hour and date is combined or is that best done after the initial import?
df = DataFrame.from_csv('bingads.csv', sep=',')
If I do the initial import how do I combine the two as a date and then delete the hour?
Thanks
Define your own date_parser:
In [291]: from dateutil.parser import parse
In [292]: import datetime as dt
In [293]: def date_parser(x):
.....: date, hour = x.split(' ')
.....: return parse(date) + dt.timedelta(0, 3600*int(hour))
In [298]: pd.read_csv('test.csv', parse_dates=[[0,1]], date_parser=date_parser)
Out[298]:
date_hour a b c
0 2013-10-01 00:00:00 1 1 1
1 2013-10-01 00:00:00 2 2 2
2 2013-10-01 00:00:00 3 3 3
3 2013-10-01 01:00:00 4 4 4
4 2013-10-01 01:00:00 5 5 5
Apply read_csv instead of read_clipboard to handle your actual data:
>>> df = pd.read_clipboard(sep=',')
>>> df['date'] = pd.to_datetime(df.date) + pd.to_timedelta(df.hour, unit='D')/24
>>> del df['hour']
>>> df
date ...
0 2013-10-01 00:00:00 NaN
1 2013-10-01 00:00:00 NaN
2 2013-10-01 00:00:00 NaN
3 2013-10-01 01:00:00 NaN
4 2013-10-01 01:00:00 NaN
[5 rows x 2 columns]
Take a look at the parse_dates argument which pandas.read_csv accepts.
You can do something like:
df = pandas.read_csv('some.csv', parse_dates=True)
# in which case pandas will parse all columns where it finds dates
df = pandas.read_csv('some.csv', parse_dates=[i,j,k])
# in which case pandas will parse the i, j and kth columns for dates
Since you are only using the two columns from the cdv file and combining those into one, I would squeeze into a series of datetime objects like so:
import pandas as pd
from StringIO import StringIO
import datetime as dt
txt='''\
date,hour,A,B
10-1-2013,0,1,6
10-1-2013,0,2,7
10-1-2013,0,3,8
10-1-2013,1,4,9
10-1-2013,1,5,10'''
def date_parser(date, hour):
dates=[]
for ed, eh in zip(date, hour):
month, day, year=list(map(int, ed.split('-')))
hour=int(eh)
dates.append(dt.datetime(year, month, day, hour))
return dates
p=pd.read_csv(StringIO(txt), usecols=[0,1],
parse_dates=[[0,1]], date_parser=date_parser, squeeze=True)
print p
Prints:
0 2013-10-01 00:00:00
1 2013-10-01 00:00:00
2 2013-10-01 00:00:00
3 2013-10-01 01:00:00
4 2013-10-01 01:00:00
Name: date_hour, dtype: datetime64[ns]

Categories