I downloaded a dataframe to csv, made some changes and then tried to call is again . for some reasons the date column is all mixed up.
can some one please help and tell me why I am getting this message.
before saving as csv my df looked like this:
aapl = web.DataReader("AAPL", "yahoo", start, end)
bbry = web.DataReader("BBRY", "yahoo", start, end)
lulu = web.DataReader("LULU", "yahoo", start, end)
amzn = web.DataReader("AMZN", "yahoo", start, end)
# Below I create a DataFrame consisting of the adjusted closing price of these stocks, first by making a list of these objects and using the join method
stocks = pd.DataFrame({"AAPL": aapl["Adj Close"],
"BBRY": bbry["Adj Close"],
"LULU": lulu["Adj Close"],
"AMZN":amzn["Adj Close"]}, pd.date_range(start, end, freq='BM'))
stocks.head()
Out[60]:
AAPL AMZN BBRY LULU
2011-11-30 49.987684 192.289993 17.860001 49.700001
2011-12-30 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
2012-02-29 70.945373 179.690002 14.170000 67.019997
2012-03-30 78.414750 202.509995 14.700000 74.730003
In [74]:
stocks.to_csv('A5.csv', encoding='utf-8')
after reading the correct csv it now looks like this:
In [81]:
stocks1.head()
Out[81]:
Unnamed: 0 AAPL AMZN BBRY LULU
0 2011-11-30 00:00:00 49.987684 192.289993 17.860001 49.700001
1 2011-12-30 00:00:00 52.969683 173.100006 14.500000 46.660000
2 2012-01-31 00:00:00 59.702715 194.440002 16.629999 63.130001
3 2012-02-29 00:00:00 70.945373 179.690002 14.170000 67.019997
4 2012-03-30 00:00:00 78.414750 202.509995 14.700000 74.730003
why is it not recognizing the date column as date?
Thanks for your help
I would suggest you to use HDF store instead of CSV - it's much faster, it preserves your dtypes, you can conditionally select subsets of your data sets, it supports fast compression, etc.
import pandas_datareader.data as web
stocklist = ['AAPL','BBRY','LULU','AMZN']
p = web.DataReader(stocklist, 'yahoo', '2011-11-01', '2012-04-01')
df = p['Adj Close'].resample('M').last()
print(df)
# saving DF to HDF file
store = pd.HDFStore(r'd:/temp/stocks.h5')
store.append('stocks', df, data_columns=True, complib='blosc', complevel=5)
store.close()
Output:
AAPL AMZN BBRY LULU
Date
2011-11-30 49.987684 192.289993 17.860001 49.700001
2011-12-31 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
2012-02-29 70.945373 179.690002 14.170000 67.019997
2012-03-31 78.414750 202.509995 14.700000 74.730003
let's read our data back from the HDF file:
In [9]: store = pd.HDFStore(r'd:/temp/stocks.h5')
In [10]: x = store.select('stocks')
In [11]: x
Out[11]:
AAPL AMZN BBRY LULU
Date
2011-11-30 49.987684 192.289993 17.860001 49.700001
2011-12-31 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
2012-02-29 70.945373 179.690002 14.170000 67.019997
2012-03-31 78.414750 202.509995 14.700000 74.730003
you can select your data conditionally:
In [12]: x = store.select('stocks', where="AAPL >= 50 and AAPL <= 70")
In [13]: x
Out[13]:
AAPL AMZN BBRY LULU
Date
2011-12-31 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
check index dtype:
In [14]: x.index.dtype
Out[14]: dtype('<M8[ns]')
In [15]: x.index.dtype_str
Out[15]: 'datetime64[ns]'
Related
I tried to use the join function to combine the close price of all 500 stocks in 5 year period (2013-02-08 to 2018-02-07), where each column represent a stock, with the index of the dataframe being the dates.
But the join function in pandas seems to automatically change the date format (index), rendering all the entries in the combined dataframe to be NaN.
The code to import and preview the file:
import pandas as pd
df= pd.read_csv('all_stocks_5yr.csv')
df.head()
(https://i.stack.imgur.com/29Wq4.png)
# df.info()
df['Name'].unique().shape #There are 505 stock names in total
# dates = pd.date_range(df['date'].min(), df['date'].max()) #check the date range
Single out the close prices:
close_prices = pd.DataFrame(index=dates) #Make the index column to be the dates
# close_prices.head()
symbols = df['Name'].unique(). #Denote the stock names as an array
So I tried to test the result for each stock using the first 3 stocks:
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
print(df_tmp) #print the temporary dataframes
i += 1
if i >3: break
And the results were expected, a dataframe indexed by date and only one stock:
AAL
date
2013-02-08 14.75
2013-02-11 14.46
2013-02-12 14.27
2013-02-13 14.66
2013-02-14 13.99
... ...
2018-02-01 53.88
2018-02-02 52.10
2018-02-05 49.76
2018-02-06 51.18
2018-02-07 51.40
[1259 rows x 1 columns]
AAPL
date
2013-02-08 67.8542
2013-02-11 68.5614
2013-02-12 66.8428
2013-02-13 66.7156
2013-02-14 66.6556
... ...
2018-02-01 167.7800
2018-02-02 160.5000
2018-02-05 156.4900
2018-02-06 163.0300
2018-02-07 159.5400
[1259 rows x 1 columns]
...
Now here's part I find very confusing: I checked what happens when combining first 3 stock dataframes using join function, with index 'date':
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
close_prices = close_prices.join(df_tmp)
i += 1
if i >3: break
close_prices.head()
(https://i.stack.imgur.com/MqVDo.png)
Somehow the index changed from date to date-time format, and therefore naturally "join" function will recognize that none of the entries matches with the new index, and automatically put a NA there for every single entry.
What caused the date to have changed to date-time?
You can use pivot:
df = pd.read_csv('all_stocks_5yr.csv')
out = df1.pivot(index='date', columns='Name', values='close')
Output:
>>> df.iloc[:, :8]
Name A AAL AAP AAPL ABBV ABC ABT ACN
date
2013-02-08 45.08 14.75 78.90 67.8542 36.25 46.89 34.41 73.31
2013-02-11 44.60 14.46 78.39 68.5614 35.85 46.76 34.26 73.07
2013-02-12 44.62 14.27 78.60 66.8428 35.42 46.96 34.30 73.37
2013-02-13 44.75 14.66 78.97 66.7156 35.27 46.64 34.46 73.56
2013-02-14 44.58 13.99 78.84 66.6556 36.57 46.77 34.70 73.13
... ... ... ... ... ... ... ... ...
2018-02-01 72.83 53.88 117.29 167.7800 116.34 99.29 62.18 160.46
2018-02-02 71.25 52.10 113.93 160.5000 115.17 96.02 61.69 156.90
2018-02-05 68.22 49.76 109.86 156.4900 109.51 91.90 58.73 151.83
2018-02-06 68.45 51.18 112.20 163.0300 111.20 91.54 58.86 154.69
2018-02-07 68.06 51.40 109.93 159.5400 113.62 94.22 58.67 155.15
[1259 rows x 8 columns]
Source 'all_stocks_5yr.csv': Kaggle
I have a csv-file with time series data, the first column is the date in the format %Y:%m:%d and the second column is the intraday time in the format '%H:%M:%S'. I would like to import this csv-file into a multiindex dataframe or panel object.
With this code, it already works:
_file_data = pd.read_csv(_file,
sep=",",
header=0,
index_col=['Date', 'Time'],
thousands="'",
parse_dates=True,
skipinitialspace=True
)
It returns the data in the following format:
Date Time Volume
2016-01-04 2018-04-25 09:01:29 53645
2018-04-25 10:01:29 123
2018-04-25 10:01:29 1345
....
2016-01-05 2018-04-25 10:01:29 123
2018-04-25 12:01:29 213
2018-04-25 10:01:29 123
1st question:
I would like to show the second index as a pure time-object not datetime. To do that, I have to declare two different date-pasers in the read_csv function, but I can't figure out how. What is the "best" way to do that?
2nd question:
After I created the Dataframe, I converted it to a panel-object. Would you recommend doing that? Is the panel-object the better choice for such a data structure? What are the benefits (drawbacks) of a panel-object?
1st question:
You can create multiple converters and define parsers in dictionary:
import pandas as pd
temp=u"""Date,Time,Volume
2016:01:04,09:00:00,53645
2016:01:04,09:20:00,0
2016:01:04,09:40:00,0
2016:01:04,10:00:00,1468
2016:01:05,10:00:00,246
2016:01:05,10:20:00,0
2016:01:05,10:40:00,0
2016:01:05,11:00:00,0
2016:01:05,11:20:00,0
2016:01:05,11:40:00,0
2016:01:05,12:00:00,213"""
def converter1(x):
#convert to datetime and then to times
return pd.to_datetime(x).time()
def converter2(x):
#define format of datetime
return pd.to_datetime(x, format='%Y:%m:%d')
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter1, 'Date': converter2})
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 0
12:00:00 213
Sometimes is possible use built-in parser, e.g. if format of dates is YY-MM-DD:
import pandas as pd
temp=u"""Date,Time,Volume
2016-01-04,09:00:00,53645
2016-01-04,09:20:00,0
2016-01-04,09:40:00,0
2016-01-04,10:00:00,1468
2016-01-05,10:00:00,246
2016-01-05,10:20:00,0
2016-01-05,10:40:00,0
2016-01-05,11:00:00,0
2016-01-05,11:20:00,0
2016-01-05,11:40:00,0
2016-01-05,12:00:00,213"""
def converter(x):
#define format of datetime
return pd.to_datetime(x).time()
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
parse_dates=['Date'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter})
print (df.index.get_level_values(0))
DatetimeIndex(['2016-01-04', '2016-01-04', '2016-01-04', '2016-01-04',
'2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
'2016-01-05', '2016-01-05', '2016-01-05'],
dtype='datetime64[ns]', name='Date', freq=None)
Last possible solution is convert datetime to times in MultiIndex by set_levels - after processing:
df.index = df.index.set_levels(df.index.get_level_values(1).time, level=1)
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:00:00 0
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 213
2nd question:
Panel in pandas 0.20.+ is deprecated and will be removed in a future version.
To convert to a time series use pd.to_timedelta.
Ex:
import pandas as pd
df = pd.DataFrame({"Time": ["2018-04-25 09:01:29", "2018-04-25 10:01:29", "2018-04-25 10:01:29"]})
df["Time"] = pd.to_timedelta(pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S'))
print df["Time"]
Output:
0 09:01:29
1 10:01:29
2 10:01:29
Name: Time, dtype: timedelta64[ns]
So I start out with a pd.Series called jpm, and I would like to group it into weeks and take the last value from each week. This works with the code below, it does get the last value. But it changes corresponding index to the Sunday of the week, and I would like it to leave it unchaged.
import pandas_datareader.data as web
import pandas as pd
start = pd.datetime(2015, 11, 1)
end = pd.datetime(2015, 11, 17)
raw_jpm = web.DataReader("JPM", 'yahoo', start, end)["Adj Close"]
jpm = raw_jpm.ix[raw_jpm.index[::2]]
jpm is now
Date
2015-11-02 64.125610
2015-11-04 64.428918
2015-11-06 66.982593
2015-11-10 66.219427
2015-11-12 64.575682
2015-11-16 65.074678
Name: Adj Close, dtype: float64
I want to do some operations to it, such as
weekly = jpm.groupby(pd.TimeGrouper('W')).last()
weekly is now
Date
2015-11-08 66.982593
2015-11-15 64.575682
2015-11-22 65.074678
Freq: W-SUN, Name: Adj Close, dtype: float64
which is great, except all my dates got changed. The output I want, is:
Date
2015-11-06 66.982593
2015-11-12 64.575682
2015-11-16 65.074678
you can do it this way:
In [15]: jpm
Out[15]:
Date
2015-11-02 64.125610
2015-11-04 64.428918
2015-11-06 66.982593
2015-11-10 66.219427
2015-11-12 64.575682
2015-11-16 65.074678
Name: Adj Close, dtype: float64
In [16]: jpm.groupby(jpm.index.week).transform('last').drop_duplicates(keep='last')
Out[16]:
Date
2015-11-06 66.982593
2015-11-12 64.575682
2015-11-16 65.074678
dtype: float64
Explanation:
In [17]: jpm.groupby(jpm.index.week).transform('last')
Out[17]:
Date
2015-11-02 66.982593
2015-11-04 66.982593
2015-11-06 66.982593
2015-11-10 64.575682
2015-11-12 64.575682
2015-11-16 65.074678
dtype: float64
You could provide a DateOffset by specifying the class name Week and indicating the weekly frequency W-FRI, by setting the dayofweek property as 4 [Monday : 0 → Sunday : 6]
jpm.groupby(pd.TimeGrouper(freq=pd.offsets.Week(weekday=4))).last().tail(5)
Date
2016-08-19 65.860001
2016-08-26 66.220001
2016-09-02 67.489998
2016-09-09 66.650002
2016-09-16 65.820000
Freq: W-FRI, Name: Adj Close, dtype: float64
If you want the starting date as the next monday from start date and the previous sunday from the end date, you could do this way:
from datetime import datetime, timedelta
start = datetime(2015, 11, 1)
monday = start + timedelta(days=(7 - start.weekday()))
end = datetime(2016, 9, 30)
sunday = end - timedelta(days=end.weekday() + 1)
print (monday)
2015-11-02 00:00:00
print (sunday)
2016-09-25 00:00:00
Then, use it as:
jpm = web.DataReader('JPM', 'yahoo', monday, sunday)["Adj Close"]
jpm.groupby(pd.TimeGrouper(freq='7D')).last()
To get it all on a Sunday, as you specified the range Monday → Sunday and Sunday being the last day for the date to be considered, you could do a small hack:
monday_new = monday - timedelta(days=3)
jpm = web.DataReader('JPM', 'yahoo', monday_new, sunday)["Adj Close"]
jpm.groupby(pd.TimeGrouper(freq='W')).last().head()
Date
2015-11-01 62.863448
2015-11-08 66.982593
2015-11-15 64.145175
2015-11-22 66.082449
2015-11-29 65.720431
Freq: W-SUN, Name: Adj Close, dtype: float64
Now that you've posted the desired output, you can arrive at the result using transform method instead of taking the aggregated last, so that it returns an object that is indexed the same size as the one being grouped.
df = jpm.groupby(pd.TimeGrouper(freq='W')).transform('last').reset_index(name='Last')
df
df['counter'] = (df['Last'].shift() != df['Last']).astype(int).cumsum()
df.groupby(['Last','counter'])['Date'].apply(lambda x: np.array(x)[-1]) \
.reset_index().set_index('Date').sort_index()['Last']
Date
2015-11-06 66.982593
2015-11-12 64.575682
2015-11-16 65.074678
Name: Last, dtype: float64
Note: This is capable of handling repeated entries that occur in two separate dates due to the inclusion of the counter column which bins them separately into two buckets.
It seems a little tricky to do this in pure pandas, so I used numpy
import numpy as np
weekly = jpm.groupby(pd.TimeGrouper('W-SUN')).last()
weekly.index = jpm.index[np.searchsorted(jpm.index, weekly.index, side="right")-1]
I'm trying to make a time series plot with seaborn from a dataframe that has multiple series.
From this post:
seaborn time series from pandas dataframe
I gather that tsplot isn't going to work as it is meant to plot uncertainty.
So is there another Seaborn method that is meant for line charts with multiple series?
My dataframe looks like this:
print(df.info())
print(df.describe())
print(df.values)
print(df.index)
output:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 253 entries, 2013-01-03 to 2014-01-03
Data columns (total 5 columns):
Equity(24 [AAPL]) 253 non-null float64
Equity(3766 [IBM]) 253 non-null float64
Equity(5061 [MSFT]) 253 non-null float64
Equity(6683 [SBUX]) 253 non-null float64
Equity(8554 [SPY]) 253 non-null float64
dtypes: float64(5)
memory usage: 11.9 KB
None
Equity(24 [AAPL]) Equity(3766 [IBM]) Equity(5061 [MSFT]) \
count 253.000000 253.000000 253.000000
mean 67.560593 194.075383 32.547436
std 6.435356 11.175226 3.457613
min 55.811000 172.820000 26.480000
25% 62.538000 184.690000 28.680000
50% 65.877000 193.880000 33.030000
75% 72.299000 203.490000 34.990000
max 81.463000 215.780000 38.970000
Equity(6683 [SBUX]) Equity(8554 [SPY])
count 253.000000 253.000000
mean 33.773277 164.690180
std 4.597291 10.038221
min 26.610000 145.540000
25% 29.085000 156.130000
50% 33.650000 165.310000
75% 38.280000 170.310000
max 40.995000 184.560000
[[ 77.484 195.24 27.28 27.685 145.77 ]
[ 75.289 193.989 26.76 27.85 146.38 ]
[ 74.854 193.2 26.71 27.875 145.965]
...,
[ 80.167 187.51 37.43 39.195 184.56 ]
[ 79.034 185.52 37.145 38.595 182.95 ]
[ 77.284 186.66 36.92 38.475 182.8 ]]
DatetimeIndex(['2013-01-03', '2013-01-04', '2013-01-07', '2013-01-08',
'2013-01-09', '2013-01-10', '2013-01-11', '2013-01-14',
'2013-01-15', '2013-01-16',
...
'2013-12-19', '2013-12-20', '2013-12-23', '2013-12-24',
'2013-12-26', '2013-12-27', '2013-12-30', '2013-12-31',
'2014-01-02', '2014-01-03'],
dtype='datetime64[ns]', length=253, freq=None, tz='UTC')
This works (but I want to get my hands dirty with Seaborn):
df.plot()
Output:
Thank you for your time!
Update1:
df.to_dict() returned:
https://gist.github.com/anonymous/2bdc1ce0f9d0b6ccd6675ab4f7313a5f
Update2:
Using #knagaev sample code, I've narrowed it down to this difference:
current dataframe (output of print(current_df)):
Equity(24 [AAPL]) Equity(3766 [IBM]) \
2013-01-03 00:00:00+00:00 77.484 195.2400
2013-01-04 00:00:00+00:00 75.289 193.9890
2013-01-07 00:00:00+00:00 74.854 193.2000
2013-01-08 00:00:00+00:00 75.029 192.8200
2013-01-09 00:00:00+00:00 73.873 192.3800
desired dataframe (output of print(desired_df)):
Date Company Kind Price
0 2014-01-02 IBM Open 187.210007
1 2014-01-02 IBM High 187.399994
2 2014-01-02 IBM Low 185.199997
3 2014-01-02 IBM Close 185.529999
4 2014-01-02 IBM Volume 4546500.000000
5 2014-01-02 IBM Adj Close 171.971090
6 2014-01-02 MSFT Open 37.349998
7 2014-01-02 MSFT High 37.400002
8 2014-01-02 MSFT Low 37.099998
9 2014-01-02 MSFT Close 37.160000
10 2014-01-02 MSFT Volume 30632200.000000
11 2014-01-02 MSFT Adj Close 34.960000
12 2014-01-02 ORCL Open 37.779999
13 2014-01-02 ORCL High 38.029999
14 2014-01-02 ORCL Low 37.549999
15 2014-01-02 ORCL Close 37.840000
16 2014-01-02 ORCL Volume 18162100.000000
What's the best way to reorganize the current_df to desired_df?
Update 3:
I finally got it working from the help of #knagaev:
I had to add a dummy column as well as finesse the index:
df['Datetime'] = df.index
melted_df = pd.melt(df, id_vars='Datetime', var_name='Security', value_name='Price')
melted_df['Dummy'] = 0
sns.tsplot(melted_df, time='Datetime', unit='Dummy', condition='Security', value='Price', ax=ax)
to produce:
You can try to get hands dirty with tsplot.
You will draw your line charts with standard errors ("statistical additions")
I tried to simulate your dataset. So here is the results
import pandas.io.data as web
from datetime import datetime
import seaborn as sns
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
start = datetime(2014,1,1)
end = datetime(2014,3,28)
f = web.DataReader(stocks, 'yahoo',start,end)
df = pd.DataFrame(f.to_frame().stack()).reset_index()
df.columns = ['Date', 'Company', 'Kind', 'Price']
sns.tsplot(df, time='Date', unit='Kind', condition='Company', value='Price')
By the way this sample is very imitative. The parameter "unit" is "Field in the data DataFrame identifying the sampling unit (e.g. subject, neuron, etc.). The error representation will collapse over units at each time/condition observation. " (from documentation). So I used the 'Kind' field for illustrative purposes.
Ok, I made an example for your dataframe.
It has dummy field for "noise cleaning" :)
import pandas.io.data as web
from datetime import datetime
import seaborn as sns
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
start = datetime(2010,1,1)
end = datetime(2015,12,31)
f = web.DataReader(stocks, 'yahoo',start,end)
df = pd.DataFrame(f.to_frame().stack()).reset_index()
df.columns = ['Date', 'Company', 'Kind', 'Price']
df_open = df[df['Kind'] == 'Open'].copy()
df_open['Dummy'] = 0
sns.tsplot(df_open, time='Date', unit='Dummy', condition='Company', value='Price')
P.S. Thanks to #VanPeer - now you can use seaborn.lineplot for this problem
I am having trouble with some dates from zipped xlsx files. These files are loaded into a sqlite database then exported as .csv. Each file is about 40,000 rows per day. The issue I run into is that pd.to_datetime does not seem to work on these objects (dates from Excel format is causing the issue I think - pure .csv files work fine with this command). This is fine actually - I do not need them to be in datetime format.
What I am trying to achieve is creating a column called ShortDate which is %m/%d/%Y. How can I do this on a datetime object (format is mm/dd/yyyy hh:mm:ss from Excel). I will then create a new column called RosterID which combines the EmployeeID field and the ShortDate field together into a unique ID.
I am very new to pandas and I am currently only using it to process .csv files (rename and select certain columns, create unique IDs to use in filters in Tableau, etc).
rep = pd.read_csv(r'C:\Users\Desktop\test.csv.gz', dtype = 'str', compression = 'gzip', usecols = ['etc','etc2'])
print('Read successfully.')
rep['Total']=1
rep['UniqueID']= rep['EmployeeID'] + rep['InteractionID']
rep['ShortDate'] = ??? #what do I do here to get what I am looking for?
rep['RosterID']= rep['EmployeeID'] + rep['ShortDate'] # this is my goal
print('Modified successfully.')
Here is some of the raw data from the .csv. Column names would be
InteractionID, Created Date, EmployeeID, Repeat Date
07927,04/01/2014 14:05:10,912a,04/01/2014 14:50:03
02158,04/01/2014 13:44:05,172r,04/04/2014 17:47:29
44279,04/01/2014 17:28:36,217y,04/07/2014 22:06:19
You can apply a post-processing step that first converts the string to a datetime and then applies a lambda to keep just the date portion:
In [29]:
df['Created Date'] = pd.to_datetime(df['Created Date']).apply(lambda x: x.date())
df['Repeat Date'] = pd.to_datetime(df['Repeat Date']).apply(lambda x: x.date())
df
Out[29]:
InteractionID Created Date EmployeeID Repeat Date
0 7927 2014-04-01 912a 2014-04-01
1 2158 2014-04-01 172r 2014-04-04
2 44279 2014-04-01 217y 2014-04-07
EDIT
After looking at this again, you can access just the date component using dt.date if your version of pandas is greater than 0.15.0:
In [18]:
df['just_date'] = df['Repeat Date'].dt.date
df
Out[18]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03
1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29
2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19
just_date
0 2014-04-01
1 2014-04-04
2 2014-04-07
Additionally you can also do dt.strftime now rather than use apply to achieve the result you want:
In [28]:
df['short_date'] = df['Repeat Date'].dt.strftime('%m%d%Y')
df
Out[28]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03
1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29
2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19
just_date short_date
0 2014-04-01 04012014
1 2014-04-04 04042014
2 2014-04-07 04072014
So generating the Roster Id's is now a trivial exercise of adding the 2 new columns:
In [30]:
df['Roster ID'] = df['EmployeeID'] + df['short_date']
df
Out[30]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03
1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29
2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19
just_date short_date Roster ID
0 2014-04-01 04012014 912a04012014
1 2014-04-04 04042014 172r04042014
2 2014-04-07 04072014 217y04072014
Create a new column, then just apply simple datetime functions using lambda and apply.
In [14]: df['Short Date']= pd.to_datetime(df['Created Date'])
In [15]: df
Out[15]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 4/1/2014 14:05 912a 4/1/2014 14:50
1 2158 4/1/2014 13:44 172r 4/4/2014 17:47
2 44279 4/1/2014 17:28 217y 4/7/2014 22:06
Short Date
0 2014-04-01 14:05:00
1 2014-04-01 13:44:00
2 2014-04-01 17:28:00
In [16]: df['Short Date'] = df['Short Date'].apply(lambda x:x.date().strftime('%m%d%y'))
In [17]: df
Out[17]:
InteractionID Created Date EmployeeID Repeat Date Short Date
0 7927 4/1/2014 14:05 912a 4/1/2014 14:50 040114
1 2158 4/1/2014 13:44 172r 4/4/2014 17:47 040114
2 44279 4/1/2014 17:28 217y 4/7/2014 22:06 040114
Then just concatenate the two columns. Convert the Short Date column to strings to avoid errors on concatenation of strings and integers.
In [32]: df['Roster ID'] = df['EmployeeID'] + df['Short Date'].map(str)
In [33]: df
Out[33]:
InteractionID Created Date EmployeeID Repeat Date Short Date \
0 7927 4/1/2014 14:05 912a 4/1/2014 14:50 040114
1 2158 4/1/2014 13:44 172r 4/4/2014 17:47 040114
2 44279 4/1/2014 17:28 217y 4/7/2014 22:06 040114
Roster ID
0 912a040114
1 172r040114
2 217y040114
You can also do it using only the standard libraries (in any format you want '%m/%d/%Y', '%m-%d-%Y' or other orders/formats):
In [118]:
import time
df['Created Date'] = df['Created Date'].apply(lambda x: time.strftime('%m/%d/%Y', time.strptime(x, '%m/%d/%Y %H:%M:%S')))
In [120]:
print df
InteractionID Created Date EmployeeID Repeat Date
0 7927 04/01/2014 912a 04/01/2014 14:50:03
1 2158 04/01/2014 172r 04/04/2014 17:47:29
2 44279 04/01/2014 217y 04/07/2014 22:06:19