I have created a dataframe with one column as a series of calender dates using
start = datetime.date(2008, 8, 01)
end = datetime.date(2009, 1, 19)
range = pd.date_range(start, end, freq = 'D')
df = pd.DataFrame({'date': pd.Series(range)})
By this I get the type of date column as datetime64[ns] although I used the datetime.date to create the dates. I have looked through a few questions but didn't really find them helpful.
How can I convert the type of the date column of this dataframe to a date object?
date_range indeed returns datetime64, regardless of how you specify start and end (eg these can also be strings).
If you want to convert datetime64 values to datetime.date objects, you can use the .date accessor of DatetimeIndex (date_range returns such an index):
In [22]: s = pd.Series(range.date)
In [23]: s
Out[23]:
0 2008-08-01
1 2008-08-02
2 2008-08-03
3 2008-08-04
4 2008-08-05
...
167 2009-01-15
168 2009-01-16
169 2009-01-17
170 2009-01-18
171 2009-01-19
Length: 172, dtype: object
In [24]: s[0]
Out[24]: datetime.date(2008, 8, 1)
See here for docs on these datetime components: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components. To convert it to datetime.datetime objects, you can use range.to_pydatetime().
But as U2EF1, dependent on the application, it's quite possible you want such datetime64 values, as operations with it will be much more performant.
Related
Just as the title says, I am trying to convert my DataFrame lables to type datetime. In the following attempted solution I pulled the labels from the DataFrame to dates_index and tried converting them to datetime by using the function DatetimeIndex.to_datetime, however, my compiler says that DatetimeIndex has no attribute to_datetime.
dates_index = df.index[0::]
dates = DatetimeIndex.to_datetime(dates_index)
I've also tried using the pandas.to_datetime function.
dates = pandas.to_datetime(dates_index, errors='coerce')
This returns the datetime wrapped in DatetimeIndex instead of just datetimes.
My DatetimeIndex labels contain data for date and time and my goal is to push that data into two seperate columns of the DataFrame.
if your DateTimeIndex is myindex, then
df.reset_index() will create a myindex column, which you can do what you want with, and if you want to make it an index again later, you can revert by `df.set_index('myindex')
You can set the index after converting the datatype of the column.
To convert datatype to datetime, use: to_datetime
And, to set the column as index use: set_index
Hope this helps!
import pandas as pd
df = pd.DataFrame({
'mydatecol': ['06/11/2020', '06/12/2020', '06/13/2020', '06/14/2020'],
'othcol1': [10, 20, 30, 40],
'othcol2': [1, 2, 3, 4]
})
print(df)
print(f'Index type is now {df.index.dtype}')
df['mydatecol'] = pd.to_datetime(df['mydatecol'])
df.set_index('mydatecol', inplace=True)
print(df)
print(f'Index type is now {df.index.dtype}')
Output is
mydatecol othcol1 othcol2
0 06/11/2020 10 1
1 06/12/2020 20 2
2 06/13/2020 30 3
3 06/14/2020 40 4
Index type is now int64
othcol1 othcol2
mydatecol
2020-06-11 10 1
2020-06-12 20 2
2020-06-13 30 3
2020-06-14 40 4
Index type is now datetime64[ns]
I found a quick solution to my problem. You can create a new pandas column based on the index and then use datetime to reformat the date.
df['date'] = df.index # Creates new column called 'date' of type Timestamp
df['date'] = df['date'].dt.strftime('%m/%d/%Y %I:%M%p') # Date formatting
I have sliced the pandas dataframe.
end_date = df[-1:]['end']
type(end_date)
Out[4]: pandas.core.series.Series
end_date
Out[3]:
48173 2017-09-20 04:47:59
Name: end, dtype: datetime64[ns]
How to get rid of end_date's index value 48173 and get only 2017-09-20 04:47:59 string? I have to call REST API with 2017-09-20 04:47:59 as a parameter, so I have to get string from pandas datetime64 series.
How to get rid of end_date's index value 48173 and get only datetime object [something like datetime.datetime.strptime('2017-09-20 04:47:59', '%Y-%m-%d %H:%M:%S')]. I need it because, later I will have to check if '2017-09-20 04:47:59' < datetime.datetime(2017,1,9)
I need to convert just a single cell value, not a whole column.
How to do these conversions?
It seems you need:
import pandas as pd
data = ['2017-09-20 04:47:59','2017-10-20 04:47:59','2017-09-30 04:47:59']
df = pd.DataFrame(data,columns=['end'])
df['end'] = pd.to_datetime(df['end'])
df
df will be:
end
0 2017-09-20 04:47:59
1 2017-10-20 04:47:59
2 2017-09-30 04:47:59
After that you can use below code to get rid of index and use as 'Timestamp' object:
end_date = df['end'].iloc[-1] #get last row of column end
print(type(end_date)) # pandas.tslib.Timestamp
end_date_str = end_date.strftime('%Y-%m-%d %H:%M:%S') #convert to str
print(end_date_str) # '2017-09-30 04:47:59'
print(end_date < datetime.datetime(2017,1,9)) #False
Simply cast the result to a string, and recover it using .values[0]:
In [38]: end_date
Out[38]:
48173 2017-09-20 04:47:59
Name: end, dtype: datetime64[ns]
In [39]: end_date.astype(str).values[0]
Out[39]: '2017-09-20 04:47:59'
If you want a datetime object, you have to convert it to a timestamp, and then back to a datetime object:
In [42]: end_date.values[0].item()
Out[42]: 1505882879000000000
In [43]: datetime.fromtimestamp(end_date.values[0].item()/10**9)
Out[43]: datetime.datetime(2017, 9, 20, 6, 47, 59)
Otherwise, you can strptime the string recovered in step 1:
In [48]: datetime.datetime.strptime(end_date.astype(str).values[0], '%Y-%m-%d %H:%M:%S')
Out[48]: datetime.datetime(2017, 9, 20, 4, 47, 59)
You may wonder why there is a 2 hours difference between the results. This is because the datetime.datetime.fromtimestamp takes my timezone into account (currently CEST, which is UTC+2).
On the other hand, parsing a string to a datetime object doesn't yield any timezone information, srtptime naively parses the timestamp without regards for the timezone, which leads to a 2 hours discrepancy.
I create a series from some random dates
import pandas as pd
from datetime import datetime
pd.Series([datetime(2012, 8, 1), datetime(2013, 4, 1), datetime(2013, 8, 1)])
Out[49]:
0 2012-08-01
1 2013-04-01
2 2013-08-01
dtype: datetime64[ns]
However, if I create a series with a datetime.max, the dtype of the series is all of a sudden an object
pd.Series([datetime(2012, 8, 1), datetime(2013, 4, 1), datetime.max])
Out[50]:
0 2012-08-01 00:00:00
1 2013-04-01 00:00:00
2 9999-12-31 23:59:59.999999
dtype: object
Also the way the dates are shown changes. I guess this latter point is related to the fact that the series is now an object.
datetime.max is of the same type as the other dates
type(datetime.max)
Out[53]: datetime.datetime
type(datetime(2014, 1,1))
Out[54]: datetime.datetime
What is going on here? How can create a series containing the 'max'-datetime value? Like this
0 2012-08-01
1 2013-04-01
2 9999-12-31
dtype: datetime64[ns]
The datetime64[ns] dtype can represent dates between 1678 AD and 2262 AD. Since datetime.max lies outside this range, the dtype of the Series was changed to object and all the values converted to datetime.datetimes so that the Series could hold the required range of datetimes.
Currently the nanosecond-frequency datetime64[ns] dtype (as opposed to say, datetime64[s], or datetime64[Y]) is the only NumPy datetime dtype that Pandas supports. The recommended workaround is to use pd.Period or pd.PeriodIndex objects to represent dates outside the range representable by datetime64[ns]:
import datetime as DT
import pandas as pd
s = pd.Series([DT.datetime(2012, 8, 1), DT.datetime(2013, 4, 1), DT.datetime.max])
p = s.apply(lambda x: pd.Period(x, freq='D'))
print(p)
yields
0 2012-08-01
1 2013-04-01
2 9999-12-31
dtype: object
Notice that the freq parameter must be set to something larger than ns
to expand the allowable range of dates (at the expense of less granularity).
Here is a table of common aliases you can use for the freq parameter.
When I compute the difference between two pandas datetime64 dates I get np.timedelta64. Is there any easy way to convert these deltas into representations like hours, days, weeks, etc.?
I could not find any methods in np.timedelta64 that facilitate conversions between different units, but it looks like Pandas seems to know how to convert these units to days when printing timedeltas (e.g. I get: 29 days, 23:20:00 in the string representation dataframes). Any way to access this functionality ?
Update:
Strangely, none of the following work:
> df['column_with_times'].days
> df['column_with_times'].apply(lambda x: x.days)
but this one does:
df['column_with_times'][0].days
pandas stores timedelta data in the numpy timedelta64[ns] type, but also provides the Timedelta type to wrap this for more convenience (eg to provide such accessors of the days, hours, .. and other components).
In [41]: timedelta_col = pd.Series(pd.timedelta_range('1 days', periods=5, freq='2 h'))
In [42]: timedelta_col
Out[42]:
0 1 days 00:00:00
1 1 days 02:00:00
2 1 days 04:00:00
3 1 days 06:00:00
4 1 days 08:00:00
dtype: timedelta64[ns]
To access the different components of a full column (series), you have to use the .dt accessor. For example:
In [43]: timedelta_col.dt.hours
Out[43]:
0 0
1 2
2 4
3 6
4 8
dtype: int64
With timedelta_col.dt.components you get a frame with all the different components (days to nanoseconds) as different columns.
When accessing one value of the column above, this gives back a Timedelta, and on this you don't need to use the dt accessor, but you can access directly the components:
In [45]: timedelta_col[0]
Out[45]: Timedelta('1 days 00:00:00')
In [46]: timedelta_col[0].days
Out[46]: 1L
So the .dt accessor provides access to the attributes of the Timedelta scalar, but on the full column. That is the reason you see that df['column_with_times'][0].days works but df['column_with_times'].days not.
The reason that df['column_with_times'].apply(lambda x: x.days) does not work is that apply is given the timedelta64 values (and not the Timedelta pandas type), and these don't have such attributes.
I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])