From pd.date_range('2016-01', '2016-05', freq='M', ).strftime('%Y-%m'), the last month is 2016-04, but I was expecting it to be 2016-05. It seems to me this function is behaving like the range method, where the end parameter is not included in the returning array.
Is there a way to get the end month included in the returning array, without processing the string for the end month?
A way to do it without messing with figuring out month ends yourself.
pd.date_range(*(pd.to_datetime(['2016-01', '2016-05']) + pd.offsets.MonthEnd()), freq='M')
DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
'2016-05-31'],
dtype='datetime64[ns]', freq='M')
You can use .union to add the next logical value after initializing the date_range. It should work as written for any frequency:
d = pd.date_range('2016-01', '2016-05', freq='M')
d = d.union([d[-1] + 1]).strftime('%Y-%m')
Alternatively, you can use period_range instead of date_range. Depending on what you intend to do, this might not be the right thing to use, but it satisfies your question:
pd.period_range('2016-01', '2016-05', freq='M').strftime('%Y-%m')
In either case, the resulting output is as expected:
['2016-01' '2016-02' '2016-03' '2016-04' '2016-05']
For the later crowd. You can also try to use the Month-Start frequency.
>>> pd.date_range('2016-01', '2016-05', freq='MS', format = "%Y-%m" )
DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01',
'2016-05-01'],
dtype='datetime64[ns]', freq='MS')
Include the day when specifying the dates in date_range call
pd.date_range('2016-01-31', '2016-05-31', freq='M', ).strftime('%Y-%m')
array(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05'],
dtype='|S7')
I had a similar problem when using datetime objects in dataframe. I would set the boundaries through .min() and .max() functions and then fill in missing dates using the pd.date_range function. Unfortunately the returned list/df was missing the maximum value.
I found two work arounds for this:
1) Add "closed = None" parameter in the pd.date_range function. This worked in the example below; however, it didn't work for me when working only with dataframes (no idea why).
2) If option #1 doesn't work then you can add one extra unit (in this case a day) using the datetime.timedelta() function. In the case below it over indexed by a day but it can work for you if the date_range function isn't giving you the full range.
import pandas as pd
import datetime as dt
#List of dates as strings
time_series = ['2020-01-01', '2020-01-03', '2020-01-5', '2020-01-6', '2020-01-7']
#Creates dataframe with time data that is converted to datetime object
raw_data_df = pd.DataFrame(pd.to_datetime(time_series), columns = ['Raw_Time_Series'])
#Creates an indexed_time list that includes missing dates and the full time range
#Option No. 1 is to use the closed = None parameter choice.
indexed_time = pd.date_range(start = raw_data_df.Raw_Time_Series.min(),end = raw_data_df.Raw_Time_Series.max(),freq='D',closed= None)
print('indexed_time option #! = ', indexed_time)
#Option No. 2 if the function allows you to extend the time by one unit (in this case day)
#by using the datetime.timedelta function to get what you need.
indexed_time = pd.date_range(start = raw_data_df.Raw_Time_Series.min(),end = raw_data_df.Raw_Time_Series.max()+dt.timedelta(days=1),freq='D')
print('indexed_time option #2 = ', indexed_time)
#In this case you over index by an extra day because the date_range function works properly
#However, if the "closed = none" parameters doesn't extend through the full range then this is a good work around
I dont think so. You need to add the (n+1) boundary
pd.date_range('2016-01', '2016-06', freq='M' ).strftime('%Y-%m')
The start and end dates are strictly inclusive. So it will not
generate any dates outside of those dates if specified.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
Either way, you have to manually add some information. I believe adding just one more month is not a lot of work.
The explanation for this issue is that the function pd.to_datetime() converts a '%Y-%m' date string by default to the first of the month datetime, or '%Y-%m-01':
>>> pd.to_datetime('2016-05')
Timestamp('2016-05-01 00:00:00')
>>> pd.date_range('2016-01', '2016-02')
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
'2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
'2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
'2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
'2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
'2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
'2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01'],
dtype='datetime64[ns]', freq='D')
Then everything follows from that. Specifying freq='M' includes month ends between 2016-01-01 and 2016-05-01, which is the list you receive and excludes 2016-05-31. But specifying month starts 'MS' like the second answer provides, includes 2016-05-01 as it falls within the range. pd.date_range() default behavior isn't like the range method since ends are included. From the docs:
closed controls whether to include start and end that are on the boundary. The default includes boundary points on either end.
Related
I am trying to use this, but eventually, I get the same year-month-day format where my year changed to default "1900". I want to get only month-day pairs if it is possible.
df['date'] = pd.to_datetime(df['date'], format="%m-%d")
If you transform anything to date time, you'll always have a year in it, i.e. to_datetime will always yield a date time with a year.
Without a year, you will need to store it as a string, e.g. by running the inverse of your example:
df['date'] = df['date'].dt.strftime(format="%m-%d")
I need to print a string on the first multi-index in a date format.
Essentially, I need to delete all data on the first date. But finding out the cause of this error is also very important to me. Thank you very much in advance!
As commented dt.date returns datetime.date object, which is different from Pandas' datetime object. Use dt.floor('D') or dt.normalized() instead. For example, this would work:
df['Date'] = df.session_started.dt.normalize()
df['Time'] = df.session_started.dt.hour
df_hour = df.groupby(['Date','Time']).checkbooking.count()
df_hour.loc['2019-01-13']
When I used Pandas to convert my datetime string, it sets it to the first day of the month if the day is missing.
For example:
pd.to_datetime('2017-06')
OUT[]: Timestamp('2017-06-01 00:00:00')
Is there a way to have it use the 15th (middle) day of the month?
EDIT:
I only want it to use day 15 if the day is missing, otherwise use the actual date - so offsetting all values by 15 won't work.
While this isn't possible using the actual call, you could always use regex matching to figure out if the string contains a date and proceed accordingly. Note: this code only works if using '-' delimited dates:
import re
date_str = '2017-06'
if (not bool(re.match('.+-.+-.+',date_str))):
pd.to_datetime(date_str).replace(date=15)
else:
pd.to_datetime(date_str)
So my timestamp column has a mixture of both epoch(s) and milliseconds(ms) times. Setting pd.to_datetime(unit='s', errors='ignore') first gives this as the head and this as the tail tail as I thought it would. It ignored the 'ms' type timestamp.
But when I run df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms', errors='ignore') my head becomes NaT but my tail correctly converts. Why is the function not ignoring the already converted timestamps?
Why is this? Is there a way around converting both unit types using an inbuilt? My current solution iterates over each row to check if the length of of the time stamp is greater than 10, if it is cut it down to 10. After which I use to_datetime.
My data set is to large for the current solution as it takes way to long.
EDIT
Timestamp column would look something like this
1541760294
1541746328
1541723516
1543826478000
1543804455000
1541741097
This is a quick hack, but you could just use the index given by a type check to convert the indexes you did not get in the first pass:
idx = [df['timestamp'].apply(lambda x: type(x)!=datetime.datetime)]
df['timestamp'][idx] = pd.to_datetime(df['timestamp'], unit='ms', errors='ignore')
I am not entirely sure what the datatype would be but I am guessing it would be a datetime.datetime...
I have a dataframe full of dates and I would like to select all dates where the month==12 and the day==25 and add replace the zero in the xmas column with a 1.
Anyway to do this? the second line of my code errors out.
df = DataFrame({'date':[datetime(2013,1,1).date() + timedelta(days=i) for i in range(0,365*2)], 'xmas':np.zeros(365*2)})
df[df['date'].month==12 and df['date'].day==25] = 1
Pandas Series with datetime now behaves differently. See .dt accessor.
This is how it should be done now:
df.loc[(df['date'].dt.day==25) & (cust_df['date'].dt.month==12), 'xmas'] = 1
Basically what you tried won't work as you need to use the & to compare arrays, additionally you need to use parentheses due to operator precedence. On top of this you should use loc to perform the indexing:
df.loc[(df['date'].month==12) & (df['date'].day==25), 'xmas'] = 1
An update was needed in reply to this question. As of today, there's a slight difference in how you extract months from datetime objects in a pd.Series.
So from the very start, incase you have a raw date column, first convert it to datetime objects by using a simple function:
import datetime as dt
def read_as_datetime(str_date):
# replace %Y-%m-%d with your own date format
return dt.datetime.strptime(str_date,'%Y-%m-%d')
then apply this function to your dates column and save results in a new column namely datetime:
df['datetime'] = df.dates.apply(read_as_datetime)
finally in order to extract dates by day and month, use the same piece of code that #Shayan RC explained, with this slight change; notice the dt.datetime after calling the datetime column:
df.loc[(df['datetime'].dt.datetime.month==12) &(df['datetime'].dt.datetime.day==25),'xmas'] =1