Related
My variable dates_city stores this:
Index(['2020-11-17T00:00:00', '2020-11-18T00:00:00', '2020-11-19T00:00:00',
'2020-11-20T00:00:00', '2020-11-21T00:00:00', '2020-11-22T00:00:00',
'2020-11-23T00:00:00', '2020-11-24T00:00:00', '2020-11-25T00:00:00',
'2020-11-26T00:00:00', '2020-11-27T00:00:00', '2020-11-28T00:00:00'])
I want it to be stored as:
Index(['2020-11-17', '2020-11-18', '2020-11-19',
'2020-11-20', '2020-11-21', '2020-11-22',
'2020-11-23', '2020-11-24', '2020-11-25',
'2020-11-26', '2020-11-27', '2020-11-28'])
So, basically with just the date in yyyy-mm-dd format. I was trying to use datetime but I can't seem to get it to work, possibly because this variable is an index, not an array. How do I reformat this?
You could change the index of your dataframe using pandas reset_index() method. Note that this will rename the date column to 'index', so you may want to rename it using pandas rename() method.
Then you can use pandas strftime() method to reformat your dates. After reformatting, if you still want to use the date column as the index, you can do that by changing the index attribute of the dataframe (see code below):
df.index = df['Date']
pandas.to_datetime worked for me:
pd.to_datetime(dates_city)
#DatetimeIndex(['2020-11-17', '2020-11-18', '2020-11-19', '2020-11-20',
# '2020-11-21', '2020-11-22', '2020-11-23', '2020-11-24',
# '2020-11-25', '2020-11-26', '2020-11-27', '2020-11-28'],
# dtype='datetime64[ns]', freq=None)
If you want to keep it as pandas.Index, you can add the method pandas.DatetimeIndex.strftime:
pd.to_datetime(dates_city).strftime("%Y-%m-%d")
#Index(['2020-11-17', '2020-11-18', '2020-11-19', '2020-11-20', '2020-11-21',
# '2020-11-22', '2020-11-23', '2020-11-24', '2020-11-25', '2020-11-26',
# '2020-11-27', '2020-11-28'],
# dtype='object')
You can find the datetime format codes here.
I am working on a code that takes hourly data for a month and groups it into 24 hour sums. My problem is that I would like the index to read the date/year and I am just getting an index of 1-30.
The code I am using is
df = df.iloc[:,16:27].groupby([lambda x: x.day]).sum()
example of output I am getting
DateTime data
1 1772.031568
2 19884.42243
3 28696.72159
4 24906.20355
5 9059.120325
example of output I would like
DateTime data
1/1/2017 1772.031568
1/2/2017 19884.42243
1/3/2017 28696.72159
1/4/2017 24906.20355
1/5/2017 9059.120325
This is an old question, but I don't think the accepted solution is the best in this particular case. What you want to accomplish is to down sample time series data, and Pandas has built-in functionality for this called resample(). For your example you will do:
df = df.iloc[:,16:27].resample('D').sum()
or if the datetime column is not the index
df = df.iloc[:,16:27].resample('D', on='datetime_column_name').sum()
There are (at least) 2 benefits from doing it this way as opposed to accepted answer:
Resample can up sample and down sample, groupby() can only down sample
No lambdas, list comprehensions or date formatting functions required.
For more information and examples, see documentation here: resample()
If your index is a datetime, you can build a combined groupby clause:
df = df.iloc[:,16:27].groupby([lambda x: "{}/{}/{}".format(x.day, x.month, x.year)]).sum()
or even better:
df = df.iloc[:,16:27].groupby([lambda x: x.strftime("%d%m%Y")]).sum()
if your index was not datetime object.
import pandas as pd
df = pd.DataFrame({'data': [1772.031568, 19884.42243,28696.72159, 24906.20355,9059.120325]},index=[1,2,3,4,5])
print df.head()
rng = pd.date_range('1/1/2017',periods =len(df.index), freq='D')
df.set_index(rng,inplace=True)
print df.head()
will result in
data
1 1772.031568
2 19884.422430
3 28696.721590
4 24906.203550
5 9059.120325
data
2017-01-01 1772.031568
2017-01-02 19884.422430
2017-01-03 28696.721590
2017-01-04 24906.203550
2017-01-05 9059.120325
First you need to create an index on your datetime column to expose functions that break the datetime into smaller pieces efficiently (like the year and month of the datetime).
Next, you need to group by the year, month and day of the index if you want to apply an aggregate method (like sum()) to each day of the year, and retain separate aggregations for each day.
The reset_index() and rename() functions allow us to rename our group_by categories to their original names. This "flattens" out our data, making the category an actual column on the resulting dataframe.
import pandas as pd
date_index = pd.DatetimeIndex(df.created_at)
# 'df.created_at' is the datetime column in your dataframe
counted = df.group_by([date_index.year, date_index.month, date_index.day])\
.agg({'column_to_sum': 'sum'})\
.reset_index()\
.rename(columns={'level_1': 'year',
'level_2': 'month',
'level_3': 'day'})
# Resulting dataframe has columns "column_to_sum", "year", "month", "day" available
You can exploit panda's DatetimeIndex:
working_df=df.iloc[:, 16:27]
result = working_df.groupby(pd.DatetimeIndex(working_df.DateTime)).date).sum()
This if you DateTime column is actually DateTime (and be careful of the timezone).
In this way you will have valid datetime in the index, so that you can easily do other manipulations.
Given a date, I'm using pandas date_range to generate additional 30 dates:
import pandas as pd
from datetime import timedelta
pd.date_range(startdate, startdate + timedelta(days=30))
Out of these 30 dates, how can I randomly select 10 dates in order starting from date in first position and ending with date in last position?
use np.random.choice to choose specified number of items from a given set of choices.
In order to guarantee that the first and last dates are preserved, I pull them out explicitly and choose 8 more dates at random.
I then pass them back to pd.to_datetime and sort_values to ensure they stay in order.
dates = pd.date_range('2011-04-01', periods=30, freq='D')
random_dates = pd.to_datetime(
np.concatenate([
np.random.choice(dates[1:-1], size=8, replace=False),
dates[[0, -1]]
])
).sort_values()
random_dates
DatetimeIndex(['2011-04-01', '2011-04-02', '2011-04-03', '2011-04-13',
'2011-04-14', '2011-04-21', '2011-04-22', '2011-04-26',
'2011-04-27', '2011-04-30'],
dtype='datetime64[ns]', freq=None)
You can use numpy.random.choice with replace=False if is not necessary explicitly get first and last value (if yes use another answer):
a = pd.date_range('2011-04-01', periods=30, freq='D')
print (pd.to_datetime(np.sort(np.random.choice(a, size=10, replace=False))))
DatetimeIndex(['2011-04-01', '2011-04-03', '2011-04-05', '2011-04-09',
'2011-04-12', '2011-04-17', '2011-04-22', '2011-04-24',
'2011-04-29', '2011-04-30'],
dtype='datetime64[ns]', freq=None)
From pd.date_range('2016-01', '2016-05', freq='M', ).strftime('%Y-%m'), the last month is 2016-04, but I was expecting it to be 2016-05. It seems to me this function is behaving like the range method, where the end parameter is not included in the returning array.
Is there a way to get the end month included in the returning array, without processing the string for the end month?
A way to do it without messing with figuring out month ends yourself.
pd.date_range(*(pd.to_datetime(['2016-01', '2016-05']) + pd.offsets.MonthEnd()), freq='M')
DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
'2016-05-31'],
dtype='datetime64[ns]', freq='M')
You can use .union to add the next logical value after initializing the date_range. It should work as written for any frequency:
d = pd.date_range('2016-01', '2016-05', freq='M')
d = d.union([d[-1] + 1]).strftime('%Y-%m')
Alternatively, you can use period_range instead of date_range. Depending on what you intend to do, this might not be the right thing to use, but it satisfies your question:
pd.period_range('2016-01', '2016-05', freq='M').strftime('%Y-%m')
In either case, the resulting output is as expected:
['2016-01' '2016-02' '2016-03' '2016-04' '2016-05']
For the later crowd. You can also try to use the Month-Start frequency.
>>> pd.date_range('2016-01', '2016-05', freq='MS', format = "%Y-%m" )
DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01',
'2016-05-01'],
dtype='datetime64[ns]', freq='MS')
Include the day when specifying the dates in date_range call
pd.date_range('2016-01-31', '2016-05-31', freq='M', ).strftime('%Y-%m')
array(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05'],
dtype='|S7')
I had a similar problem when using datetime objects in dataframe. I would set the boundaries through .min() and .max() functions and then fill in missing dates using the pd.date_range function. Unfortunately the returned list/df was missing the maximum value.
I found two work arounds for this:
1) Add "closed = None" parameter in the pd.date_range function. This worked in the example below; however, it didn't work for me when working only with dataframes (no idea why).
2) If option #1 doesn't work then you can add one extra unit (in this case a day) using the datetime.timedelta() function. In the case below it over indexed by a day but it can work for you if the date_range function isn't giving you the full range.
import pandas as pd
import datetime as dt
#List of dates as strings
time_series = ['2020-01-01', '2020-01-03', '2020-01-5', '2020-01-6', '2020-01-7']
#Creates dataframe with time data that is converted to datetime object
raw_data_df = pd.DataFrame(pd.to_datetime(time_series), columns = ['Raw_Time_Series'])
#Creates an indexed_time list that includes missing dates and the full time range
#Option No. 1 is to use the closed = None parameter choice.
indexed_time = pd.date_range(start = raw_data_df.Raw_Time_Series.min(),end = raw_data_df.Raw_Time_Series.max(),freq='D',closed= None)
print('indexed_time option #! = ', indexed_time)
#Option No. 2 if the function allows you to extend the time by one unit (in this case day)
#by using the datetime.timedelta function to get what you need.
indexed_time = pd.date_range(start = raw_data_df.Raw_Time_Series.min(),end = raw_data_df.Raw_Time_Series.max()+dt.timedelta(days=1),freq='D')
print('indexed_time option #2 = ', indexed_time)
#In this case you over index by an extra day because the date_range function works properly
#However, if the "closed = none" parameters doesn't extend through the full range then this is a good work around
I dont think so. You need to add the (n+1) boundary
pd.date_range('2016-01', '2016-06', freq='M' ).strftime('%Y-%m')
The start and end dates are strictly inclusive. So it will not
generate any dates outside of those dates if specified.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
Either way, you have to manually add some information. I believe adding just one more month is not a lot of work.
The explanation for this issue is that the function pd.to_datetime() converts a '%Y-%m' date string by default to the first of the month datetime, or '%Y-%m-01':
>>> pd.to_datetime('2016-05')
Timestamp('2016-05-01 00:00:00')
>>> pd.date_range('2016-01', '2016-02')
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
'2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
'2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
'2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
'2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
'2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
'2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01'],
dtype='datetime64[ns]', freq='D')
Then everything follows from that. Specifying freq='M' includes month ends between 2016-01-01 and 2016-05-01, which is the list you receive and excludes 2016-05-31. But specifying month starts 'MS' like the second answer provides, includes 2016-05-01 as it falls within the range. pd.date_range() default behavior isn't like the range method since ends are included. From the docs:
closed controls whether to include start and end that are on the boundary. The default includes boundary points on either end.
I have a data frame indexed with a date (Python datetime object). How could I find the frequency as the number of months of data in the data frame?
I tried the attribute data_frame.index.freq, but it returns a none value. I also tried asfreq function using data_frame.asfreq('M',how={'start','end'} but it does not return the expected results. Please advise how I can get the expected results.
You want to convert you index of datetimes to a DatetimeIndex, the easiest way is to use to_datetime:
df.index = pd.to_datetime(df.index)
Now you can do timeseries/frame operations, like resample or TimeGrouper.
If your data has a consistent frequency, then this will be df.index.freq, if it doesn't (e.g. if some days are missing) then df.index.freq will be None.
You probably want to be use pandas Timestamp for your index instead of datetime to use 'freq'. See example below
import pandas as pd
dates = pd.date_range('2012-1-1','2012-2-1')
df = pd.DataFrame(index=dates)
print (df.index.freq)
yields,
<Day>
You can easily convert your dataframe like so,
df.index = [pd.Timestamp(d) for d in df.index]