Resampling pandas columns datetime - python

(I think) I have a dataset with the columns representing datetime intervals
Columns were transformed in datetime with:
for col in df.columns:
df.rename({col: pd.to_datetime(col, infer_datetime_format=True)}, inplace=True)
Then, I need to resample the columns (year and month '2001-01') into quarters using mean
I tried
df = df.resample('1q', how='mean', axis=1)
The DataFrame also has a multindex set ['RegionName', 'County']
But I get the error:
Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Is the problem in the to_datetime function or in the wrong sampling?

(I think) you are renaming each column head rather than making the entire columns object a DatetimeIndex
Try this instead:
df.columns = pd.to_datetime(df.columns)
Then run your resample
note:
I'd do it with period after converting to DatetimeIndex. That way, you get the period in your column header rather than an end date of the quarter.
df.groupby(df.columns.to_period('Q'), axis=1).mean()
demo
df = pd.DataFrame(np.arange(12).reshape(2, -1),
columns=['2011-01-31', '2011-02-28', '2011-03-31',
'2011-04-30', '2011-05-31', '2011-06-30'])
df.columns = pd.to_datetime(df.columns)
print(df.groupby(df.columns.to_period('Q'), axis=1).mean())
2011Q1 2011Q2
0 1 4
1 7 10

Related

How to split csv into 2 dataframe with the condition

My idea is seperate both of the "String" then convert both dataframe into same datetime format. I try the code
data['date'] = pd.to_datetime(data['date'])
data['date'] = data['date'].dt.strftime('%Y-%m-%d')
but there are some error on the output. The 13/02/2020 will become 2020-02-13 that is what i want. But the 12/02/2020 will become 2020-12-02.
My dataframe have 2 type of date format. Which is YYYY-MM-DD and DD/MM/YYYY.
dataframe
I need to split it into 2 dataframe, all the row that have the date YYYY-MM-DD into the df1.
The data type is object.
All all the row that have the date DD/MM/YYYY into the df2.
Anyone know how to code it?
If dont need convert to datetimes use Series.str.contains with boolean indexing:
mask = df['date'].str.contains('-')
df1 = df[mask].copy()
df2 = df[~mask].copy()
If need datetimes you can use parameter errors='coerce' in to_datetime for missing values if not matching format, so last remove missing values:
df1 = (df.assign(date = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
.dropna(subset=['date']))
df2 = (df.assign(date = pd.to_datetime(df['date'], format='%d/%m/%Y', errors='coerce')
.dropna(subset=['date']))
EDIT: If need output column filled by correct datetimes you can replace missing values by another Series by Series.fillna:
date1 = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
date2 = pd.to_datetime(df['date'], format='%d/%m/%Y', errors='coerce')
df['date'] = date1.fillna(date2)
you can use the fact that the separation is different to find the dates.
If your dataframe is in this format:
df = pd.DataFrame({'id' : [1,1,2,2,3,3],
"Date": ["30/8/2020","30/8/2021","30/8/2022","2019-10-24","2019-10-25","2020-10-24"] })
With either "-" or "/" to separate the data
you can use a function that finds this element and apply it to the date column:
def find(string):
if string.find('/')==2:
return True
else:
return False
df[df['date'].apply(find)]

Select Pandas dataframe rows based on 'hour' datetime

I have a pandas dataframe 'df' with a column 'DateTimes' of type datetime.time.
The entries of that column are hours of a single day:
00:00:00
.
.
.
23:59:00
Seconds are skipped, it counts by minutes.
How can I choose rows by hour, for example the rows between 00:00:00 and 00:01:00?
If I try this:
df.between_time('00:00:00', '00:00:10')
I get an error that index must be a DateTimeIndex.
I set the index as such with:
df=df.set_index(keys='DateTime')
but I get the same error.
I can't seem to get 'loc' to work either. Any suggestions?
Here a working example of what you are trying to do:
times = pd.date_range('3/6/2012 00:00', periods=100, freq='S', tz='UTC')
df = pd.DataFrame(np.random.randint(10, size=(100,1)), index=times)
df.between_time('00:00:00', '00:00:30')
Note the index has to be of type DatetimeIndex.
I understand you have a column with your dates/times. The problem probably is that your column is not of this type, so you have to convert it first, before setting it as index:
# Method A
df.set_index(pd.to_datetime(df['column_name'], drop=True)
# Method B
df.index = pd.to_datetime(df['column_name'])
df = df.drop('col', axis=1)
(The drop is only necessary if you want to remove the original column after setting it as index)
Check out these links:
convert column to date type: Convert DataFrame column type from string to datetime
filter dataframe on dates: Filtering Pandas DataFrames on dates
Hope this helps

Pandas group hourly data into daily sums with date index

I am working on a code that takes hourly data for a month and groups it into 24 hour sums. My problem is that I would like the index to read the date/year and I am just getting an index of 1-30.
The code I am using is
df = df.iloc[:,16:27].groupby([lambda x: x.day]).sum()
example of output I am getting
DateTime data
1 1772.031568
2 19884.42243
3 28696.72159
4 24906.20355
5 9059.120325
example of output I would like
DateTime data
1/1/2017 1772.031568
1/2/2017 19884.42243
1/3/2017 28696.72159
1/4/2017 24906.20355
1/5/2017 9059.120325
This is an old question, but I don't think the accepted solution is the best in this particular case. What you want to accomplish is to down sample time series data, and Pandas has built-in functionality for this called resample(). For your example you will do:
df = df.iloc[:,16:27].resample('D').sum()
or if the datetime column is not the index
df = df.iloc[:,16:27].resample('D', on='datetime_column_name').sum()
There are (at least) 2 benefits from doing it this way as opposed to accepted answer:
Resample can up sample and down sample, groupby() can only down sample
No lambdas, list comprehensions or date formatting functions required.
For more information and examples, see documentation here: resample()
If your index is a datetime, you can build a combined groupby clause:
df = df.iloc[:,16:27].groupby([lambda x: "{}/{}/{}".format(x.day, x.month, x.year)]).sum()
or even better:
df = df.iloc[:,16:27].groupby([lambda x: x.strftime("%d%m%Y")]).sum()
if your index was not datetime object.
import pandas as pd
df = pd.DataFrame({'data': [1772.031568, 19884.42243,28696.72159, 24906.20355,9059.120325]},index=[1,2,3,4,5])
print df.head()
rng = pd.date_range('1/1/2017',periods =len(df.index), freq='D')
df.set_index(rng,inplace=True)
print df.head()
will result in
data
1 1772.031568
2 19884.422430
3 28696.721590
4 24906.203550
5 9059.120325
data
2017-01-01 1772.031568
2017-01-02 19884.422430
2017-01-03 28696.721590
2017-01-04 24906.203550
2017-01-05 9059.120325
First you need to create an index on your datetime column to expose functions that break the datetime into smaller pieces efficiently (like the year and month of the datetime).
Next, you need to group by the year, month and day of the index if you want to apply an aggregate method (like sum()) to each day of the year, and retain separate aggregations for each day.
The reset_index() and rename() functions allow us to rename our group_by categories to their original names. This "flattens" out our data, making the category an actual column on the resulting dataframe.
import pandas as pd
date_index = pd.DatetimeIndex(df.created_at)
# 'df.created_at' is the datetime column in your dataframe
counted = df.group_by([date_index.year, date_index.month, date_index.day])\
.agg({'column_to_sum': 'sum'})\
.reset_index()\
.rename(columns={'level_1': 'year',
'level_2': 'month',
'level_3': 'day'})
# Resulting dataframe has columns "column_to_sum", "year", "month", "day" available
You can exploit panda's DatetimeIndex:
working_df=df.iloc[:, 16:27]
result = working_df.groupby(pd.DatetimeIndex(working_df.DateTime)).date).sum()
This if you DateTime column is actually DateTime (and be careful of the timezone).
In this way you will have valid datetime in the index, so that you can easily do other manipulations.

Filling in missing date/times in my pd.date_range

I have a column of data which looks like the following:
I am trying to set a range of the entire month:
rng = pd.date_range('2016-09-01 00:00:00', '2016-09-30 23:59:58', freq='S')
But my column of data (above) is missing a few hours, and I am unsure where (since my data is 2 million rows large.
I tried to use the reindex command, but it instead seemed to have filled everyhthing with zeroes.
The code that I was using is as follows:
df = pd.DataFrame(df_csv)
rng = pd.date_range('2016-09-01 00:00:00', '2016-09-30 23:59:58', freq='S')
df = df.reindex(rng,fill_value=0.0)
How do I properly fill in the missing date/times without filling everything with a 0?
I think you need set_index from column date first, then is possible use reindex:
#cast column date if dtype is not datetime
df.date = pd.to_datetime(df.date)
df = df.set_index('date').reindex(rng,fill_value=0.0)
You get all NaN values, because reindexing int index by datetime values (After using fill_value=0.0 all NaN are replaced to 0.0).
Also if column date is sorted, you can use more general solution with selecting first and last value of column date:
start_date = df.date.iat[0]
end_date = df.date.iat[-1]
rng = pd.date_range(start_date, end_date, freq='S')

Python/Pandas - Convert type from pandas period to string

I have a DataFrame:
Seasonal
Date
2014-12 -1.089744
2015-01 -0.283654
2015-02 0.158974
2015-03 0.461538
I used a pd.to_period in the DataFrame, so its index has turned into a Pandas period type (type 'pandas._period.Period').
Now, I want to turn that index to strings. I'm trying to apply the following:
df.index=df.index.astype(str)
However that doesn't work...
ValueError: Cannot cast PeriodIndex to dtype |S0
My code has been frozen since then.
S.O.S.
You can use to_series and then convert to string:
print df
# Seasonal
#Date
#2014-12 -1.089744
#2015-01 -0.283654
#2015-02 0.158974
#2015-03 0.461538
print df.index
#PeriodIndex(['2014-12', '2015-01', '2015-02', '2015-03'],
# dtype='int64', name=u'Date', freq='M')
df.index=df.index.to_series().astype(str)
print df
# Seasonal
#Date
#2014-12 -1.089744
#2015-01 -0.283654
#2015-02 0.158974
#2015-03 0.461538
print df.index
#Index([u'2014-12', u'2015-01', u'2015-02', u'2015-03'], dtype='object', name=u'Date')
The line below should convert your PeriodIndex to string format:
df.index = df.index.strftime('%Y-%m')
You can convert the items to strings by specifying basestring:
df.index = df.index.astype(basestring)
or if that doesn't work:
df.index = df.index.map(str)
Refering to the comments from this answer, it might have to do with your pandas/python version.
If your index is a PeriodIndex, you can convert it to a str list as the following example shows:
import pandas as pd
pi = pd.period_range("2000-01", "2000-12", freq = "M")
print(list(pi.astype(str)))

Categories