I have hourly data, of variable x for 3 types, and Category column, and ds is set as index.
> df
ds Category X
2010-01-01 01:00:00 A 32
2010-01-01 01:00:00 B 13
2010-01-01 01:00:00 C 09
2010-01-01 02:00:00 A 12
2010-01-01 02:00:00 B 62
2010-01-01 02:00:00 C 12
I want to resample it to Week. But if I use df2 = df.resample('W').mean(), it simply drops 'Category' Column.
If need resample per Category column per weeks add groupby, so is using DataFrameGroupBy.resample:
Notice:
For correct working is necessary DatetimeIndex.
df2 = df.groupby('Category').resample('W').mean()
print (df2)
X
Category ds
A 2010-01-03 22.0
B 2010-01-03 37.5
C 2010-01-03 10.5
To complete the answer by jezrael, I found it useful to put the content back as a DataFrame instead of a DataFrameGroup, as explained here. So, the answer will be:
df2 = df.groupby('Category').resample('W').mean()
# the inverse of groupby, reset_index
df2 = df2.reset_index()
# set again the timestamp as index
df2 = df2.set_index("ds")
Related
I have a pandas dataframe which is structured as follows:
timestamp y
0 2020-01-01 00:00:00 336.0
1 2020-01-01 00:15:00 544.0
2 2020-01-01 00:30:00 736.0
3 2020-01-01 00:45:00 924.0
4 2020-01-01 01:00:00 1260.0
...
The timestamp column is a datetime data type
and I have another dataframe with the following structure:
y
timestamp
00:00:00 625.076923
00:15:00 628.461538
00:30:00 557.692308
00:45:00 501.692308
01:00:00 494.615385
...
I this case, the time is the pandas datetime index.
Now what I want to do is replace the values in the first dataframe where the time field is matching i.e. the time of the day is matching with the second dataset.
IIUC your first dataframe df1's timestamp is datetime type and your second dataframe (df2) has an index of type datetime as well but only time and not date.
then you can do:
df1['y'] = df1['timestamp'].dt.time.map(df2['y'])
I wouldn't be surprised if there is a better way, but you can accomplish this by working to get the tables so that they can merge on the time. Assuming your dataframes will be df and df2.
df['time'] = df['timestamp'].dt.time
df2 = df2.reset_index()
df2['timestamp'] = pd.to_datetime(df2['timestamp'].dt.time
df_combined = pd.merge(df,df2,left_on='time',right_on='timestamp')
df_combined
timestamp_x y_x time timestamp_y y_y
0 2020-01-01 00:00:00 336.0 00:00:00 00:00:00 625.076923
1 2020-01-01 00:15:00 544.0 00:15:00 00:15:00 628.461538
2 2020-01-01 00:30:00 736.0 00:30:00 00:30:00 557.692308
3 2020-01-01 00:45:00 924.0 00:45:00 00:45:00 501.692308
4 2020-01-01 01:00:00 1260.0 01:00:00 01:00:00 494.615385
# This clearly has more than you need, so just keep what you want and rename things back.
df_combined = df_combined[['timestamp_x','y_y']]
df_combined = df_combined.rename(columns={'timestamp_x':'timestamp','y_y':'y'})
New answer I like way better: actually using .map()
Still need to get df2 to have the time column to match on.
df2 = df2.reset_index()
df2['timestamp'] = pd.to_datetime(df2['timestamp'].dt.time
df['y'] = df['timestamp'].dt.time.map(dict(zip(df2['timestamp',df2['y'])))
I have a dataframe with hourly values for several years. My dataframe is already in datetime format, and the column containing the values is called say "value column".
date = ['2015-02-03 23:00:00','2015-02-03 23:30:00','2015-02-04 00:00:00','2015-02-04 00:30:00']
value_column = [33.24 , 31.71 , 34.39 , 34.49 ]
df = pd.DataFrame({'value column':value_column})
df.index = pd.to_datetime(df['index'],format='%Y-%m-%d %H:%M')
df.drop(['index'],axis=1,inplace=True)
print(df.head())
value column
index
2015-02-03 23:00:00 33.24
2015-02-03 23:30:00 31.71
2015-02-04 00:00:00 34.39
2015-02-04 00:30:00 34.49
I know how to get the mean of the "value column" for each year efficiently with for instance the following command:
df = df.groupby(df.index.year).mean()
Now, I would like to divide all hourly values of the column "value column" by the mean of its values for its corresponding year (for instance dividing all the 2015 hourly values by the mean of 2015 values, and same for the other years).
Is there an efficient way to do that in pandas?
Expected result:
value column Value column/mean of year
index
2015-02-03 23:00:00 33.24 0.993499
2015-02-03 23:30:00 31.71 0.94777
2015-02-04 00:00:00 34.39 1.027871
2015-02-04 00:30:00 34.49 1.03086
Many thanks,
Try the following:
df.groupby(df.index.year).transform(lambda x: x/x.mean())
Refer: Group By: split-apply-combine
Transformation is recommended as it is meant to perform some group-specific computations and return a like-indexed object.
I just found another way, which im not sure to understand but works!
df['result'] = df['value column'].groupby(df.index.year).apply(lambda x: x/x.mean())
I thought that in apply functions, x was refering to single values of the array but it seems that it refers to the group itself.
You should be able to do:
df = (df.set_index(df.index.year)/df.groupby(df.index.year).mean()).set_index(df.index)
So you set the index to be the year in order to divide by the groupby object, and then reset the index to keep the original timestamps.
Full example:
import pandas as pd
import numpy as np
np.random.seed(1)
dr = pd.date_range('1-1-2010','1-1-2020', freq='H')
df = pd.DataFrame({'value column':np.random.rand(len(dr))}, index=dr)
print(df, '\n')
print(df.groupby(df.index.year).mean(), '\n')
df = (df.set_index(df.index.year)/df.groupby(df.index.year).mean()).set_index(df.index)
print(df)
Output:
#original data
value column
2010-01-01 00:00:00 0.417022
2010-01-01 01:00:00 0.720324
2010-01-01 02:00:00 0.000114
2010-01-01 03:00:00 0.302333
2010-01-01 04:00:00 0.146756
...
2019-12-31 20:00:00 0.530828
2019-12-31 21:00:00 0.224505
2019-12-31 22:00:00 0.459977
2019-12-31 23:00:00 0.931504
2020-01-01 00:00:00 0.581869
[87649 rows x 1 columns]
#grouped by year
value column
2010 0.497135
2011 0.503547
2012 0.501023
2013 0.497848
2014 0.497065
2015 0.501417
2016 0.498303
2017 0.499266
2018 0.499533
2019 0.492220
2020 0.581869
#final output
value column
2010-01-01 00:00:00 0.838851
2010-01-01 01:00:00 1.448952
2010-01-01 02:00:00 0.000230
2010-01-01 03:00:00 0.608150
2010-01-01 04:00:00 0.295203
...
2019-12-31 20:00:00 1.078436
2019-12-31 21:00:00 0.456107
2019-12-31 22:00:00 0.934494
2019-12-31 23:00:00 1.892455
2020-01-01 00:00:00 1.000000
[87649 rows x 1 columns]
I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome
There is a pandas dataframe like this:
index
2018-06-01 02:50:00 R 45.48 -2.8
2018-06-01 07:13:00 R 45.85 -2.0
...
2018-06-01 08:37:00 R 45.87 -2.7
I would like to round the index to the hour like this:
index
2018-06-01 02:00:00 R 45.48 -2.8
2018-06-01 07:00:00 R 45.85 -2.0
...
2018-06-01 08:00:00 R 45.87 -2.7
I am trying the following code:
df = df.date_time.apply ( lambda x : x.round('H'))
but returns a serie instead of a dataframe with the modified index column
Try using floor:
df.index.floor('H')
Setup:
df = pd.DataFrame(np.arange(25),index=pd.date_range('2018-01-01 01:12:50','2018-01-02 01:12:50',freq='H'),columns=['Value'])
df.head()
Value
2018-01-01 01:12:50 0
2018-01-01 02:12:50 1
2018-01-01 03:12:50 2
2018-01-01 04:12:50 3
2018-01-01 05:12:50 4
df.index = df.index.floor('H')
df.head()
Value
2018-01-01 01:00:00 0
2018-01-01 02:00:00 1
2018-01-01 03:00:00 2
2018-01-01 04:00:00 3
2018-01-01 05:00:00 4
Try my method:
Add a new column by the rounded value of hour:
df['E'] = df.index.round('H')
Set it as index:
df1 = df.set_index('E')
Delete the name you set('E' here):
df1.index.name = None
And now, df1 is a new DataFrame with index hour rounded from df.
Try this
df['index'].apply(lambda dt: datetime.datetime(dt.year, dt.month, dt.day, dt.hour,60*(dt.minute // 60)))
So I have a 'Date' column in my data frame where the dates have the format like this
0 1998-08-26 04:00:00
If I only want the Year month and day how do I drop the trivial hour?
The quickest way is to use DatetimeIndex's normalize (you first need to make the column a DatetimeIndex):
In [11]: df = pd.DataFrame({"t": pd.date_range('2014-01-01', periods=5, freq='H')})
In [12]: df
Out[12]:
t
0 2014-01-01 00:00:00
1 2014-01-01 01:00:00
2 2014-01-01 02:00:00
3 2014-01-01 03:00:00
4 2014-01-01 04:00:00
In [13]: pd.DatetimeIndex(df.t).normalize()
Out[13]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-01-01]
Length: 5, Freq: None, Timezone: None
In [14]: df['date'] = pd.DatetimeIndex(df.t).normalize()
In [15]: df
Out[15]:
t date
0 2014-01-01 00:00:00 2014-01-01
1 2014-01-01 01:00:00 2014-01-01
2 2014-01-01 02:00:00 2014-01-01
3 2014-01-01 03:00:00 2014-01-01
4 2014-01-01 04:00:00 2014-01-01
DatetimeIndex also has some other useful attributes, e.g. .year, .month, .day.
From 0.15 they'll be a dt attribute, so you can access this (and other methods) with:
df.t.dt.normalize()
# equivalent to
pd.DatetimeIndex(df.t).normalize()
Another option
df['my_date_column'].dt.date
Would give
0 2019-06-15
1 2019-06-15
2 2019-06-15
3 2019-06-15
4 2019-06-15
Another Possibility is using str.split
df['Date'] = df['Date'].str.split(' ',expand=True)[0]
This should split the 'Date' column into two columns marked 0 and 1. Using the whitespace in between the date and time as the split indicator.
Column 0 of the returned dataframe then includes the date, and column 1 includes the time.
Then it sets the 'Date' column of your original dataframe to column [0] which should be just the date.
At read_csv with date_parser
to_date = lambda times : [t[0:10] for t in times]
df = pd.read_csv('input.csv',
parse_dates={date: ['time']},
date_parser=to_date,
index_col='date')