So I have a 'Date' column in my data frame where the dates have the format like this
0 1998-08-26 04:00:00
If I only want the Year month and day how do I drop the trivial hour?
The quickest way is to use DatetimeIndex's normalize (you first need to make the column a DatetimeIndex):
In [11]: df = pd.DataFrame({"t": pd.date_range('2014-01-01', periods=5, freq='H')})
In [12]: df
Out[12]:
t
0 2014-01-01 00:00:00
1 2014-01-01 01:00:00
2 2014-01-01 02:00:00
3 2014-01-01 03:00:00
4 2014-01-01 04:00:00
In [13]: pd.DatetimeIndex(df.t).normalize()
Out[13]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-01-01]
Length: 5, Freq: None, Timezone: None
In [14]: df['date'] = pd.DatetimeIndex(df.t).normalize()
In [15]: df
Out[15]:
t date
0 2014-01-01 00:00:00 2014-01-01
1 2014-01-01 01:00:00 2014-01-01
2 2014-01-01 02:00:00 2014-01-01
3 2014-01-01 03:00:00 2014-01-01
4 2014-01-01 04:00:00 2014-01-01
DatetimeIndex also has some other useful attributes, e.g. .year, .month, .day.
From 0.15 they'll be a dt attribute, so you can access this (and other methods) with:
df.t.dt.normalize()
# equivalent to
pd.DatetimeIndex(df.t).normalize()
Another option
df['my_date_column'].dt.date
Would give
0 2019-06-15
1 2019-06-15
2 2019-06-15
3 2019-06-15
4 2019-06-15
Another Possibility is using str.split
df['Date'] = df['Date'].str.split(' ',expand=True)[0]
This should split the 'Date' column into two columns marked 0 and 1. Using the whitespace in between the date and time as the split indicator.
Column 0 of the returned dataframe then includes the date, and column 1 includes the time.
Then it sets the 'Date' column of your original dataframe to column [0] which should be just the date.
At read_csv with date_parser
to_date = lambda times : [t[0:10] for t in times]
df = pd.read_csv('input.csv',
parse_dates={date: ['time']},
date_parser=to_date,
index_col='date')
Related
I have the following time series and I want to convert to datetime in DataFrame using "pd.to_datetime". I am getting the following error: "hour must be in 0..23: 2017/ 01/01 24:00:00". How can I go around this error?
DateTime
0 2017/ 01/01 01:00:00
1 2017/ 01/01 02:00:00
2 2017/ 01/01 03:00:00
3 2017/ 01/01 04:00:00
...
22 2017/ 01/01 23:00:00
23 2017/ 01/01 24:00:00
Given:
DateTime
0 2017/01/01 01:00:00
1 2017/01/01 02:00:00
2 2017/01/01 03:00:00
3 2017/01/01 04:00:00
4 2017/01/01 23:00:00
5 2017/01/01 24:00:00
As the error says, 24:00:00 isn't a valid time. Depending on what it actually means, we can salvage it like this:
# Split up your Date and Time Values into separate Columns:
df[['Date', 'Time']] = df.DateTime.str.split(expand=True)
# Convert them separately, one as datetime, the other as timedelta.
df.Date = pd.to_datetime(df.Date)
df.Time = pd.to_timedelta(df.Time)
# Fix your DateTime Column, Drop the helper Columns:
df.DateTime = df.Date + df.Time
df = df.drop(['Date', 'Time'], axis=1)
print(df)
print(df.dtypes)
Output:
DateTime
0 2017-01-01 01:00:00
1 2017-01-01 02:00:00
2 2017-01-01 03:00:00
3 2017-01-01 04:00:00
4 2017-01-01 23:00:00
5 2017-01-02 00:00:00
DateTime datetime64[ns]
dtype: object
df['DateTime'] =pd.to_datetime(df['DateTime'], format='%y-%m-%d %H:%M', errors='coerce')
Try this out!
I am trying to take a dataframe, which has timestamps and various other fields, and group by rounded timestamps (to the nearest minute), and take the average of another field. I'm getting the error no numeric value to aggregate
I am rounding the timestamp column like such:
df['time'] = df['time'].dt.round('1min')
The aggregation column is of the form: 0 days 00:00:00.054000
df3 = (
df
.groupby([df['time']])['total_diff'].mean()
.unstack(fill_value=0)
.reset_index()
)
I realize the total_diff column is a time delta field, but I would have thought this would still be considered numerical?
My ideal output would have the following columns: Rounded timestamp, number of records that are grouped in that timestamp, average total_diff by rounded timestamp. How can I achieve this?
EDIT
Example row:
[index, id, time, total_diff]
[400, 5fdfe9242c2fb0da04928d55, 2020-12-21 00:16:00, 0 days 00:00:00.055000]
[401, 5fdfe9242c2fb0da04928d56, 2020-12-21 00:16:00, 0 days 00:00:00.01000]
[402, 5fdfe9242c2fb0da04928d57, 2020-12-21 00:15:00, 0 days 00:00:00.05000]
The time column is not unique. I want to group by the time column, count the number of rows that are grouped into each time bucket, and produce an average of the total_diff for each time bucket.
Desired outcome:
[time, count, avg_total_diff]
[2020-12-21 00:16:00, 2, .0325]
By default DataFrame.groupby.mean has numeric_only=True, and numeric only considers int, bool and float. To also work with timedelta64[ns] you must set this to False
.
Sample Data
import pandas as pd
df = pd.DataFrame(pd.date_range('2010-01-01', freq='2T', periods=6))
df[1] = df[0].diff().bfill()
# 0 1
#0 2010-01-01 00:00:00 0 days 00:02:00
#1 2010-01-01 00:02:00 0 days 00:02:00
#2 2010-01-01 00:04:00 0 days 00:02:00
#3 2010-01-01 00:06:00 0 days 00:02:00
#4 2010-01-01 00:08:00 0 days 00:02:00
#5 2010-01-01 00:10:00 0 days 00:02:00
df.dtypes
#0 datetime64[ns]
#1 timedelta64[ns]
#dtype: object
Code
df.groupby(df[0].round('5T'))[1].mean()
#DataError: No numeric types to aggregate
df.groupby(df[0].round('5T'))[1].mean(numeric_only=False)
#0
#2010-01-01 00:00:00 0 days 00:02:00
#2010-01-01 00:05:00 0 days 00:02:00
#2010-01-01 00:10:00 0 days 00:02:00
#Name: 1, dtype: timedelta64[ns]
I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome
I have a DataFramewith with a datetime index.
df1=pd.DataFrame(index=pd.date_range('20100201', periods=24, freq='8h3min'),
data=np.random.rand(24),columns=['Rubbish'])
df1.index=df1.index.to_datetime()
I want to resample this DataFrame, as in :
df1=df1.resample('7D').agg(np.median)
Then I have another DataFrame, with index of different frequency and starting at a different offset hour
df2=pd.DataFrame(index=pd.date_range('20100205', periods=24, freq='6h3min'),
data=np.random.rand(24),columns=['Rubbish'])
df2.index=df2.index.to_datetime()
df2=df2.resample('7D').agg(np.median)
The operations work well independently, but when I try to merge the results using
print(pd.merge(df1,df2,right_index=True,left_index=True,how='outer'))
I get:
Rubbish_x Rubbish_y
2010-02-01 0.585986 NaN
2010-02-05 NaN 0.423316
2010-02-08 0.767499 NaN
While I would like to resample both with same offset, and get the following result after a merge
Rubbish_x Rubbish_y
2010-02-01 AVALUE AVALUE
2010-02-08 AVALUE AVALUE
I have tried the following, but it only generates nans
df2.reindex(df1.index)
print(pd.merge(df1,df2,right_index=True,left_index=True,how='outer'))
I have to stick to pandas 0.20.1.
I have tried mergeas_of
df1.index
Out[48]: Index([2015-03-24, 2015-03-31, 2015-04-07, 2015-04-14, 2015-04-21, 2015-04-28], dtype='object')
df2.index
Out[49]: Index([2015-03-24, 2015-03-31, 2015-04-07, 2015-04-14, 2015-04-21, 2015-04-28], dtype='object')
output=pd.merge_asof(df1,df2,left_index=True,right_index=True)
but it crashes with following traceback
Traceback (most recent call last):
TypeError: 'NoneType' object is not callable
I believe need merge_asof:
print(pd.merge_asof(df1,df2,right_index=True,left_index=True))
Rubbish_x Rubbish_y
2010-02-01 0.446505 NaN
2010-02-08 0.474330 0.606826
Or parameter method='nearest' to reindex:
df2 = df2.reindex(df1.index, method='nearest')
print (df2)
Rubbish
2010-02-01 0.415248
2010-02-08 0.415248
print(pd.merge(df1,df2,right_index=True,left_index=True,how='outer'))
Rubbish_x Rubbish_y
2010-02-01 0.431966 0.415248
2010-02-08 0.279121 0.415248
I think follow code base would achieve your task.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Freq: T, dtype: int64
>>> series.resample('3T').sum()
2000-01-01 00:00:00 3
2000-01-01 00:03:00 12
2000-01-01 00:06:00 21
Freq: 3T, dtype: int64
https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.resample.html
I have a dataframe with only numeric values and I want to calculate the mean for every column and create a new dataframe.
The original dataframe is indexed by a datetimefield. The new dataframe should be indexed by the same field as original dataframe with a value equal to last row index of original dataframe.
Code so far
mean_series=df.mean()
df_mean= pd.DataFrame(stddev_series)
df_mean.rename(columns=lambda x: 'std_dev_'+ x, inplace=True)
but this gives an error
df_mean.rename(columns=lambda x: 'std_mean_'+ x, inplace=True)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S21') dtype('S21') dtype('S21')
Your question implies that you want a new DataFrame with a single row.
In [10]: df.head(10)
Out[10]:
0 1 2 3
2011-01-01 00:00:00 0.182481 0.523784 0.718124 0.063792
2011-01-01 01:00:00 0.321362 0.404686 0.481889 0.524521
2011-01-01 02:00:00 0.514426 0.735809 0.433758 0.392824
2011-01-01 03:00:00 0.616802 0.149099 0.217199 0.155990
2011-01-01 04:00:00 0.525465 0.439633 0.641974 0.270364
2011-01-01 05:00:00 0.749662 0.151958 0.200913 0.219916
2011-01-01 06:00:00 0.665164 0.396595 0.980862 0.560119
2011-01-01 07:00:00 0.797803 0.377273 0.273724 0.220965
2011-01-01 08:00:00 0.651989 0.553929 0.769008 0.545288
2011-01-01 09:00:00 0.692169 0.261194 0.400704 0.118335
In [11]: df.tail()
Out[11]:
0 1 2 3
2011-01-03 19:00:00 0.247211 0.539330 0.734206 0.781125
2011-01-03 20:00:00 0.278550 0.534943 0.804949 0.137291
2011-01-03 21:00:00 0.602246 0.108791 0.987120 0.455887
2011-01-03 22:00:00 0.003097 0.436435 0.987877 0.046066
2011-01-03 23:00:00 0.604916 0.670532 0.513927 0.610775
In [12]: df.mean()
Out[12]:
0 0.495307
1 0.477509
2 0.562590
3 0.447997
dtype: float64
In [13]: new_df = pd.DataFrame(df.mean().to_dict(),index=[df.index.values[-1]])
In [14]: new_df
Out[14]:
0 1 2 3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997
In [15]: new_df.rename(columns=lambda c: "mean_"+str(c))
Out[15]:
mean_0 mean_1 mean_2 mean_3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997