If I have some data (24 hours time series) read into Pandas:
import pandas as pd
import numpy as np
#read CSV file
df = pd.read_csv('https://raw.githubusercontent.com/bbartling/Building-Demand-Electrical-Load-Profiles/master/july15.csv',
index_col='Date', parse_dates=True)
How can I average the df column kW between these time stamps into a new separate pandas df?
bkps_timestamps_kW = [
'2013-06-19 00:15:00',
'2013-06-19 05:15:00',
'2013-06-19 16:30:00',
'2014-06-18 00:00:00']
The new pandas df could have column names something like avg_kw1, avg_kw2, avg_kw3 that would represent the averaging of the data between the time stamps in bkps_timestamps_kW
thanks for any help/tips
I think you need cut for bining by your list converterted to datetimes and aggregate mean:
d = [
'2013-06-19 00:00:00',
'2013-06-19 00:15:00',
'2013-06-19 01:15:00',
'2013-06-19 05:15:00',
'2013-06-19 07:15:00',
'2013-06-19 16:30:00',
'2013-06-20 16:30:00',
'2014-06-18 00:00:00',
'2015-06-18 00:00:00']
df = pd.DataFrame({'Date':range(len(d))}, index=pd.to_datetime(d))
print (df)
Date
2013-06-19 00:00:00 0
2013-06-19 00:15:00 1
2013-06-19 01:15:00 2
2013-06-19 05:15:00 3
2013-06-19 07:15:00 4
2013-06-19 16:30:00 5
2013-06-20 16:30:00 6
2014-06-18 00:00:00 7
2015-06-18 00:00:00 8
bkps_timestamps_kW = [
'2013-06-19 00:15:00',
'2013-06-19 05:15:00',
'2013-06-19 16:30:00',
'2014-06-18 00:00:00']
b = pd.to_datetime(bkps_timestamps_kW)
labels = [f'{i}-{j}' for i, j in zip(bkps_timestamps_kW[:-1], bkps_timestamps_kW[1:])]
df = df.groupby(pd.cut(df.index, bins=b, labels=labels)).mean()
print (df)
Date
2013-06-19 00:15:00-2013-06-19 05:15:00 2.5
2013-06-19 05:15:00-2013-06-19 16:30:00 4.5
2013-06-19 16:30:00-2014-06-18 00:00:00 6.5
If need closed left intervals in cut:
df = df.groupby(pd.cut(df.index, bins=b, labels=labels, right=False)).mean()
print (df)
Date
2013-06-19 00:15:00-2013-06-19 05:15:00 1.5
2013-06-19 05:15:00-2013-06-19 16:30:00 3.5
2013-06-19 16:30:00-2014-06-18 00:00:00 5.5
Related
I have to find the difference in data provided at 00:00:00 and 23:59:59 per day for seven days.
How to find the difference in the data frame, which is given on the start date and end date?
Sample Data
Date Data
2018-12-01 00:00:00 2
2018-12-01 12:00:00 5
2018-12-01 23:59:59 10
2018-12-02 00:00:00 12
2018-12-02 12:00:00 15
2018-12-02 23:59:59 22
Expected Output
Date Data
2018-12-01 8
2018-12-02 10
Example
data = {
'Date': ['2018-12-01 00:00:00', '2018-12-01 12:00:00', '2018-12-01 23:59:59',
'2018-12-02 00:00:00', '2018-12-02 12:00:00', '2018-12-02 23:59:59'],
'Data': [2, 5, 10, 12, 15, 22]
}
df = pd.DataFrame(data)
Code
df['Date'] = pd.to_datetime(df['Date'])
out = (df.resample('D', on='Date')['Data']
.agg(lambda x: x.iloc[-1] - x.iloc[0]).reset_index())
out
Date Data
0 2018-12-01 8
1 2018-12-02 10
Update
more efficient way
you can get same result following code:
g = df.resample('D', on='Date')['Data']
out = g.last().sub(g.first()).reset_index()
You can use groupby and iterate over with min-max range.
import pandas as pd
df = pd.DataFrame({
'Date': ['2018-12-01 00:00:00', '2018-12-01 12:00:00', '2018-12-01 23:59:59',
'2018-12-02 00:00:00', '2018-12-02 12:00:00', '2018-12-02 23:59:59'],
'Data': [2, 5, 10, 12, 15, 22]
})
df['Date'] = pd.to_datetime(df['Date'])
df['Date_Only'] = df['Date'].dt.date
result = df.groupby('Date_Only').apply(lambda x: x['Data'].max() - x['Data'].min())
print(result)
Below is a sample of my df
date value
0006-03-01 00:00:00 1
0006-03-15 00:00:00 2
0006-05-15 00:00:00 1
0006-07-01 00:00:00 3
0006-11-01 00:00:00 1
2009-05-20 00:00:00 2
2009-05-25 00:00:00 8
2020-06-24 00:00:00 1
2020-06-30 00:00:00 2
2020-07-01 00:00:00 13
2020-07-15 00:00:00 2
2020-08-01 00:00:00 4
2020-10-01 00:00:00 2
2020-11-01 00:00:00 4
2023-04-01 00:00:00 1
2218-11-12 10:00:27 1
4000-01-01 00:00:00 6
5492-04-15 00:00:00 1
5496-03-15 00:00:00 1
5589-12-01 00:00:00 1
7199-05-15 00:00:00 1
9186-12-30 00:00:00 1
As you can see, the data contains some misspelled dates.
Questions:
How can we convert this column to format dd.mm.yyyy?
How can we replace rows when Year greater than 2022? by 01.01.2100
How can we Remove All rows when Year less than 2005?
The final output should look like this.
date value
20.05.2009 2
25.05.2009 8
26.04.2020 1
30.06.2020 2
01.07.2020 13
15.07.2020 2
01.08.2020 4
01.10.2020 2
01.11.2020 4
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
I tried to convert the column using to_datetime but it failed.
df[col] = pd.to_datetime(df[col], infer_datetime_format=True)
Out of bounds nanosecond timestamp: 5-03-01 00:00:00
Thanks to anyone helping!
You could check the first element of your datetime strings after a split on '-' and clean up / replace based on its integer value. For the small values like '0006', calling pd.to_datetime with errors='coerce' will do the trick. It will leave 'NaT' for the invalid dates. You can drop those with dropna(). Example:
import pandas as pd
df = pd.DataFrame({'date': ['0006-03-01 00:00:00',
'0006-03-15 00:00:00',
'0006-05-15 00:00:00',
'0006-07-01 00:00:00',
'0006-11-01 00:00:00',
'nan',
'2009-05-25 00:00:00',
'2020-06-24 00:00:00',
'2020-06-30 00:00:00',
'2020-07-01 00:00:00',
'2020-07-15 00:00:00',
'2020-08-01 00:00:00',
'2020-10-01 00:00:00',
'2020-11-01 00:00:00',
'2023-04-01 00:00:00',
'2218-11-12 10:00:27',
'4000-01-01 00:00:00',
'NaN',
'5496-03-15 00:00:00',
'5589-12-01 00:00:00',
'7199-05-15 00:00:00',
'9186-12-30 00:00:00']})
# first, drop columns where 'date' contains 'nan' (case-insensitive):
df = df.loc[~df['date'].str.contains('nan', case=False)]
# now replace strings where the year is above a threshold:
df.loc[df['date'].str.split('-').str[0].astype(int) > 2022, 'date'] = '2100-01-01 00:00:00'
# convert to datetime, if year is too low, will result in NaT:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['date']
# 0 NaT
# 1 NaT
# 2 NaT
# 3 NaT
# 4 NaT
# 5 2009-05-20
# 6 2009-05-25
# ...
df = df.dropna()
# df
# date
# 6 2009-05-25
# 7 2020-06-24
# 8 2020-06-30
# 9 2020-07-01
# 10 2020-07-15
# 11 2020-08-01
# 12 2020-10-01
# 13 2020-11-01
# 14 2100-01-01
# 15 2100-01-01
# ...
Due to the limitations of pandas, the out of bounds error is thrown (https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). This code will remove values that would cause this error before creating the dataframe.
import datetime as dt
import pandas as pd
data = [[dt.datetime(year=2022, month=3, day=1), 1],
[dt.datetime(year=2009, month=5, day=20), 2],
[dt.datetime(year=2001, month=5, day=20), 2],
[dt.datetime(year=2023, month=12, day=30), 3],
[dt.datetime(year=6, month=12, day=30), 3]]
dataCleaned = [elements for elements in data if pd.Timestamp.max > elements[0] > pd.Timestamp.min]
df = pd.DataFrame(dataCleaned, columns=['date', 'Value'])
print(df)
# OUTPUT
date Value
0 2022-03-01 1
1 2009-05-20 2
2 2001-05-20 2
3 2023-12-30 3
df.loc[df.date.dt.year > 2022, 'date'] = dt.datetime(year=2100, month=1, day=1)
df.drop(df.loc[df.date.dt.year < 2005, 'date'].index, inplace=True)
print(df)
#OUTPUT
0 2022-03-01 1
1 2009-05-20 2
3 2100-01-01 3
If you still want to include the dates that throw the out of bounds error, check out How to work around Python Pandas DataFrame's "Out of bounds nanosecond timestamp" error?
I suggest the following:
df = pd.DataFrame.from_dict({'date': ['0003-03-01 00:00:00',
'7199-05-15 00:00:00',
'2020-10-21 00:00:00'],
'value': [1, 2, 3]})
df['date'] = [d[8:10] + '.' + d[5:7] + '.' + d[:4] if '2004' < d[:4] < '2023' \
else '01.01.2100' if d[:4] > '2022' else np.NaN for d in df['date']]
df.dropna(inplace = True)
This yields the desired output:
date value
01.01.2100 2
21.10.2020 3
MRE:
idx = pd.date_range('2015-07-03 08:00:00', periods=30,
freq='H')
data = np.random.randint(1, 100, size=len(idx))
df = pd.DataFrame({'index':idx, 'col':data})
df.set_index("index", inplace=True)
which looks like:
col
index
2015-07-03 08:00:00 96
2015-07-03 09:00:00 79
2015-07-03 10:00:00 15
2015-07-03 11:00:00 2
2015-07-03 12:00:00 84
2015-07-03 13:00:00 86
2015-07-03 14:00:00 5
.
.
.
Note that dataframe contain multiple days. Since frequency is in hours, starting from 07/03 08:00:00 it will contain hourly date.
I want to get all data from 05:00:00 including day 07/03 even if it will contain value 0 in "col" column.
I want to extend it backwards so it starts from 05:00:00.
No I just can't start from 05:00:00 since I already have dataframe that starts from 08:00:00. I am trying to keep everything same but add 3 rows in the beginning to include 05:00:00, 06:00:00, and 07:00:00
The reindex method is handy for changing the index values:
idx = pd.date_range('2015-07-03 08:00:00', periods=30, freq='H')
data = np.random.randint(1, 100, size=len(idx))
# use the index param to set index or you might lose the freq
df = pd.DataFrame({'col':data}, index=idx)
# reindex with a new index
start = df.tshift(-3).index[0]
end = df.index[-1]
new_index = pd.date_range(start, end, freq='H')
new_df = df.reindex(new_index)
resample is also very useful for date indices
Just change the time from 08:00:00 to 05:00:00 in your code and create 3 more rows and update this dataframe to the existing one.
idx1 = pd.date_range('2015-07-03 05:00:00', periods=3,freq='H')
df1 = pd.DataFrame({'index': idx1 ,'col':np.random.randint(1,100,size = 3)})
df1.set_index('index',inplace=True)
df = df1.append(df)
print(df)
Add this snippet to your code...
I have a dataframe with data for each minutes, it also contains a date column which is used to keep track of the date in timestamp format.
Here I'm trying to aggregate the data by hours instead of minute.
I tried the following code which is working but it needs to index based on date column which I don't want because then I cannot loop through the dataframe using df.loc function.
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='T')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
df.set_index('date')
df.index = pd.to_datetime(df.index, unit='s')
df = df.resample('H').sum()
df.head(15)
I also tried groupby but it's not working, following is the code.
df.groupby([df.date.dt.hour]).data.sum()
print(df.head(15))
How I can groupby date without indexing it?
Thanks.
Try pd.Grouper and specify the freq parameter:
df.groupby([pd.Grouper(key='date', freq='1H')]).sum()
Full code:
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='T')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))
print(df.groupby([pd.Grouper(key='date', freq='1H')]).sum())
# data
# date
# 2018-01-01 00:00:00 2958
# 2018-01-01 01:00:00 3084
# 2018-01-01 02:00:00 2991
# 2018-01-01 03:00:00 3021
# 2018-01-01 04:00:00 2894
# ... ...
# 2018-01-07 20:00:00 2863
# 2018-01-07 21:00:00 2850
# 2018-01-07 22:00:00 2823
# 2018-01-07 23:00:00 2805
# 2018-01-08 00:00:00 25
# [169 rows x 1 columns]
Hope that helps !
I have a csv-file with time series data, the first column is the date in the format %Y:%m:%d and the second column is the intraday time in the format '%H:%M:%S'. I would like to import this csv-file into a multiindex dataframe or panel object.
With this code, it already works:
_file_data = pd.read_csv(_file,
sep=",",
header=0,
index_col=['Date', 'Time'],
thousands="'",
parse_dates=True,
skipinitialspace=True
)
It returns the data in the following format:
Date Time Volume
2016-01-04 2018-04-25 09:01:29 53645
2018-04-25 10:01:29 123
2018-04-25 10:01:29 1345
....
2016-01-05 2018-04-25 10:01:29 123
2018-04-25 12:01:29 213
2018-04-25 10:01:29 123
1st question:
I would like to show the second index as a pure time-object not datetime. To do that, I have to declare two different date-pasers in the read_csv function, but I can't figure out how. What is the "best" way to do that?
2nd question:
After I created the Dataframe, I converted it to a panel-object. Would you recommend doing that? Is the panel-object the better choice for such a data structure? What are the benefits (drawbacks) of a panel-object?
1st question:
You can create multiple converters and define parsers in dictionary:
import pandas as pd
temp=u"""Date,Time,Volume
2016:01:04,09:00:00,53645
2016:01:04,09:20:00,0
2016:01:04,09:40:00,0
2016:01:04,10:00:00,1468
2016:01:05,10:00:00,246
2016:01:05,10:20:00,0
2016:01:05,10:40:00,0
2016:01:05,11:00:00,0
2016:01:05,11:20:00,0
2016:01:05,11:40:00,0
2016:01:05,12:00:00,213"""
def converter1(x):
#convert to datetime and then to times
return pd.to_datetime(x).time()
def converter2(x):
#define format of datetime
return pd.to_datetime(x, format='%Y:%m:%d')
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter1, 'Date': converter2})
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 0
12:00:00 213
Sometimes is possible use built-in parser, e.g. if format of dates is YY-MM-DD:
import pandas as pd
temp=u"""Date,Time,Volume
2016-01-04,09:00:00,53645
2016-01-04,09:20:00,0
2016-01-04,09:40:00,0
2016-01-04,10:00:00,1468
2016-01-05,10:00:00,246
2016-01-05,10:20:00,0
2016-01-05,10:40:00,0
2016-01-05,11:00:00,0
2016-01-05,11:20:00,0
2016-01-05,11:40:00,0
2016-01-05,12:00:00,213"""
def converter(x):
#define format of datetime
return pd.to_datetime(x).time()
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
parse_dates=['Date'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter})
print (df.index.get_level_values(0))
DatetimeIndex(['2016-01-04', '2016-01-04', '2016-01-04', '2016-01-04',
'2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
'2016-01-05', '2016-01-05', '2016-01-05'],
dtype='datetime64[ns]', name='Date', freq=None)
Last possible solution is convert datetime to times in MultiIndex by set_levels - after processing:
df.index = df.index.set_levels(df.index.get_level_values(1).time, level=1)
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:00:00 0
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 213
2nd question:
Panel in pandas 0.20.+ is deprecated and will be removed in a future version.
To convert to a time series use pd.to_timedelta.
Ex:
import pandas as pd
df = pd.DataFrame({"Time": ["2018-04-25 09:01:29", "2018-04-25 10:01:29", "2018-04-25 10:01:29"]})
df["Time"] = pd.to_timedelta(pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S'))
print df["Time"]
Output:
0 09:01:29
1 10:01:29
2 10:01:29
Name: Time, dtype: timedelta64[ns]