I have a pandas dataframe of energy demand vs. time:
0 1
0 20201231T23-07 39815
1 20201231T22-07 41387
2 20201231T21-07 42798
3 20201231T20-07 44407
4 20201231T19-07 45612
5 20201231T18-07 44920
6 20201231T17-07 42617
7 20201231T16-07 41454
8 20201231T15-07 41371
9 20201231T14-07 41793
10 20201231T13-07 42298
11 20201231T12-07 42740
12 20201231T11-07 43185
13 20201231T10-07 42999
14 20201231T09-07 42373
15 20201231T08-07 41273
16 20201231T07-07 38909
17 20201231T06-07 37099
18 20201231T05-07 36022
19 20201231T04-07 35880
20 20201231T03-07 36305
21 20201231T02-07 36988
22 20201231T01-07 38166
23 20201231T00-07 40167
24 20201230T23-07 42624
25 20201230T22-07 44777
26 20201230T21-07 46205
27 20201230T20-07 47324
28 20201230T19-07 48011
29 20201230T18-07 46995
30 20201230T17-07 44902
31 20201230T16-07 44134
32 20201230T15-07 44228
33 20201230T14-07 44813
34 20201230T13-07 45187
35 20201230T12-07 45622
36 20201230T11-07 45831
37 20201230T10-07 45832
38 20201230T09-07 45476
39 20201230T08-07 44145
40 20201230T07-07 41650
I need to convert the time column into hourly data. I know that Python has some tools that can convert dates directly, is there one I could use here or will I need to do it manually?
Well just to obtain a time string you could use str.replace:
df["time"] = df["0"].str.replace(r'^\d{8}T(\d{2})-(\d{2})$', r'\1:\2')
Assuming the time column is currently a string you could convert it to a datetime using pd.to_datetime and then extract the hour.
If you want to calculate, say, the average demand for each hour you could then use groupby.
df['time'] = pd.to_datetime(df['time'], format="%Y%m%dT%H-%M").dt.hour
df_demand_by_hour = df.groupby('time').mean()
print(df_demand_by_hour)
demand
time
0 40167.0
1 38166.0
2 36988.0
3 36305.0
4 35880.0
5 36022.0
6 37099.0
7 40279.5
8 42709.0
9 43924.5
10 44415.5
11 44508.0
12 44181.0
13 43742.5
14 43303.0
15 42799.5
16 42794.0
17 43759.5
18 45957.5
19 46811.5
20 45865.5
21 44501.5
22 43082.0
23 41219.5
i don't know exactly what the -07 means but you can turn the string to datetime by doing:
import pandas as pd
import datetime as dt
df['0'] = pd.to_datetime(df['0'], format = '%Y-%m-%d %H:%M:%S').dt.strftime('%H:%M:%S')
df
0 1
0 23:00:00 39815
1 22:00:00 41387
2 21:00:00 42798
3 20:00:00 44407
4 19:00:00 45612
...
Related
I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0
I have an 8-year timeseries with daily values where I would like to resample biweekly. However, I only need biweekly values from week 18 to week 30 of each year (i.e. W18, W20, W22, ..., W30). This method would sometimes give me the 'odd' biweekly values (i.e. W19, W21, W23,..., W29). How might I ensure that I would always get the 'even' biweekly values?
df = df.resample("2W").mean()
df["Week"] = df.index.map(lambda dt: dt.week)
df = df.loc[df.Week.isin(range(18,31))]
An example of the daily data from 2010-01-01 to 2018-12-31: (short version)
Date value_1 value_2
... ... ...
2010-05-03 10 1
2010-05-04 79 66
2010-05-05 40 16
2010-05-06 13 76
2010-05-07 2 36
2010-05-08 31 98
2010-05-09 96 3
2010-05-10 66 18
2010-05-11 99 9
... ... ...
Expected biweekly data between week 18 and week 30:
Date value_1 value_2 Week
2010-05-03 14 1 18
2010-05-17 33 89 20
2010-05-31 21 31 22
2010-06-14 33 56 24
2010-06-28 12 43 26
2010-07-12 21 72 28
2010-07-26 76 13 30
2011-05-02 60 28 18
2011-05-16 82 2 20
2011-05-30 30 15 22
... ... ... ...
I think that the best way is to create the range separately with list comprehension. The code below will give a range between 18 and 30 with only even values:
weeks_to_include = [i for i in range(18, 31) if i % 2 == 0]
With this range you can filter as you have above. I tested the code below and it worked for me:
#create a dummy dataframe
dr = pd.date_range(start='2013-01-01', end='2021-12-31', freq='D')
df = pd.DataFrame(index=dr)
df['col1'] = range(0, len(df))
#create a list of even weeks in a range
weeks_to_include = [i for i in range(18, 31) if i % 2 == 0]
#create a column with the week of the year
df['weekofyear'] = df.index.isocalendar().week
#filter for only weeks_to_include
df = df.loc[df['weekofyear'].isin(weeks_to_include)]
i have two dataframe like df1
time kw
0 13:00 30
1 13:02 28
2 13:04 29
and df2
time kw
1 13:01 30
2 13:03 28
3 13:05 29
i want to add rows from one dataframe to another for end result like
time kw
1 13:00 30
2 13:01 30
3 13:02 28
4 13:03 28
5 13:04 29
6 13:05 29
Please help..
I concat both dataframeresult_df = pd.concat([df1, df2]), but it just put them side by side. Secondly i tried to append both dataframe, but still not what i am looking for
Thanks in advance
Use df.append with df.sort_values:
In [2362]: df1.append(df2).sort_values('time')
Out[2362]:
time kw
0 13:00 30
1 13:01 30
1 13:02 28
2 13:03 28
2 13:04 29
3 13:05 29
import pandas as pd
df1 = pd.DataFrame([("13:00", 30), ("13:02", 28), ("13:04", 29)], columns=["time", "kw"])
df2 = pd.DataFrame([("13:01", 30), ("13:03", 28), ("13:05", 29)], columns=["time", "kw"])
df = pd.concat([df1, df2]).sort_values("time")
I have a file, df, that I wish to take the delta of every 7 day period and reflect the timestamp for that particular period
df:
Date Value
10/15/2020 75
10/14/2020 70
10/13/2020 65
10/12/2020 60
10/11/2020 55
10/10/2020 50
10/9/2020 45
10/8/2020 40
10/7/2020 35
10/6/2020 30
10/5/2020 25
10/4/2020 20
10/3/2020 15
10/2/2020 10
10/1/2020 5
Desired Output:
10/15/2020 to 10/9/2020 is 7 days with the delta being: 75 - 45 = 30
10/9/2020 timestamp would be: 30 and so on
Date Value
10/9/2020 30
10/2/2020 30
This is what I am doing:
df= df['Delta']=df.iloc[:,6].sub(df.iloc[:,0]),Date=pd.Series
(pd.date_range(pd.Timestamp('2020-10-
15'),
periods=7, freq='7d')))[['Delta','Date']]
I am also thinking I may be able to do this:
Edit I updated callDate to Date
for row in df.itertuples():
Date = datetime.strptime(row.Date, "%m/%d/%y %I:%M %p")
previousRecord = df['Date'].shift(-6).strptime(row.Date, "%m/%d/%y %I:%M
%p")
Delta = Date - previousRecord
Any suggestion is appreciated
Don't iterate through the dataframe. You can use a merge:
(df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('6D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
Output:
Date Value
0 2020-10-09 30
1 2020-10-08 30
2 2020-10-07 30
3 2020-10-06 30
4 2020-10-05 30
5 2020-10-04 30
6 2020-10-03 30
7 2020-10-02 30
8 2020-10-01 30
The last block of code you wrote is the way I would do it. Only problem is in Delta = Date - previousRecord, there is nothing called Date here. You should instead be accessing the value associated with callDate.
I'm new to python and I'm facing the following problem. I have a dataframe composed of 2 columns, one of them is date (datetime64[ns]). I want to keep all records within the last 12 months. My code is the following:
today=start_time.date()
last_year = today + relativedelta(months = -12)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
when I run it I get the following message:
TypeError: type object 2017-06-05
Any ideas?
last_year seems to bring me the date that I want in the following format: 2017-06-05
Create a time delta object in pandas to increment the date (12 months). Call pandas.Timstamp('now') to get the current date. And then create a date_range. Here is an example for getting monthly data for 12 months.
import pandas as pd
import datetime
list_1 = [i for i in range(0, 12)]
list_2 = [i for i in range(13, 25)]
list_3 = [i for i in range(26, 38)]
data_frame = pd.DataFrame({'A': list_1, 'B': list_2, 'C':list_3}, pd.date_range(pd.Timestamp('now'), pd.Timestamp('now') + pd.Timedelta (weeks=53), freq='M'))
We create a timestamp for the current date and enter that as our start date. Then we create a timedelta to increment that date by 53 weeks (or 52 if you'd like) which gets us 12 months of data. Below is the output:
A B C
2018-06-30 05:05:21.335625 0 13 26
2018-07-31 05:05:21.335625 1 14 27
2018-08-31 05:05:21.335625 2 15 28
2018-09-30 05:05:21.335625 3 16 29
2018-10-31 05:05:21.335625 4 17 30
2018-11-30 05:05:21.335625 5 18 31
2018-12-31 05:05:21.335625 6 19 32
2019-01-31 05:05:21.335625 7 20 33
2019-02-28 05:05:21.335625 8 21 34
2019-03-31 05:05:21.335625 9 22 35
2019-04-30 05:05:21.335625 10 23 36
2019-05-31 05:05:21.335625 11 24 37
Try
today = datetime.datetime.now()
You can use pandas functionality with datetime objects. The syntax is often more intuitive and obviates the need for additional imports.
last_year = pd.to_datetime('today') + pd.DateOffset(years=-1)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
As such, we would need to see all your code to be sure of the reason behind your error; for example, how is start_time defined?