Converting timestamps in large dataset to multiple timezones - python

I have a large dataset with ~ 9 million rows and 4 columns - one of which is a utc timestamp. Data in this set has been recorded from 507 sites across Australia, and there is a site ID column. I have another dataset that has the timezones for each site ID in the format 'Australia/Brisbane'. I've written a function to create a new column in the main dataset that is the utc timestamp converted to the local time. However the wrong new time is being matched up with the utc timestamp, for example 2019-01-05 12:10:00+00:00 and 2019-01-13 18:55:00+11:00 (wrong timezone). I believe that sites are not mixed up in the data, but I've tried to sort the data incase that was the problem. Below is my code and images of the first row of each dataset, any help is much appreciated!
import pytz
from dateutil import tz
def update_timezone(df):
newtimes = []
df = df.sort_values('site_id')
sites = df['site_id'].unique().tolist()
for site in sites:
timezone = solarbom.loc[solarbom['site_id'] == site].iloc[0, 1]
dfsub = df[df['site_id'] == site].copy()
dfsub['utc_timestamp'] = dfsub['utc_timestamp'].dt.tz_convert(timezone)
newtimes.extend(dfsub['utc_timestamp'].tolist())
df['newtimes'] = newtimes
Main large dataset
Site info dataset

IIUC, you're looking to group your data by ID, then convert the timestamp specific to each ID. You could achieve this by using groupby, then applying a converter function to each group. Ex:
import pandas as pd
# dummy data:
df = pd.DataFrame({'utc_timestamp': [pd.Timestamp("2022-01-01 00:00 Z"),
pd.Timestamp("2022-01-01 01:00 Z"),
pd.Timestamp("2022-01-05 00:00 Z"),
pd.Timestamp("2022-01-03 00:00 Z"),
pd.Timestamp("2022-01-03 01:00 Z"),
pd.Timestamp("2022-01-03 02:00 Z")],
'site_id': [1, 1, 5, 3, 3, 3],
'values': [11, 11, 55, 33, 33, 33]})
# time zone info for each ID:
timezdf = pd.DataFrame({'site_id': [1, 3, 5],
'timezone_id_x': ["Australia/Adelaide", "Australia/Perth", "Australia/Darwin"]})
### what we want:
# for row, data in timezdf.iterrows():
# print(f"ID: {data['site_id']}, tz: {data['timezone_id_x']}")
# print(pd.Timestamp("2022-01-01 00:00 Z"), "to", pd.Timestamp("2022-01-01 00:00 Z").tz_convert(data['timezone_id_x']))
# ID: 1, tz: Australia/Adelaide
# 2022-01-01 00:00:00+00:00 to 2022-01-01 10:30:00+10:30
# ID: 3, tz: Australia/Perth
# 2022-01-01 00:00:00+00:00 to 2022-01-01 08:00:00+08:00
# ID: 5, tz: Australia/Darwin
# 2022-01-01 00:00:00+00:00 to 2022-01-01 09:30:00+09:30
###
def converter(group, timezdf):
# get the time zone by looking for the current group ID in timezdf
z = timezdf.loc[timezdf["site_id"] == group["site_id"].iloc[0], 'timezone_id_x'].iloc[0]
group["localtime"] = group["localtime"].dt.tz_convert(z)
return group
df["localtime"] = df["utc_timestamp"]
df = df.groupby("site_id").apply(lambda g: converter(g, timezdf))
now df looks like
df
Out[71]:
utc_timestamp site_id values localtime
0 2022-01-01 00:00:00+00:00 1 11 2022-01-01 10:30:00+10:30
1 2022-01-01 01:00:00+00:00 1 11 2022-01-01 11:30:00+10:30
2 2022-01-05 00:00:00+00:00 5 55 2022-01-05 09:30:00+09:30
3 2022-01-03 00:00:00+00:00 3 33 2022-01-03 08:00:00+08:00
4 2022-01-03 01:00:00+00:00 3 33 2022-01-03 09:00:00+08:00
5 2022-01-03 02:00:00+00:00 3 33 2022-01-03 10:00:00+08:00

Related

Pandas change time values based on condition

I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data

Get several previous rows of a dataframe while using iterrow

While using Iterrow(), I would like to create a "temporary" dataframe which would include several previous rows (not consecutive) from my initial dataframe identified using the index.
For each step of the Iterrow(), I will create the "temporary" dataframe including 4 previous prices from the initial df and all prices separated by 4 hours. I will then calculate the average of these prices. Objective is to be able to change numbers of prices and gap between prices easily.
I tried several way to get the previous rows but without success. I understand that as my index is a timestamp I need to use timedelta but it doesn't work.
My initial dataframe "df":
Price
timestamp
2022-04-01 00:00:00 124.39
2022-04-01 01:00:00 121.46
2022-04-01 02:00:00 118.75
2022-04-01 03:00:00 121.95
2022-04-01 04:00:00 121.15
... ...
2022-04-09 13:00:00 111.46
2022-04-09 14:00:00 110.90
2022-04-09 15:00:00 109.59
2022-04-09 16:00:00 110.25
2022-04-09 17:00:00 110.88
My code :
from datetime import timedelta
df_test = None
dt_test = pd.DataFrame(columns=['Price','Price_Avg'])
dt_Avg = None
dt_Avg = pd.DataFrame(columns=['PreviousPrices'])
for index, row in df.iterrows():
Price = row['Price']
#Creation of a temporary Df to stock 4 previous prices, each price separated by 4 hours :
for i in range (0,4):
delta = 4*(i+1)
PrevPrice = df.loc[(index-timedelta(hours= delta)),'Price']
myrow_dt_Avg = {'PreviousPrices': PrevPrice}
dt_Avg = dt_Avg.append(myrow_dt_Avg, ignore_index=True)
#Calculation of the Avg of the 4 previous prices :
Price_Avg = dt_Avg['PreviousPrices'].sum()/4
#Clear dt_Avg :
dt_Avg = dt_Avg[0:0]
myrow_df_test = {'Price':Price,'Price_Avg': Price_Avg}
df_test = df_test.append(myrow_df_test, ignore_index=True)
dt_test
The line PrevPrice = df.loc[(index-timedelta(hours= delta)),'Price'] is bugging, do you have any idea?

Finding the Timedelta through pandas dataframe, I keep return NaT

So I am reading in a csv file of a 30 minute timeseries going from "2015-01-01 00:00" upto and including "2020-12-31 23:30". There are five sets of these timeseries, each being at a certain location, and there are 105215 rows going down for each 30 minutes. My job is to go through and find the timedelta between each row, for each column. It should be 30 minutes for each one, except sometimes it isn't, and I have to find that.
So far I'm reading in the data fine via
ca_time = np.array(ca.iloc[0:, 1], dtype= "datetime64")
ny_time = np.array(ny.iloc[0:, 1], dtype = "datetime64")
tx_time = np.array(tx.iloc[0:, 1], dtype = "datetime64")
#I'm then passing these to a pandas dataframe for more convenient manipulation
frame_ca = pd.DataFrame(data = ca_time, dtype = "datetime64[s]")
frame_ny = pd.DataFrame(data = ny_time, dtype = "datetime64[s]")
frame_tx = pd.DataFrame(data = tx_time, dtype = "datetime64[s]")
#Then concatenating them into an array with 100k+ rows, and the five columns represent each location
full_array = pd.concat([frame_ca, frame_ny, frame_tx], axis = 1)
I now want to find the timedelta between each cell for each respective location.
Currently I'm trying this as a simply test
first_row = full_array2.loc[1:1, :1]
second_row = full_array2.loc[2:2, :1]
delta = first_row - second_row
I'm getting back
0 0 0
1 NaT NaT NaT
2 NaT NaT NaT
These seems simple enough but don't know how I'm getting Not a Time here.
For reference, below are both those rows I'm trying to subtract
ca ny tx fl az
1 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00, 0 0 0 0 0
2 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00
Any help appreciated!

Convert Pandas dataframe row containing datetime range along to new dataframe with a row for each date along with hours included on that date

So, I have StartDateTime and EndDateTime columns in my dataframe, and I want to produce a new dataframe with a row for each date in the datetime range, but I also want the number of hours of that date that are included in the date range.
In [11]: sessions = pd.DataFrame({'Start':['2018-01-01 13:00:00','2018-03-01 16:30:00'],
'End':['2018-01-03 07:00:00','2018-03-02 06:00:00'],'User':['Dan','Fred']})
In [12]: sessions
Out[12]:
Start End User
0 2018-01-01 13:00:00 2018-01-03 07:00:00 Dan
1 2018-03-01 16:30:00 2018-03-02 06:00:00 Fred
Desired dataframe:
Date Hours User
2018-01-01 11 Dan
2018-01-02 24 Dan
2018-01-02 7 Dan
2018-03-01 7.5 Fred
2018-03-02 6 Fred
I've seen a lot of examples that just produced a dataframe for each date in the date range (e.g. Expanding pandas data frame with date range in columns)
but nothing with the additional field of hours per date included in the range.
I don't know it's the cleanest solution, but it seems to work.
In [13]: sessions = pd.DataFrame({'Start':['2018-01-01 13:00:00','2018-03-01 16:30:00'],
'End':['2018-01-03 07:00:00','2018-03-02 06:00:00'],'User':['Dan','Fred']})
convert Start and End to Datetime
In [14]: sessions['Start']=pd.to_datetime(sessions['Start'])
sessions['End']=pd.to_datetime(sessions['End'])
create a row for each date in range
In [15]: dailyUsage = pd.concat([pd.DataFrame({'Date':
pd.date_range(pd.to_datetime(row.Start).date(), row.End.date(), freq='D'),'Start':row.Start,
'User': row.User,
'End': row.End}, columns=['Date', 'Start','User', 'End'])
for i, row in sessions.iterrows()], ignore_index=True)
function to calcuate the hours on date, based on start datetime, end datetime, and specfic date
In [16]: def calcDuration(x):
date= x['Date']
startDate = x['Start']
endDate = x['End']
#starts and stops on same day
if endDate.date() == startDate.date():
return (endDate - startDate).seconds/3600
#this is on the start date
if (date.to_pydatetime().date() - startDate.date()).days == 0:
return 24 - startDate.hour
#this is on the end date
if (date.to_pydatetime().date() - endDate.date()).days == 0:
return startDate.hour
#this is on an interior date
else:
return 24
calculate hours for each date
In [17]: dailyUsage['hours'] = dailyUsage.apply(calcDuration,axis=1)
In [18]: dailyUsage.drop(['Start','End'],axis=1).head()
Out [18]:
Date User hours
0 2018-01-01 Dan 11
1 2018-01-02 Dan 24
2 2018-01-03 Dan 13
3 2018-03-01 Fred 8
4 2018-03-02 Fred 16
something like this would work as well, if you don't mind integers only;
df['date'] = df['Date'].dt.date
gb = df.groupby(['date', 'User'])['Date'].size()
print(gb)
date User
2018-01-01 Dan 11
2018-01-02 Dan 24
2018-01-03 Dan 8
2018-03-01 Fred 8
2018-03-02 Fred 6
Name: Date, dtype: int64

Extract multi-year three month series (winter) from pandas dataframe

I've got a pandas DataFrame containing 70 years with hourly data, looking like this:
pressure
2015-06-01 18:00:00 945.6
2015-06-01 19:00:00 945.6
2015-06-01 20:00:00 945.4
2015-06-01 21:00:00 945.4
2015-06-01 22:00:00 945.3
I want to extract the winter months (D-J-F) from every year and generate a new DataFrame with a series of winters.
I found a lot of complicated stuff (e.g. extracting the df.index.month as a new column and then adress this one afterwards), but is there a way to get the winter months straightforward?
You can use map():
import pandas as pd
df = pd.DataFrame({'date' : [datetime.date(2015, 11, 1), datetime.date(2015, 12, 1), datetime.date(2015, 1, 1), datetime.date(2015, 2, 1)],
'pressure': [1,2,3,4]})
winter_months = [12, 1, 2]
print df
# date pressure
# 0 2015-11-01 1
# 1 2015-12-01 2
# 2 2015-01-01 3
# 3 2015-02-01 4
df = df[df["date"].map(lambda t: t.month in winter_months)]
print df
# date pressure
# 1 2015-12-01 2
# 2 2015-01-01 3
# 3 2015-02-01 4
EDIT: I noticed that in your example the dates are the dataframe's index. This still works:
df = df[df.index.map(lambda t: t.month in winter_months)]
I just found that
df[(df.index.month==12) | (df.index.month==1) | (df.index.month==2)]
works fine.

Categories