Problems with timezone in numpy.datetime64

Problems with timezone in numpy.datetime64 - python

I'm a bit confused how numpy handles timezones. If I create a datetime-object just with a date, it seems it uses Zulu-Timezone. If I use an additional timestep, it uses my current timezone. If I then manipulate these objects, e.g. add a timedelta, the results are different:
import numpy as np
a = np.datetime64('2015-04-22')
b = np.datetime64('2015-04-22T00:00')
delta = np.timedelta64(1,'h')
print(a+delta,b+delta)
I must ensure that all values are in the same timezone, so my question is, how can I ensure that a user, who initializes these date doesn't mix dates and dates with time.

If you specify Zulu in datetime with timestep you'll get uniform data.
In [30]: b = np.datetime64('2015-04-22T00:00Z')
In [31]: b + delta
Out[31]: numpy.datetime64('2015-04-22T03:00+0200')
In [32]: a + delta
Out[32]: numpy.datetime64('2015-04-22T03:00+0200','h')
http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#basic-datetimes

Related

Convert datetime pandas column to unix format [duplicate]

From the official documentation of pandas.to_datetime we can say,
unit : string, default ‘ns’
unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or
float number. This will be based off the origin. Example, with
unit=’ms’ and origin=’unix’ (the default), this would calculate the
number of milliseconds to the unix epoch start.
So when I try like this way,
import pandas as pd
df = pd.DataFrame({'time': [pd.to_datetime('2019-01-15 13:25:43')]})
df_unix_sec = pd.to_datetime(df['time'], unit='ms', origin='unix')
print(df)
print(df_unix_sec)
time
0 2019-01-15 13:25:43
0 2019-01-15 13:25:43
Name: time, dtype: datetime64[ns]
Output is not changing for the latter one. Every time it is showing the datetime value not number of milliseconds to the unix epoch start for the 2nd one. Why is that? Am I missing something?

I think you misunderstood what the argument is for. The purpose of origin='unix' is to convert an integer timestamp to datetime, not the other way.
pd.to_datetime(1.547559e+09, unit='s', origin='unix')
# Timestamp('2019-01-15 13:30:00')
Here are some options:
Option 1: integer division
Conversely, you can get the timestamp by converting to integer (to get nanoseconds) and divide by 109.
pd.to_datetime(['2019-01-15 13:30:00']).astype(int) / 10**9
# Float64Index([1547559000.0], dtype='float64')
Pros:
super fast
Cons:
makes assumptions about how pandas internally stores dates
Option 2: recommended by pandas
Pandas docs recommend using the following method:
# create test data
dates = pd.to_datetime(['2019-01-15 13:30:00'])
# calculate unix datetime
(dates - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
[out]:
Int64Index([1547559000], dtype='int64')
Pros:
"idiomatic", recommended by the library
Cons:
unweildy
not as performant as integer division
Option 3: pd.Timestamp
If you have a single date string, you can use pd.Timestamp as shown in the other answer:
pd.Timestamp('2019-01-15 13:30:00').timestamp()
# 1547559000.0
If you have to cooerce multiple datetimes (where pd.to_datetime is your only option), you can initialize and map:
pd.to_datetime(['2019-01-15 13:30:00']).map(pd.Timestamp.timestamp)
# Float64Index([1547559000.0], dtype='float64')
Pros:
best method for a single datetime string
easy to remember
Cons:
not as performant as integer division

You can use timestamp() method which returns POSIX timestamp as float:
pd.Timestamp('2021-04-01').timestamp()
[Out]:
1617235200.0
pd.Timestamp('2021-04-01 00:02:35.234').timestamp()
[Out]:
1617235355.234

value attribute of the pandas Timestamp holds the unix epoch. This value is in nanoseconds. So you can convert to ms or us by diving by 1e3 or 1e6. Check the code below.
import pandas as pd
date_1 = pd.to_datetime('2020-07-18 18:50:00')
print(date_1.value)

When you calculate the difference between two datetimes, the dtype of the difference is timedelta64[ns] by default (ns in brackets). By changing [ns] into [ms], [s], [m] etc as you cast the output to a new timedelta64 object, you can convert the difference into milliseconds, seconds, minutes etc.
For example, to find the number of seconds passed since Unix epoch, subtract datetimes and change dtype.
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]')
N.B. Oftentimes, the differences are very large numbers, so if you want them as integers, use astype('int64') (NOT astype(int)).
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]').astype('int64')
For OP's example, this would yield,
0 1547472343
Name: time, dtype: int64

In case you are accessing a particular datetime64 object from the dataframe, chances are that pandas will return a Timestamp object which is essentially how pandas stores datetime64 objects.
You can use pd.Timestamp.to_datetime64() method of the pd.Timestamp object to convert it to numpy.datetime64 object with ns precision.

Converting timestamp to seconds in pandas dataframe [duplicate]

From the official documentation of pandas.to_datetime we can say,
unit : string, default ‘ns’
unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or
float number. This will be based off the origin. Example, with
unit=’ms’ and origin=’unix’ (the default), this would calculate the
number of milliseconds to the unix epoch start.
So when I try like this way,
import pandas as pd
df = pd.DataFrame({'time': [pd.to_datetime('2019-01-15 13:25:43')]})
df_unix_sec = pd.to_datetime(df['time'], unit='ms', origin='unix')
print(df)
print(df_unix_sec)
time
0 2019-01-15 13:25:43
0 2019-01-15 13:25:43
Name: time, dtype: datetime64[ns]
Output is not changing for the latter one. Every time it is showing the datetime value not number of milliseconds to the unix epoch start for the 2nd one. Why is that? Am I missing something?

I think you misunderstood what the argument is for. The purpose of origin='unix' is to convert an integer timestamp to datetime, not the other way.
pd.to_datetime(1.547559e+09, unit='s', origin='unix')
# Timestamp('2019-01-15 13:30:00')
Here are some options:
Option 1: integer division
Conversely, you can get the timestamp by converting to integer (to get nanoseconds) and divide by 109.
pd.to_datetime(['2019-01-15 13:30:00']).astype(int) / 10**9
# Float64Index([1547559000.0], dtype='float64')
Pros:
super fast
Cons:
makes assumptions about how pandas internally stores dates
Option 2: recommended by pandas
Pandas docs recommend using the following method:
# create test data
dates = pd.to_datetime(['2019-01-15 13:30:00'])
# calculate unix datetime
(dates - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
[out]:
Int64Index([1547559000], dtype='int64')
Pros:
"idiomatic", recommended by the library
Cons:
unweildy
not as performant as integer division
Option 3: pd.Timestamp
If you have a single date string, you can use pd.Timestamp as shown in the other answer:
pd.Timestamp('2019-01-15 13:30:00').timestamp()
# 1547559000.0
If you have to cooerce multiple datetimes (where pd.to_datetime is your only option), you can initialize and map:
pd.to_datetime(['2019-01-15 13:30:00']).map(pd.Timestamp.timestamp)
# Float64Index([1547559000.0], dtype='float64')
Pros:
best method for a single datetime string
easy to remember
Cons:
not as performant as integer division

You can use timestamp() method which returns POSIX timestamp as float:
pd.Timestamp('2021-04-01').timestamp()
[Out]:
1617235200.0
pd.Timestamp('2021-04-01 00:02:35.234').timestamp()
[Out]:
1617235355.234

value attribute of the pandas Timestamp holds the unix epoch. This value is in nanoseconds. So you can convert to ms or us by diving by 1e3 or 1e6. Check the code below.
import pandas as pd
date_1 = pd.to_datetime('2020-07-18 18:50:00')
print(date_1.value)

When you calculate the difference between two datetimes, the dtype of the difference is timedelta64[ns] by default (ns in brackets). By changing [ns] into [ms], [s], [m] etc as you cast the output to a new timedelta64 object, you can convert the difference into milliseconds, seconds, minutes etc.
For example, to find the number of seconds passed since Unix epoch, subtract datetimes and change dtype.
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]')
N.B. Oftentimes, the differences are very large numbers, so if you want them as integers, use astype('int64') (NOT astype(int)).
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]').astype('int64')
For OP's example, this would yield,
0 1547472343
Name: time, dtype: int64

In case you are accessing a particular datetime64 object from the dataframe, chances are that pandas will return a Timestamp object which is essentially how pandas stores datetime64 objects.
You can use pd.Timestamp.to_datetime64() method of the pd.Timestamp object to convert it to numpy.datetime64 object with ns precision.

What does .value return when applied to pandas TimeStamp? [duplicate]

I want to change Datetime (2014-12-23 00:00:00) into unixtime. I tried it with the Datetime function but it didn´t work. I got the Datetime stamps in an array.
Zeit =np.array(Jahresgang1.ix[ :,'Zeitstempel'])
t = pd.to_datetime(Zeit, unit='s')
unixtime = pd.DataFrame(t)
print unixtime
Thanks a lot

I think you can subtract the date 1970-1-1 to create a timedelta and then access the attribute total_seconds:
In [130]:
s = pd.Series(pd.datetime(2012,1,1))
s
Out[130]:
0 2012-01-01
dtype: datetime64[ns]
In [158]:
(s - dt.datetime(1970,1,1)).dt.total_seconds()
Out[158]:
0 1325376000
dtype: float64

to emphasize EdChum's first comment, you can directly get Unix time like
import pandas as pd
s = pd.to_datetime(["2014-12-23 00:00:00"])
unix = s.astype("int64")
print(unix)
# Int64Index([1419292800000000000], dtype='int64')
or for a pd.Timestamp:
print(pd.to_datetime("2014-12-23 00:00:00").value)
# 1419292800000000000
Notes
the output precision is nanoseconds - if you want another, divide appropriately, e.g. by 10⁹ to get seconds, 10⁶ to get milliseconds etc.
this assumes the input date/time to be UTC, unless a time zone / UTC offset is specified

pandas datetime to unix timestamp seconds

From the official documentation of pandas.to_datetime we can say,
unit : string, default ‘ns’
unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or
float number. This will be based off the origin. Example, with
unit=’ms’ and origin=’unix’ (the default), this would calculate the
number of milliseconds to the unix epoch start.
So when I try like this way,
import pandas as pd
df = pd.DataFrame({'time': [pd.to_datetime('2019-01-15 13:25:43')]})
df_unix_sec = pd.to_datetime(df['time'], unit='ms', origin='unix')
print(df)
print(df_unix_sec)
time
0 2019-01-15 13:25:43
0 2019-01-15 13:25:43
Name: time, dtype: datetime64[ns]
Output is not changing for the latter one. Every time it is showing the datetime value not number of milliseconds to the unix epoch start for the 2nd one. Why is that? Am I missing something?

I think you misunderstood what the argument is for. The purpose of origin='unix' is to convert an integer timestamp to datetime, not the other way.
pd.to_datetime(1.547559e+09, unit='s', origin='unix')
# Timestamp('2019-01-15 13:30:00')
Here are some options:
Option 1: integer division
Conversely, you can get the timestamp by converting to integer (to get nanoseconds) and divide by 109.
pd.to_datetime(['2019-01-15 13:30:00']).astype(int) / 10**9
# Float64Index([1547559000.0], dtype='float64')
Pros:
super fast
Cons:
makes assumptions about how pandas internally stores dates
Option 2: recommended by pandas
Pandas docs recommend using the following method:
# create test data
dates = pd.to_datetime(['2019-01-15 13:30:00'])
# calculate unix datetime
(dates - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
[out]:
Int64Index([1547559000], dtype='int64')
Pros:
"idiomatic", recommended by the library
Cons:
unweildy
not as performant as integer division
Option 3: pd.Timestamp
If you have a single date string, you can use pd.Timestamp as shown in the other answer:
pd.Timestamp('2019-01-15 13:30:00').timestamp()
# 1547559000.0
If you have to cooerce multiple datetimes (where pd.to_datetime is your only option), you can initialize and map:
pd.to_datetime(['2019-01-15 13:30:00']).map(pd.Timestamp.timestamp)
# Float64Index([1547559000.0], dtype='float64')
Pros:
best method for a single datetime string
easy to remember
Cons:
not as performant as integer division

You can use timestamp() method which returns POSIX timestamp as float:
pd.Timestamp('2021-04-01').timestamp()
[Out]:
1617235200.0
pd.Timestamp('2021-04-01 00:02:35.234').timestamp()
[Out]:
1617235355.234

value attribute of the pandas Timestamp holds the unix epoch. This value is in nanoseconds. So you can convert to ms or us by diving by 1e3 or 1e6. Check the code below.
import pandas as pd
date_1 = pd.to_datetime('2020-07-18 18:50:00')
print(date_1.value)

When you calculate the difference between two datetimes, the dtype of the difference is timedelta64[ns] by default (ns in brackets). By changing [ns] into [ms], [s], [m] etc as you cast the output to a new timedelta64 object, you can convert the difference into milliseconds, seconds, minutes etc.
For example, to find the number of seconds passed since Unix epoch, subtract datetimes and change dtype.
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]')
N.B. Oftentimes, the differences are very large numbers, so if you want them as integers, use astype('int64') (NOT astype(int)).
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]').astype('int64')
For OP's example, this would yield,
0 1547472343
Name: time, dtype: int64

In case you are accessing a particular datetime64 object from the dataframe, chances are that pandas will return a Timestamp object which is essentially how pandas stores datetime64 objects.
You can use pd.Timestamp.to_datetime64() method of the pd.Timestamp object to convert it to numpy.datetime64 object with ns precision.

Convert pandas timezone-aware DateTimeIndex to naive timestamp, but in certain timezone

You can use the function tz_localize to make a Timestamp or DateTimeIndex timezone aware, but how can you do the opposite: how can you convert a timezone aware Timestamp to a naive one, while preserving its timezone?
An example:
In [82]: t = pd.date_range(start="2013-05-18 12:00:00", periods=10, freq='s', tz="Europe/Brussels")
In [83]: t
Out[83]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-18 12:00:00, ..., 2013-05-18 12:00:09]
Length: 10, Freq: S, Timezone: Europe/Brussels
I could remove the timezone by setting it to None, but then the result is converted to UTC (12 o'clock became 10):
In [86]: t.tz = None
In [87]: t
Out[87]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-18 10:00:00, ..., 2013-05-18 10:00:09]
Length: 10, Freq: S, Timezone: None
Is there another way I can convert a DateTimeIndex to timezone naive, but while preserving the timezone it was set in?
Some context on the reason I am asking this: I want to work with timezone naive timeseries (to avoid the extra hassle with timezones, and I do not need them for the case I am working on).
But for some reason, I have to deal with a timezone-aware timeseries in my local timezone (Europe/Brussels). As all my other data are timezone naive (but represented in my local timezone), I want to convert this timeseries to naive to further work with it, but it also has to be represented in my local timezone (so just remove the timezone info, without converting the user-visible time to UTC).
I know the time is actually internal stored as UTC and only converted to another timezone when you represent it, so there has to be some kind of conversion when I want to "delocalize" it. For example, with the python datetime module you can "remove" the timezone like this:
In [119]: d = pd.Timestamp("2013-05-18 12:00:00", tz="Europe/Brussels")
In [120]: d
Out[120]: <Timestamp: 2013-05-18 12:00:00+0200 CEST, tz=Europe/Brussels>
In [121]: d.replace(tzinfo=None)
Out[121]: <Timestamp: 2013-05-18 12:00:00>
So, based on this, I could do the following, but I suppose this will not be very efficient when working with a larger timeseries:
In [124]: t
Out[124]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-18 12:00:00, ..., 2013-05-18 12:00:09]
Length: 10, Freq: S, Timezone: Europe/Brussels
In [125]: pd.DatetimeIndex([i.replace(tzinfo=None) for i in t])
Out[125]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-18 12:00:00, ..., 2013-05-18 12:00:09]
Length: 10, Freq: None, Timezone: None

To answer my own question, this functionality has been added to pandas in the meantime. Starting from pandas 0.15.0, you can use tz_localize(None) to remove the timezone resulting in local time.
See the whatsnew entry: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#timezone-handling-improvements
So with my example from above:
In [4]: t = pd.date_range(start="2013-05-18 12:00:00", periods=2, freq='H',
tz= "Europe/Brussels")
In [5]: t
Out[5]: DatetimeIndex(['2013-05-18 12:00:00+02:00', '2013-05-18 13:00:00+02:00'],
dtype='datetime64[ns, Europe/Brussels]', freq='H')
using tz_localize(None) removes the timezone information resulting in naive local time:
In [6]: t.tz_localize(None)
Out[6]: DatetimeIndex(['2013-05-18 12:00:00', '2013-05-18 13:00:00'],
dtype='datetime64[ns]', freq='H')
Further, you can also use tz_convert(None) to remove the timezone information but converting to UTC, so yielding naive UTC time:
In [7]: t.tz_convert(None)
Out[7]: DatetimeIndex(['2013-05-18 10:00:00', '2013-05-18 11:00:00'],
dtype='datetime64[ns]', freq='H')
This is much more performant than the datetime.replace solution:
In [31]: t = pd.date_range(start="2013-05-18 12:00:00", periods=10000, freq='H',
tz="Europe/Brussels")
In [32]: %timeit t.tz_localize(None)
1000 loops, best of 3: 233 µs per loop
In [33]: %timeit pd.DatetimeIndex([i.replace(tzinfo=None) for i in t])
10 loops, best of 3: 99.7 ms per loop

Because I always struggle to remember, a quick summary of what each of these do:
>>> pd.Timestamp.now() # naive local time
Timestamp('2019-10-07 10:30:19.428748')
>>> pd.Timestamp.utcnow() # tz aware UTC
Timestamp('2019-10-07 08:30:19.428748+0000', tz='UTC')
>>> pd.Timestamp.now(tz='Europe/Brussels') # tz aware local time
Timestamp('2019-10-07 10:30:19.428748+0200', tz='Europe/Brussels')
>>> pd.Timestamp.now(tz='Europe/Brussels').tz_localize(None) # naive local time
Timestamp('2019-10-07 10:30:19.428748')
>>> pd.Timestamp.now(tz='Europe/Brussels').tz_convert(None) # naive UTC
Timestamp('2019-10-07 08:30:19.428748')
>>> pd.Timestamp.utcnow().tz_localize(None) # naive UTC
Timestamp('2019-10-07 08:30:19.428748')
>>> pd.Timestamp.utcnow().tz_convert(None) # naive UTC
Timestamp('2019-10-07 08:30:19.428748')

I think you can't achieve what you want in a more efficient manner than you proposed.
The underlying problem is that the timestamps (as you seem aware) are made up of two parts. The data that represents the UTC time, and the timezone, tz_info. The timezone information is used only for display purposes when printing the timezone to the screen. At display time, the data is offset appropriately and +01:00 (or similar) is added to the string. Stripping off the tz_info value (using tz_convert(tz=None)) doesn't doesn't actually change the data that represents the naive part of the timestamp.
So, the only way to do what you want is to modify the underlying data (pandas doesn't allow this... DatetimeIndex are immutable -- see the help on DatetimeIndex), or to create a new set of timestamp objects and wrap them in a new DatetimeIndex. Your solution does the latter:
pd.DatetimeIndex([i.replace(tzinfo=None) for i in t])
For reference, here is the replace method of Timestamp (see tslib.pyx):
def replace(self, **kwds):
return Timestamp(datetime.replace(self, **kwds),
offset=self.offset)
You can refer to the docs on datetime.datetime to see that datetime.datetime.replace also creates a new object.
If you can, your best bet for efficiency is to modify the source of the data so that it (incorrectly) reports the timestamps without their timezone. You mentioned:
I want to work with timezone naive timeseries (to avoid the extra hassle with timezones, and I do not need them for the case I am working on)
I'd be curious what extra hassle you are referring to. I recommend as a general rule for all software development, keep your timestamp 'naive values' in UTC. There is little worse than looking at two different int64 values wondering which timezone they belong to. If you always, always, always use UTC for the internal storage, then you will avoid countless headaches. My mantra is Timezones are for human I/O only.

The accepted solution does not work when there are multiple different timezones in a Series. It throws ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
The solution is to use the apply method.
Please see the examples below:
# Let's have a series `a` with different multiple timezones.
> a
0 2019-10-04 16:30:00+02:00
1 2019-10-07 16:00:00-04:00
2 2019-09-24 08:30:00-07:00
Name: localized, dtype: object
> a.iloc[0]
Timestamp('2019-10-04 16:30:00+0200', tz='Europe/Amsterdam')
# trying the accepted solution
> a.dt.tz_localize(None)
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
# Make it tz-naive. This is the solution:
> a.apply(lambda x:x.tz_localize(None))
0 2019-10-04 16:30:00
1 2019-10-07 16:00:00
2 2019-09-24 08:30:00
Name: localized, dtype: datetime64[ns]
# a.tz_convert() also does not work with multiple timezones, but this works:
> a.apply(lambda x:x.tz_convert('America/Los_Angeles'))
0 2019-10-04 07:30:00-07:00
1 2019-10-07 13:00:00-07:00
2 2019-09-24 08:30:00-07:00
Name: localized, dtype: datetime64[ns, America/Los_Angeles]

Setting the tz attribute of the index explicitly seems to work:
ts_utc = ts.tz_convert("UTC")
ts_utc.index.tz = None

Late contribution but just came across something similar in Python datetime and pandas give different timestamps for the same date.
If you have timezone-aware datetime in pandas, technically, tz_localize(None) changes the POSIX timestamp (that is used internally) as if the local time from the timestamp was UTC. Local in this context means local in the specified timezone. Ex:
import pandas as pd
t = pd.date_range(start="2013-05-18 12:00:00", periods=2, freq='H', tz="US/Central")
# DatetimeIndex(['2013-05-18 12:00:00-05:00', '2013-05-18 13:00:00-05:00'], dtype='datetime64[ns, US/Central]', freq='H')
t_loc = t.tz_localize(None)
# DatetimeIndex(['2013-05-18 12:00:00', '2013-05-18 13:00:00'], dtype='datetime64[ns]', freq='H')
# offset in seconds according to timezone:
(t_loc.values-t.values)//1e9
# array([-18000, -18000], dtype='timedelta64[ns]')
Note that this will leave you with strange things during DST transitions, e.g.
t = pd.date_range(start="2020-03-08 01:00:00", periods=2, freq='H', tz="US/Central")
(t.values[1]-t.values[0])//1e9
# numpy.timedelta64(3600,'ns')
t_loc = t.tz_localize(None)
(t_loc.values[1]-t_loc.values[0])//1e9
# numpy.timedelta64(7200,'ns')
In contrast, tz_convert(None) does not modify the internal timestamp, it just removes the tzinfo.
t_utc = t.tz_convert(None)
(t_utc.values-t.values)//1e9
# array([0, 0], dtype='timedelta64[ns]')
My bottom line would be: stick with timezone-aware datetime if you can or only use t.tz_convert(None) which doesn't modify the underlying POSIX timestamp. Just keep in mind that you're practically working with UTC then.
(Python 3.8.2 x64 on Windows 10, pandas v1.0.5.)

Building on D.A.'s suggestion that "the only way to do what you want is to modify the underlying data" and using numpy to modify the underlying data...
This works for me, and is pretty fast:
def tz_to_naive(datetime_index):
"""Converts a tz-aware DatetimeIndex into a tz-naive DatetimeIndex,
effectively baking the timezone into the internal representation.
Parameters
----------
datetime_index : pandas.DatetimeIndex, tz-aware
Returns
-------
pandas.DatetimeIndex, tz-naive
"""
# Calculate timezone offset relative to UTC
timestamp = datetime_index[0]
tz_offset = (timestamp.replace(tzinfo=None) -
timestamp.tz_convert('UTC').replace(tzinfo=None))
tz_offset_td64 = np.timedelta64(tz_offset)
# Now convert to naive DatetimeIndex
return pd.DatetimeIndex(datetime_index.values + tz_offset_td64)

The most important thing is add tzinfo when you define a datetime object.
from datetime import datetime, timezone
from tzinfo_examples import HOUR, Eastern
u0 = datetime(2016, 3, 13, 5, tzinfo=timezone.utc)
for i in range(4):
u = u0 + i*HOUR
t = u.astimezone(Eastern)
print(u.time(), 'UTC =', t.time(), t.tzname())

How I handled this problem with a 15-min frequency datetimeindex in europe.
If you are in the situation where you have a timezone aware (Europe/Amsterdam in my case) index and want to convert it into a timezone naive index by transforming everything into local time, you will have dst problems, namely
there will be 1 hour missing on the last sunday of march (when europe switches to summer time)
there will be 1 hour duplicate on the last sunday of october (when europe switches to summer time)
Here is how you can handle it:
# make index tz naive
df.index = df.index.tz_localize(None)
# handle dst
if df.index[0].month == 3:
# last sunday of march, one hour is lost
df = df.resample("15min").pad()
if df.index[0].month == 10:
# in october, one hour is added
df = df[~df.index.duplicated(keep='last')]
Note: in my case, I run the above code on a df that contains only a single month, hence I do df.index[0].month to find out the month. If yours contains more months, you should probably index it differently to know when to do DST.
It consists of resampling from the last valid value in march, to avoid losing the 1 hour (in my case, all my data is in 15 min intervals, hence i resample like that. Resample for whatever your interval is). And for october, I drop duplicates.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems with timezone in numpy.datetime64 - python

Related

Convert datetime pandas column to unix format [duplicate]

Converting timestamp to seconds in pandas dataframe [duplicate]

What does .value return when applied to pandas TimeStamp? [duplicate]

pandas datetime to unix timestamp seconds

Convert pandas timezone-aware DateTimeIndex to naive timestamp, but in certain timezone

Categories

Resources