Convert UTC timestamp to local timezone issue in pandas - python

I'm trying to convert a Unix UTC timestamp to a local date format in Pandas. I've been looking through a few solutions but I can't quite get my head around how to do this properly.
I have a dataframe with multiple UTC timestamp columns which all need to be converted to a local timezone. Let's say EU/Berlin.
I first convert all the timestamp columns into valid datetime columns with the following adjustments:
df['date'] = pd.to_datetime(df['date'], unit='s')
This works and gives me the following outcome e.g. 2019-01-18 15:58:25 if I know try to adjust the timezone for this Date Time I have tried both:
df['date'].tz_localize('UTC').tz_convert('Europe/Berlin')
and
df['date'].tz_convert('Europe/Berlin')
In both cases the error is: TypeError: index is not a valid DatetimeIndex or PeriodIndex and I don't understand why.
The problem must be that the DateTime column is not on the index. But even when I use df.set_index('date') and after I try the above options it doesn't work and I get the same error.
Also, if it would work it seems that this method only allows the indexed DateTime to be timezone adjusted. How would I then adjust for the other columns that need adjustment?
Looking to find some information on how to best approach these issues once and for all! Thanks

You should first specify that it is a datetime by adding the .dt. to a non index
df['date'] = df['date'].dt.tz_localize('UTC').dt.tz_convert('Europe/Berlin')
This should be used if the column is not the index column.

Related

How do I select all rows of a specific date in a dataframe which contains datetime column with multiple times on that date? [duplicate]

I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())

How can I subset based on smaller than time XYZ?

I am trying to create a new data frame that excludes all timestamps bigger/later than 12:00:00. I tried multiple approaches but I keep on having an issue. The time was previously a datetime column that I change to 2 columns, a date and time column (format datetime.time)
Code:
Issue thrown out:
Do you have any suggestions to be able to do this properly?
Alright the solution:
change the Datetype from datetime.time to Datetime
Use the between method by setting the Datetime column as the index and setting the range
The result is a subset of the dataframe based on the needed time frame.

Using .loc on DatetimeIndex to retrieve a value on a specific date (KeyError

I am trying to retrieve a specific value using .loc on a dataframe. This used to work, but I've upgraded to most recent version of Pandas and it is no longer working. See the sample data and code below, anyone know what's going on here?
Sample Data
Open High Low Close Volume
Date
2020-09-24 2906.500000 2962.000000 2871.000000 2960.469971 6117900
2020-09-25 3033.840088 3133.989990 3000.199951 3128.989990 6948800
Code
import pandas as pd
from datetime import date, timedelta
import datetime
yesterday = date.today() - timedelta(3)
symbol_data.loc[yesterday]['Close']
In the past it would retrieve the value "3128.989990", which is the Close value on 2020-09-25.
Now I get "KeyError: datetime.date(2020, 9, 25)".
When I look at the index, it shows DatetimeIndex(['2020-09-24', '2020-09-25'], dtype='datetime64[ns]', name='Date', freq=None)
If I pass the string value, it works. But I need to use my variable to calculate a date.
symbol_data.loc['2020-09-25']['Close'] ##this works, but I don't want to use a hard coded date
Recent pandas version doesn't allow .loc and .at slicing by python dattetime object. I got hit by it, so I knew. You need to convert it to pandas Timestamp or use string as you already discovered. To wrap it to pandas Timestamp, just pass the variable to pd.Timestamp
In [44]: print(df.loc[pd.Timestamp(date.today() - timedelta(3)), 'Close'])
Output:
3128.98999
Similar to what Trenton suggested, but chained with normalize to get the exact date. Also, try to avoid index chaining when possible
yesterday = pd.Timestamp.now().normalize() - pd.Timedelta(days=3)
df.loc[yesterday, 'Close']
# out
# 3128.98999

How to fix weird date values to datetime type in pandas

I'm new to Python and dataframes.
I have a date value that is not formatted as date. Since this value has a 'weird' format, pandas' function to_datetime() doesn't work properly. The values are formatted like:
['20190630', '20190103']
This is the 'yyyymmdd' format.
I have tried to slice the values and make different columns where I extract the year- month- day. But this doesn't work, since the slicing wasn't working. This is the code I have now, but it isn't doing anything.
df.Date = pd.to_datetime(df.date)
I would like to have the dd-mm-yyyy format and datetime type. What can I do?

Python Pandas: Overwriting an Index with a list of datetime objects

I have an input CSV with timestamps in the header like this (the number of timestamps forming columns is several thousand):
header1;header2;header3;header4;header5;2013-12-30CET00:00:00;2013-12-30CET00:01:00;...;2014-00-01CET00:00:00
In Pandas 0.12 I was able to do this, to convert string timestamps into datetime objects. The following code strips out the 'CEST' in the timestamp string (translate()), reads it in as a datetime (strptime()) and then localizes it to the correct timezone (localize()) [The reason for this approach was because, with the versions I had at least, CEST wasn't being recognised as a timezone].
DF = pd.read_csv('some_csv.csv',sep=';')
transtable = string.maketrans(string.uppercase,' '*len(string.uppercase))
tz = pytz.country_timezones('nl')[0]
timestamps = DF.columns[5:]
timestamps = map(lambda x:x.translate(transtable), timestamps)
timestamps = map(lambda x:datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'), timestamps)
timestamps = map(lambda x: pytz.timezone(tz).localize(x), timestamps)
DF.columns[5:] = timestamps
However, my downstream code required that I run off of pandas 0.16
While running on 0.16, I get this error with the above code at the last line of the above snippet:
*** TypeError: Indexes does not support mutable operations
I'm looking for a way to overwrite my index with the datetime object. Using the method to_datetime() doesn't work for me, returning:
*** ValueError: Unknown string format
I have some subsequent code that copies, then drops, the first few columns of data in this dataframe (all the 'header1; header2, header3'leaving just the timestamps. The purpose being to then transpose, and index by the timestamp.
So, my question:
Either:
how can I overwrite a series of column names with a datetime, such that I can pass in a pre-arranged set of timestamps that pandas will be able to recognise as a timestamp in subsequent code (in pandas v0.16)
Or:
Any other suggestions that achieve the same effect.
I've explored set_index(), replace(), to_datetime() and reindex() and possibly some others but non seem to be able to achieve this overwrite. Hopefully this is simple to do, and I'm just missing something.
TIA
I ended up solving this by the following:
The issue was that I had several thousand column headers with timestamps, that I couldn't directly parse into datetime objects.
So, in order to get these timestamp objects incorporated I added a new column called 'Time', and then included the datetime objects in there, then setting the index to the new column (I'm omitting code where I purged the rows of other header data, through drop() methods:
DF = DF.transpose()
DF['Time'] = timestamps
DF = DF.set_index('Time')
Summary: If you have a CSV with a set of timestamps in your headers that you cannot parse; a way around this is to parse them separately, include in a new column of Time with the correct datetime objects, then set_index() based on the new column.

Categories