I'm working with a large dataset in Pandas, that needs to be able to do lots of time calculations.
I was previously formatting and keeping time in Epoch format, but discovered the to_datetime functionality of Pandas and want to switch to it. The issue is it isn't correctly formatting my timezone.
sample date time
2015-03-01T15:41:53.825992-06:00
after parsing through Pandas to_datetime
3/1/2015 21:41:53.825992
It isn't keeping the time items in US/Central timezone instead converting to GMT
Another issue is I have row items that have an array of times. Like so
Index singleTime arrayTime
0 3/1/2015 21:41:53.825992 [3/1/2015 21:41:53.825992, 3/1/2015 21:44:53.825992,...,3/1/2015 21:49:53.825992]
1 3/1/2015 22:43:53.825992 [3/1/2015 22:43:53.825992, 3/1/2015 22:44:53.825992,...,3/1/2015 22:49:53.825992]
Currently I parse the times using
pd.to_datetime(timeString)
Then iterate through the data frame column
newTimearray=[]
for each in df.singleTime:
newTimearray.append(each.tz_localize('UTC').tz_convert('US/Central'))
df.singleTime=newTimearray
I suspect this isn't very efficient and this doesn't work for the array of times. A lot of the solutions I have seen have hinged around the time thats an index. I can index off one time, but either way I will have multiple time columns, and items that I need to convert that aren't an index.
So how do I effectively convert all time items formatted like this?
Related
My objective is to create the following pandas dataframe (with the 'date_time' column in '%Y-%m-%d %s:%m:%f%z' format):
batt_no date_time
3 4 2019-09-19 20:59:06+00:00
4 5 2019-09-19 23:44:07+00:00
5 6 2019-09-20 00:44:06+00:00
6 7 2019-09-20 01:14:06+00:00
But the constraint is that I don't want to first create a dataframe as follows and then convert the 'date_time' column into the above format.
batt_no date_time
3 4 1568926746
4 5 1568936647
5 6 1568940246
6 7 1568942046
I need to directly create it by converting two lists of values into the desired dataframe.
The following is what I've tried but I get an error
(please note: the 'date_time' values are in epoch format which I need to specify but have them converted into the '%Y-%m-%d %s:%m:%f%z' format):
pd.DataFrame({'batt_volt':[4,5,6,7],
'date_time':[1568926746,1568936647,1568940246,1568942046].dt.strftime('%Y-%m-%d %s:%m:%f%z')}, index=[3,4,5,6])
Can anyone please help?
Edit Note: My question is different from the one asked here.
The question there deals with conversion of a single value of pandas datetime to unix timestamp. Mine's different because:
My timestamp values are slightly different from any of the types mentioned there
I don't need to convert any timestamp value, rather create a full-fledged dataframe having values of the desired timestamp - in a particular manner using lists that I've clearly mentioned in my question.
I've clearly stated the way I've attempted the process but requires some modifications in order to run without error, which in no way is similar to the question asked in the aforementioned link.
Hence, my question is definitely different. I'd request to kindly reopen it.
As suggested, I put the solution in comment as an answer here.
pd.DataFrame({'batt_volt':[4,5,6,7], 'date_time': pd.to_datetime([1568926746,1568936647,1568940246,1568942046], unit='s', utc=True).strftime('%Y-%m-%d %s:%m:%f%z')}, index=[3,4,5,6])
pd.to_datetime works with dates, or list of dates, and input dates can be in many formats including epoch int. Keyword unit ensure that those ints are interpreted as a number of seconds since 1970-01-01, not of ms, μs, ns, ...
So it is quite easy to use when creating a DataFrame to create directly the list of dates.
Since a list of string, with a specific format was wanted (btw, outside any specific context, I maintain that it is probably preferable to store datetimes, and convert to string only for I/O operations. But I don't know the specific context), we can use .strftime on the result, which is of type DatetimeIndex when to_datetime is called with a list. And .strftime also works on those, and then is applied on all datetimes of the list. So we get a list of string of the wanted format.
Last remaining problem was the timezone. And here, there is no perfect solution. Because a simple int as those we had at the beginning does not carry a timezone. By default, to_datetime creates datetime without a timezone (like those ints are). So they are relative to any timezone the user decide they are.
to_datetime can create "timezone aware datetime". But only UTC. Which is done by keyword arg utc=True
With that we get timezone aware datetime, assuming that the ints we provided were in number of seconds since 1970-01-01 00:00:00+00:00
I am trying to create a new data frame that excludes all timestamps bigger/later than 12:00:00. I tried multiple approaches but I keep on having an issue. The time was previously a datetime column that I change to 2 columns, a date and time column (format datetime.time)
Code:
Issue thrown out:
Do you have any suggestions to be able to do this properly?
Alright the solution:
change the Datetype from datetime.time to Datetime
Use the between method by setting the Datetime column as the index and setting the range
The result is a subset of the dataframe based on the needed time frame.
I have a few different data frames with a similar data structure, but the contents of those data frames are slightly different. Specifically, the datetime format of the datetime fields is different- in some cases, the timestamps are timezone aware, in other cases, they are not. I need to find the minimum range of timestamps that overlap all three dataframes, such that the data in the final dataframes exclusively overlaps the same time periods.
The approach I wanted to take was to take the minimum start time from each of the starttime timestamps in each dataframe, and then take the max of those, and then repeat (but invert) the process for the endtimes. However, when I do this I get an error indicating I cannot compare timestamps with different timezone awareness. I've taken a few different approaches- using tz_convert on the timestamp series, as below:
model_output_dataframes['workRequestSplitEndTime']= pd.to_datetime(model_output_dataframes['workRequestSplitEndTime'], infer_datetime_format=True).tz_convert(None)
this generates the error
TypeError: index is not a valid DatetimeIndex or PeriodIndex
So I tried converting it into a datetimeindex, and then converting it:
model_output_dataframes['workRequestSplitEndTime']= pd.DatetimeIndex(pd.to_datetime(model_output_dataframes['workRequestSplitEndTime'], infer_datetime_format=True)).tz_convert(None)
and this generates a separate error:
ValueError: cannot reindex from a duplicate axis
So at this point, I'm somewhat stuck - I feel like after my conversions I'm back at the place I started.
I would appreciate any help you can give me.
I am trying to convert a set of date time data within Python.
The data frame within the pandas library looks like the following:
dates dates2 dates3
2011-05-09 20110509 05.2011.09
2011-09-23 20110923 09.2011.23
2012-04-30 20120430 04.2012.30
2014-09-12 20140912 09.2014.12
Is there an elegant way to automatically convert this data from an object into a date time data type, without having to loop through each element within each column?
I have a dataset with around 1 million rows and I'd like to convert 12 columns to datetime. Currently they are "object" type. I previously read that I could do this with:
data.iloc[:,7:19] = data.iloc[:,7:19].apply(pd.to_datetime, errors='coerce')
This does work, but the performance is extremely poor. Someone else mentioned performance could be sped up by doing:
def lookup(s):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date) for date in s.unique()}
return s.apply(lambda v: dates[v])
However, I'm not sure how to apply this code to my data (I'm a beginner). Does anyone know how to speed up changing many columns to datetime using this code or any other method? Thanks!
If all your dates have the same format you can define a dateparse function, then pass it as an argument when you import. Furst you import datetime, then use datetime.strf (#define your format here).
Once that function is defined, in pandas you set the parse dates option to True, then you have an option to call a date parser. you would put date parser=yourfunction.
I would look up the pandas api to get specific syntax