Convert the following time info to something that pyplot can recognise - python

I have a DataFrame with two columns of time information. The first is the epoch time in seconds, and the second is the corresponding formatted str time like "2015-06-01T09:00:00+08:00" where "+08:00" denotes the timezone.
I'm aware that time formats are in a horrible mess in Python, and that matplotlib.pyplot seems to only recognise the datetime format. I tried several ways to convert the str time to datetime but none of them would work. When I use pd.to_datetime it will convert to datetime64, and when using pd.Timestamp it converts to Timestamp, and even when I tried using combinations of these two functions, the output would always be either datetime64 or Timestamp but NEVER for once datetime. I also tried the method suggested in this answer. Didn't work. It's kind of driving me up the wall now.
Could anybody kindly figure out a quick way for this? Thanks!
I post a minimal example below:
import matplotlib.pyplot as plt
import time
import pandas as pd
df = pd.DataFrame([[1433120400, "2015-06-01T09:00:00+08:00"]], columns=["epoch", "strtime"])
# didn't work
df["usable_time"] = pd.to_datetime(df["strtime"])
# didn't work either
df["usable_time"] = pd.to_datetime(df["strtime"].apply(lambda s: pd.Timestamp(s)))
# produced a strange type called "struct_time". Don't think it'd be compatible with pyplot
df["usable_time"] = df["epoch"].apply(lambda x: time.localtime(x))
# attempted to plot with pyplot
df["usable_time"] = pd.to_datetime(df["strtime"])
plt.plot(x=df["usable_time"], y=[0.123])
plt.show()

UPDATE (per comments)
It seems like the confusion here is stemming from the fact that the call to plt.plot() takes positional x/y arguments instead of keyword arguments. In other words, the appropriate signature is:
plt.plot(x, y)
Or, alternately:
plt.plot('x_label', 'y_label', data=obj)
But not:
plt.plot(x=x, y=y)
There's a separate discussion of why this quirk of Pyplot exists here, also see ImportanceOfBeingErnest's comments below.
Original
This isn't really an answer, more of a demonstration that Pyplot doesn't have an issue with Pandas datetime data. I've added an extra row to df to make the plot clearer:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[1433120400, "2015-06-01T09:00:00+08:00"],
[1433130400, "2015-07-01T09:00:00+08:00"]],
columns=["epoch", "strtime"])
df["usable_time"] = pd.to_datetime(df["strtime"])
df.dtypes
epoch int64
strtime object
usable_time datetime64[ns]
dtype: object
plt.plot(df.usable_time, df.epoch)
pd.__version__ # '0.23.3'
matplotlib.__version__ # '2.2.2'

You can use to_pydatetime (from the dt accessor or Timestamp) to get back native datetime objects if you really want to, e.g.:
pd.to_datetime(df["strtime"]).dt.to_pydatetime()
This will return an array of native datetime objects:
array([datetime.datetime(2015, 6, 1, 1, 0)], dtype=object)
However, pyplot seems to be able to work with pandas datetime series.

Related

How to convert string from csv to hour(s):minute(s)?

This link shows my csv file and graph.
I want to represent the AVG number (which are seconds actually) as hour(s):minute(s) on y axis.
I think, it cannot be solved because I spent 3 days wit this problem.
But to be more precise, aside of lot of conversations with dateime, timedelta, timestamp nothing worked.
Either the data could no be shown on y axis because it did not represent number like variable to plot or I've got not proper representation of the data.
I was trying to create something like converting seconds to calculate with divmod
than put them on the top of the bars with annonate.
Later I have used Timple.
I do not understand how should I create an acceptable datatype for this.
I've made some related and use pandasDataFrame.plot
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df["activity"] = ['run', 'swim', 'drive']
>>> df["avg"] = [86400,43200,21600]
>>> df
activity avg
0 run 86400
1 swim 43200
2 drive 21600
>>> df.plot.bar(x="activity")
<AxesSubplot: xlabel='activity'>
>>> plt.show()
To represent time transcurred for a certain number of seconds you can use fromtimestamp and formatting strftime but it might not be compatible with matplotlib, then using Timple is something related, but graph could not be properly plotted or maybe it is needed to perform something like explore data or apply a certain statistical procedure.
>>> import datetime
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import timple
>>> tmpl = timple.Timple()
>>> tmpl.enable()
>>> timedeltas = np.array([datetime.timedelta(seconds=(s)) for s in df["avg"]])
>>> timedeltas
array([datetime.timedelta(days=1), datetime.timedelta(seconds=43200),
datetime.timedelta(seconds=21600)], dtype=object)
>>> plt.plot(timedeltas, df["activity"])
[<matplotlib.lines.Line2D object at 0x0000026FAC3F5B40>]
>>> plt.show()

Plotting with with datetime64[ns] objects in Seaborn

I have a large (> 1 mil rows) dataset that has datetime timestamps inside of it. I want to look at trends that may occur throughout the day. So to start if I do: print(df['timestamp']) it will show my data as:
0 2014-01-01 13:11:50.3
1 2011-02-13 04:12:45.0
Name: timestamp, Length: 1000000, dtype: datetime64[ns] /
However, I do not want the date there, as I only want to plot trends throughout the day, without caring what day it is. So I do this line of code:
df['timestamp'] = pd.Series([val.time() for val in df['timestamp']]), this gives me the desired only-timestamp data, but returns the dtype as 'object', which I cannot plot. For example when I try using Seaborn: sns.lineplot(df['timestamp'], df['Task_Length']), I get the error "TypeError: Invalid object type at position 0".
BUT, if I just do the same exact sns.lineplot(df['timestamp'], df['Task_Length']), without the intermediary step of cutting off date, leaving it as datetime64[ns] object as opposed to the generic 'object' datatype; it plots fine. However, this results in a plot spanning multiple years, whereas I only want to see time-of-day trends.
For clarity, this is a pandas dataframe where each row has a task that occurs, which generically I could call one column being 'TaskName', and each is associated with a 'timestamp' as previously explained, and I want to use any sort of Seaborn plotting to analyze daily trends such as different task types happening at different times of the day, not caring about days of the year. Thanks for any help.
Edit* updating another thing that I tried: using original datetime64[ns] object that does plot, I tried doing sns.lineplot(df['timestamp'].dt.time, df['Task_Length']) which gave the same error as when I add the line of code to cut off date. Can't figure our why Seaborn doesn't like just the time component.
This works for me.
Difference is in converting column "timestamp" from datetime to time.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame([['2014-01-01 13:11:50.3',10],['2011-02-13 04:12:45.0',15]], columns=['timestamp','Task_Length'])
df['timestamp'] = pd.to_datetime(df['timestamp']).dt.strftime('%H:%M:%S')
sns.lineplot(df['timestamp'], df['Task_Length'])
plt.show()
Refer this question for further details
Plot datetime.time in seaborn

Python: DateTime-Objects can be plotted in matplotlib, but only sometimes?

So, I made a DataFrame that looks like this:
The DF ist chronologically ordered by the DateTime-Objects. These DateTimes are generated by transforming the column "attributes.timestamp" which contains timestamps as strings:
df["DateTime"] = df["attributes.timestamp"].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ'))
The corresponding y-values is a counter that counts objects within the corresponding minute.
When I try to plot this DF in matplotlib, it actually works. It takes the datetime-objects as x-values and plots the counts for that minute as follows:
Now of course it looks dumb to have to full DateTime-object shown on the x-axis. It shows month, day and hour in this order (in the example it's the 2nd of March from 2pm to 20pm). I want it to show JUST the hours (or at least just the time, not the entire date that comes with it). So I tried to add a new column (called "Time") to the DF. That column would extract the time from the DateTime column using the following code:
df["Time"] = df["DateTime"].time()
However, that doesn't work, for it gives me the attribute error "'Series' object has no attribute 'time'". Instead I tried something else. I just repeated the whole code I used earlier when I generated the DateTime objects and added ".time()" to it.
df["Time"] = df["attributes.timestamp"].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ').time())
I have no idea why, but now it works just fine. I was capable of adding the time from my Datetime-object:
My next idea would be to use the "Time" column on my x-axis instead of the whole datetime for plotting. y-values from the counter stay the same. But for some reason that doesn't work. When I try to plot it like that, it gives me the following error: TypeError: float() argument must be a string or a number, not 'datetime.time'
Strangely enough that was no problem when plotting with the whole DateTime-object. I don't know, why the exctracted time would be a problem, since it is a chronologically ordered value as well.
My question is: Why the heck does my approach not work? And: Is there any way to go around this?
Matplotlib supports plotting pandas DatetimeIndex, as well as numpy datetime64 objects, but not sequences of datetime.time. In addition, df["Time"] = df["DateTime"].time() does not work because you are applying the .time() method to the Series itself, instead of to the elements of the Series within, which are pandas.Timestamp objects that do have the .time() method defined.
To answer your main question, you just want the x-axis to not show redundant info, yes? Instead of creating a new column only for itme, the proper way to do this is to format the matplotlib x-axis with matplotlib.dates.DateFormatter.
Here's a minimal example:
import matplotlib.pyplot as plt
import pandas as pd
# Example DatetimeIndex and data
x = pd.date_range(start='2020-05-10', end='2020-05-11', freq='1h')
y = list(range(len(x)))
fig, ax = plt.subplots()
plt.plot(x, y)
# The following specifies the format for dates
import matplotlib.dates as mdates
date_fmt = mdates.DateFormatter('%I: %M%p')
ax.xaxis.set_major_formatter(date_fmt)
# autofmt_xdate helps with auto-rotating dates so they do not overlap
fig.autofmt_xdate()
plt.show()
As for how to know what string to pass to DateFormatter, refer to https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior for strftime formats.
Matplotlib has a page dedicated to fixing common date annoyances that you might find useful: https://matplotlib.org/3.1.1/gallery/recipes/common_date_problems.html

numpy datetime and pandas datetime

I'm confused by the interoperation between numpy and pandas date objects (or maybe just by numpy's datetime64 in general).
I was trying to count business days using numpy's built-in functionality like so:
np.busday_count("2016-03-01", "2016-03-31", holidays=[np.datetime64("28/03/2016")])
However, numpy apparently can't deal with the inverted date format:
ValueError: Error parsing datetime string "28/03/2016" at position 2
To get around this, I thought I'd just use pandas to_datetime, which can. However:
np.busday_count("2016-03-01", "2016-03-31", holidays=[np.datetime64(pd.to_datetime("28/03/2016"))])
ValueError: Cannot safely convert provided holidays input into an array of dates
Searching around for a bit, it seemed that this was caused by the fact that the chaining of to_datetime and np.datetime64 results in a datetime64[us] object, which apparently the busday_count function cannot accept (is this intended behaviour or a bug?). Thus, my next attempt was:
np.busday_count("2016-03-01", "2016-03-31", holidays=[np.datetime64(pd.Timestamp("28"), "D")])
But:
TypeError: Cannot cast datetime.datetime object from metadata [us] to [D] according to the rule 'same_kind'
And that's me out - why are there so many incompatibilities between all these datetime formats? And how can I get around them?
I've been having a similar issue, using np.is_busday()
The type of datetime64 is vital to get right. Checking the numpy datetime docs, you can specify the numpy datetime type to be D.
This works:
my_holidays=np.array([datetime.datetime.strptime(x,'%m/%d/%y') for x in holidays.Date.values], dtype='datetime64[D]')
day_flags['business_day'] = np.is_busday(days,holidays=my_holidays)
Whereas this throws the same error you got:
my_holidays=np.array([datetime.datetime.strptime(x,'%m/%d/%y') for x in holidays.Date.values], dtype='datetime64')
The only difference is specifying the type of datetime64.
dtype='datetime64[D]'
vs
dtype='datetime64'
Docs are here:
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html
I had the same issue while using np.busday_count, later I figured out the problem was with the hours, minutes, seconds, and milliseconds getting added while converting it to datetime object or numpy datetime object.
I just converted to datetime object with only date and not the hours, minutes, seconds, and milliseconds.
The following was my code:
holidays_list.json file:
{
"holidays_2019": [
"04-Mar-2019",
"21-Mar-2019",
"17-Apr-2019",
"19-Apr-2019",
"29-Apr-2019",
"01-May-2019",
"05-Jun-2019",
"12-Aug-2019",
"15-Aug-2019",
"02-Sep-2019",
"10-Sep-2019",
"02-Oct-2019",
"08-Oct-2019",
"28-Oct-2019",
"12-Nov-2019",
"25-Dec-2019"
],
"format": "%d-%b-%Y"
}
code file:
import json
import datetime
import numpy as np
with open('holidays_list.json', 'r') as infile:
data = json.loads(infile.read())
# the following is where I convert the datetime object to date
holidays = list(map(lambda x: datetime.datetime.strptime(
x, data['format']).date(), data['holidays_2019']))
start_date = datetime.datetime.today().date()
end_date = start_date + datetime.timedelta(days=30)
holidays = [start_date + datetime.timedelta(days=1)]
print(np.busday_count(start_date, end_date, holidays=holidays))

Why pandas series return the element of my numpy datetime64 array as timestamp?

I have a pandas Series which can be constructed like the following:
given_time = datetime(2013, 10, 8, 0, 0, 33, 945109,
tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))
given_times = np.array([given_time] * 3, dtype='datetime64[ns]'))
column = pd.Series(given_times)
The dtype of my Series is datetime64[ns]
However, when I access it: column[1], somehow it becomes of type pandas.tslib.Timestamp, while column.values[1] stays np.datetime64. Does Pandas auto cast my datetime into Timestamp when accessing the item? Is it slow?
Do I need to worry about the difference in types? As far as I see, Timestamp seems not have timezone (numpy.datetime64('2013-10-08T00:00:33.945109000+0100') -> Timestamp('2013-10-07 23:00:33.945109', tz=None))
In practice, I would do datetime arithmetic like take difference, compare to a datetimedelta. Does the possible type inconsistency around my operators affect my use case at all?
Besides, am I encouraged to use pd.to_datetime instead of astype(dtype='datetime64') while converting datetime objects?
Pandas time types are built on top of numpy's datetime64.
In order to continue using the pandas operators, you should keep using pd.to_datetime, rather than as astype(dtype='datetime64'). This is especially true since you'll be taking date time deltas, which pandas handles admirably, for example with resampling, and period definitions.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#period
Though I haven't measured, since the pandas times are hiding numpy times, I suspect the conversion, is quite fast. Alternatively, you can just use pandas built in time series definitions and avoid the conversion altogether.
As a rule of thumb, it's good to use the type from the package you'll be using functions from, though. So if you're really only going to use numpy to manipulate the arrays, then stick with numpy date time. Pandas methods => pandas date time.
I had read in the documentation somewhere (apologies, can't find link) that scalar values will be converted to timestamps while arrays will keep their data type. For example:
from datetime import date
import pandas as pd
time_series = pd.Series([date(2010 + x, 1, 1) for x in range(5)])
time_series = time_series.apply(pd.to_datetime)
so that:
In[1]:time_series
Out[1]:
0 2010-01-01
1 2011-01-01
2 2012-01-01
3 2013-01-01
4 2014-01-01
dtype: datetime64[ns]
and yet:
In[2]:time_series.iloc[0]
Out[2]:Timestamp('2010-01-01 00:00:00')
while:
In[3]:time_series.values[0]
In[3]:numpy.datetime64('2009-12-31T19:00:00.000000000-0500')
because iloc requests a scalar from pandas (type conversion to Timestamp) while values requests the full numpy array (no type conversion).
There is similar behavior for series of length one. Additionally, referencing more than one element in the slice (ie iloc[1:10]) will return a series, which will always keep its datatype.
I'm unsure as to why pandas behaves this way.
In[4]: pd.__version__
Out[4]: '0.15.2'

Categories