Pandas Dataframe Multicolor Line plot - python

I have a Pandas Dataframe with a DateTime index and two column representing Wind Speed and ambient Temperature. Here is the data for half a day
temp winds
2014-06-01 00:00:00 8.754545 0.263636
2014-06-01 01:00:00 8.025000 0.291667
2014-06-01 02:00:00 7.375000 0.391667
2014-06-01 03:00:00 6.850000 0.308333
2014-06-01 04:00:00 7.150000 0.258333
2014-06-01 05:00:00 7.708333 0.375000
2014-06-01 06:00:00 9.008333 0.391667
2014-06-01 07:00:00 10.858333 0.300000
2014-06-01 08:00:00 12.616667 0.341667
2014-06-01 09:00:00 15.008333 0.308333
2014-06-01 10:00:00 17.991667 0.491667
2014-06-01 11:00:00 21.108333 0.491667
2014-06-01 12:00:00 21.866667 0.395238
I would like to plot this data as one line where the color changes according to temperature. So from light red to dark red the higher the temperature for example.
I found this example of multicolored lines with matplotlib but I have no idea how to use this with a pandas DataFrame. Has anyone an idea what I could do?
If it is possible to do this, would it also be possible as additional feature to change the width of the line according to wind speed? So the faster the wind the wider the line.
Thanks for any help!

The build-in plot method in pandas probably won't be able to do it. You need to extract the data and plot them using matplotlib.
from matplotlib.collections import LineCollection
import matplotlib.dates as mpd
x=mpd.date2num(df.index.to_pydatetime())
y=df.winds.values
c=df['temp'].values
points = np.array([x, y]).T.reshape(-1, 1, 2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
lc = LineCollection(segments, cmap=plt.get_cmap('copper'), norm=plt.Normalize(0, 10))
lc.set_array(c)
lc.set_linewidth(3)
ax=plt.gca()
ax.add_collection(lc)
plt.xlim(min(x), max(x))
ax.xaxis.set_major_locator(mpd.HourLocator())
ax.xaxis.set_major_formatter(mpd.DateFormatter('%Y-%m-%d:%H:%M:%S'))
_=plt.setp(ax.xaxis.get_majorticklabels(), rotation=70 )
plt.savefig('temp.png')
There are two issues worth mentioning,
the range of the color gradient is controlled by norm=plt.Normalize(0, 10)
pandas and matplotlib plot time series differently, which requires the df.index to be converted to float before plotting. And by modifying the major_locators, we will get the xaxis majorticklabels back into date-time format.
The second issue may cause problem when we want to plot more than just one lines (the data will be plotted in two separate x ranges):
#follow what is already plotted:
df['another']=np.random.random(13)
print ax.get_xticks()
df.another.plot(ax=ax, secondary_y=True)
print ax.get_xticks(minor=True)
[ 735385. 735385.04166667 735385.08333333 735385.125
735385.16666667 735385.20833333 735385.25 735385.29166667
735385.33333333 735385.375 735385.41666667 735385.45833333
735385.5 ]
[389328 389330 389332 389334 389336 389338 389340]
Therefore we need to do it without .plot() method of pandas:
ax.twinx().plot(x, df.another)

Related

matplotlib xticks outputs wrong array

I am trying to plot a time series, which looks like this
ts
2020-01-01 00:00:00 1300.0
2020-01-01 01:00:00 1300.0
2020-01-01 02:00:00 1300.0
2020-01-01 03:00:00 1300.0
2020-01-01 04:00:00 1300.0
...
2020-12-31 19:00:00 1300.0
2020-12-31 20:00:00 1300.0
2020-12-31 21:00:00 1300.0
2020-12-31 22:00:00 1300.0
2020-12-31 23:00:00 1300.0
Freq: H, Name: 1, Length: 8784, dtype: float64
And I plot it via: ts.plot(label=label, linestyle='--', color='k', alpha=0.75, zorder=2)
If the time series ts starts from 2020-01-01 to 2020-12-31, I get following when I call plt.xticks()[0]:
array([438288, 439032, 439728, 440472, 441192, 441936, 442656, 443400,
444144, 444864, 445608, 446328, 447071], dtype=int64)
which is fine since the first element of that array actually shows the right position of the first xtick. However when I expand the time series object from 2019-01-01 to 2020-12-31, so over 2 years, when I call the plt.xticks()[0], I get following:
array([429528, 431688, 433872, 436080, 438288, 440472, 442656, 444864,
447071], dtype=int64)
I don't understand why now I am getting less values as xticks. So for 12 months I am getting 13 locations for xticks. But for 24 months I was expecting to get 25 locations. Instead I got only 9. How would I get all of these 25 locations?
This is the whole script:
fig, ax = plt.subplots(figsize=(8,4))
ts.plot(label=label, linestyle='--', color='k', alpha=0.75, zorder=2)
locs, labels = plt.xticks()
Matplotlib automatically selects an appropriate number of ticks and tick labels so that the x-axis does not become unreadable. You can override the default behavior by using tick locators and formatters from the matplotlib.dates module.
But note that you are plotting the time series with the pandas plot method which is a wrapper around plt.plot. Pandas uses custom tick formatters for time series plots that produce nicely-formatted tick labels. By doing so, it uses x-axis units for dates that are different from the matplotlib date units, which explains why you get what looks like a random number of ticks when you try using the MonthLocator.
To make the pandas plot compatible with matplotlib.dates tick locators, you need to add the undocumented x_compat=True argument. Unfortunately, this also removes the pandas custom tick label formatters. So here is an example of how to use a matplotlib date tick locator with a pandas plot and get a similar tick format (minor ticks not included):
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import matplotlib.dates as mdates
# Create sample time series stored in a dataframe
ts = pd.DataFrame(data=dict(constant=1),
index=pd.date_range('2019-01-01', '2020-12-31', freq='H'))
# Create pandas plot
ax = ts.plot(figsize=(10,4), x_compat=True)
ax.set_xlim(min(ts.index), max(ts.index))
# Select and format x ticks
ax.xaxis.set_major_locator(mdates.MonthLocator())
ticks = pd.to_datetime(ax.get_xticks(), unit='d') # timestamps of x ticks
labels = [timestamp.strftime('%b\n%Y') if timestamp.year != ticks[idx-1].year
else timestamp.strftime('%b') for idx, timestamp in enumerate(ticks)]
plt.xticks(ticks, labels, rotation=0, ha='center');

Modifying x ticks labels in seaborn

I am trying to modify the format of the x-tick label to date format (%m-%d).
My data consists of hourly data values over a certain period of dates. I am trying to plot the data for 14 days. However when I run I get x labels fully jumbled up.
Is there any way I can show only dates and skip hourly values on the x-axis. ? Is there any way to modify x ticks where I can skip labels for hours and show labels only for dates? I am using seaborn.
After suggestion from comment by i edited my code to plot as below:
fig, ax = plt.pyplot.subplots()
g = sns.barplot(data=data_n,x='datetime',y='hourly_return')
g.xaxis.set_major_formatter(plt.dates.DateFormatter("%d-%b"))
But I got the following error:
ValueError: DateFormatter found a value of x=0, which is an illegal
date; this usually occurs because you have not informed the axis that
it is plotting dates, e.g., with ax.xaxis_date()
Upon checking the datetime column I get following output with data type type of the column:
0 2020-01-01 00:00:00
1 2020-01-01 01:00:00
2 2020-01-01 02:00:00
3 2020-01-01 03:00:00
4 2020-01-01 04:00:00
...
307 2020-01-13 19:00:00
308 2020-01-13 20:00:00
309 2020-01-13 21:00:00
310 2020-01-13 22:00:00
311 2020-01-13 23:00:00
Name: datetime, Length: 312, dtype: datetime64[ns]
I was suspecting the x ticks so when I ran g.get_xticks() [which gets the ticks on x-axis], I got output as ordinal numbers. Can anyone tell why is this happening?
1. Approach for Drawing Line Plot with x-axis datetime
Can you try changing x axis format as below
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import dates
## create dummy dataframe
datelist = pd.date_range(start='2020-01-01 00:00:00', periods=312,freq='1H').tolist()
#create dummy dataframe
df = pd.DataFrame(datelist, columns=["datetime"])
df["val"] = [i for i in range(1,312+1)]
df.head()
Below is the dataframe info
Draw plot
fig, ax = plt.subplots()
chart = sns.lineplot(data=df, ax=ax, x="datetime",y="val")
ax.xaxis.set_major_formatter(dates.DateFormatter("%d-%b"))
Output:
2. Approach for Drawing Bar plot using seaborn with x-axis datetime
There is a problem with the above approach if you draw for barplot. So, will use below code
fig, ax = plt.subplots()
## barplot
chart = sns.barplot(data=df, ax=ax,x="datetime",y="val")
## freq of showing dates, since frequency of datetime in our data is 1H.
## so, will have every day 24data points
## Trying to calculate the frequency by each day
## (assumed points are collected every hours in each day, 24)
## set the frequency for labelling the xaxis
freq = int(24)
# set the xlabels as the datetime data for the given labelling frequency,
# also use only the date for the label
ax.set_xticklabels(df.iloc[::freq]["datetime"].dt.strftime("%d-%b-%y"))
# set the xticks at the same frequency as the xlabels
xtix = ax.get_xticks()
ax.set_xticks(xtix[::freq])
# nicer label format for dates
fig.autofmt_xdate()
plt.show()
output:

how to plot time series where x-axis is datetime.time object in matplotlib?

I'm trying to graph timeseries data beginning from 9pm to 6pm the next day. Here is my failed attempt.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import time
a=np.array([35,25,24,25,27,28,30,35])
df=pd.DataFrame(index=pd.date_range("00:00", "23:00", freq="3H").time,data={'column1':a})
column1
00:00:00 35
03:00:00 25
06:00:00 24
09:00:00 25
12:00:00 27
15:00:00 28
18:00:00 30
21:00:00 35
reindexing data to go from 21:00 to 18:00. perhaps there is a better way of achieving this part too, but this works.
df=df.reindex(np.concatenate([df.loc[time(21,00):].index,df.loc[:time(21,00)].index[:-1]]))
column1
21:00:00 35
00:00:00 35
03:00:00 25
06:00:00 24
09:00:00 25
12:00:00 27
15:00:00 28
18:00:00 30
plt.plot(df.index,df['column1'])
the x-axis does not seem to match df.index. also the axis begins at 00:00 not 21:00. does anyone know a solution that doesn't involve using string labels for the x-axis?
A simple way to do that is to plot the data without expliciting the x axis and the to change the labels. The problem is that this will only works if the time between the data is constant. I know you said you didn't want to use string labels, so I don't know of this solution will be what you want.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import time
a=np.array([35,25,24,25,27,28,30,35])
df=pd.DataFrame(index=pd.date_range("00:00", "23:00",freq="3H").time,data={'column1':a})
df=df.reindex(np.concatenate([df.loc[time(21,00):].index,df.loc[:time(21,00)].index[:-1]]))
# Now we create the figure and the axes
fig,axes=plt.subplots()
axes.plot(df['column1']) # In the x axis will appear 0,1,2,3...
axes.set_xticklabels(df.index) # now we change the labels of the xaxis
plt.show()
This should do the trick and will plot what you want.

Python: Matplotlib avoid plotting gaps

I am currently generating the plot below:
with this code:
ax = plt.subplots()
ax.plot(intra.to_pydatetime(), data)
plt.title('Intraday Net Spillover')
fig.autofmt_xdate()
where intra.to_pydatetime() is a:
<bound method DatetimeIndex.to_pydatetime of <class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-03 09:35:00, ..., 2011-01-07 16:00:00]
Length: 390, Freq: None, Timezone: None>
So the dates go from 2011-01-03 09:35:00, increments by 5 minutes until 16:00:00, and then jumps to the next day, 2011-01-04 09:35:00 until 2011-01-04 16:00:00, and so on.
How can I avoid plotting the gaps between 16:00:00 and 9:30:00 on the following day? I don't want to see these straight lines.
UPDATE:
I will try this to see if it works.
Simply set the two values defining the line you don't want to see as NaN (Not a Number). Matplotlib will hide the line between the two values automatically.
Check out this example :
http://matplotlib.org/examples/pylab_examples/nan_test.html
Try to resample your dataframe.
For example :
df.plot()
gives me that result :
plot
and now with resample:
df = df.resample('H').first().fillna(value=np.nan)
plot after resample

Python pandas, how to only plot a DataFrame that actually have the datapoint and leave the gap out

I have a DataFrame with intraday data indexed with DatetimeIndex
df1 =pd.DataFrame(np.random.randn(6,4),index=pd.date_range('1/1/2000',periods=6, freq='1h'))
df2 =pd.DataFrame(np.random.randn(6,4),index=pd.date_range('1/2/2000',periods=6, freq='1h'))
df3 = df1.append(df2)
so as can be seen there is a big gap between within the two days in df3
df3.plot()
will plot every single hour from 2000-01-01 00:00:00 to 2000-01-02 05:00:00, while actually from 2000-01-01 06:00:00 to 2000-01-02 00:00:00 there are actually no datapoint.
How to leave those data point in the plot so that from 2000-01-01 06:00:00 to 2000-01-02 00:00:00 is not plotted?
This seems to have been in discussion for some time at Google Groups.
Pandas Intraday Time Series plots
One way to do this is to resample (hourly) before you plot:
df3.resample('H').plot()
Note: This ensures you have NaN values between real values which are not plotted (rather than connected). This means you are storing more data here, which may be an issue.

Categories