Format datetime in seaborn faceted scatter plot - python

I am learning python pandas + matplotlib + seaborn plotting and data visualization from a "R Lattice" perspective. I am still getting my legs. Here is a basic question that I could not get to work just right. Here's the example:
# envir (this is running in an iPython notebook)
%pylab inline
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# generate some data
nRows = 500
df = pd.DataFrame({'c1' : np.random.choice(['A','B','C','D'], size=nRows),
'c2' : np.random.choice(['P','Q','R'], size=nRows),
'i1' : np.random.randint(20,50, nRows),
'i2' : np.random.randint(0,10, nRows),
'x1' : 3 * np.random.randn(nRows) + 90,
'x2' : 2 * np.random.randn(nRows) + 89,
't1' : pd.date_range('10/3/2014', periods=nRows)})
# plot a lattice like plot
# 'hue=' is like 'groups=' in R
# 'col=' is like "|" in lattice formula interface
g = sns.FacetGrid(df, col='c1', hue='c2', size=4, col_wrap=2, aspect=2)
g.map(scatter, 't1', 'x1', s=20)
g.add_legend()
I would like the x axis to plot in an appropriate date time format, not as an integer. I am ok specify the format (YYYY-MM-DD, for example) as a start.
However it would be better if the time range was inspected and the appropriate scale was produced. In R Lattice (and other plotting systems), if the x variable is a datetime, a "pretty" function would determine if the range was large and implied YYYY only (say, for plotting 20 year time trend), YYYY-MM (for plotting something that was a few years)... or YYYY-MM-DD HH:MM:SS format for high frequency time series data (i.e. something sampled every 100 mS). That was done automatically. Is there anything like that available for this case?
One other really basic question on this example (I am almost embarrassed to ask). How can I get a title on this plot?
Thanks!
Randall

It looks like seaborn does not support datetime on the axes in lmplot yet. However, it does support with a few other of its plots. In the mean time, I would suggest adding your need to the issue in the link above, since it currently seems there isn't enough perceived need for them to address it.
As far as a title, use can use set_title() on the object itself. That would look something like this:
.
.
.
g = sns.FacetGrid(df, col='c1', hue='c2', size=4, col_wrap=2, aspect=2)
g.map(scatter, 't1', 'x1', s=20)
g.add_legend()
Then simply add:
g.set_title('Check out that beautiful facet plot!')

Related

How do you get dates on the start on the specified month? (matplotlib)

# FEB
# configuring the figure and plot space
fig, lx = plt.subplots(figsize=(30,10))
# converting the Series into str so the data can be plotted
wd = df2['Unnamed: 1']
wd = wd.astype(float)
# adding the x and y axes' values
lx.plot(list(df2.index.values), wd)
# defining what the labels will be
lx.set(xlabel='Day', ylabel='Weight', title='Daily Weight February 2022')
# defining the date format
date_format = DateFormatter('%m-%d')
lx.xaxis.set_major_formatter(date_format)
lx.xaxis.set_minor_locator(mdates.WeekdayLocator(interval=1))
Values I would like the x-axis to have:
['2/4', '2/5', '2/6', '2/7', '2/8', '2/9', '2/10', '2/11', '2/12', '2/13', '2/14', '2/15', '2/16', '2/17', '2/18', '2/19', '2/20', '2/21', '2/22', '2/23', '2/24', '2/25', '2/26', '2/27']
Values on the x-axis:
enter image description here
It is giving me the right number of values just not the right labels. I have tried to specify the start and end with xlim=['2/4', '2/27], however that did seem to work.
It would be great to see how your df2 actually looks, but from your code snippet, it looks like it has weights recorded but not the corresponding dates.
How about prepare a data frame that has dates in it?
(Also, since this question is tagged with seaborn too, I'm going to use Seaborn, but the same idea should work.)
import pandas as pd
import seaborn as sns
import seaborn.objects as so
from matplotlib.dates import DateFormatter
sns.set_theme()
Create an index with the dates starting from 4 Feb with the number of days we have weight recorded.
index = pd.date_range(start="2/4/2022", periods=df.count().Weight, name="Date")
Then with Seaborn's object interface (v0.12+), we can do:
(
so.Plot(df2.set_index(index), x="Date", y="Weight")
.add(so.Line())
.scale(x=so.Temporal().label(formatter=DateFormatter('%m-%d')))
.label(title="Daily Weight February 2022")
)
I have solved this solution. Very simple. I just added mdates.WeekdayLocator() to set_major_formatter. I overlooked this when I was going through the matplotlib docs. But happy to have found this solution.

Plotly: How to add vertical lines at specified points?

I have a data frame plot of a time series along with a list of numeric values at which I'd like to draw vertical lines. The plot is an interactive one created using the cufflinks package. Here is an example of three time series in 1000 time values, I'd like to draw vertical lines at 500 and 800. My attempt using "axvlinee" is based upon suggestions I've seen for similar posts:
import numpy as np
import pandas as pd
import cufflinks
np.random.seed(123)
X = np.random.randn(1000,3)
df=pd.DataFrame(X, columns=['a','b','c'])
fig=df.iplot(asFigure=True,xTitle='time',yTitle='values',title='Time Series Plot')
fig.axvline([500,800], linewidth=5,color="black", linestyle="--")
fig.show()
The error message states 'Figure' object has no attribute 'axvline'.
I'm not sure whether this message is due to my lack of understanding about basic plots or stems from a limitation of using igraph.
The answer:
To add a line to an existing plotly figure, just use:
fig.add_shape(type='line',...)
The details:
I gather this is the post you've seen since you're mixing in matplotlib. And as it has been stated in the comments, axvline has got nothing to do with plotly. That was only used as an example for how you could have done it using matplotlib. Using plotly, I'd either go for fig.add_shape(go.layout.Shape(type="line"). But before you try it out for yourself, please b aware that cufflinks has been deprecated. I really liked cufflinks, but now there are better options for building both quick and detailed graphs. If you'd like to stick to one-liners similat to iplot, I'd suggest using plotly.express. The only hurdle in your case is changing your dataset from a wide to a long format that is preferred by plotly.express. The snippet below does just that to produce the following plot:
Code:
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.offline import iplot
#
np.random.seed(123)
X = np.random.randn(1000,3)
df=pd.DataFrame(X, columns=['a','b','c'])
df['id'] = df.index
df = pd.melt(df, id_vars='id', value_vars=df.columns[:-1])
# plotly line figure
fig = px.line(df, x='id', y='value', color='variable')
# lines to add, specified by x-position
lines = {'a':500,'c':700,'a':900,'b':950}
# add lines using absolute references
for k in lines.keys():
#print(k)
fig.add_shape(type='line',
yref="y",
xref="x",
x0=lines[k],
y0=df['value'].min()*1.2,
x1=lines[k],
y1=df['value'].max()*1.2,
line=dict(color='black', width=3))
fig.add_annotation(
x=lines[k],
y=1.06,
yref='paper',
showarrow=False,
text=k)
fig.show()
Not sure if this is what you want, adding two scatter seems to work:
np.random.seed(123)
X = np.random.randn(1000,3)
df=pd.DataFrame(X, columns=['a','b','c'])
fig = df.iplot(asFigure=True,xTitle='time',yTitle='values',title='Time Series Plot')
fig.add_scatter(x=[500]*100, y=np.linspace(-4,4,100), name='lower')
fig.add_scatter(x=[800]*100, y=np.linspace(-4,4,100), name='upper')
fig.show()
Output:

Integrating over range of dates, and labeling the xaxis

I am trying to integrate 2 curves as they change through time using pandas. I am loading data from a CSV file like such:
Where the Dates are the X-axis and both the Oil & Water points are the Y-axis. I have learned to use the cross-section option to isolate the "NAME" values, but am having trouble finding a good way to integrate with dates as the X-axis. I eventually would like to be able to take the integrals of both curves and stack them against each other. I am also having trouble with the plot defaulting the x-ticks to arbitrary values, instead of the dates.
I can change the labels/ticks manually, but have a large CSV to process and would like to automate the process. Any help would be greatly appreciated.
NAME,DATE,O,W
A,1/20/2000,12,50
B,1/20/2000,25,28
C,1/20/2000,14,15
A,1/21/2000,34,50
B,1/21/2000,8,3
C,1/21/2000,10,19
A,1/22/2000,47,35
B,1/22/2000,4,27
C,1/22/2000,46,1
A,1/23/2000,19,31
B,1/23/2000,18,10
C,1/23/2000,19,41
Contents of CSV in text form above.
Further to my comment above, here is some sample code (using logic from the example mentioned) to label your xaxis with formatted dates. Hope this helps.
Data Collection / Imports:
Just re-creating your dataset for the example.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
header = ['NAME', 'DATE', 'O', 'W']
data = [['A','1/20/2000',12,50],
['B','1/20/2000',25,28],
['C','1/20/2000',14,15],
['A','1/21/2000',34,50],
['B','1/21/2000',8,3],
['C','1/21/2000',10,19],
['A','1/22/2000',47,35],
['B','1/22/2000',4,27],
['C','1/22/2000',46,1],
['A','1/23/2000',19,31],
['B','1/23/2000',18,10],
['C','1/23/2000',19,41]]
df = pd.DataFrame(data, columns=header)
df['DATE'] = pd.to_datetime(df['DATE'], format='%m/%d/%Y')
# Subset to just the 'A' labels.
df_a = df[df['NAME'] == 'A']
Plotting:
# Define the number of ticks you need.
nticks = 4
# Define the date format.
mask = '%m-%d-%Y'
# Create the set of custom date labels.
step = int(df_a.shape[0] / nticks)
xdata = np.arange(df_a.shape[0])
xlabels = df_a['DATE'].dt.strftime(mask).tolist()[::step]
# Create the plot.
fig, ax = plt.subplots(1, 1)
ax.plot(xdata, df_a['O'], label='Oil')
ax.plot(xdata, df_a['W'], label='Water')
ax.set_xticks(np.arange(df_a.shape[0], step=step))
ax.set_xticklabels(xlabels, rotation=45, horizontalalignment='right')
ax.set_title('Test in Naming Labels for the X-Axis')
ax.legend()
Output:
I'd recommend modifying the X-axis into some form of integers or floats (Seconds, minutes, hours days since a certain time, based on the precision that you need). You can then use usual methods to integrate and the x-axes would no longer default to some other values.
See How to convert datetime to integer in python

Plotly - Histogram bins size to weeks

I'm trying to plot a histogram with date data using plotly. I would like to plot it with bin sizes corresponding to weeks, and that doesn't seem to work. I searched for documentation about it but didn't find anything.
Here is the code I have. I tried (line 5): 'D7' and 'W1'. That doesn't work (plotly seems not to recognize argument, and set it to one bin per day). What's strange is that 'M1', 'M3' etc... seem to work
fig = go.Figure(data=[go.Histogram(x=df.col,
xbins=dict(
start='2018-01-01',
end='2018-12-31',
size='D7'),
autobinx=False)])
fig.update_layout(
title=go.layout.Title(
text="title",
xref="paper",
x=0.5
),
xaxis_title_text='xaxis title',
yaxis_title_text='yaxis title'
)
fig.show()
Would someone have any information about this problem ?
Thanks
xbins.size is specified in milliseconds by default. To get weekly bins, set xbins.size to 604800000 (7 days with 86,400,000 milliseconds each).
Plotly provides the format xM to get monthly bins because this use case requires more complicated calculations in the background, as monthly bins do not have a uniform size.
It seems that a resampled data source and a bar plot is what you're really looking for:
Plot:
Here, the source data based on daily observations DatetimeIndex(['2020-01-01', '2020-01-02', ... , '2020-07-18'], have been resampled to show sum per week for a certain stock price.
Code:
# Imports
import pandas as pd
#import matplotlib.pyplot as plt
import numpy as np
import plotly.graph_objects as go
#from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# data, random sample to illustrate stocks
np.random.seed(12345)
rows = 200
x = pd.Series(np.random.randn(rows),index=pd.date_range('1/1/2020', periods=rows)).cumsum()
y = pd.Series(x-np.random.randn(rows)*5,index=pd.date_range('1/1/2020', periods=rows))
df = pd.concat([y,x], axis = 1)
df.columns = ['StockA', 'StockB']
# resample daily data to weekly sums
df2=df.reset_index()
df3=df2.resample('W-Mon', on='index').mean()
# build and show plotly plot
fig = go.Figure([go.Bar(x=df3.index, y=df3['StockA'])])
fig.show()
Let me know how this works for you.

Renaming X-Axis Labels when using Matplotlib and Pandas

I'm running my code on iPython Notebooks, on a Macbook Pro Yosemite 10.10.4
I have a CSV file that I am trying to read using Python, and looking to come up with charts. The problem I am facing is renaming the X-Axis labels.
Essentially, the chart is trying to plot a count of different types of Audit Violations, but has really long descriptions of the said violations. For example:
Not approved by regional committee.......another 300 words - 17
No contract with vendor.......another 300 words - 14
Vendor Registration not on record.......another 300 words - 9
Instead of having these verbose reasons though, I would like to rename the X-Axis labels to just numbers or alphabets so that the graph reads somewhat like this:
A - 17
B - 14
C - 9
This is the code I have used, and except for the label names, I am happy with the result.
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
plt.rcParams['figure.figsize'] = (15, 5)
fixed_data = pd.read_csv('audit-rep.csv',sep=',',encoding='latin1',index_col='Index', parse_dates=['Report Date'],dayfirst=False)
viol_counts = data['Audit Flag'].value_counts()
viol_counts[:10]
viol_counts[:10].plot(kind='bar')
I have tried to rename the x-axis labels using the code below.
viol_counts.set_ylabel('No. of Violations')
viol_counts.set_title('Audit Results')
viol_counts.set_xticks(ind+width)
viol_counts.set_xticklabels( ('A', 'B','C') )
This is the error I get when using the above code.
AttributeError: 'Series' object has no attribute 'set_ylabel'
I have come across a few other posts related to this issue, but not seen one that specifically addresses the renaming of individual labels.
This isn't utterly important though, and I'm just trying to learn using python, and the actual work has been done in excel.
if you assign the plot Axes object to a variable name (here I've called it viol_plot), then you can perform action on that Axes object (you are currently trying to set the labels and ticks on the Series, not the plot):
viol_plot = viol_counts[:10].plot(kind='bar')
viol_plot.set_ylabel('No. of Violations')
viol_plot.set_title('Audit Results')
viol_plot.set_xticks(ind+width)
viol_plot.set_xticklabels( ('A', 'B','C') )
Methods such as set_ylabel, set_title are for plot object.
One option would be using subplot:
figure, ax = pl.subplots(1,1)
#note the ax option.
viol_counts[:10].plot(kind='bar', ax=ax)
ax.set_ylabel('No. of Violations')
ax.set_title('Audit Results')
ax.set_xticks(ind+width)
ax.set_xticklabels( ('A', 'B','C') )
Or, you can just plot and use different functions. One downside is that I don't know how to set xticklabels in this case:
viol_counts[:10].plot(kind='bar')
#note names of functions are slightly different
pl.ylabel('No. of Violations')
pl.title('Audit Results')
pl.xticks(ind+width)

Categories