Python categorical plot with error bands - python

I need to make a plot of the following data, with the year_week on x-axis, the test_duration on the y-axis, and each operator as a different series. There may be multiple data points for the same operator in one week. I need to show standard deviation bands around each series.
data = pd.DataFrame({'year_week':[1601,1602,1603,1604,1604,1604],
'operator':['jones','jack','john','jones','jones','jack'],
'test_duration':[10,12,43,7,23,9]})
prints as:
I have looked at seaborn, matplotlib, and pandas, but I cannot find a solution.

It could be that you are looking for seaborn pointplot.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.DataFrame({'year_week':[1601,1602,1603,1604,1604,1604],
'operator':['jones','jack','john','jones','jones','jack'],
'test_duration':[10,12,43,7,23,9]})
sns.pointplot(x="year_week", y="test_duration", hue="operator", data=data)
plt.show()

Related

can seaborn normalise data such that y-axis is clear

I plot time series of data where the y values of the data are orders of magnitude different.
I am using seaborn.lmplot and was expecting to find a normalise keyword, but have been unable to.
I tried to use a log scale, but this failed (see diagram).
This is my best attempt so far:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
gbp_stats = pd.read_csv('price_data.csv')
sns.lmplot(data=gbp_stats, x='numeric_time', y='last trade price', col='symbol')
plt.yscale('log')
plt.show()
Which gave me this:
As you can see, the result needs to scale or normalize the y-axis for each plot. I could do a normalization in pandas, but wanted to avoid such if possible.
So my question is this: Does seaborn have a normailze feature such that the y-axis can be compared better than what i have achieved?
I post this answer which was directly derived from mwaskom comment sharey=False, with a small tweak as this format was depreciated in seaborn and sharey=False now goes into a dict.
The implementation is to add the keyword which takes a dict like this: facet_kws={'sharey':False}
So the answer becomes this:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
gbp_stats = pd.read_csv('price_data.csv')
sns.lmplot(data=gbp_stats, x='numeric_time', y='last trade price',
col='symbol', hue='symbol', facet_kws={'sharey':False})
plt.yscale('log') # this is optional now.
plt.show()
And the result is this:

How to create a loop for plotting dataframe features in Python using seaborn

I would like to plot multiple subplot of histogram to observe the distribution of each individual features in a data frame.
I have tried to use the below code, but it says dataframe object has no attribute i. I know the code is wrong somewhere, but i have tried to search for solution and could not find a way to generate a loop for this.
for i in enumerate(feature):
plt.subplot(3,3, i[0]+1)
sns.histplot(df.i[i], kde=True)
Is this what you're looking for?
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')
df = pd.read_csv('data.csv')
df.plot.hist(subplots=True, legend=False)
Seaborn alternative
g = sns.FacetGrid(df, row='column')
g.map(plt.hist, 'value')

How to create seaborn violinplot with mean,median and mode displayed?

Is there a way to add a mean and a mode to a violinplot ? I have categorical data in one of my columns and the corresponding values in the next column. I tried looking into matplotlib violin plot as it technically offers the functionality I am looking for but it does not allow me to specify a categorical variable on the x axis, and this is crucial as I am looking at the distribution of the data per category. I have added a small table illustrating the shape of the data.
plt.figure(figsize=10,15)
ax=sns.violinplot(x='category',y='value',data=df)
First we calculate the the mode and means:
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({'Category':[1,2,5,1,2,4,3,4,2],
'Value':[1.5,1.2,2.2,2.6,2.3,2.7,5,3,0]})
Means = df.groupby('Category')['Value'].mean()
Modes = df.groupby('Category')['Value'].agg(lambda x: pd.Series.mode(x)[0])
You can use seaborn to make the basic plot, below I remove the inner boxplot using the inner= argument, so that we can see the mode and means:
fig, ax = plt.subplots()
sns.violinplot(x='Category',y='Value',data=df,inner=None)
plt.setp(ax.collections, alpha=.3)
plt.scatter(x=range(len(Means)),y=Means,c="k")
plt.scatter(x=range(len(Modes)),y=Modes)

Seaborn plot two data sets on the same scatter plot

I have 2 data sets in Pandas Dataframe and I want to visualize them on the same scatter plot so I tried:
import matplotlib.pyplot as plt
import seaborn as sns
sns.pairplot(x_vars=['Std'], y_vars=['ATR'], data=set1, hue='Asset Subclass')
sns.pairplot(x_vars=['Std'], y_vars=['ATR'], data=set2, hue='Asset Subclass')
plt.show()
But all the time I get 2 separate charts instead of a single one
How can I visualize both data sets on the same plot? Also can I have the same legend for both data sets but different colors for the second data set?
The following should work in the latest version of seaborn (0.9.0)
import matplotlib.pyplot as plt
import seaborn as sns
First we concatenate the two datasets into one and assign a dataset column which will allow us to preserve the information as to which row is from which dataset.
concatenated = pd.concat([set1.assign(dataset='set1'), set2.assign(dataset='set2')])
Then we use the sns.scatterplot function from the latest seaborn version (0.9.0) and via the style keyword argument set it so that the markers are based on the dataset column:
sns.scatterplot(x='Std', y='ATR', data=concatenated,
hue='Asset Subclass', style='dataset')
plt.show()

How to change the step size matplotlib uses when plotting timestamp objects?

I'm currently attempting to graph a fairly small dataset using the matplotlib and pandas libraries. The format of the dataset is a CSV file. Here is the dataset:
DATE,UNRATE
1948-01-01,3.4
1948-02-01,3.8
1948-03-01,4.0
1948-04-01,3.9
1948-05-01,3.5
1948-06-01,3.6
1948-07-01,3.6
1948-08-01,3.9
1948-09-01,3.8
1948-10-01,3.7
1948-11-01,3.8
1948-12-01,4.0
I loaded the dataset using pandas (as can be seen, the file that holds that dataset is named 'dataset.csv'):
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('dataset.csv')
dataset['DATE'] = pd.to_datetime(dataset['DATE'])
I then attempted to plot the dataset loaded, using matplotlib:
plt.plot(dataset['DATE'], dataset['UNRATE'])
plt.show()
The code above mostly worked fine, and displayed the following graph:
The problem, however, is that the data I wanted displayed on the x axis, seems to have only been plotted in intervals of two:
I found the question, Changing the “tick frequency” on x or y axis in matplotlib?, which does correlate to my problem. But, from my testing, only seems to work with integral values.
I also found the question, controlling the number of x ticks in pyplot, which seemed to provide a solution to my problem. The method the answer said to use, to_pydatetime, was a method of DatetimeIndex. Since my understanding is that pandas.to_datetime would return a DatetimeIndex by default, I could use to_pydatetime on dataset['DATE']:
plt.xticks(dataset['DATE'].to_pydatetime())
However, I instead received the error:
AttributeError: 'Series' object has no attribute 'to_pydatetime'
Since this appears to just be default behavior, is there a way to force matplotlib to graph each point along the x axis, rather than simply graphing every other point?
To get rid of the error you may convert the dates as follows and also set the labels accordingly:
plt.xticks(dataset['DATE'].tolist(),dataset['DATE'].tolist())
or as has been mentionned in the comments
plt.xticks(dataset['DATE'].dt.to_pydatetime(),dataset['DATE'].dt.to_pydatetime())
But let's look at some more useful options.
Plotting strings
First of all it is possible to plot the data as it is, i.e. as strings.
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('dateunrate.txt')
plt.plot(dataset['DATE'], dataset['UNRATE'])
plt.setp(plt.gca().get_xticklabels(), rotation=45, ha="right")
plt.show()
This is just like plotting plt.plot(["apple", "banana", "cherry"], [1,2,3]). This means that the successive dates are just placed one-by-one on the axes, independent on whether they are a minute, a day or a year appart. E.g. if your dates were 2018-01-01, 2018-01-03, 2018-01-27 they would still appear equally spaced on the axes.
Plot dates with pandas (automatically)
Pandas can nicely plot dates out of the box if the dates are in the index of the dataframe. To this end you may read the dataframe in a way that the first csv column is parsed as the index.
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('dateunrate.txt', parse_dates=[0], index_col=0)
dataset.plot()
plt.show()
This is equivalent to
dataset = pd.read_csv('../dateunrate.txt', parse_dates=[0])
dataset = dataset.set_index("DATE")
dataset.plot()
or
dataset = pd.read_csv('../dateunrate.txt')
dataset["DATE"] = pd.to_datetime(dataset["DATE"])
dataset = dataset.set_index("DATE")
dataset.plot()
or even
dataset = pd.read_csv('../dateunrate.txt')
dataset["DATE"] = pd.to_datetime(dataset["DATE"])
dataset.plot(x="DATE",y="UNRATE")
This works nice in this case because you happen to have one date per month and pandas will decide to show all 12 months as ticklabels in this case.
For other cases this may result in different tick locations.
Plot dates with matplotlib or pandas (manually)
In the general case, you may use matplotlib.dates formatters and locators to tweak the tick(label)s in the way you want. Here, we might use a MonthLocator and set the ticklabel format to "%b %Y". This works well with matplotlib plot or pandas plot(x_compat=True).
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
dataset = pd.read_csv('dateunrate.txt', parse_dates=[0], index_col=0)
plt.plot(dataset.index, dataset['UNRATE'])
## or use
#dataset.plot(x_compat=True) #note the x_compat argument
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%b %Y"))
plt.setp(plt.gca().get_xticklabels(), rotation=45, ha="right")
plt.show()

Categories