Creating a seaborn factor plot using two data sets? - python

I have two data sets:
https://storage.googleapis.com/hewwo/NCHS_-_Leading_Causes_of_Death__United_States.csv
https://storage.googleapis.com/hewwo/BRFSS_Prevalence_and_Trends_Data__Tobacco_Use_-_Four_Level_Smoking_Data_for_2011.csv
From these two data sets, I have created two factor plots
sns.factorplot(x='state', y='deaths', data=death, aspect = 4)
plt.xticks(rotation=90)
Result:
smoking['total_smoker'] = smoking['smoke_everyday'] + smoking['smoke_some_days']
sns.factorplot(x='state', y='total_smoker', data=smoking.sort_values("state"), aspect = 3)
plt.xticks(rotation = 90)
Result:
I am looking for a way to visually compare the two lines. Is there a way to create one factor plot using data from both sets in order to better compare data per state? Is there maybe a better way to show this visualization than what I am thinking? Apologies if my question is unclear, my experience with these tools are still lacking.

just use two lineplot on the same axes:
fig, ax = plt.subplots()
sns.lineplot(..., ax=ax) # first dataset
sns.lineplot(..., ax=ax) # second dataset

Related

Plotting two pandas series together one appears flat

I am practicing with Python Pandas plotting functions and I am trying to plot the content of two series extracted from the same dataframe into one plot.
When I plot the two series individually the result is correct. However, when I plot them together, the one that I plot as second appears flat in the picture.
Here is my code:
# dailyFlow and smooth are created in the same way from the same dataframe
dailyFlow = pd.Series(dataFrame...
smooth = pd.Series(dataFrame...
# lower the noise in the signal with standard deviation = 6
smooth = smooth.resample('D').sum().rolling(31, center=True, win_type='gaussian').sum(std=6)
dailyFlow.plot(style ='-b')
plt.legend(loc = 'upper right')
plt.show()
smooth.plot(style ='-r')
plt.legend(loc = 'upper right')
plt.show()
plt.figure(figsize=(12,5))
smooth.plot(style ='-r')
dailyFlow.plot(style ='-b')
plt.legend(loc = 'upper right')
plt.show()
Here is the output of my function:
I already tried using the parameter secondary_y=True in the second plot, but then I lose the information on the second line in the legend and the scaling between the two plots is wrong.
Many sources on the Internet seem to suggest that plotting the two series like I am doing should be correct, but then why is the third plot incorrect?
Thank you very much for your help.
For the data you have, the 3rd plot is correct. Look at the scale of the y axis on your two plots: one goes up to 70,000 and the other to 60,000,000.
I suspect what you actually want is a .rolling(...).mean() which should have a range comparable to your original data.
If you would like to make both plots bigger, you cold try something like this
fig, ax1 = plt.subplots()
ax1.set_ylim([0, 75000])
# plot first graph
ax2 = ax1.twinx() # second axes that shares the same x-axis
ax2.set_ylim([0, 60000000])
#plot the second graph

How to print multiple plots together in python?

I am trying to print about 42 plots in 7 rows, 6 columns, but the printed output in jupyter notebook, shows all the plots one under the other. I want them in (7,6) format for comparison. I am using matplotlib.subplot2grid() function.
Note: I do not get any error, and my code works, however the plots are one under the other, vs being in a grid/ matrix form.
Here is my code:
def draw_umap(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean', title=''):
fit = umap.UMAP(
n_neighbors=n_neighbors,
min_dist=min_dist,
n_components=n_components,
metric=metric
)
u = fit.fit_transform(df);
plots = []
plt.figure(0)
fig = plt.figure()
fig.set_figheight(10)
fig.set_figwidth(10)
for i in range(7):
for j in range(6):
plt.subplot2grid((7,6), (i,j), rowspan=7, colspan=6)
plt.scatter(u[:,0], u[:,1], c= df.iloc[:,0])
plt.title(title, fontsize=8)
n=range(7)
d=range(6)
for n in n_neighbors:
for d in dist:
draw_umap(n_neighbors=n, min_dist=d, title="n_neighbors={}".format(n) + " min_dist={}".format(d))
I did refer to this post to get the plots in a grid and followed the code.
I also referred to this post, and modified my code for size of the fig.
Is there a better way to do this using Seaborn?
What am I missing here? Please help!
Both questions that you have linked contain solutions that seem more complicated than necessary. Note that subplot2grid is useful only if you want to create subplots of varying sizes which I understand is not your case. Also note that according to the docs Using GridSpec, as demonstrated in GridSpec demo is generally preferred, and I would also recommend this function only if you want to create subplots of varying sizes.
The simple way to create a grid of equal-sized subplots is to use plt.subplots which returns an array of Axes through which you can loop to plot your data as shown in this answer. That solution should work fine in your case seeing as you are plotting 42 plots in a grid of 7 by 6. But the problem is that in many cases you may find yourself not needing all the Axes of the grid, so you will end up with some empty frames in your figure.
Therefore, I suggest using a more general solution that works in any situation by first creating an empty figure and then adding each Axes with fig.add_subplot as shown in the following example:
import numpy as np # v 1.19.2
import matplotlib.pyplot as plt # v 3.3.4
# Create sample dataset
rng = np.random.default_rng(seed=1) # random number generator
nvars = 8
nobs = 50
xs = rng.uniform(size=(nvars, nobs))
ys = rng.normal(size=(nvars, nobs))
# Create figure with appropriate space between subplots
fig = plt.figure(figsize=(10, 8))
fig.subplots_adjust(hspace=0.4, wspace=0.3)
# Plot data by looping through arrays of variables and list of colors
colors = plt.get_cmap('tab10').colors
for idx, x, y, color in zip(range(len(xs)), xs, ys, colors):
ax = fig.add_subplot(3, 3, idx+1)
ax.scatter(x, y, color=color)
This could be done in seaborn as well, but I would need to see what your dataset looks like to provide a solution relevant to your case.
You can find a more elaborate example of this approach in the second solution in this answer.

How to improve this seaborn countplot?

I used the following code to generate the countplot in python using seaborn:
sns.countplot( x='Genres', data=gn_s)
But I got the following output:
I can't see the items on x-axis clearly as they are overlapping. How can I correct that?
Also I would like all the items to be arranged in a decreasing order of count. How can I achieve that?
You can use choose the x-axis to be vertical, as an example:
g = sns.countplot( x='Genres', data=gn_s)
g.set_xticklabels(g.get_xticklabels(),rotation=90)
Or, you can also do:
plt.xticks(rotation=90)
Bring in matplotlib to set up an axis ahead of time, so that you can modify the axis tick labels by rotating them 90 degrees and/or changing font size. To arrange your samples in order, you need to modify the source. I assume you're starting with a pandas dataframe, so something like:
data = data.sort_values(by='Genres', ascending=False)
labels = # list of labels in the correct order, probably your data.index
fig, ax1 = plt.subplots(1,1)
sns.countplot( x='Genres', data=gn_s, ax=ax1)
ax1.set_xticklabels(labels, rotation=90)
would probably help.
edit Taking andrewnagyeb's suggestion from the comments to order the plot:
sns.countplot( x='Genres', data=gn_s, order = gn_s['Genres'].value_counts().index)

How to use different axis scales in pandas' DataFrame.plot.hist?

I find DataFrame.plot.hist to be amazingly convenient, but I cannot find a solution in this case.
I want to plot the distribution of many columns in the dataset. The problem is that pandas retains the same scale on all x axes, rendering most of the plots useless. Here is the code I'm using:
X.plot.hist(subplots=True, layout=(13, 6), figsize=(20, 45), bins=50, sharey=False, sharex=False)
plt.show()
And here's a section of the result:
It appears that the issue is that pandas uses the same bins on all the columns, irrespectively of their values. Is there a convenient solution in pandas or am I forced to do it by hand?
I centered the data (zero mean and unit variance) and the result improved a little, but it's still not acceptable.
There are a couple of options, here is the code and output:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Dummy data - value ranges differ a lot between columns
X = pd.DataFrame()
for i in range(18):
X['COL0{0}'.format(i+38)]=(2**i)*np.random.random(1000)
# Method 1 - just using the hist function to generate each plot
X.hist(layout=(3, 6), figsize=(20, 10), sharey=False, sharex=False, bins=50)
plt.title('Method 1')
plt.show()
# Method 2 - generate each plot separately
cols = plt.cm.spectral(np.arange(1,255,13))
fig, axes = plt.subplots(3,6,figsize=(20,10))
for index, column in enumerate(X.columns):
ax = axes.flatten()[index]
ax.hist(X[column],bins=50, label=column, fc=cols[index])
ax.legend(loc='upper right')
ax.set_ylim((0,1.2*ax.get_ylim()[1]))
fig.suptitle('Method 2')
fig.show()
The first plot:
The second plot:
I would definitely recommend the second method as you have much more control over the individual plots, for example you can change the axes scales, labels, grid parameters, and almost anything else.
I couldn't find anything that would allow you to modify the original plot.hist bins to accept individually calculated bins.
I hope this helps!

How to plot non-numeric data in Matplotlib

I wish to plot the time variation of my y-axis variable using Matplotlib. This is no problem for continuously discrete data, however how should this be tackled for non-continuous data.
I.e. if I wanted to visualise the times at which my car was stationary on the way to work the x-axis would be time and the y-axis would be comprised of the variables 'stationary' and 'moving' (pretty useless example i know)
The non-continuous data would need to be indexed somehow, but i don't know how to proceed...any ideas?
Is this the type of thing you want? (If not, you might want to check out the matplotlib gallery page to give yourself some ideas, or maybe just draw a picture and post it.)
import matplotlib.pyplot as plt
data = [0]*5 + [1]*10 + [0]*3 +[1]*2
print data
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(data)
ax.set_yticks((0, 1.))
ax.set_yticklabels(('stopped', 'moving'))
ax.set_ybound((-.2, 1.2))
ax.set_xlabel("time (minutes)")
plt.show()

Categories