I have a dataframe that consists of a bunch of x,y data that I'd like to see in scatter form along with a line. The dataframe consists of data with its form repeated over multiple categories. The end result I'd like to see is some kind of grid of the plots, but I'm not totally sure how matplotlib handles multiple subplots of overplotted data.
Here's an example of the kind of data I'm working with:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
category = np.arange(1,10)
total_data = pd.DataFrame()
for i in category:
x = np.arange(0,100)
y = 2*x + 10
data = np.random.normal(0,1,100) * y
dataframe = pd.DataFrame({'x':x, 'y':y, 'data':data, 'category':i})
total_data = total_data.append(dataframe)
We have x data, we have y data which is a linear model of some kind of generated dataset (the data variable).
I had been able to generate individual plots based on subsetting the master dataset, but I'd like to see them all side-by-side in a 3x3 grid in this case. However, calling the plots within the loop just overplots them all onto one single image.
Is there a good way to take the following code block and make a grid out of the category subsets? Am I overcomplicating it by doing the subset within the plot call?
plt.scatter(total_data['x'][total_data['category']==1], total_data['data'][total_data['category']==1])
plt.plot(total_data['x'][total_data['category']==1], total_data['y'][total_data['category']==1], linewidth=4, color='black')
If there's a simpler way to generate the by-category scatter plus line, I'm all for it. I don't know if seaborn has a similar or more intuitive method to use than pyplot.
You can use either sns.FacetGrid or manual plt.plot. For example:
g = sns.FacetGrid(data=total_data, col='category', col_wrap=3)
g = g.map(plt.scatter, 'x','data')
g = g.map(plt.plot,'x','y', color='k');
Gives:
Or manual plt with groupby:
fig, axes = plt.subplots(3,3)
for (cat, data), ax in zip(total_data.groupby('category'), axes.ravel()):
ax.scatter(data['x'], data['data'])
ax.plot(data['x'], data['y'], color='k')
gives:
Related
When exploring a I often use Pandas' DataFrame.hist() method to quickly display a grid of histograms for every numeric column in the dataframe, for example:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.hist(bins=50, figsize=(10,7))
plt.show()
Which produces a figure with separate plots for each column:
I've tried the following:
import pandas as pd
import seaborn as sns
from sklearn import datasets
data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
for col_id in df.columns:
sns.distplot(df[col_id])
But this produces a figure with a single plot and all columns overlayed:
Is there a way to produce a grid of histograms showing the data from a DataFrame's columns with Seaborn?
You can take advantage of seaborn's FacetGrid if you reorganize your dataframe using melt. Seaborn typically expects data organized this way (long format).
g = sns.FacetGrid(df.melt(), col='variable', col_wrap=2)
g.map(plt.hist, 'value')
There is no equivalent as seaborn displot itself will only pick 1-D array, or list, maybe you can try generating the subplots.
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for i in range(ax.shape[0]):
for j in range(ax.shape[1]):
sns.distplot(df[df.columns[i*2+j]], ax=ax[i][j])
https://seaborn.pydata.org/examples/distplot_options.html
Here is an example how you can show 4 graphs using subplot, with seaborn.
Anothert useful SEABORN method to quickly display a grid of histograms for every numeric column in the dataframe for you could be the quick,clean and handy sns.pairplot()
try:
sns.pairplot(df)
this has a lot of cool parameters you can explor like Hue etc
pairplot example for iris dataset
if you DON'T want the scatters you can actually create a customised grid really really quickly using sns.PairGrid(df)
this creates an empty grid with all the spaces and you can map whatever you want on them :g = sns.pairgrid(df)
`g.map(sns.distplot)` or `g.map_diag(plt.scatter)`
etc
I ended up adapting jcaliz's to make it work more generally, i.e. not just when the DataFrame has four columns, I also added code to remove any unused axes and ensure axes appear in alphabetical order (as with df.hist()).
size = int(math.ceil(len(df.columns)**0.5))
fig, ax = plt.subplots(size, size, figsize=(10, 10))
for i in range(ax.shape[0]):
for j in range(ax.shape[1]):
data_index = i*ax.shape[1]+j
if data_index < len(df.columns):
sns.distplot(df[df.columns.sort_values()[data_index]], ax=ax[i][j])
for i in range(len(df.columns), size ** 2):
fig.delaxes(ax[i // size][i % size])
I am trying to make a scatterplot over two different types of categorical variables, each with three different levels. Right now I am using the seaborn library in python:
sns.pairplot(x_vars = ['UTM_x'], y_vars = ['UTM_y'], data = df, hue = "Mobility_Provider", height = 5)
sns.pairplot(x_vars = ['UTM_x'], y_vars = ['UTM_y'], data = df, hue = "zone_number", height = 5)
which gives me two separate scatter plot, one grouped by Mobility_Provider, one grouped by zone_number. However, I was wondering if it's possible to combine these two graphs together, e.g. different levels of Mobility_Provider are represented in different colours, while different levels of zone_number are represented in different shapes/markers of the plot.
Thanks a lot!
A sample plot would be:
Plot1
Plot2
Each row of the df has x and y values, and two categorical variables ("Mobility_Provider" and "zone_number")
This can be easily done using seaborn's scatterplot, just use
hue = "Mobility_Provider",style="zone_number"
Something like this
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4],'y':[1,2,3,4],'Mobility_Provider':[0,0,1,1],\
'zone_number':[0,1,0,1]})
sns.scatterplot(x="x", y="y",s=100,hue='Mobility_Provider',style='zone_number', data=df)
plt.show()
I'm trying to plot the relationship of two independent variables x and y with a dependent variable score as a heatmap: x and y are integer values from 0 to infinity and score is a real value between 0 and 1.
Desired appearance
There are a large number of seen values for x and y, so I would like to have it look more like a typical density plot like the example below, since the exact values for each individual (x, y) are not of great importance:
(example taken from Seaborn's documentation)
Current approach
Currently, I'm trying to use Seaborn's heatmap(..) function to plot the data, but the resulting plot is almost unreadable, with a large amount of space between each discrete data point rather than a "continuous" gradient. The logic for plotting used is as follows:
import pandas as pd
from matplotlib.pyplot import cm
import seaborn as sns
sns.set_style("whitegrid")
df = read_df_using_pandas(...)
table = df.pivot_table(
values="score",
index="y",
columns="x", aggfunc='mean')
ax = sns.heatmap(table, cmap=cm.magma_r)
ax.invert_yaxis()
fig = sns_plot.get_figure()
fig.savefig("some_outfile.png", format="png")
The result plot looks like the following, which is wrong, as it does not match the desired appearance described in the section above:
I do not know why there is a large amount of space between each discrete data point rather than a "continuous" gradient. How can I plot the relationship between my data composed of two discrete values (x and y) which is represented as a third, scalar value (score), in a way which mimics the style of a gradient density plot? The solution need not use either Seaborn or even matplotlib.
use imshow
an example that works for me, where 'toplot' is a matrix containing the values you want the heatmap for:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(6,6))
plt.clf()
ax = fig.add_subplot(111)
toplot = INSERT MATRIX HERE
res = ax.imshow(toplot, cmap=plt.cm.viridis, vmin = 0)
cb = fig.colorbar(res,fraction=0.046, pad=0.04)
plt.title('Heatmap')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
row = np.where(toplot == toplot.max())[0][0]
column= np.where(toplot == toplot.max())[1][0]
plt.plot(column,row,'*')
plt.savefig('plots/heatmap.png', format='png')
I also added a star, indicating the highest point in the plot, which I needed.
I am trying to plot multiple lines in a 3D plot using matplotlib. I have 6 datasets with x and y values. What I've tried so far was, to give each point in the data sets a z-value. So all points in data set 1 have z=1 all points of data set 2 have z=2 and so on.
Then I exported them into three files. "X.txt" containing all x-values, "Y.txt" containing all y-values, same for "Z.txt".
Here's the code so far:
#!/usr/bin/python
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
import numpy as np
import pylab
xdata = '/X.txt'
ydata = '/Y.txt'
zdata = '/Z.txt'
X = np.loadtxt(xdata)
Y = np.loadtxt(ydata)
Z = np.loadtxt(zdata)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_wireframe(X,Y,Z)
plt.show()
What I get looks pretty close to what I need. But when using wireframe, the first point and the last point of each dataset are connected. How can I change the colour of the line for each data set and how can I remove the connecting lines between the datasets?
Is there a better plotting style then wireframe?
Load the data sets individually, and then plot each one individually.
I don't know what formats you have, but you want something like this
from mpl_toolkits.mplot3d.axes3d import Axes3D
import matplotlib.pyplot as plt
fig, ax = plt.subplots(subplot_kw={'projection': '3d'})
datasets = [{"x":[1,2,3], "y":[1,4,9], "z":[0,0,0], "colour": "red"} for _ in range(6)]
for dataset in datasets:
ax.plot(dataset["x"], dataset["y"], dataset["z"], color=dataset["colour"])
plt.show()
Each time you call plot (or plot_wireframe but i don't know what you need that) on an axes object, it will add the data as a new series. If you leave out the color argument matplotlib will choose them for you, but it's not too smart and after you add too many series' it will loop around and start using the same colours again.
n.b. i haven't tested this - can't remember if color is the correct argument. Pretty sure it is though.
I haven't really attempted any way to do this, but I am wondering if there is a way to merge two plots that already exist into one graph. Any input would be greatly appreciated!
Here is a complete minimal working example that goes through all the steps you need to extract and combine the data from multiple plots.
import numpy as np
import pylab as plt
# Create some test data
secret_data_X1 = np.linspace(0,1,100)
secret_data_Y1 = secret_data_X1**2
secret_data_X2 = np.linspace(1,2,100)
secret_data_Y2 = secret_data_X2**2
# Show the secret data
plt.subplot(2,1,1)
plt.plot(secret_data_X1,secret_data_Y1,'r')
plt.plot(secret_data_X2,secret_data_Y2,'b')
# Loop through the plots created and find the x,y values
X,Y = [], []
for lines in plt.gca().get_lines():
for x,y in lines.get_xydata():
X.append(x)
Y.append(y)
# If you are doing a line plot, we don't know if the x values are
# sequential, we sort based off the x-values
idx = np.argsort(X)
X = np.array(X)[idx]
Y = np.array(Y)[idx]
plt.subplot(2,1,2)
plt.plot(X,Y,'g')
plt.show()
Assuming you are using Matplotlib, you can get the data for a figure as an NX2 numpy array like so:
gca().get_lines()[n].get_xydata()