Matplotlib scatter(): Polynomial regression line [duplicate] - python

This question already has answers here:
Simplest way to make a polynomial regression with sklearn?
(2 answers)
polynomial regression using python
(3 answers)
Closed 4 years ago.
Is it possible to do a polynomial regression line on a scatter() in matplotlib?
This is my graph:
https://imgur.com/a/Xh1BO
alg_n = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4...]
orig_hc_runtime = [0.01, 0.02, 0.03, 0.04, 0.04, 0.04, 0.05, 0.09...]
plt.scatter(alg_n, orig_hc_runtime, label="Orig HC", color="b", s=4)
plt.scatter(alg_n, mod_hc_runtime, label="Mod HC", color="c", s=4)
...
x_values = [x for x in range(5, n_init+2, 2)]
y_values = [y for y in range(0, 10, 2)]
plt.xlabel("Number of Queens")
plt.ylabel("Time (sec)")
plt.title("Algorithm Performance: Time")
plt.xticks(x_values)
plt.yticks(y_values)
plt.grid(linewidth="1", color="white")
plt.legend()
plt.show()
Is it possible to have regression lines for eat data set? If so, can you please explain how I can do it.

Not sure if it can be done just using matplotlib but you can always compute regression separately and plot it. I leave an example code using scikit-learn to compute regression line.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
x = [1, 2, 3, 4, 5, 8, 10]
y = [1.1, 3.8, 8.5, 16, 24, 65, 99.2]
model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(np.array(x).reshape(-1, 1), y)
x_reg = np.arange(11)
y_reg = model.predict(x_reg.reshape(-1, 1))
plt.scatter(x, y)
plt.plot(x_reg, y_reg)
plt.show()
Output :

I would advise you to use the Seaborn library. It is built on top of matplotlib and has many statistical plotting routines. Have a look at the examples for regplot and lmplot: http://seaborn.pydata.org/tutorial/regression.html#functions-to-draw-linear-regression-models
In your case, you could do something like:
import pandas as pd
import seaborn as sns
df = pd.DataFrame.from_dict({"Number of Queens": [1, 1, 1, 2, 2, 2, 3,
3, 3, 4, 4, 4],
"Time (sec)": [0.01, 0.02, 0.03, 0.04, 0.04, 0.04,
0.05, 0.09, 0.12, 0.14, 0.15, 0.16]})
sns.lmplot('Number of Queens', 'Time (sec)', df, order=1)
If you want regression lines for different groups, add a column with the group labels and add it to the hue parameter of lm_plot.

Related

Matplotlib Plot curve logistic regression

I am trying to plot the trained curve in matplotlib. However I am getting this thing:
The scatter works fine:
How can I create the curve using plot?
It may be that the order of your X_train data is wrong. Try to sort them out. For instance, if X_train is just a list of numbers, you could say:
X_train.sort()
You can plot a smooth line curve by first determining the spline curve’s coefficients using the scipy.interpolate.make_interp_spline():
import numpy as np
import numpy as np
from scipy.interpolate import make_interp_spline
import matplotlib.pyplot as plt
# Dataset
x = np.array([1, 2, 3, 4, 5, 6, 7, 8])
y = np.array([20, 30, 5, 12, 39, 48, 50, 3])
X_Y_Spline = make_interp_spline(x, y)
# Returns evenly spaced numbers
# over a specified interval.
X_ = np.linspace(x.min(), x.max(), 500)
Y_ = X_Y_Spline(X_)
# Plotting the Graph
plt.plot(X_, Y_)
plt.title("Plot Smooth Curve Using the scipy.interpolate.make_interp_spline() Class")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
Result:
It seems, that you have unsorted values in X_train. For instance, if
In [1]: X_train
Out [1]: array([30, 20, 50, 40])
then
In [2]: model.predict_proba(X_train)
Out [2]: array([0.2, 0.1, 0.8, 0.5])
Here, plt.plot will try to plot lines from point [30, 0.2] to point [20, 0.1], then from [20, 0.1] to [50, 0.8], then from [50, 0.8] to [40, 0.5].
Thus, the solution to your problem is to sort X_train before plotting =)
import numpy as np
X_train_sorted = np.sort(X_train)
y_train_sorted = model.predict_proba(X_train_sorted)
plt.scatter(X_train_sorted, y_train_sorted)
plt.plot(X_train_sorted, y_train_sorted)

Use rows in a dataframe as inputs for function and visualization

Please help me iterate over each row of the following .CSV file, loaded as a Pandas dataframe called df_data in my subsequent code:
aud,wllt_cnt,bcr
Group A,64700,0.15116928389684975
Group B,7654,0.02786525362332031
Group C,11183,0.01278621197465396
Group D,8025,0.00881753794562903
Currently, I created the following function by providing references to specific locations (please refer to variables total_group_size and crc):
import pandas as pd
import itertools
import numpy as np
import statsmodels.stats.api as sms
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
# Group A
# making a dataframe that captures the combinations of holdout / lift
def expand_grid(data_dict):
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
# making a function for testing
def test(holdout,lift):
cl = 0.9
alpha = 1-cl
total_group_size = df_data.iloc[0,1]
wllt_cnt_reach = total_group_size*0.6
crc = df_data.iloc[0,2]
conversion_rate_control = crc
conversion_rate_test = conversion_rate_control*(1+lift)
es = sms.proportion_effectsize(conversion_rate_test, conversion_rate_control)
n1 = wllt_cnt_reach*(1-holdout)
n2 = wllt_cnt_reach*holdout
return sms.NormalIndPower().solve_power(es, nobs1=n1, alpha=alpha, ratio=n2/n1, alternative='two-sided')
holdout=np.array([0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5])
lift=np.array([0.01, 0.02, 0.025, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.125, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4])
df = expand_grid({'holdout':holdout,'lift':lift})
Then applying the function row-wise:
df['power'] = df.apply(lambda x:test(x[0],x[1]),axis=1)
And plotting a heatmap (here I also used reference to the df_data dataframe when giving a plot title (please refer to plt_title):
plt_title = df_data.iloc[0,0]
x_axis_labels = [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
y_axis_labels = [1, 2, 2.5, 3, 4, 5, 6, 7, 8, 9, 10, 12.5, 15, 20, 25, 30, 35, 40]
fig, ax = plt.subplots(figsize = (13,6))
ax = sns.heatmap(data=df.pivot(index='lift',columns='holdout',values='power'),
annot=True, fmt='.0%', cmap=sns.color_palette("RdYlGn", 200), xticklabels=x_axis_labels, yticklabels=y_axis_labels,
cbar_kws={'label': 'Power'})
fig.axes[0].invert_yaxis()
ax.set_title(plt_title)
plt.xlabel("Holdout")
plt.ylabel("% Lift")
b, t = plt.ylim()
b -= 0.5
t += 0.5
plt.ylim(b, t)
plt.show()
I need to optimize this code, so that I don't have to copy-paste the same code to apply it to each row of the df_data dataframe. My understanding is that I need to change the function test from above to take more inputs, then wrap the plotting into another function and show a heatmap for each of the groups (rows in my .CSV file), but I'm stuck at the first step in this action plan.
Please help me take this off, and thank you in advance!

Plot normal distribution in python, matplotlib, multi-arm bandit

How can I plot multiple normal distribution-like images in matplotlib, like the one above?
This kind of plot is called a "violin" plot:
import numpy as np
from matplotlib import pyplot as plt
# Generate 5 actions with different reward distributions
data = np.random.randn(1000, 5) * [1, 0.5, 1.5, 1, 0.5] + [0, 1, -1, 2, -0.5]
plt.violinplot(data)
plt.xlabel('Action')
plt.ylabel('Reward distribution')

Arguments for LogLocator in MatPlotLib

In MatPlotLib, I want to plot a graph with a linear x-axis and a logarithmic y-axis. For the x-axis, there should be labels at multiples of 4, and minor ticks at multiples of 1. I have been able to do this using the MultipleLocator class.
However, I am having difficulty doing a similar thing for the logarithmic y-axis. I want there to be labels at 0.1, 0.2, 0.3 etc., and minor ticks at 0.11, 0.12, 0.13 etc. I have tried doing this with the LogLocator class, but I'm not sure what the right parameters are.
Here is what I have tried to far:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
y = [0.32, 0.30, 0.28, 0.26, 0.24, 0.22, 0.20, 0.18, 0.16, 0.14, 0.12, 0.10]
fig = plt.figure()
ax1 = fig.add_subplot(111)
x_major = MultipleLocator(4)
x_minor = MultipleLocator(1)
ax1.xaxis.set_major_locator(x_major)
ax1.xaxis.set_minor_locator(x_minor)
ax1.set_yscale("log")
y_major = LogLocator(base=10)
y_minor = LogLocator(base=10)
ax1.yaxis.set_major_locator(y_major)
ax1.yaxis.set_minor_locator(y_minor)
ax1.plot(x, y)
plt.show()
This shows the following plot:
The x-axis is as I want it, but not the y-axis. There is a label on the y-axis at 0.1, but no labels at 0.2 and 0.3. Also, there are no ticks at 0.11, 0.12, 0.13 etc.
I have tried some different values for the LogLocator constructor, such as subs, numdecs, and numticks, but I cannot get the right plot. The documentation at https://matplotlib.org/api/ticker_api.html#matplotlib.ticker.LogLocator doesn't really explain these parameters very well.
What parameter values should I be using?
I think you still want MultipleLocator rather than LogLocator because your desired tick location is still "on every integer that is multiple of base in the view interval" rather than "subs[j] * base**i". For example:
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
y = [0.32, 0.30, 0.28, 0.26, 0.24, 0.22, 0.20, 0.18, 0.16, 0.14, 0.12, 0.10]
fig = plt.figure(figsize=(8, 12))
ax1 = fig.add_subplot(111)
x_major = MultipleLocator(4)
x_minor = MultipleLocator(1)
ax1.xaxis.set_major_locator(x_major)
ax1.xaxis.set_minor_locator(x_minor)
ax1.set_yscale("log")
# You would need to erase default major ticklabels
ax1.set_yticklabels(['']*len(ax1.get_yticklabels()))
y_major = MultipleLocator(0.1)
y_minor = MultipleLocator(0.01)
ax1.yaxis.set_major_locator(y_major)
ax1.yaxis.set_minor_locator(y_minor)
ax1.plot(x, y)
plt.show()
LogLocator always put major tick labels at "every base**i". Therefore, it is impossible to use it for your desired major tick labels. You can use parameter subs for your minor tick labels like this:
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, LogLocator
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
y = [0.32, 0.30, 0.28, 0.26, 0.24, 0.22, 0.20, 0.18, 0.16, 0.14, 0.12, 0.10]
fig = plt.figure()
ax1 = fig.add_subplot(111)
x_major = MultipleLocator(4)
x_minor = MultipleLocator(1)
ax1.xaxis.set_major_locator(x_major)
ax1.xaxis.set_minor_locator(x_minor)
ax1.set_yscale("log")
y_major = LogLocator(base=10)
y_minor = LogLocator(base=10, subs=[1.1, 1.2, 1.3])
ax1.yaxis.set_major_locator(y_major)
ax1.yaxis.set_minor_locator(y_minor)
ax1.plot(x, y)
plt.show()

Suggestions to plot overlapping lines in matplotlib?

Does anybody have a suggestion on what's the best way to present overlapping lines on a plot? I have a lot of them, and I had the idea of having full lines of different colors where they don't overlap, and having dashed lines where they do overlap so that all colors are visible and overlapping colors are seen.
But still, how do I that.
I have the same issue on a plot with a high degree of discretization.
Here the starting situation:
import matplotlib.pyplot as plt
grid=[x for x in range(10)]
graphs=[
[1,1,1,4,4,4,3,5,6,0],
[1,1,1,5,5,5,3,5,6,0],
[1,1,1,0,0,3,3,2,4,0],
[1,2,4,4,3,2,3,2,4,0],
[1,2,3,3,4,4,3,2,6,0],
[1,1,3,3,0,3,3,5,4,3],
]
for gg,graph in enumerate(graphs):
plt.plot(grid,graph,label='g'+str(gg))
plt.legend(loc=3,bbox_to_anchor=(1,0))
plt.show()
No one can say where the green and blue lines run exactly
and my "solution"
import matplotlib.pyplot as plt
grid=[x for x in range(10)]
graphs=[
[1,1,1,4,4,4,3,5,6,0],
[1,1,1,5,5,5,3,5,6,0],
[1,1,1,0,0,3,3,2,4,0],
[1,2,4,4,3,2,3,2,4,0],
[1,2,3,3,4,4,3,2,6,0],
[1,1,3,3,0,3,3,5,4,3],
]
for gg,graph in enumerate(graphs):
lw=10-8*gg/len(graphs)
ls=['-','--','-.',':'][gg%4]
plt.plot(grid,graph,label='g'+str(gg), linestyle=ls, linewidth=lw)
plt.legend(loc=3,bbox_to_anchor=(1,0))
plt.show()
I am grateful for suggestions on improvement!
Just decrease the opacity of the lines so that they are see-through. You can achieve that using the alpha variable. Example:
plt.plot(x, y, alpha=0.7)
Where alpha ranging from 0-1, with 0 being invisible.
imagine your panda data frame is called respone_times, then you can use alpha to set different opacity for your graphs. Check the picture before and after using alpha.
plt.figure(figsize=(15, 7))
plt.plot(respone_times,alpha=0.5)
plt.title('a sample title')
plt.grid(True)
plt.show()
Depending on your data and use case, it might be OK to add a bit of random jitter to artificially separate the lines.
from numpy.random import default_rng
import pandas as pd
rng = default_rng()
def jitter_df(df: pd.DataFrame, std_ratio: float) -> pd.DataFrame:
"""
Add jitter to a DataFrame.
Adds normal distributed jitter with mean 0 to each of the
DataFrame's columns. The jitter's std is the column's std times
`std_ratio`.
Returns the jittered DataFrame.
"""
std = df.std().values * std_ratio
jitter = pd.DataFrame(
std * rng.standard_normal(df.shape),
index=df.index,
columns=df.columns,
)
return df + jitter
Here's a plot of the original data from Markus Dutschke's example:
And here's the jittered version, with std_ratio set to 0.1:
Replacing solid lines by dots or dashes works too
g = sns.FacetGrid(data, col='config', row='outputs', sharex=False)
g.map_dataframe(sns.lineplot, x='lag',y='correlation',hue='card', linestyle='dotted')
Instead of random jitter, the lines can be offset just a little bit, creating a layered appearance:
import matplotlib.pyplot as plt
from matplotlib.transforms import offset_copy
grid = list(range(10))
graphs = [[1, 1, 1, 4, 4, 4, 3, 5, 6, 0],
[1, 1, 1, 5, 5, 5, 3, 5, 6, 0],
[1, 1, 1, 0, 0, 3, 3, 2, 4, 0],
[1, 2, 4, 4, 3, 2, 3, 2, 4, 0],
[1, 2, 3, 3, 4, 4, 3, 2, 6, 0],
[1, 1, 3, 3, 0, 3, 3, 5, 4, 3]]
fig, ax = plt.subplots()
lw = 1
for gg, graph in enumerate(graphs):
trans_offset = offset_copy(ax.transData, fig=fig, x=lw * gg, y=lw * gg, units='dots')
ax.plot(grid, graph, lw=lw, transform=trans_offset, label='g' + str(gg))
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1.01))
# manually set the axes limits, because the transform doesn't set them automatically
ax.set_xlim(grid[0] - .5, grid[-1] + .5)
ax.set_ylim(min([min(g) for g in graphs]) - .5, max([max(g) for g in graphs]) + .5)
plt.tight_layout()
plt.show()

Categories