Use rows in a dataframe as inputs for function and visualization - python

Please help me iterate over each row of the following .CSV file, loaded as a Pandas dataframe called df_data in my subsequent code:
aud,wllt_cnt,bcr
Group A,64700,0.15116928389684975
Group B,7654,0.02786525362332031
Group C,11183,0.01278621197465396
Group D,8025,0.00881753794562903
Currently, I created the following function by providing references to specific locations (please refer to variables total_group_size and crc):
import pandas as pd
import itertools
import numpy as np
import statsmodels.stats.api as sms
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
# Group A
# making a dataframe that captures the combinations of holdout / lift
def expand_grid(data_dict):
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
# making a function for testing
def test(holdout,lift):
cl = 0.9
alpha = 1-cl
total_group_size = df_data.iloc[0,1]
wllt_cnt_reach = total_group_size*0.6
crc = df_data.iloc[0,2]
conversion_rate_control = crc
conversion_rate_test = conversion_rate_control*(1+lift)
es = sms.proportion_effectsize(conversion_rate_test, conversion_rate_control)
n1 = wllt_cnt_reach*(1-holdout)
n2 = wllt_cnt_reach*holdout
return sms.NormalIndPower().solve_power(es, nobs1=n1, alpha=alpha, ratio=n2/n1, alternative='two-sided')
holdout=np.array([0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5])
lift=np.array([0.01, 0.02, 0.025, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.125, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4])
df = expand_grid({'holdout':holdout,'lift':lift})
Then applying the function row-wise:
df['power'] = df.apply(lambda x:test(x[0],x[1]),axis=1)
And plotting a heatmap (here I also used reference to the df_data dataframe when giving a plot title (please refer to plt_title):
plt_title = df_data.iloc[0,0]
x_axis_labels = [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
y_axis_labels = [1, 2, 2.5, 3, 4, 5, 6, 7, 8, 9, 10, 12.5, 15, 20, 25, 30, 35, 40]
fig, ax = plt.subplots(figsize = (13,6))
ax = sns.heatmap(data=df.pivot(index='lift',columns='holdout',values='power'),
annot=True, fmt='.0%', cmap=sns.color_palette("RdYlGn", 200), xticklabels=x_axis_labels, yticklabels=y_axis_labels,
cbar_kws={'label': 'Power'})
fig.axes[0].invert_yaxis()
ax.set_title(plt_title)
plt.xlabel("Holdout")
plt.ylabel("% Lift")
b, t = plt.ylim()
b -= 0.5
t += 0.5
plt.ylim(b, t)
plt.show()
I need to optimize this code, so that I don't have to copy-paste the same code to apply it to each row of the df_data dataframe. My understanding is that I need to change the function test from above to take more inputs, then wrap the plotting into another function and show a heatmap for each of the groups (rows in my .CSV file), but I'm stuck at the first step in this action plan.
Please help me take this off, and thank you in advance!

Related

Number of columns does not match number of bins

I am trying to plot some 336 data points and am encountering an issue with my use of pythons plt.hist() function. I would like to use more than eight bins for my data, but when I do a lot of whitespace is introduced. For example, here is a plot with bins = 8
and with bins = 24
Does anyone know why this is and how I can best represent my data with more bins? Many thanks, ~S.
Sample code:
tumbles = np.array(df['Tumbles'])
fig, axs = plt.subplots(1, 1,
tight_layout = True)
N, bins, patches = axs.hist(tumbles, bins = 24, edgecolor= "black")
axs.grid(b = True, color ='grey',
linestyle ='-.', linewidth = 0.5,
alpha = 0.6)
plt.xlabel("Time (s)", size = 14)
plt.ylabel("Frequency", size = 14)
plt.title('Histogram ofTimes', size = 18)
plt.show()
I feel like your data is distributed in a way that the empty space between bars are simply bars with height 0 (a lack of samples). In such a case you just don't need more bins.
Please include your code
With this setup I get the same problem:
import matplotlib.pyplot as plt
plt.hist([1, 2, 2, 3, 4, 5, 5, 5, 5, 5, 6, 7, 9], bins=20)
plt.show()
It would be a bit more effort, but if you want a bit more control over the number of bins and the range of each bin, you might set up the bin parameter in your histogram definition as a list. This was alluded to above, but here is a snippet of code illustrating that.
import matplotlib.pyplot as plt
data = [0.02, 0.02, 0.02, 0.27, 0.27, 0.03, 0.03, 0.04, 0.044, 0.044, 0.05, 0.05, 0.06, 0.07, 0.08, 0.08, 0.08, 0,10, 0.10, 0.11, 0.12, 0.13, 0.13, 0.14, 0.15, 0.17, 0,18, 0.19, 0.20, 0.20, 0.22, 0.23, 0.23, 0.23, 0.23, 0.24, 0.26, 0.26, 0.28, 0.29, 0.30, 0.32]
fig, ax = plt.subplots()
N, bins, values = ax.hist(data, [0.000,0.015,0.030,0.045,0.060,0.075,0.090,0.105,0.120,0.135,0.150,0.165,0.180,0.195,0.210,0.225,0.240,0.255,0.270,0.285,0.300,0.315,0.330,0.345], linewidth=1)
plt.bar_label(values)
plt.xlabel("Time (s)", size = 14)
plt.ylabel("Frequency", size = 14)
plt.title('Histogram of Times', size = 18)
plt.show()
The data is just a small subset to establish some data points to produce a histogram. Following was the histogram created in this fashion.
You might give that a try adjusting for the range each bin should have.

Defining an array of specific slider values in matplotlib vice using valmin and valmax

Looking at the slider_demo.py, the author has the following code:
axcolor = 'lightgoldenrodyellow'
axfreq = plt.axes([0.25, 0.1, 0.65, 0.03], facecolor=axcolor)
axamp = plt.axes([0.25, 0.15, 0.65, 0.03], facecolor=axcolor)
sfreq = Slider(axfreq, 'Freq', 0.1, 30.0, valinit=f0, valstep=delta_f)
samp = Slider(axamp, 'Amp', 0.1, 10.0, valinit=a0)
where, in sfreq, for example, the values 0.1 and 30.0 are valmin and valmax, respectively, and the slider values are incremented by delta_f.
I would like to, instead, define an array of values
valarray = [0.1, 1, 3, 5, 15, 20, 27, 30]
where this array of values are the only values selected, displayed, or available as one moves the slider. This does not appear to be an explicit option for this widget.
It would also be helpful, but not necessary, that I am able to update this list dynamically.
Thanks in advance for your help and assistance.
You can set the major ticks of the Slider with this line:
axfreq.set_xticks(np.array([0.1, 1, 3, 5, 15, 20, 27, 30]), minor = False)
eventually, you can also set minor ticks by setting minor = True.

Arguments for LogLocator in MatPlotLib

In MatPlotLib, I want to plot a graph with a linear x-axis and a logarithmic y-axis. For the x-axis, there should be labels at multiples of 4, and minor ticks at multiples of 1. I have been able to do this using the MultipleLocator class.
However, I am having difficulty doing a similar thing for the logarithmic y-axis. I want there to be labels at 0.1, 0.2, 0.3 etc., and minor ticks at 0.11, 0.12, 0.13 etc. I have tried doing this with the LogLocator class, but I'm not sure what the right parameters are.
Here is what I have tried to far:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
y = [0.32, 0.30, 0.28, 0.26, 0.24, 0.22, 0.20, 0.18, 0.16, 0.14, 0.12, 0.10]
fig = plt.figure()
ax1 = fig.add_subplot(111)
x_major = MultipleLocator(4)
x_minor = MultipleLocator(1)
ax1.xaxis.set_major_locator(x_major)
ax1.xaxis.set_minor_locator(x_minor)
ax1.set_yscale("log")
y_major = LogLocator(base=10)
y_minor = LogLocator(base=10)
ax1.yaxis.set_major_locator(y_major)
ax1.yaxis.set_minor_locator(y_minor)
ax1.plot(x, y)
plt.show()
This shows the following plot:
The x-axis is as I want it, but not the y-axis. There is a label on the y-axis at 0.1, but no labels at 0.2 and 0.3. Also, there are no ticks at 0.11, 0.12, 0.13 etc.
I have tried some different values for the LogLocator constructor, such as subs, numdecs, and numticks, but I cannot get the right plot. The documentation at https://matplotlib.org/api/ticker_api.html#matplotlib.ticker.LogLocator doesn't really explain these parameters very well.
What parameter values should I be using?
I think you still want MultipleLocator rather than LogLocator because your desired tick location is still "on every integer that is multiple of base in the view interval" rather than "subs[j] * base**i". For example:
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
y = [0.32, 0.30, 0.28, 0.26, 0.24, 0.22, 0.20, 0.18, 0.16, 0.14, 0.12, 0.10]
fig = plt.figure(figsize=(8, 12))
ax1 = fig.add_subplot(111)
x_major = MultipleLocator(4)
x_minor = MultipleLocator(1)
ax1.xaxis.set_major_locator(x_major)
ax1.xaxis.set_minor_locator(x_minor)
ax1.set_yscale("log")
# You would need to erase default major ticklabels
ax1.set_yticklabels(['']*len(ax1.get_yticklabels()))
y_major = MultipleLocator(0.1)
y_minor = MultipleLocator(0.01)
ax1.yaxis.set_major_locator(y_major)
ax1.yaxis.set_minor_locator(y_minor)
ax1.plot(x, y)
plt.show()
LogLocator always put major tick labels at "every base**i". Therefore, it is impossible to use it for your desired major tick labels. You can use parameter subs for your minor tick labels like this:
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, LogLocator
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
y = [0.32, 0.30, 0.28, 0.26, 0.24, 0.22, 0.20, 0.18, 0.16, 0.14, 0.12, 0.10]
fig = plt.figure()
ax1 = fig.add_subplot(111)
x_major = MultipleLocator(4)
x_minor = MultipleLocator(1)
ax1.xaxis.set_major_locator(x_major)
ax1.xaxis.set_minor_locator(x_minor)
ax1.set_yscale("log")
y_major = LogLocator(base=10)
y_minor = LogLocator(base=10, subs=[1.1, 1.2, 1.3])
ax1.yaxis.set_major_locator(y_major)
ax1.yaxis.set_minor_locator(y_minor)
ax1.plot(x, y)
plt.show()

Matplotlib scatter(): Polynomial regression line [duplicate]

This question already has answers here:
Simplest way to make a polynomial regression with sklearn?
(2 answers)
polynomial regression using python
(3 answers)
Closed 4 years ago.
Is it possible to do a polynomial regression line on a scatter() in matplotlib?
This is my graph:
https://imgur.com/a/Xh1BO
alg_n = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4...]
orig_hc_runtime = [0.01, 0.02, 0.03, 0.04, 0.04, 0.04, 0.05, 0.09...]
plt.scatter(alg_n, orig_hc_runtime, label="Orig HC", color="b", s=4)
plt.scatter(alg_n, mod_hc_runtime, label="Mod HC", color="c", s=4)
...
x_values = [x for x in range(5, n_init+2, 2)]
y_values = [y for y in range(0, 10, 2)]
plt.xlabel("Number of Queens")
plt.ylabel("Time (sec)")
plt.title("Algorithm Performance: Time")
plt.xticks(x_values)
plt.yticks(y_values)
plt.grid(linewidth="1", color="white")
plt.legend()
plt.show()
Is it possible to have regression lines for eat data set? If so, can you please explain how I can do it.
Not sure if it can be done just using matplotlib but you can always compute regression separately and plot it. I leave an example code using scikit-learn to compute regression line.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
x = [1, 2, 3, 4, 5, 8, 10]
y = [1.1, 3.8, 8.5, 16, 24, 65, 99.2]
model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(np.array(x).reshape(-1, 1), y)
x_reg = np.arange(11)
y_reg = model.predict(x_reg.reshape(-1, 1))
plt.scatter(x, y)
plt.plot(x_reg, y_reg)
plt.show()
Output :
I would advise you to use the Seaborn library. It is built on top of matplotlib and has many statistical plotting routines. Have a look at the examples for regplot and lmplot: http://seaborn.pydata.org/tutorial/regression.html#functions-to-draw-linear-regression-models
In your case, you could do something like:
import pandas as pd
import seaborn as sns
df = pd.DataFrame.from_dict({"Number of Queens": [1, 1, 1, 2, 2, 2, 3,
3, 3, 4, 4, 4],
"Time (sec)": [0.01, 0.02, 0.03, 0.04, 0.04, 0.04,
0.05, 0.09, 0.12, 0.14, 0.15, 0.16]})
sns.lmplot('Number of Queens', 'Time (sec)', df, order=1)
If you want regression lines for different groups, add a column with the group labels and add it to the hue parameter of lm_plot.

Three variables as heatmap

I want to plot my data as a heatmap which has the following structure:
X = [1,1,1,1,1,1,1,1,1,1],
Y = [1,2,3,4,5,6,7,8,9,10]
Z = [0.2, 0.33, 0.1, 0.25, 0.0, 0.9, 0.75, 0.88, 0.44, 0.95]
The x and y-axis shall be represented by X and Y, while the 'heat' is represented by the values of Z.
E.g. at coordinate (x,y) = (1,2) the intensity shall be 0.33
How can this be achieved by using matplotlib?
Looking at posts which relate to the keyword heatmap or even to those related to the term contour map, I could not transfer it to this problem yet.
Thank you in advance for any hints
Dan
I hope your data is just an example because it will look funny (it's more a sequence of strips; the x-dimension is constant).
I would recommend the usage of pandas (general data-analysis) and seaborn (matplotlib-extensions) which makes it a bit nicer.
Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
X = [1,1,1,1,1,1,1,1,1,1]
Y = [1,2,3,4,5,6,7,8,9,10]
Z = [0.2, 0.33, 0.1, 0.25, 0.0, 0.9, 0.75, 0.88, 0.44, 0.95]
data = pd.DataFrame({'X': X, 'Y': Y, 'Z': Z})
data_pivoted = data.pivot("X", "Y", "Z")
ax = sns.heatmap(data_pivoted)
plt.show()
Output

Categories