Unable to plot distribution of a column containing binary values using Python - python

I'm trying to plot the original data before handling the imbalance in a way to show the class distribution and class imbalance (class is Failure =0/1) 2. I might need to do some transformation on the data in both cases to be able to visualize it.
Here's what the column looks like:
| failure |
|---------|
| 1 |
| 0 |
| 0 |
| 1 |
| 0 |
Here's what I have tried so far:
import numpy as np
from scipy.stats.kde import gaussian_kde
def distribution_scatter(x, symmetric=True, cmap=None, size=None):
pdf = gaussian_kde(x)
w = np.random.rand(len(x))
if symmetric:
w = w*2-1
pseudo_y = pdf(x) * w
if cmap:
plt.scatter(x, pseudo_y, c=x, cmap=cmap, s=size)
else:
plt.scatter(x, pseudo_y, s=size)
return pseudo_y
Results:
The problem with the results:
I want the plot the distribution of 0's and 1's. For which I believe I need to transform it in someway.
Desired output:

If you want a KDE plot, you can check kdeplot from seaborn:
x = np.random.binomial(1, 0.2, 100)
sns.kdeplot(x)
Output:
Update: Or a swarmplot if you want a scatter:
x = np.random.binomial(1, 0.2, 25)
sns.swarmplot(x=x)
Output:
Update 2: In fact, your function seems to also produce a reasonable visualization:
distribution_scatter(np.random.binomial(1, 0.2, 100))
Output:

Related

Calculate gap between two datasets (pandas, matplotlib, fill_between already used)

I'd like to ask for suggestions how to calculate lenght of gap between two datasets in matplotlib made of pandas dataframe. Ideally, I would like to have these gap values written in the plot and also, if it is possible, include them into the dataframe.
Here is my simplified example of dataframe:
import pandas as pd
d = {'Mean-1': [0.195842, 0.295069, 0.321345, 0.773725], 'SEM-1': [0.001216, 0.002687, 0.005267, 0.029974], 'Mean-2': [0.143103, 0.250505, 0.305767, 0.960804],'SEM-2': [0.000959, 0.001368, 0.003722, 0.150025], 'Atom Number': [1, 3, 5, 7]}
df=pd.DataFrame(d)
df
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number
0 0.195842 0.001216 0.143103 0.000959 1
1 0.295069 0.002687 0.250505 0.001368 3
2 0.321345 0.005267 0.305767 0.003722 5
3 0.773725 0.029974 0.960804 0.150025 7
Then I made plot, where we can see two lines representing Mean-1 and Mean-2, and then shaded area around each line representing standard error of the mean. This is done for the selected atom numbers.
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'])
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
plt.xticks(x)
What I would like to do further is to calculate the gap for each residue. The gap is the white space only, thus space where the lines as well as the shaded areas (SEMs) don't overlap.
And also would like to know if I can somehow print the gap values from the plot? And save them into column. Thank You for suggestions.
It's not a compact solution but you could try something like this (Check the order of things). Calculate all the position (y_i and upper and lower limits).
import numpy as np
df['y1_upper'] = y_1+error_1
df['y1_lower'] = y_1-error_1
df['y2_upper'] = y_2+error_2
df['y2_lower'] = y_2-error_2
which gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower
0 0.144319 0.141887
1 0.253192 0.247818
2 0.311034 0.300500
3 0.990778 0.930830
The distances (gaps) are calculated differently depending on if y_1 is over y_2and vice versa. So use conditions on the upper and lower limits and use linalg.norm to compute the distance.
conditions = [
(df['y1_lower'] >= df['y2_upper']),
(df['y1_lower'] < df['y2_upper'])]
choices = [np.linalg.norm(df['y1_lower']-df['y2_upper']), np.linalg.norm(df['y2_lower']-df['y1_upper'])]
df['dist'] = np.select(conditions, choices)
This gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower dist
0 0.144319 0.141887 0.255175
1 0.253192 0.247818 0.255175
2 0.311034 0.300500 0.255175
3 0.990778 0.930830 0.149605
As I said, check the order, but this is a possible solution.
IIUC, do you want something like this:
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'], figsize=(15,8))
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
ax.fill_between(df['Atom Number'], y_1+error_1, y_2-error_2, alpha=.2, edgecolor='k', facecolor='blue')
for i in range(len(x)):
gap = y_1[i]+error_1[i] - y_2[i]-error_2[i]
ylabel = min(y_1[i], y_2[i]) + abs(gap) / 2
_ = ax.annotate(f'{gap:0.4f}', xy=(x[i],ylabel), xytext=(x[i]-.14,y_1[i]+gap/abs(gap)*.2), arrowprops=dict(arrowstyle="-"))
plt.xticks(x);
Output:

Pyspark : how do I apply the function to get slope in a dataframe?

I have a spark dataframe like this (x and y columns, each with 6 datapoints). I want to be able to extract the slope by fitting a simple regression line (basically to see the trend of y, increasing or decreasing).
+------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|x |y |
+------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|[1, 2, 3, 4, 5, 6]|[3.42102562348192E-6, 4.2159323917750995E-6, 3.924587540944015E-6, 4.167182871752131E-6, 4.109192066532302E-6, 4.297804458327455E-6] |
|[1, 2, 3, 4, 5, 6]|[1.384402399630826E-5, 9.913141993957704E-6, 1.1145077060247102E-5, 1.1005472165326649E-5, 1.1004462921073546E-5, 1.1004462921073546E-5]|
+------------------+----------------------------------------------------------------------------------------------------------------------------------------+
I want to use this function but I am getting errors when applying to dataframe. what is the correct way to apply this function which takes in 2 arrays?
import numpy as np
def trendline(x,y, order=1):
coeffs = np.polyfit(x, y, order)
slope = coeffs[-2]
return float(slope)
#example to run
x_example=[1,2,3,4,5,6]
y_example=[3.42102562348192,4.2159323917750995,3.924587540944015,4.167182871752131,4.109192066532302,4.297804458327455 ]
slope=trendline(x_example,y_example)
print(slope)
this functions works itself like the example above but i have trouble applying to the dataframe , i want to create a new column in the dataframe with returned slople.
thank you!!
I tried this and it didn't work
def get_slope_func(x,y, order=1):
coeffs = np.polyfit(x, y, order)
slope = coeffs[-2]
return float(slope)
get_slope = pandas_udf(get_slope_func, returnType=LongType())
df.select(get_slope(col("x"), col("y"))).show()
So you are returning a float in get_slope_func and in registering your udf you have set return type as LongType() which is basically a BigInt in SQL. Set your returnType to DoubleType().
from pyspark.sql import functions as F
from pyspark.sql.type import *
import numpy as np
def get_slope_func(x,y,order=1):
coeffs = np.polyfit(x, y, order)
slope = coeffs[-2]
return float(slope)
get_slope = F.udf(get_slope_func, returnType=DoubleType())
df.select(get_slope(F.col("x"), F.col("y")).alias("slope")).show(truncate=False)
#+----------------------+
#|slope |
#+----------------------+
#|1.2303624369449727E-7 |
#|-3.1609849970704353E-7|
#+----------------------+

how to isolate data that are 2 and 3 sigma deviated from mean and then mark them in a plot in python?

I am reading from a dataset which looks like the following when plotted in matplotlib and then taken the best fit curve using linear regression.
The sample of data looks like following:
# ID X Y px py pz M R
1.04826492772e-05 1.04828050287e-05 1.048233088e-05 0.000107002791008 0.000106552433081 0.000108704469007 387.02 4.81947797625e+13
1.87380963036e-05 1.87370588085e-05 1.87372620448e-05 0.000121616280029 0.000151924707761 0.00012371156585 428.77 6.54636174067e+13
3.95579877816e-05 3.95603773653e-05 3.95610756809e-05 0.000163470663023 0.000265203868883 0.000228031803626 470.74 8.66961875758e+13
My code looks the following:
# Regression Function
def regress(x, y):
#Return a tuple of predicted y values and parameters for linear regression.
p = sp.stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = sp.polyval([b1, b0], x)
return y_pred, p
# plotting z
xz, yz = M, Y_z # data, non-transformed
y_pred, _ = regress(xz, np.log(yz)) # change here # transformed input
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit') # transformed output
However I can see a lot upward scatter in the data and the best fit curve is affected by those. So first I want to isolate the data points which are 2 and 3 sigma away from my mean data, and mark them with circle around them.
Then take the best fit curve considering only the points which fall within 1 sigma of my mean data
Is there a good function in python which can do that for me?
Also in addition to that may I also isolate the data from my actual dataset, like if the third row in the sample input represents 2 sigma deviation may I have that row as an output too to save later and investigate more?
Your help is most appreciated.
Here's some code that goes through the data in a given number of windows, calculates statistics in said windows, and separates data in well- and misbehaved lists.
Hope this helps.
from scipy import stats
from scipy import polyval
import numpy as np
import matplotlib.pyplot as plt
num_data = 10000
fake_data_x = np.sort(12.8+np.random.random(num_data))
fake_data_y = np.exp(fake_data_x) + np.random.normal(0,scale=50000,size=num_data)
# Regression Function
def regress(x, y):
#Return a tuple of predicted y values and parameters for linear regression.
p = stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = polyval([b1, b0], x)
return y_pred, p
# plotting z
xz, yz = fake_data_x, fake_data_y # data, non-transformed
y_pred, _ = regress(xz, np.log(yz)) # change here # transformed input
plt.figure()
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit') # transformed output
plt.show()
num_bin_intervals = 10 # approx number of averaging windows
window_boundaries = np.linspace(min(fake_data_x),max(fake_data_x),int(len(fake_data_x)/num_bin_intervals)) # window boundaries
y_good = [] # list to collect the "well-behaved" y-axis data
x_good = [] # list to collect the "well-behaved" x-axis data
y_outlier = []
x_outlier = []
for i in range(len(window_boundaries)-1):
# create a boolean mask to select the data within the averaging window
window_indices = (fake_data_x<=window_boundaries[i+1]) & (fake_data_x>window_boundaries[i])
# separate the pieces of data in the window
fake_data_x_slice = fake_data_x[window_indices]
fake_data_y_slice = fake_data_y[window_indices]
# calculate the mean y_value in the window
y_mean = np.mean(fake_data_y_slice)
y_std = np.std(fake_data_y_slice)
# choose and select the outliers
y_outliers = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
x_outliers = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
# choose and select the good ones
y_goodies = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
x_goodies = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
# extend the lists with all the good and the bad
y_good.extend(list(y_goodies))
y_outlier.extend(list(y_outliers))
x_good.extend(list(x_goodies))
x_outlier.extend(list(x_outliers))
plt.figure()
plt.semilogy(x_good,y_good,'o')
plt.semilogy(x_outlier,y_outlier,'r*')
plt.show()

Python: Predict the y value using Statsmodels - Linear Regression

I am using the statsmodels library of Python to predict the future balance using Linear Regression. The csv file is displayed below:
Year | Balance
3 | 30
8 | 57
9 | 64
13 | 72
3 | 36
6 | 43
11 | 59
21 | 90
1 | 20
16 | 83
It contains the 'Year' as the independent 'x' variable, while the 'Balance' is the dependent 'y' variable
Here's the code for Linear Regression for this data:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np
from matplotlib import pyplot as plt
import os
os.chdir('C:\Users\Admin\Desktop\csv')
cw = pd.read_csv('data-table.csv')
y=cw.Balance
X=cw.Year
X = sm.add_constant(X) # Adds a constant term to the predictor
est = sm.OLS(y, X)
est = est.fit()
print est.summary()
est.params
X_prime = np.linspace(X.Year.min(), X.Year.max(), 100)[:, np.newaxis]
X_prime = sm.add_constant(X_prime) # add constant as we did before
y_hat = est.predict(X_prime)
plt.scatter(X.Year, y, alpha=0.3) # Plot the raw data
plt.xlabel("Year")
plt.ylabel("Total Balance")
plt.plot(X_prime[:, 1], y_hat, 'r', alpha=0.9) # Add the regression line, colored in red
plt.show()
The question is how to predict the 'Balance' value, using Statsmodels when the value of 'Year'=10 ?
You can use the predict method from the result object est but in order to succesfully use it you have to use as formula
est = sm.ols("y ~ x", data =data).fit()
est.predict(exog=new_values)
where new_values is a dictionary.
Check out this link.

Grouped linear regression in Spark

I'm working in PySpark, and I'd like to find a way to perform linear regressions on groups of data. Specifically given this dataframe
import pandas as pd
pdf = pd.DataFrame({'group_id':[1,1,1,2,2,2,3,3,3,3],
'x':[0,1,2,0,1,5,2,3,4,5],
'y':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
df.show()
# +--------+-+---+
# |group_id|x| y|
# +--------+-+---+
# | 1|0|2.0|
# | 1|1|1.0|
# | 1|2|0.0|
# | 2|0|0.0|
# | 2|1|0.5|
# | 2|5|2.5|
# | 3|2|3.0|
# | 3|3|4.0|
# | 3|4|5.0|
# | 3|5|6.0|
# +--------+-+---+
I'd now like to be able to fit a separate y ~ ax + b model for each group_id and output a new dataframe with columns a and b and a row for each group.
For instance for group 1 I could do:
from sklearn import linear_model
# Regression on group_id = 1
data = df.where(df.group_id == 1).toPandas()
regr = linear_model.LinearRegression()
regr.fit(data.x.values.reshape(len(data),1), data.y.reshape(len(data),1))
a = regr.coef_[0][0]
b = regr.intercept_[0]
print('For group 1, y = {0}*x + {1}'.format(a, b))
# Repeat for group_id=2, group_id=3
But to do this for each group involves bringing the data back to the driver one be one, which doesn't take advantage of any Spark parallelism.
Here's a solution I found. Instead of performing separate regressions on each group of data, create one sparse matrix with separate columns for each group:
from pyspark.mllib.regression import LabeledPoint, SparseVector
# Label points for regression
def groupid_to_feature(group_id, x, num_groups):
intercept_id = num_groups + group_id-1
# Need a vector containing x and a '1' for the intercept term
return SparseVector(num_groups*2, {group_id-1: x, intercept_id: 1.0})
labelled = df.map(lambda line:LabeledPoint(line[2],
groupid_to_feature(line[0], line[1], 3)))
labelled.take(5)
# [LabeledPoint(2.0, (6,[0,3],[0.0,1.0])),
# LabeledPoint(1.0, (6,[0,3],[1.0,1.0])),
# LabeledPoint(0.0, (6,[0,3],[2.0,1.0])),
# LabeledPoint(0.0, (6,[1,4],[0.0,1.0])),
# LabeledPoint(0.5, (6,[1,4],[1.0,1.0]))]
Then use Spark's LinearRegressionWithSGD to run the regression:
from pyspark.mllib.regression import LinearRegressionModel, LinearRegressionWithSGD
lrm = LinearRegressionWithSGD.train(labelled, iterations=5000, intercept=False)
The weights from this regression contain the coefficient and intercept for each group_id, i.e.
lrm.weights
# DenseVector([-1.0, 0.5, 1.0014, 2.0, 0.0, 0.9946])
or reshaped into a DataFrame to give a and b for each group:
pd.DataFrame(lrm.weights.reshape(2,3).transpose(), columns=['a','b'], index=[1,2,3])
# a b
# 1 -0.999990 1.999986e+00
# 2 0.500000 5.270592e-11
# 3 1.001398 9.946426e-01

Categories