I'm learning deep learning and I would like to print this histogram with matplotlib:
from this code who print the data :
lr = LogisticRegression()
lr.fit(X, y)
print(lr.coef_)
who prints :
[[-0.150896 0.23357229 0.00669907 0.3730938 0.100852 -0.85258357]]
edit:
I tried the basic hist but I don't understand the output :
plt.hist(lr.coef_)
plt.show()
but i got :
As mentioned in the documentation (".. Compute and draw the histogram of .."), pl.hist bot calculates and plots a histogram from the raw data. For example:
import matplotlib.pylab as pl
import numpy as np
# Dummy data
data = np.random.normal(size=1000)
pl.figure()
pl.subplot(121)
pl.hist(data)
What you want is the pl.bar function:
# Your data
data = np.array([-0.150896, 0.23357229, 0.00669907, 0.3730938, 0.100852, -0.85258357])
labels = ['as','df','as','df','as','df']
ax=pl.subplot(122)
pl.bar(np.arange(data.size), data)
ax.set_xticks(np.arange(data.size))
ax.set_xticklabels(labels)
Combined this produces:
You need to understand the usage of hist and bar correctly. hist is to give the frequency distribution of the data directly, while the bar is simply to give the height bar of a column or row of data.
Related
I have never been great with Python plotting concepts, and now I'm still apparently missing something new.
Here is my code.
import pandas as pd
import matplotlib.pyplot as plt
import sys
from numpy import genfromtxt
from sklearn.cluster import DBSCAN
data = pd.read_csv('C:\\Users\\path_here\\wine.csv')
data
# Reading in 2D Feature Space
model = DBSCAN(eps=0.9, min_samples=10).fit(data)
array_flavanoids = data.iloc[:, 2]
# Slicing array
array_colorintensity = data.iloc[:, 3]
# Scatter plot function
colors = model.labels_
plt.scatter(array_flavanoids, array_colorintensity, c=colors, marker='o')
plt.xlabel('Concentration of flavanoids', fontsize=16)
plt.ylabel('Color intensity', fontsize=16)
plt.title('Concentration of flavanoids vs Color intensity', fontsize=20)
plt.show()
Here is my result.
I am expecting the outliers to be in a different color than the non-outliers. So, something like this.
Maybe one color for outliers and another for non-outliers. I am just trying to learn the concept in this exercise. I am trying to follow the example from this link.
https://towardsdatascience.com/outlier-detection-python-cd22e6a12098
I am using this data source.
https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
I am testing different data sets.
I got this to work.
from sklearn.cluster import DBSCAN
def dbscan(X, eps, min_samples):
ss = StandardScaler()
X = ss.fit_transform(X)
db = DBSCAN(eps=eps, min_samples=min_samples)
db.fit(X)
y_pred = db.fit_predict(X)
plt.scatter(X[:,0], X[:,1],c=y_pred, cmap='Paired')
plt.title("DBSCAN")
dbscan(data, eps=.5, min_samples=5)
I found this to be a great resource.
https://medium.com/#plog397/functions-to-plot-kmeans-hierarchical-and-dbscan-clustering-c4146ed69744
--- SAMPLE ---
I have a data set (sample) that contains 1 000 damage values (the values are very small <1e-6) in a 1-dimension array (see the attached .json file). The sample is seemed to follow Lognormal distribution:
--- PROBLEM & WHAT I ALREADY TRIED ---
I tried the suggestions in this post Fitting empirical distribution to theoretical ones with Scipy (Python)? and this post Scipy: lognormal fitting to fit my data by lognormal distribution. None of these works. :(
I always get something very large in Y-axis as the following:
Here is the code that I used in Python (and the data.json file can be downloaded from here):
from matplotlib import pyplot as plt
from scipy import stats as scistats
import json
with open("data.json", "r") as f:
sample = json.load(f) # load data: a 1000 * 1 array with many small values( < 1e-6)
fig, axis = plt.subplots() # initiate a figure
N, nbins, patches = axis.hist(sample, bins = 40) # plot sample by histogram
axis.ticklabel_format(style = 'sci', scilimits = (-3, 4), axis = 'x') # make X-axis to use scitific numbers
axis.set_xlabel("Value")
axis.set_ylabel("Count")
plt.show()
fig, axis = plt.subplots()
param = scistats.lognorm.fit(sample) # fit data by Lognormal distribution
pdf_fitted = scistats.lognorm.pdf(nbins, * param[: -2], loc = param[-2], scale = param[-1]) # prepare data for ploting fitted distribution
axis.plot(nbins, pdf_fitted) # draw fitted distribution on the same figure
plt.show()
I tried the other kind of distribution, but when I try to plot the result, the Y-axis is always too large and I can't plot with my histogram. Where did I fail ???
I'have also tried out the suggestion in my another question: Use scipy lognormal distribution to fit data with small values, then show in matplotlib. But the value of variable pdf_fitted is always too big.
--- EXPECTING RESULT ---
Basically, what I want is like this:
And here is the Matlab code that I used in the above screenshot:
fname = 'data.json';
sample = jsondecode(fileread(fname));
% fitting distribution
pd = fitdist(sample, 'lognormal')
% A combined command for plotting histogram and distribution
figure();
histfit(sample,40,"lognormal")
So if you have any idea of the equivalent command of fitdist and histfit in Python/Scipy/Numpy/Matplotlib, please post it !
Thanks a lot !
Try the distfit (or fitdist) library.
https://erdogant.github.io/distfit
pip install distfit
import numpy as np
# Example data
X = np.random.normal(10, 3, 2000)
y = [3,4,5,6,10,11,12,18,20]
# From the distfit library import the class distfit
from distfit import distfit
# Initialize
dist = distfit()
# Search for best theoretical fit on your emperical data
dist.fit_transform(X)
# Plot
dist.plot()
# summay plot
dist.plot_summary()
So in your case it would be:
dist = distfit(distr='lognorm')
dist.fit_transform(X)
Try seaborn:
import seaborn as sns, numpy as np
sns.set(); np.random.seed(0)
x = np.random.randn(100)
ax = sns.distplot(x)
I tried your dataset using Openturns library
x is the list given in you json file.
import openturns as ot
from openturns.viewer import View
import matplotlib.pyplot as plt
# first format your list x as a sample of dimension 1
sample = ot.Sample(x,1)
# use the LogNormalFactory to build a Lognormal distribution according to your sample
distribution = ot.LogNormalFactory().build(sample)
# draw the pdf of the obtained distribution
graph = distribution.drawPDF()
graph.setLegends(["LogNormal"])
View(graph)
plt.show()
If you want the parameters of the distribution
print(distribution)
>>> LogNormal(muLog = -16.5263, sigmaLog = 0.636928, gamma = 3.01106e-08)
You can build the histogram the same way by calling HistogramFactory, then you can add one graph to another:
graph2 = ot.HistogramFactory().build(sample).drawPDF()
graph2.setColors(['blue'])
graph2.setLegends(["Histogram"])
graph2.add(graph)
View(graph2)
and set the boundaries values if you want to zoom
axes = view.getAxes()
_ = axes[0].set_xlim(-0.6e-07, 2.8e-07)
plt.show()
the question topic is a little complex because I need a lot of help lol. To explain, I have a csv of data with labels (names) and numerical data...
name,post_count,follower_count,following_count,anonymous_pic,is_private,...
adam,3,997,435,0,0,1,0,0,0,0 bob,2,723,600,0,0,1,0,0,0,0
jill,11,3193,962,0,0,1,0,0,0,0 sara,0,225,298,0,0,1,0,0,0,0
.
.
and so on. This data is loaded into a pandas dataframe from the csv. Now, I wish to pass only the numerical parts of this data into a sklearn.manifold class called TSNE (t-distributed stochastic neighbor embedding) which will output a list the same size as the input data, where each element of the new list is is list of size k (where k is the number of components specified as an argument to the TSNE class). In my case k = 2.
I'm graphing this data on a 2-D scatter plot from matplotlib, and I'd like to be able to inspect the labels on the data. I know matplotlib has an annotate feature in which points can be labeled, but how do I go about separating these labels from the data for TSNE? and if i just separate the labels prior to transformation, how can I go about ensuring that i'm relabeling the right points?
I'd like to be able to inspect these names, because I need to see if the transformation is useful on my data. This way I can analyze a few really bizarre places and see if something interesting is happening. Here is my code if you find it useful (Although I'll admit its just scratchwork)
# Data structuring
import pandas as pd
import numpy as np
# Plotting
import seaborn as sns
import matplotlib.pyplot as plt
sns.set() # for plot styling
# Load data
df = pd.read_csv('user_data.csv')
print(df.head())
# sklearn
from sklearn.mixture import GMM
from sklearn.manifold import TSNE
tsne = TSNE(n_components = 2, init = 'random', random_state = 0)
lab_proj = tsne.fit_transform(df)
x = [i[0] for i in lab_proj]
y = [i[1] for i in lab_proj]
print(len(lab_proj))
df['PCA1'] = x
df['PCA2'] = y
model = GMM(n_components = 1, covariance_type = 'full')
model.fit(df)
y_gmm = model.predict(df)
df['cluster'] = y_gmm
sns.lmplot('PCA1', 'PCA2', data = df, col='cluster', fit_reg = False)
plt.show()
Thanks!
I am trying to plot a Linear Regression onto a scatterplot in Python.
In R I would simply do the following:
Run OLS Linear Regresion
fit_1 <- lm(medv ~ lstat)
plot(medv ~ lstat)
abline(fit_1, col = "red")
I have been looking at different solutions in Python, but I can't seem to be able to actually get it to work.
My script is:
Plot Data
Boston.plot(kind='scatter', x='medv', y='lstat', color = "black")
plt.show()
Run Linear Regression
fit_1 = sm.ols(formula='medv ~ lstat', data= Boston).fit()
Show Summary
fit_1.summary()
Plot Regression Line
Insert code here
It can be done quite simply. In the below code, I use sklearn to fit the model and predict the values.
import pandas as pd
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
model = LinearRegression()
model.fit(X,y)
predictions = model.predict(X)
plt.plot(X,y,'o')
# change here
plt.plot(X, predictions, '-')
plt.show()
Try this:
plt.plot(Boston.lstat, fit_1.fittedvalues, 'r')
Saw this on Statology that helped me a lot:
def abline(slope, intercept):
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = intercept + slope * x_vals
plt.plot(x_vals, y_vals, '--')
gh_data = ascii.read('http://dept.astro.lsa.umich.edu/~ericbell/data/GHOSTS/M81/ngc3031- field15.newphoto_radec')
ra = gh_data['col5'][:]
dec = gh_data['col6'][:]
f606 = gh_data['col3'][:]
f814 = gh_data['col4'][:]
plot(f6062-f8142,f8142, 'bo', alpha=0.15)
axis([-1,2.5,27,23])
xlabel('F606W-F814W')
ylabel('F814W')
title('Field 14')
The data set is imported and organized into different columns, I am trying to overlay a line of best fit, or linear regression over the scatterplot created, but I cannot figure out how. Thanks in advance.
As #rayryeng pointed out, your code just plots the data, but doesn't actually compute any regression results to plot. Here's one way of doing it:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame({"y": range(1,11)+np.random.rand(10),
"x": range(1,11)+np.random.rand(10)})
Use statsmodels OLS method to fit a regression line, and params to extract the coefficient on the single regressor:
beta_1 = sm.OLS(data.y, data.x).fit().params
Produce a scatterplot and add a regression line:
fig, ax = plt.subplots()
ax.scatter(data.x, data.y)
ax.plot(range(1,11), [i*beta_1 for i in range(1,11)], label = "best fit")
ax.legend(loc="best")