So I implemented the Power Iteration Clustering in Spark(inbuilt) with the Dataset I have. I got the model after using this
model = PowerIterationClustering.train(similarities, 2, 10)
When I do
model.assignments.collect()
I've all the values.
Now I want to plot a scatter plot of this model using Matplotlib.
But I'm not able to understand how to do it.
I got that x and y in the below code is id and cluster in model-
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
But I'm not able to understand how to use it. What should be the area, colors ?
You first need to parse the Assignment tuple, then collect. The output will be:
(<id int>, <cluster int>)
Instead of
Assignment(id=..,cluster=...)
You can do this by
model.assignments.map(lambda asm: (asm[0], asm[1])).collect()
You can then extract the x and y from the resulting list of tuples.
Related
I'm trying to compare my predicted output and the test data using matplotlib.As I'm new to python I'm not able to find how to connect each entry with line like in this photo.
I was able to write a code like this which compares the Y coordinate and entries but I'm unable to map each entry of test data with predicted output with a line
X_1 = range(len(Y_test))
plt.figure(figsize=(5,5))
plt.scatter(X_1, output, label='Y_output',alpha=0.3)
plt.scatter(X_1, Y_test, label='Y_test',alpha=0.3)
plt.title("Scatter Plot")
plt.legend()
plt.xlabel("entries")
plt.ylabel("Y value")
plt.show()
graph we are getting
Try something like this in addition to your code
plt.plot(np.stack((X_1,X_1)), np.stack((output,Y_test)), color="black")
In fact, to reproduce the plot you want, you need different x for output and for Y_test (for example, X_1 and X_2 that are different).
I try to Fit Multiple Linear Regression Model
Y= c + a1.X1 + a2.X2 + a3.X3 + a4.X4 +a5X5 +a6X6
Had my model had only 3 variable I would have used 3D plot to plot.
How can I plot this . I basically want to see how the best fit line looks like or should I plot multiple scatter plot and see the effect of individual variable
Y = a1X1 when all others are zero and see the best fit line.
What is the best approach for these models. I know it is not possible to visualize higher dimensions want to know what should be the best approach. I am desperate to see the best fit line
I found this post which is more helpful and followed
https://stats.stackexchange.com/questions/73320/how-to-visualize-a-fitted-multiple-regression-model.
Based on suggestions
I am currently just plotting scatter plots like dependent variable vs. 1st independent variable, then vs. 2nd independent variable etc I am doing same thing . I may not be able to see best fit line for complete model but I know how it is dependent on individual variable
from sklearn.linear_model import LinearRegression
train_copy = train[['OverallQual', 'AllSF','GrLivArea','GarageCars']]
train_copy =pd.get_dummies(train_copy)
train_copy=train_copy.fillna(0)
linear_regr_test = LinearRegression()
fig, axes = plt.subplots(1,len(train_copy.columns.values),sharey=True,constrained_layout=True,figsize=(30,15))
for i,e in enumerate(train_copy.columns):
linear_regr_test.fit(train_copy[e].values[:,np.newaxis], y.values)
axes[i].set_title("Best fit line")
axes[i].set_xlabel(str(e))
axes[i].set_ylabel('SalePrice')
axes[i].scatter(train_copy[e].values[:,np.newaxis], y,color='g')
axes[i].plot(train_copy[e].values[:,np.newaxis],
linear_regr_test.predict(train_copy[e].values[:,np.newaxis]),color='k')
You can use Seaborn's regplot function, and use the predicted and actual data for comparison. It is not the same as plotting a best fit line, but it shows you how well the model works.
sns.regplot(x=y_test, y=y_predict, ci=None, color="b")
You could try to visualize how well your model is performing by comparing actual and predicted values.
Assuming that our actual values are stored in Y, and the predicted ones in Y_, we could plot and compare both.
import seaborn as sns
ax1 = sns.distplot(Y, hist=False, color="r", label="Actual Value")
sns.distplot(Y_, hist=False, color="b", label="Fitted Values" , ax=ax1)
I'm facing a silly problem while plotting a graph from a regression function calculated using sci-kit-learn. After constructing the function I need to plot a graph that shows X and Y from the original data and calculated dots from my function. The problem is that my function is not a line, despite being linear, it uses a Fourier series in order to give the right shape for my curve, and when I try to plot the lines using:
ax.plot(df['GDPercapita'], modelp1.predict(df1), color='k')
I got a Graph like this:
Plot
But the trhu graph is supposed to be a line following those black points:
Dots to be connected
I'm generating the graph using the follow code:
fig, ax = plt.subplots()
ax.scatter(df['GDPercapita'], df['LifeExpectancy'], edgecolors=(0, 0, 0))
ax.scatter(df['GDPercapita'], modelp1.predict(df1),color='k') #this line is changed to get the first pic.
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show(block=True)
Does anyone have an idea about what to do?
POST DISCUSSION EDIT:
Ok, so first things first:
The data can be download at: http://www.est.ufmg.br/~marcosop/est171-ML/dados/worldDevelopmentIndicators.csv
I had to generate new data using a Fourier expasion, with normalized values of GDPercapita, in order to perform an exhaustive optimization algorithm for Regression Function used to predict the LifeExpectancy, and found out the number o p parameters that generate the best Regression Function, this number is p=22.
Now I have to generate a Polynomial Function using the predictions points of the regression fuction with p=22, to show how the best regression function is compared to the Polynomial function with the 22 degrees.
To generate the prediction I use the following code:
from sklearn import linear_model
modelp22 = linear_model.LinearRegression()
modelp22.fit(xp22,y_train)
df22 = df[p22]
fig, ax = plt.subplots()
ax.scatter(df['GDPercapita'], df['LifeExpectancy'], edgecolors=(0, 0, 0))
ax.scatter(df['GDPercapita'], modelp22.predict(df22),color='k')
ax.set_xlabel('GDPercapita')
ax.set_ylabel('LifeExpectancy')
plt.show(block=True)
Now I need to use the predictions points to create a Polynomial Function and plot a graph with: The original data(first scatter), the predictions points(secont scatter) and the Polygonal Funciontion (a curve or plot) to show their visual relation.
features = ["Ask1", "Bid1", "smooth_midprice", "BidSize1", "AskSize1"]
client = InfluxDBClient(host='127.0.0.1', port=8086, database='data',
username=username, password=password)
series = "DCIX_2016_11_15"
sql = "SELECT * FROM {} where time >= '{}' AND time <= '{}' ".format(series,FROMT,TOT)
df = pd.DataFrame(client.query(sql).get_points())
#Separating out the features
X = df.loc[:, features].values
# Standardizing the features
X = StandardScaler().fit_transform(X)
tsne = TSNE(n_components=3, n_jobs=5).fit_transform(X)
I would like map my 5 features into a 2D or 3D plot. I am a bit confused how to do that. How can I build a plot from that information?
You already have most of the work done. t-SNE is a common visualization for understanding high-dimensional data, and right now the variable tsne is an array where each row represents a set of (x, y, z) coordinates from the obtained embedding. You could use other visualizations if you would like, but t-SNE is probably a good starting place.
As far as actually seeing the results, even though you have the coordinates available you still need to plot them somehow. The matplotlib library is a good option, and that's what we'll use here.
To plot in 2D you have a couple of options. You can either keep most of your code the same and simply perform a 2D t-SNE with
tsne = TSNE(n_components=2, n_jobs=5).fit_transform(X)
Or you can just use the components you have and only look at two of them at a time. The following snippet should handle either case:
import matplotlib.pyplot as plt
plt.scatter(*zip(*tsne[:,:2]))
plt.show()
The zip(*...) transposes your data so that you can pass the x coordinates and the y coordinates individually to scatter(), and the [:,:2] piece selects two coordinates to view. You could ignore it if your data is already 2D, or you could replace it with something like [:,[0,2]] to view, for example, the 0th and 2nd features in higher-dimensional data rather than just the first 2.
To plot in 3D the code looks much the same, at least for a minimal version.
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(*zip(*tsne))
plt.show()
The main differences are a use of 3D plotting libraries and making a 3D subplot.
Adding color: t-SNE visualizations are typically more helpful if they're color-coded somehow. One example might be the smooth midprice you currently have stored in X[:,2]. For exploratory visualizations, I find 2D plots more helpful, so I'll use that as the example:
plt.scatter(*zip(*tsne[:,:2]), c=X[:,2])
You still need the imports and whatnot, but by passing the keyword argument c you can color code the scatter plot. To adjust how that numeric data is displayed, you could use a different color map like so:
plt.scatter(*zip(*tsne[:,:2]), c=X[:,2], cmap='RdBu')
As the name might suggest, this colormap consists of a gradient between red and blue, and the lower values of X[:,2] will correspond to red.
I've got the following code that generates a surface density plot.
x and y are position co=ordinates and z axis represents the density. All the values are pre calculated and is stored in a numpy array.
#set up the grid
xi, yi = np.linspace(x.min(), x.max(), 200), np.linspace(y.min(), y.max(), 200)
xi, yi = np.meshgrid(xi, yi)
#interpolate
rbf = scipy.interpolate.Rbf(x, y, z, function='linear')
zi = rbf(xi, yi)
plt.imshow(zi, vmin=z.min(), vmax=z.max(), origin='lower', extent=[x.min(), x.max(), y.min(), y.max()])
plt.scatter(x, y, c=z,marker='o')
plt.colorbar()
plt.scatter(xo,yo, c='b', marker='*')
plt.xlabel("RA(degrees)")
plt.ylabel("DEC(degrees)")
plt.title('Surface Density Plot 2.0 < z < 2.2')
plt.savefig('2.0-2.2.png', dpi= 300 )
plt.show()
The problem I have is the xaxis ticks are not in user friendly terms, they are values between 150-152 but I can't seem to change the ticks positions using the xticks() function.
Would anyone have a suggestion how I can go about to formatting the x axis?
edit-
These are the values for xyz used for the plot. x,y,z are three numpy arrays- https://www.dropbox.com/s/l03pkzplqmfm1su/xyz.csv
the first row is x values, second the y and third the z.
When using the pyplot interface, you can set the xticks via (provided you imported matplotlib.pyplot as plt)
plt.xticks(*args, **kwargs)
You can give the ticks-locations eg. as a list or a numpy array and the tick-labels as a touple (or list, ...).
However, please include a minimal example of code that we can run, so we can test if it's working and see why not, if that's the case. Also, you seem to have imported matplotlib as plt, but some of your commands (like xlabel) lack the plt. part. Is this just a typo or copy/paste error?
If you want more fine-tuning for your ticks and the tick-format, you should consider using the OO interface of matplotlib. Yes, it's more verbose and you have to type some more letters, but in my opinion the code gets much clearer and you have more possibilities to adapt the graph to your expectations.
Edit: As I understand from your comments, you are not satisfied with the format of the xtick labels. So instead of "0.0" "+1.5e2" you probably want to have "150.0" or so. The function to look into (using the pyplot interface) is:
plt.ticklabel_format(**kwargs)
The kwargs are shown here here. You should try, if style='plain' fits your demands.
Again I want to stress, that the OO interface grants you far more versatility to change the format of the tick labels. The respective functions would be:
matplotlib.axes.yaxis.set_major_formatter()
matplotlib.axes.xaxis.set_major_formatter()
You can choose between several formatters or even write your own formating function. If you want to do that, I can give you further advice.
Firefly, based on the dropbox image you have given in the comments I believe the following describes your problem. The magnitude of the x data is much larger than the variations, so python has a list of values like
[150.05,150.10,150.15,150.20,150.25]
which is too large for the xaxis in this figure so python does some clever business which you do not like (and I agree).
One fix could be to simply set the xticks vertical e.g
py.xticks(rotation='vertical')
Failing that you could manually do what python has attempted, subtract 150 degrees from the x axis and change your xlabel such that you have
plt.xlabel("RA+150(degrees)")
If your data was not degrees I would instead suggest rescaling instead (e.g divide by 1e2) but with degrees this looks very strange.