Let's say price of houses(target variable) can be easily plotted against area of houses(predictor variables) and we can see the data plotted and draw a best fit line through the data.
However, consider if we have predictor variables as ( size, no.of bedrooms,locality,no.of floors ) etc. How am I gonna plot all these against the
target variable and visualize them on a 2-D figure?
The computation shouldn't be an issue (the math works regardless of dimensionality), but the plotting definitely gets tricky. PCA can be hard to interpret and forcing orthogonality might not be appropriate here. I'd check out some of the advice provided here: https://stats.stackexchange.com/questions/73320/how-to-visualize-a-fitted-multiple-regression-model
Fundamentally, it depends on what you are trying to communicate. Goodness of fit? Maybe throw together multiple plots of residuals.
If you truly want a 2D figure, that's certainly not easy. One possible approach would be to reduce the dimensionality of your data to 2 using something like Principal Component Analysis. Then you can plot it in two dimensions again. Reducing to 3 dimensions instead of 2 might also still work, humans can understand 3D plots drawn on a 2D screen fairly well.
You don't normally need to do linear regression by hand though, so you don't need a 2D drawing of your data either. You can just let your computer compute the linear regression, and that works perfectly fine with way more than 2 or 3 dimensions.
Related
i'm not a datascientist however i am intriuged with datascience, machine learning etc etc..
in my efforts to understand all of this i am continiously making a dataset (daily scraping) of grand exchange prices of one of my favourite games Old School runescape.
one of my goals is to pick a set of stocks/items that would give me the most profit. currently i am trying out clustering with k-means, to find stocks that are similar to eachother based on some basic features that i could think of.
however i have no clue if what i'm doing is correct,
for example:
( y = kmeans.fit_predict(df_items) my item_id is included with this, so is it actualy considering item_id as a feature now?)
and how do i even visualise the outcome of this i mean what goes on the x axis and what goes on the y axis, i have multiple columns...
https://github.com/extreme4all/OSRS_DataSet/blob/master/NoteBooks/Stock%20Picking.ipynb
To visualize something you have to reduce dimensionality to 2-3 dimensions, plus you can use color as 4-th dimension or in your case to indicate cluster number.
tSNE is a common choice for this task, check sklearn docs for details: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Choose almost any visualization technique for multivariate data.
Scatterplot matrix
Parallel coordinates
Dimensionality reduction (PCA makes more sense for k-mrans than tSNE, but also consider Fishers LDA, LMNN, etc.)
Box plots
Violin plots
...
I would like to know if in Python, and more precisely, in lmfit library, there is an option for fitting data by parts ? I would like to fit data defined in different ranges and then obtain a unique fit.
Thank you
Without a more concrete example, it is hard to give a concrete answer. But, if I understand your question correctly, you are looking to do a fit to one specific region of your data, then a fit (probably with a different functional form) to another region of your data, and then perhaps combine the multiple regions to get a final fit.
If that is correct, then yes, this can be done with lmfit (and probably with other libraries as well). Let's say you want to fit data that is sort of peak like with an exponential decaying background. First, isolate a region around that peak (it doesn't have to be perfect) and fit a peak (say, Gaussian to that). Then fit an exponential decay to all the data except the peak area. (Aside: numpy.where can be very useful in identifying the regions). Finally, combine the two and fit the whole curve to peak + background.
If that is too vague and doesn't point you in the right direction, please make the question more specific.
I'm working in a space which has 8 dimensions (i.e. 8 features). I have plotted the data points in 2D by applying PCA as well as TSNE. Now I would like also to draw the borderlines of the classifiers I use as shown here. By the way, I'm using different classifiers (SVM, GNB, Logistic Regression).
This means that I have the different 8-dimensional points which I plot in 2D using PCA or TSNE. On top of this plot I would like to plot the different classification regions as shown in the link above.
Of course the classification boundaries/regions are also 8-dimensional. How can I turn the classification boundaries/regions into 2D matching my 2D data points?
Interesting question here, I once wondered it.
It can be answered several way, including more or less details depending whether you want to fully understand or to apply the method.
As you don't a lot of detail but you included a sklearn link, I will first answer on a technical point of view: "How can you do it with sklearn?"
You have a function for this: transform(X, y=None) which will apply the PCA projection (yes, PCA is a projection for high dimensional space to a lower one).
So you basically just need to give transform(your_boundaries) to apply it.
In term of pseudo code this would give:
pca = PCA(n_component=2).fit(data)
2dboundaries = pca.transform(boundaries)
Et voilĂ !
Do not hesitate to give more details or ask question. I could add some specific development if it is relevant.
Hope it helps
pltrdy
I have some data that looks like this
What is the typical way to do a polynomial map of z based on x and y? I have used numpy.polyfit in the past to do similar things in 2 dimensions, so I suppose I could just iterate through all the points and then fit those answers with another 1d polyfit. However, it seems there should be a more straight forward way.
By the way the picture shows 2 different sets of data that would be fit with different equations.
It seems to me that what you really want is to fit a surface (linear or spline) in terms of z(x,y), but you have only one single line of data. This is like solving for two unknowns with only one equation - the problem is, basically, how could you decide if the difference in your red line from A to B was caused by the change in PSI, or the change in V?
My suggestions:
fit a surface to your existing dataset. You will get something.
try to get more data that you can fit a more accurate surface on
do what you first wanted - fit a function separately in each dimensions, combine them and use the best of the three functions (the one fitted for PSI, the one fitted for V and the one combined).
try to combine your PSI and V factors with some fancy physics-based trick into one significant factor that contains both of them
Can someone please tell me if there is a good (easy) way to visualize high dimensional data? My data is currently 21 dimensions but I would like to see how whether it is dense or sparse. Are there techniques to achieve this?
Parallel coordinates are a popular method for visualizing high-dimensional data.
What kind of visualization is best for your data in particular will depend on its characteristics-- how correlated are the different dimensions?
Principal component analysis could be helpful if the dimensions are correlated.
The buzzword I would search for is multidimensional scaling. It is a technique to develop a projection from the high dimensional space to a lower space (2 or 3 dimensional) in such a way that points which are close in the full space will be close in the projection.
It is often used for visualising the output of clustering algorithms (i.e. if your clusters are compact in the MDS projection there is a good chance they are also in the full space).
Edit: This wouldn't necessarily help with determining if the data is dense or sparse, because you lose the scale in the projection, but it would show whether it is uniform or clumpy (perhaps thats what you mean).
Not sure what kind of patterns you would like to see from the data. t-SNE and its faster variant Barnes-Hut-SNE do a very good job in visualizing groups of related concepts for high-dimensional data. It is available through R.
There is a short tutorial on using it against high-dimensional data with about 300 dimensions.
http://www.codeproject.com/Tips/788739/Visualizing-High-Dimensional-Vector-using-T-SNE-wi
I was looking for ways to visualize high dimensional data and found this t-SNE technique that has been used effectively. Might help others as well.
Take a look at http://www.ggobi.org (tours, parallel coordinates, scatterplot matrices) can be used for real-valued variables. Also http://cranvas.org for more recent. The tourr package in R.
Try using http://hypertools.readthedocs.io/en/latest/.
HyperTools is a library for visualizing and manipulating high-dimensional data in Python.
Star Schema.
http://en.wikipedia.org/wiki/Star_schema
Works well for high-dimensional data.
If the cardinality of your fact table is close to the product of your dimension sizes, you have dense data.
If the cardinality of your fact table is smaller than the product of your dimension sizes, you have sparse data.
In the middle you have a judgement call.
The curios.IT data exploration software is designed for the visualization of high dimensional data: data is shown as a collection of 3D objects (one for each data group) which can show up to 13 variables at the same time. The relationships between data variables and visual features are much easier to remember than with other techniques (like parallel coordinates).