Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a dataset of 10 features. Three of these are categorical; when I apply one-hot encoding to these three, they blow up into 96 features. I reduced these 96 features into 20 by PCA.
I plan to use the 20 principal components and the remaining 7 features as my final feature set. Is this a good idea: to combine principal components with actual features?
PCA tends to represent a combination of actual features, most of the times this combination leads to some information loss. That usually is fair trade-off by the dimensionality reduction. Adding those actual features won't get you dimensionality too large and will get "back" some information lost by PCA.
But my advice would still be to try it both. and choose the one that leads better results (given your specification)
There is no theoretical problem with this approach. From a statistical standpoint, all you've done is to exclude those seven features from the PCA reduction. This implies that you know, a priori, that those seven features are principal components -- that they're significant to the results, without having to analyze them for independence from the other features, and for relevance.
As loeschet already mentioned, you should try it both ways: once the way you're proposing, and once with all 103 features included in your PCA phase. See which gives you better results. Much of data set analysis consists of trying different approaches to see which gives you the best empirical results.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have a dataset which have 13 features in total, out of which 5 features are Categorical features. Now these features have 1700, 25, 65, 275 and 3 different categories available respectively. I will convert these categorical features to numeric data using available encoding techniques before applying ML algorithms.
Problem that I am working on is a multiclass classification.
My question is do I need a large amount of data (in hundreds of thousands) to make my model learn different combinations of each category available to me effectively?
No,you don't need an especially large amount of data.
This is a common issue concerning high-cardinality categorical features, which you will find a lot of information on if you look it up.
One approach is known as target encoding, where the feature is encoded by taking into consideration the corresponding values of the target (i.e. labels).
See TargetEncoder from scikit-learn for example.
While working with multiclass classification problems, it is best to have same number of samples for each target class. If not then it becomes an imbalanced dataset.
To answer your question, features count wouldn't play that crucial role as that of target, so you need not have large amount of data.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a few lists of movement tracking data, which looks something like this
I want to create a list of outputs where I mark these large spikes, essentially telling that there is a movement at that point.
I applied a rolling standard deviation on the data with a window size of two and got this result
Now I can see the spikes which mark the point of interest, but I am not sure how to do it in code. A statistical tool to measure these spikes, which can be used to flag these spikes.
There are several approaches that you can use for an anomaly detection task.
The choice depends on your data.
If you want to use a statistical approach, you can use some measures like z-score or IQR.
Here you can find a tutorial for these measures.
Here instead, you can find another tutorial for a statistical approach which uses mean and variance.
Last but not least, I suggest you also to check how to use a control chart, because in some cases it's enough.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Suppose I have a 5000-rows matrix with 10 columns and most of them are filled with categorical data (strings) and in each column I have 10-30 different strings. What is the best ideal way/algorithm to deal with it in python? OneHotEncoder would give me a very large matrix
Scikit-learn's one-hot encoder uses sparse matrices by default, so exact matrix shape is not problematic (because it won't store nonzero entries).
Some simple sklearn algorithms (linear models, trees, Naive Bayes) are able to handle such sparse data - for concrete example see Computational Performance section or Classification of text documents using sparse features
I don't know if it is ideal but you could use scipy. You could try one-hot encoding alongside with sparse matrix representation for the resulting matrix.
Why Not use graph database such:https://neo4j.com
but my recommendation is JCR : modeshape.jboss.org
you can make deeper leaf indexing and get very flexible query
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm clustering some data using scikit.
I have the easiest possible task: I do know the number of clusters. And, I do know the size of each cluster. Is it possible to specify this information and relay it to the K-means function?
No. You need some type of constrained clustering algorithm to do this, and none are implemented in scikit-learn. (This is not "the easiest possible task", I wouldn't even know of a principled algorithm that does this, aside from some heuristic moving of samples from one cluster to another.)
It won't be k-means anymore.
K-means is variance minimization, and it seems your objective is to produce paritions of a predefined size, not of minimum variance.
However, here is a tutorial that shows how to modify k-means to produce clusters of the same size. You can easily extend this to produce clusters of the desired sizes instead of the average size. It's fairly easy to modify k-means this way. But the results will be even more meaningless than k-means results on most data sets. K-means is often just as good as random convex partitions.
I can think only of bruteforce algorithm. If clusters are well separated then you may try to run clustering several times with different random initializations providing just number of clusters as an input. After each iteration count size of each cluster, sort it and compare to sorted list of known cluster sizes. If they don't match rinse and repeat.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm new to python. Now I have a dataframe which contain annual records from 1959 to 2009. Could you please tell me how to use it to predict, say from 2010 to 2012?
Appreciation for any help!
First of all, plot your data and have a look at it. You must then have a feeling of what's going on and also have a subjective prediction.
If your data seems to be completely random, without any obvious trends, calculate its average and use it as a first-guess prediction. (For a fully random data, it will be the result from the linear regression as well).
You can then use linear regression, either with Pandas' ols regression tools, or numpy's polyfit. Make sure you plot your data and the regression line to actually see how well your prediction is doing.
And don't expect to do a miracle with this method. Complicated things are much harder to predict than a linear regression, and 50-year-long processes, whatever they be, are usually complicated enough.