I have a problem where I need to predict a list of objects based on previous history of the usage of the objects. It is a recommendation system in short.
I figured I can use clustering on existing data, and then try to find pattern among the clusters.
For this I came acros scikit-learn library in python, and i think it will work.
But I need to know how I will use one of their clustering algorithms(say MeanShift) , since the examples they provide mostly work on their own datasets provided in the library itself.
So,
How do I organize my data so that I can use the MeanShift class from sklearn.cluster package?
My data points are multidimensional, so will I be able to use sklearn package in the first place? they haven't mentioned any constraints.
If I can cluster multidimensional data points, will I have to do dimensionality reduction? ( I don't know how to do this either, but I am aware of the concept)
I have done some data mining in one of my courses, but these are new waters for me, any help in terms of pointing to resources/tutorials will be appreciated hightly.
Thank you.
Related
I'm interested in learning to rank with pairwise comparison. While working on this I found that XGBoost has a model called XGBRanker which works very well.
I want to find out how the XGBRanker manages the training data to get such low memory usage and great results?(It uses LambdaMART I believe) I imagine it must be some kind of lookup table for the features and maybe making the pairs iteratively or not using all possible permutations with different labels within one group.
I tried looking through the source code but everything keeps referring to some other XGBoost method and I haven't been able to understand it so far.
I would like to create a similar method to train NNs for pairwise comparison but handling the training data has been a huge hurdle so far.
So more generally my Question would be: How are the pairs created in pairwise ranking anlgorithms?(RankNet,LambdaNet and so on) Are all pairs used? Only a percentage? Is there some other way of doing this? If you're working with >100.000 items you would easily get into the range of hundreds of millions.
I hope someone has some information about this or knows who might.
I am working on Python 2.7. I want to create nomograms based on the data of various variables in order to predict one variable. I am looking into and have installed PyNomo package.
However, the from the documentation here and here and the examples, it seems that nomograms can only be made when you have equation(s) relating these variables, and not from the data. For example, examples here show how to use equations to create nomograms. What I want, is to create a nomogram from the data and use that to predict things. How do I do that? In other words, how do I make the nomograph take data as input and not the function as input? Is it even possible?
Any input would be helpful. If PyNomo cannot do it, please suggest some other package (in any language). For example, I am trying function nomogram from package rms in R, but not having luck with figuring out how to properly use it. I have asked a separate question for that here.
The term "nomogram" has become somewhat confused of late as it now refers to two entirely different things.
A classic nomogram performs a full calculation - you mark two scales, draw a straight line across the marks and read your answer from a third scale. This is the type of nomogram that pynomo produces, and as you correctly say, you need a formula. As mentioned above, producing nomograms like this is definitely a two-step process.
The other use of the term (very popular, recently) is to refer to regression nomograms. These are graphical depictions of regression models (usually logistic regression models). For these, a group of parallel predictor variables are depicted with a common scale on the bottom; for each predictor you read the 'score' from the scale and add these up. These types of nomograms have become very popular in the last few years, and thats what the RMS package will draft. I haven't used this but my understanding is that it works directly from the data.
Hope this is of some use! :-)
I'm trying to figure out a procedure to perform clustering on a set of data with 52 dimensions. This is purely for my own learning so I have a data set of known fields. The data is from retrosheet.org Gamelogs using the World Series data set. I'm attempting to use only columns 25-77, so only the integers, ignoring the string data.
This is my first attempt at unsupervised learning and while I understand the concepts, I'm struggling to implement a solution in Python. I've been using scipy and numpy. If anyone knows a good place to start or some suggestions on tackling this problem, I'd appreciate it.
Scikit learn is the way to go for clustering in Python. See http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#example-cluster-plot-kmeans-digits-py for a demo and code for clustering with 64 features. It would be good to start with the tutorial at http://scikit-learn.org/stable/tutorial/basic/tutorial.html and apply what you learn there to your dataset and then to k-means clustering.
I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.
What I know ?
Document clustering is typically done using TF/IDF. Which essentially
converts the words in the documents to vector space model which is
then input to the algorithm.
There are many algorithms like k-means, neural networks, hierarchical
clustering to accomplish this.
My Data :
I am experimenting with linkedin data, each document would be the
linkedin profile summary, I would like to see if similar job
documents get clustered together.
Current Challenges:
My data has huge summary descriptions, which end up becoming 10000's
of words when I apply TF/IDF. Is there any proper way to handle this
high dimensional data.
K - means and other algorithms requires I specify the no. of clusters
( centroids ), in my case I do not know the number of clusters
upfront. This I believe is a completely unsupervised learning. Are
there algorithms which can determine the no. of clusters themselves?
I've never worked with document clustering before, if you are aware
of tutorials , textbooks or articles which address this issue, please
feel free to suggest.
I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.
My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
My first suggestion is that you don't unless you absolutely have to, due to memory or execution time problems.
If you must handle it, you should use dimensionality reduction (PCA for example) or feature selection (probably better in your case, see chi2 for example)
K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
If you look at the clustering algorithms available in scikit-learn, you'll see that not all of them require that you specify the number of clusters.
Another one that does not is hierarchical clustering, implemented in scipy. Also see this answer.
I would also suggest that you use KMeans and try to manually tweak the number of clusters until you are satisfied with the results.
I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.
Scikit has a lot of tutorials for working with text data, just use the "text data" search query on their site. One is for KMeans, others are for supervised learning, but I suggest you go over those too to get more familiar with the library. From a coding, style and syntax POV, unsupervised and supervised learning are pretty similar in scikit-learn, in my opinion.
Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
Minor correction here: TF-IDF has nothing to do with clustering. It is simply a method for turning text data into numerical data. It does not care what you do with that data (clustering, classification, regression, search engine things etc.) afterwards.
I understand the message you were trying to get across, but it is incorrect to say that "clustering is done using TF-IDF". It's done using a clustering algorithm, TF-IDF only plays a preprocessing role in document clustering.
For the large matrix after TF/IDF transformation, consider using sparse matrix.
You could try different k values. I am not an expert in unsupervised clustering algorithms, but I bet with such algorithms and different parameters, you could also end up with a varied number of clusters.
This link might be useful. It provides good amount of explanation for k-means clustering with a visual output http://brandonrose.org/clustering
I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python?
Thanks.
Imputing data will skew the result in ways that might bias the PCA estimates. A better approach is to use a PPCA algorithm, which gives the same result as PCA, but in some implementations can deal with missing data more robustly.
I have found two libraries. You have
Package PPCA on PyPI, which is called PCA-magic on github
Package PyPPCA, having the same name on PyPI and github
Since the packages are in low maintenance, you might want to implement it yourself instead. The code above build on theory presented in the well quoted (and well written!) paper by Tipping and Bishop 1999. It is available on Tippings home page if you want guidance on how to implement PPCA properly.
As an aside, the sklearn implementation of PCA is actually a PPCA implementation based on TippingBishop1999, but they have not chosen to implement it in such a way that it handles missing values.
EDIT: both the libraries above had issues so I could not use them directly myself. I forked PyPPCA and bug fixed it. Available on github.
I think you will probably need to do some preprocessing of the data before doing PCA.
You can use:
sklearn.impute.SimpleImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
With this function you can automatically replace the missing values for the mean, median or most frequent value. Which of this options is the best is hard to tell, it depends on many factors such as how the data looks like.
By the way, you can also use PCA using the same library with:
sklearn.decomposition.PCA
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
And many others statistical functions and machine learning tecniques.