Simple machine learning problem (SVM, random forest - python

I am trying to solve a machine learning task but have encountered some problems. Any tips would be greatly appreciated. One of my questions is, how do you create a correlation matrix for 2 dataframes (data for 2 labels) of different sizes, to see if you can combine them into one.
Here is the whole text of the task
This dataset is composed of 1100 samples with 30 features each. The first column is the sample id. The second column in the dataset represents the label. There are 4 possible values for the labels. The remaining columns are numeric features.
Notice that the classes are unbalanced: some labels are more frequent than others. You need to decide whether to take this into account, and if so how.
Compare the performance of a Support-Vector Machine (implemented by sklearn.svm.LinearSVC) with that of a RandomForest (implemented by sklearn.ensemble.ExtraTreesClassifier). Try to optimize both algorithms' parameters and determine which one is best for this dataset. At the end of the analysis, you should have chosen an algorithm and its optimal set of parameters.
I have tried to make a correlation matrix for rows with the labels with lower cardinality but I am not convinced it is reliable
I tried to make two new dataframes from the rows that have labels 1 and 2. There are 100-150 entries for each of those 2 labels, compared to 400 for labels 0 and 3. I wanted to check if there is high correlation bewteen data labeled 1 and 2 to see if i could combine them but dont know if this is the right approach.I tried to make the dataframes the same size by appending zeros to the smaller one and then did a correlation matrix for both datasets together. is this a correct approach

your question and approach is not clear. can you modify the question with problem statement and few data sets that you have been given.
If you wanted to visualize your data set please plot them into 2,3 or 4 dimensions.
Here are many plotting tools like 3D scatter plot, pair plot, histogram and may more. use them to better understand your data sets.

Related

Selecting a subset of array

I am trying to create a Histogram to be able to identify the "noise" in my dataset. However, the dataset has a large data spread and thus a lot of the details I am interested in get lumped in the "zero column". I have a dataset that ranges from -0,0001 to 10000 but the bulk of values are around 0,1-0,1. I want to look specifically at the spread around zero and I have 600 000 data points. My question is thus, should I create a "new" dataset and remove everything above let say 1, to be able to see the distribution around this point in more detail and if so how do I do this in pandas? And if not is there another way of creating a histogram that have bins in the decimal ranges? Below if what I have so far and "depo" represents my dataset.
bins = np.arange(20)
depo.hist(bins=bins)
OUTPUT

randomforest Regressor with all independent variable as categorical

I am stuck in the process of building a model.Basically I have 10 parameters all of which are categorical variables, Even the categories have a large number of unique values (one category has 1335 unique values of 300 000 records), and the y value which is to be predicted is the number of days (Numerical). I am using randomforestregressor and getting an accuracy of around 55-60%. I am not sure if this is the max limit or I really need to change the algorithm itself. I am flexible with any kind of solutions.
Having up to 1335 categories for a categorical dimension might cause a random forest regressor (or classifier) some headache depending on how categorical dimensions are handled internally, and things will also depend on the distribution frequencies of the categories. What library are you using for the random forest regression?
Have you tried converting the categorical dimensions into unique integer IDs and interpreting this representation as a real number dimension? I've made the experience that this can raise the variable importance of many a type of categorical dimensions. (At times the inherent/initial ordering of the categories can provide useful grouping/partitioning information).
You can even shuffle your dimensions a few times and use these as input dimensions. I'll try to explain with an example:
You have a categorical dimension x1 with categories [c11,c12,...,c1n]
We easily map these categories to numerical values by saying x1 has a value of 1 if it's the category is c11, or a value of 2 if it's category, or a value or i for category c1i etc.
Use this new non-categorical dimension as an input dimension for training (you will have to change your input to the regressor accordingly later on).
You can go further than this. Shuffle the order (randomly) of your categories of x1 so you get a random order, for example [c13,c19,c1n,c1i,...,c12]. Do the same thing as above and you have another new non-categorical input dimension (Consider that you'll have to remember the shuffling order for the sake of regression later on).
I'm curious if adding a few (anywhere between 1 to 100, or whatever number you choose) dimensions like this can improve your performance.
Please, see how performance changes for different numbers of such dimensions. (But be aware that more such dimensions will cost you in preprocessing time at regression)
The statement in the codeblock below would require combining multiple categorical dimensions at once. Consider it only for inspiration.
Another idea would be to check if some form of linear classifier with the hot-encodings for each individual category for multiple categorical dimensions might be able improve things (This can help you find useful orderings more quickly than the approach above).
I am sure you need to more processing on your data.
having 1335 unique values on one variable is something bizarre.
please, if the data is public share it with me, I want to take a look.

what is the best algorithm to cluster this data

can some one help me find a good clustering algorithm that will cluster this into 3 clusters without defining the number of clusters.
i have tried many algorithms in its basic form.. nothing seems to work properly.
clustering = AgglomerativeClustering().fit(temp)
same way i tried the dbscan and kmeans too.. just used the guidelines from sklean. i couldn't get the expected results.
my original data set is a 1D list of numbers.. but the order of the numbers matters, so generated a 2D list as bellow.
temp = []
for i in range(len(avgs)):
temp.append([avgs[i], i+1])
clustering = AgglomerativeClustering().fit(temp)
in plotting piloting i used a similter range as the y axis
ax2.scatter(range(len(plots[i])), plots[i], c=np.random.rand(3,))
the order of the data matters, so this need to clustered into 3. and there might be some other data sets that the data is very good so that the result of that need to be just one cluster.
Link to the list if someone want to try
so i tried using the step detection and got the following image according to ur answer. but how can i find the values of the peaks.. if i get the max value i can get one of them.. but how to get the rest of it.. the second max is not an answer because the one right next to the max is the second max
Your data is not 2d coordinates. So don't choose an algorithm designed for that!
Instead your data appears to be sequential or time series.
What you want to use is a change point detection algorithm, capable of detecting a change in the mean value of a series.
A simple approach would be to compute the sum of the next 10 points minus the sum of the previous 10 points, then look for extreme values of this curve.

How to visualize k-means of multiple columns

i'm not a datascientist however i am intriuged with datascience, machine learning etc etc..
in my efforts to understand all of this i am continiously making a dataset (daily scraping) of grand exchange prices of one of my favourite games Old School runescape.
one of my goals is to pick a set of stocks/items that would give me the most profit. currently i am trying out clustering with k-means, to find stocks that are similar to eachother based on some basic features that i could think of.
however i have no clue if what i'm doing is correct,
for example:
( y = kmeans.fit_predict(df_items) my item_id is included with this, so is it actualy considering item_id as a feature now?)
and how do i even visualise the outcome of this i mean what goes on the x axis and what goes on the y axis, i have multiple columns...
https://github.com/extreme4all/OSRS_DataSet/blob/master/NoteBooks/Stock%20Picking.ipynb
To visualize something you have to reduce dimensionality to 2-3 dimensions, plus you can use color as 4-th dimension or in your case to indicate cluster number.
tSNE is a common choice for this task, check sklearn docs for details: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Choose almost any visualization technique for multivariate data.
Scatterplot matrix
Parallel coordinates
Dimensionality reduction (PCA makes more sense for k-mrans than tSNE, but also consider Fishers LDA, LMNN, etc.)
Box plots
Violin plots
...

How can I fine tune K means clustering when I'm only getting clusters in lines?

It's my first time trying to do K-Means clustering using Python and Sci-Kit Learn and I don't know what to make of my final cluster plot or how to fine tune my K means clustering algorithm.
My end goal is to find a clustering of user categories that delineates some interesting or useful behavior traits.
ATTEMPT 1:
Input: Gender, Age Range, Country (all one hot encoded because the data is categorical), and Account Age (numerical in weeks old)
Code:
# Convert DataFrame to matrix
mat2 = all_dummy.as_matrix()
# Using sklearn
km2 = sklearn.cluster.KMeans(n_clusters=6)
km2.fit(mat2)
# Get cluster assignment labels
labels2 = km2.labels_
# Format results as a DataFrame
results2 = pd.DataFrame([all_dummy.index,labels2]).T
plot_x2 = results2[0].tolist()
plot_y2 = results2[1].tolist()
pyplot.scatter(plot_x2,plot_y2)
pyplot.show()
Plot:
Specific Questions:
What is the X and Y axis of this graph?
What is this graph even telling me?
Why are there only 3 clusters showing up when I put 6 clusters as an input? (answered by first comment and updated code and graph)
How can I fine tune this graph to tell me more and show me a useful relationship if I don't know what the relationship I am looking for is?
Read up on the limitations of k-means.
In particular, be aware that
you must remove all identifier columns
k-means is very sensitive to scale. All attributes need to be carefully scaled according to their value range, distribution, and importance. Preprocessing is essential!
k-means assumes continuous variables. The use on categorical data, even when one-hot encoded, is questionable. It sometimes works "okayish" but barely ever workd "good".
According to your code, the X axis corresponds to the indices of your samples (seeing your graph, I suppose you have around 10 000 users then), and the Y axis corresponds to the labels of each sample.
You might not have 6 clusters as an input. Indeed, when you format your results as a dataframe, a labels variable is used, while it is actually labels2 that contain the computed cluster assignments. I don't know where your labels come from, but I suspect this is the reason you obtain those results. Hence, regarding question 2, this graph probably doesn't show anything relevant.
You first could use other visualisations to better understand how your data is being clustered. Sklearn's documentation provides many examples you could use for inspiration (1, 2, 3).
Hope it helped !

Categories