Hy I am starting to learn Scikit learn, but I am not interested in this iris-date or orlando real estate price things that they use in all this tutorials. This does not make any sence for me. I want to use my own date, but I can not figure out what input format should be used.
This is how my Code looks:
import matplotlib.pyplot as plt
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100)
x,y = [[1,2], [2,4]]
clf.fit(x,y)
I always geht the message:
ValueError: Found input variables with inconsistent numbers of
samples: [1, 2]
I tried many other formats like [[1],[1]] or 1,1.
So my simple questions, in witch formate do I have to write this
"x,y = [[1,2], [2,4]]" for my data?
Also, how can I train a model to make a forecast, for example: I have 10 Sports Teams in one legue.
In my table I have:
Team 1 | Team 2 | Result | location
So I want to figure if 2 teams play against each other who will win, but the location of course can be a factor.
I want to predict if team A plays against B at home, who is more likly to win.
How to enter your data:
The way you enter your data x contains just a single sample with 2 features, whereas y provides 2 labels.
Use this notation to get 2 samples with one feature each: x,y = [[1],[4]],[2,4]
Or, just to make it more obvious:
x = [[1],[4]]
y = [2,4]
Btw.: Given that you are new to Scikit learn, you should definitely try to do the same with Numpy arrays.
Make a classification:
If you want to make a forecast about who wins, you need to follow a couple of steps:
- Split you data, so that your features ("teamA", "teamB" and "location") are contained in you training data and the results represent the labels, e.g.:
x = [[teamA1,teamB1,Loc1],[teamA2,teamB2,Loc2],[teamA3,teamB3,Loc3],...]
y = [result1,result2,result3,...]
- Fit your model as before
- Make a prediction given your test data, e.g.:
x_test = [teamX,teamY,locX] # data for which you want the forecast
clf.predict(x_test) # this returns the estimated result
Related
i'm working in a machine learning project and i'm stuck with this warning when i try to use cross validation to know how many neighbours do i need to achieve the best accuracy in knn; here's the warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
The dataset i'm using is https://archive.ics.uci.edu/ml/datasets/Student+Performance
In this dataset we have several attributes, but we'll be using only "G1", "G2", "G3", "studytime","freetime","health","famrel". all the instances in those columns are integers.
https://i.stack.imgur.com/sirSl.png <-dataset example
Next,here's my first chunk of code where i assign the train and test groups:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/gdrive')
import sklearn
data=pd.read_excel("/gdrive/MyDrive/Colab Notebooks/student-por.xls")
#print(data.head())
data = data[["G1", "G2", "G3", "studytime","freetime","health","famrel"]]
print(data)
predict = "G3"
x = np.array(data.drop([predict], axis=1))
y = np.array(data[predict])
print(y)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=42)
print(len(y))
print(len(x))
That's how i assign x and y. with len, i can see that x and y have 649 rows both, representing 649 students.
Here's the second chunk of code when i do the cross_val:
#CROSSVALIDATION
from sklearn.neighbors import KNeighborsClassifier
neighbors = list (range(2,30))
cv_scores=[]
#print(y_train)
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn,x_train,y_train,cv=11,scoring='accuracy')
cv_scores.append(scores.mean())
plt.plot(cv_scores)
plt.show()```
the code is pretty self explanatory as you can tell
The warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
happens in every iteration of the for-loop
Although this warning happens every time, plt.show() is still able to plot a graph regarding which amount of neighbours is best to achieve a good accuracy, i dont know if the plot, or the readings in cv_scores are accurate.
my question is :
How my "class in y" has only 1 members, len(y) clearly says y have 649 instances, more than enough to be splitted in 59 groups of 11 members each one?, By members is it referring to "instances" in my dataset, or colums/labels in the y group?
I'm not using stratify=y when i do the train/test splits, it's seems to be the 1# solution to this warning but its useless in my case.
I've tried everything i've seen on google/stack overflow and nothing helped me, the dataset seems to be the problem, but i canĀ“t understand whats wrong.
I think your main mistake is that your are using KNeighborsClassifier, and your feature to predict seems to be continuous (G3 - final grade (numeric: from 0 to 20, output target)) and not categorical.
In this case, every single value of the "y" is taken as a different possible class or label. The message you obtain is saying that in your dataset (on the "y"), there are values that only appears one time. For example, the values 3, appears only one time inside your dataset. This is not an error, but indicates that the model won't work correctly or accurate.
After all, I strongly reccomend you to use the sklearn.neighbors.KNeighborsRegressor.
This is the Knn used for "continuous" variables and not classes. Using this model, you shouldn't have this problem anymore. The output value will be the mean between the number of nearest neighbors you defined.
With this simple changes, your problem will be solved.
I want to import my own data (sentences which are located in a .txt file) into this example algorithm, which can be found at: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
The problem is that this code uses a make_blobs dataset and i have a hard time understanding how to replace it with data from .txt file.
All I predict is that I need to replace this piece of code right here:
X, y = make_blobs(n_samples=500,
n_features=2,
centers=4,
cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1) # For reproducibility
Also I do not understand these variables X, y . I assume that X is an array of data, and what about y?
Should I just assign everything to the X as like this and that example code would work? But what about those make_blobs features like centers, n_features etc.? Do I need to specify them somehow differently?
# open and read from the txt file
path = "C:/Users/user/Desktop/sentences.txt"
file = open(path, 'r')
# assign it to the X
X = file.readlines()
Any help is appreciated!
Firstly, you need to create a mapping of your words to a number that your k-means algorithm can use.
For example:
I ride a bike and I like it.
1 2 3 4 5 1 6 7 # <- number ids
After that you have a new embedding for you dataset and you can apply k-means. If you want a homogeneous appearance for your sample you must convert them to one-hot-representation (which is that you create a N-length array for each sample, where N is the total number of unique words you have, which has one to the corresponding position which is the same as the index of the sample).
Example for the above for N = 7 would be
1 -> 1000000
2 -> 0100000
...
So, now you can have a X variable containing your data in a proper format. You don't need y which is the corresponding labels for your samples.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
...
How can I use more than one feature/class as input/output on a LSTM using Sequential from Keras Models in Python?
To be more specific, I would like to use as input and output to the network: [FeatureA][FeatureB][FeatureC].
FeatureA is a categorial class with 100 different possible values indicating the sensor which collected the data;
FeatureB is a on/off indicator, being 0 or 1;
FeatureC is a categorial class too having 5 unique values.
Data Example:
1. 40,1,5
2. 58,1,2
3. 57,1,5
4. 40,0,1
5. 57,1,4
6. 23,0,3
When using the raw data and loss='categorical_crossentropy' on model.compile, the loss is over than 10.0.
When normalisating the data to values between 0-1 and using mean_squared_error on loss, it gets an average of 0.27 on loss. But when testing it on prediction, the results does not makes any sense.
Any suggestions here or tutorials I could consult?
Thanks in advance.
You need to convert FeatureC to a binary category. Mean squared error is for regression and as best I can tell you are trying to predict which class a certain combination of sensors and states belongs to. Since there's 5 possible classes you can kind of think that you're trying to predict if the class is Red, Green, Blue, Yellow, or Purple. Right now those are represented by numbers but for a regression you model is going to be predicting values like 3.24 which doesn't make sense.
In effect you're converting values of FeatureC to 5 columns of binary values. Since it seems like the classes are exclusive there should be a single 1 and the rest of the columns of a row would be 0s. So if the first row is 'red' it would be [1, 0, 0, 0, 0]
For best results you should also convert FeatureA to binary categorical features. For the same reason as above, sensor 80 is not 4x more than sensor 20, but instead a different entity.
The last layer of your model should be of softmax type with 5 neurons. Basically your model is going to try to be predicting the probability of each class, in the end.
It looks like you are working with a dataframe since there's an index. Therefore I would try:
import keras
import numpy as np
import pandas as pd # assume that this has probably already been done though
feature_a = data.loc[:, "FeatureA"].values # pull out feature A
labels = data.loc[:, "FeatureC"].values # y are the labels that we're trying to predict
feature_b = data.loc[:, "FeatureB"].values # pull out B to make things more clear later
# make sure that feature_b.shape = (rows, 1) otherwise reset the shape
# so hstack works later
feature_b = feature_b.reshape(feature_b.shape[0], 1)
labels -= 1 # subtract 1 from all labels values to zero base (0 - 4)
y = keras.utils.to_categorical(labels)
# labels.shape should be (rows, 5)
# convert 1-100 to binary columns
# zero base again
feature_a -= 1
# Before: feature_a.shape=(rows, 1)
feature_a_data = keras.utils.to_categorical(feature_a)
# After: feature_a_data.shape=(rows, 100)
data = np.hstack([feature_a_data, feature_b])
# data.shape should be (rows, 101)
# y.shape should be (rows, 5)
Now you're ready to train/test split and so on.
Here's something to look at which has multi-class prediction:
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
I've been messing around with Random Forest models lately and they are really useful w/ the feature_importance_ attribute!
It would be useful to know which variables are more predictive of particular targets.
For example, what if the 1st and 2nd attributes were more predictive of distringuishing target 0 but the 3rd and 4th attributes were more predictive of target 1?
Is there a way to get the feature_importance_ array for each target separately? With sklearn, scipy, pandas, or numpy preferably.
# Iris dataset
DF_iris = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_iris = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Import modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
# Split Data
X_tr, X_te, y_tr, y_te = train_test_split(DF_iris, Se_iris, test_size=0.3, random_state=0)
# Create model
Mod_rf = RandomForestClassifier(random_state=0)
Mod_rf.fit(X_tr,y_tr)
# Variable Importance
Mod_rf.feature_importances_
# array([ 0.14334485, 0.0264803 , 0.40058315, 0.42959169])
# Target groups
Se_iris.unique()
# array([0, 1, 2])
This is not really how RF works. Since there is no simple "feature voting" (which takes place in linear models) it is really hard to answer the question what "feature X is more predictive for target Y" even means. What feature_importance of RF captures is "how probable is, in general, to use this feature in the decision process". The problem with addressing your question is that if you ask "how probable is, in general, to use this feature in decision process leading to label Y" you would have to pretty much run the same procedure but remove all subtrees which do not contain label Y in a leaf - this way you remove parts of the decision process which do not address the problem "is it Y or not Y" but rather try to answer which "not Y" it is. However, in practice, due to very stochastic nature of RF, cutting its depth etc. this might barely reduce anything. The bad news is also, that I never seen it implemented in any standard RF library, you could do this on your own, just the way I said:
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
for x in X (X is your dataset)
features_importance[i] += how_many_times_each_feature_is_used(tree, x) / |X|
features_importance[i] /= |tmp_RF|
return features_importance
in particular you could use existing feature_importance codes, simply by doing
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
features_importance[i] = run_regular_feature_importance(tmp_RF)
return features_importance
In the following code I create a random sample set of size 50, with 20 features each. I then generate a random target vector composed of half True and half False values.
All of the values are stored in Pandas objects, since this simulates a real scenario in which the data will be given in that way.
I then perform a manual leave-one-out inside a loop, each time selecting an index, dropping its respective data, fitting the rest of the data using a default SVC, and finally running a prediction on the left-out data.
import random
import numpy as np
import pandas as pd
from sklearn.svm import SVC
n_samp = 50
m_features = 20
X_val = np.random.rand(n_samp, m_features)
X = pd.DataFrame(X_val, index=range(n_samp))
# print X_val
y_val = [True] * (n_samp/2) + [False] * (n_samp/2)
random.shuffle(y_val)
y = pd.Series(y_val, index=range(n_samp))
# print y_val
seccess_count = 0
for idx in y.index:
clf = SVC() # Can be inside or outside loop. Result is the same.
# Leave-one-out for the fitting phase
loo_X = X.drop(idx)
loo_y = y.drop(idx)
clf.fit(loo_X.values, loo_y.values)
# Make a prediction on the sample that was left out
pred_X = X.loc[idx:idx]
pred_result = clf.predict(pred_X.values)
print y.loc[idx], pred_result[0] # Actual value vs. predicted value - always opposite!
is_success = y.loc[idx] == pred_result[0]
seccess_count += 1 if is_success else 0
print '\nSeccess Count:', seccess_count # Almost always 0!
Now here's the strange part - I expect to get an accuracy of about 50%, since this is random data, but instead I almost always get exactly 0! I say almost always, since every about 10 runs of this exact code I get a few correct hits.
What's really crazy to me is that if I choose the answers opposite to those predicted, I will get 100% accuracy. On random data!
What am I missing here?
Ok, I think I just figured it out! It all comes down to our old machine learning foe - the majority class.
In more detail: I chose a target comprising 25 True and 25 False values - perfectly balanced. When performing the leave-one-out, this caused a class imbalance, say 24 True and 25 False. Since the SVC was set to default parameters, and run on random data, it probably couldn't find any way to predict the result other than choosing the majority class, which in this iteration would be False! So in every iteration the imbalance was turned against the currently-left-out sample.
All in all - a good lesson in machine learning, and an excelent mathematical riddle to share with your friends :)