Decision tree only predicts one class - python

I am fitting decision tree on the following dataset:
https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data
And following is my code:
balance_data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep= ',', header= None)
le = preprocessing.LabelEncoder()
balance_data = balance_data.apply(le.fit_transform)
X = balance_data.values[:, 0:5]
Y = balance_data.values[:,6]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 100)
#using Gini index
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
#using Information Gain
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
#Gini prediction
y_pred = clf_gini.predict(X_test)
y_pred
#IG prediction
y_pred_en = clf_entropy.predict(X_test)
y_pred_en
In both cases Gini Index and IG, the output is following:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,])
Is there problem with training? Moreover how can I convert this numeric value to string value.
Edit1: I calculated the accuracy and it says 71. Is there a possibility that the only problem is in the display of the output?

Your dataset is unbalanced
Given that your data looks like this:
0 1 2 3 4 5 6
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
And that your target variable is column 6 Y = balance_data.values[:,6]. A quick look into the the target variable distribution leads to conclude that your dataset is unbalanced.
In fact, when starting a new machine learning project, one of the main tasks to do is checking whether your dataset is unbalanced. This can be done by counting the distribution of the observations of the target variable values.
Since your data is a pandas dataframe, your get the values distribution as follows:
In [46]: balance_data.iloc[:,6].value_counts()
Out[46]:
unacc 1210
acc 384
good 69
vgood 65
Name: 6, dtype: int64
As you can see, the dataset contains mainly observations with the target value unacc, 70% to be accurate:
In [49]: 1210/1728.
Out[49]: 0.7002314814814815
As you mentioned, the accuracy of your model is around 71% which corresponds to the percentage of target value unacc from the overall dataset.
There are several techniques to overcome this problem, check the following links for detailed tutorials on that:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/

Related

How to sperate Keras mnist dataset into 5 groups of (0 1), (2 3), (4 5), (6 7), (8 9)

I am new to machine learning. I am trying to train models on keras mnist dataset. But I want to train the models on the 5 groups sperately. Can someone please advise how to sperate the mnist dataset into the specified groups?
I have tried google for quite some time, but couldn't figure out how to do this.
Many thanks in advance!
How about using a for loop:
from keras.datasets import mnist
import numpy as np
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
images = np.concatenate((train_images, test_images), axis=0)
labels = np.concatenate((train_labels, test_labels), axis=0)
groups = [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
images_and_labels_by_group = []
for group in groups:
indices = np.where(np.isin(labels, group))[0]
group_images = images[indices]
group_labels = labels[indices]
images_and_labels_by_group.append((group_images, group_labels))
Here is one of the ways to achieve that:
import numpy as np
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
y_train_5cs=np.zeros_like(y_train)
y_test_5cs=np.zeros_like(y_test)
for sn, lbl in enumerate(y_train):
y_train_5cs[sn]=lbl//2
for sn, lbl in enumerate(y_test):
y_test_5cs[sn]=lbl//2
print('y_train[:10]: ', y_train[:10])
print('y_train_5cs[:10]:', y_train_5cs[:10])
print('\n')
print('y_test[:10]: ', y_test[:10])
print('y_test_5cs[:10]:', y_test_5cs[:10])
Output:
y_train[:10]: [5 0 4 1 9 2 1 3 1 4]
y_train_5cs[:10]: [2 0 2 0 4 1 0 1 0 2]
y_test[:10]: [7 2 1 0 4 1 4 9 5 9]
y_test_5cs[:10]: [3 1 0 0 2 0 2 4 2 4]

How does Sklearn Naive Bayes Bernoulli Classifier work when the predictors are not binary?

As we know the Bernoulli Naive Bayes Classifier uses binary predictors (features). The thing I am not getting is how BernoulliNB in scikit-learn is giving results even if the predictors are not binary. The following example is taken verbatim from the documentation:
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
Y = np.array([1, 2, 3, 4, 4, 5])
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X, Y)
print(clf.predict(X[2:3]))
Output:
array([3])
Here are the first 10 features of X, and they are obviously not binary:
3 4 0 1 3 0 0 1 4 4 1
1 0 2 4 4 0 4 1 4 1 0
2 4 4 0 3 3 0 3 1 0 2
2 2 3 1 4 0 0 3 2 4 1
0 4 0 3 2 4 3 2 4 2 4
3 3 3 3 0 2 3 1 3 2 3
How does BernoulliNB work here even though the predictors are not binary?
This is due to the binarize argument; from the docs:
binarize : float or None, default=0.0
Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
When called with its default value binarize=0.0, as is the case in your code (since you do not specify it explicitly), it will result in converting every element of X greater than 0 to 1, hence the transformed X that will be used as the actual input to the BernoulliNB classifier will consist indeed of binary values.
The binarize argument works exactly the same way with the stand-alone preprocessing function of the same name; here is a simplified example, adapting your own:
from sklearn.preprocessing import binarize
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 1))
X
# result
array([[3],
[4],
[0],
[1],
[3],
[0]])
binarize(X) # here as well, default threshold=0.0
# result (binary values):
array([[1],
[1],
[0],
[1],
[1],
[0]])

Keras LSTM - feed sequence data with Tensorflow dataset API from the generator

I am trying to solve how I can feed data to my LSTM model for training. (I will simplify the problem in my example below.) I have the following data format in csv files in my dataset.
Timestep Feature1 Feature2 Feature3 Feature4 Output
1 1 2 3 4 a
2 5 6 7 8 b
3 9 10 11 12 c
4 13 14 15 16 d
5 17 18 19 20 e
6 21 22 23 24 f
7 25 26 27 28 g
8 29 30 31 32 h
9 33 34 35 36 i
10 37 38 39 40 j
The task is to estimate the Output of any future timestep based on the data from last 3 timesteps. Some input-output exapmles are as following:
Example 1:
Input:
Timestep Feature1 Feature2 Feature3 Feature4
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
Output: c
Example 2:
Input:
Timestep Feature1 Feature2 Feature3 Feature4
2 5 6 7 8
3 9 10 11 12
4 13 14 15 16
Output: d
Example 3:
Input:
Timestep Feature1 Feature2 Feature3 Feature4
3 9 10 11 12
4 13 14 15 16
5 17 18 19 20
Output: e
And when feeding the data to the model, I would like to shuffle the data in a way so that I do not feed consecutive sequences when training.
With other words, I ideally would like to feed the data sequences like timesteps 3,4,5 in one step, maybe timesteps 5,6,7 in the next step, and maybe 2,3,4 in the following step, and so on..
And I preferably do not want to feed the data as 1,2,3 first, then 2,3,4, then 3,4,5, and so on...
When training my LSTM network, I am using Keras with Tensorflow backend. I would like to use a generator when feeding my data to the fit_generator(...) function.
My desire is to use Tensorflow's dataset API to fetch the data from csv files. But I could not figure out how to make the generator return what I need.
If I shuffle the data with Tensorflow's dataset API, it will destroy the order of the timesteps. The generator should also return batches that include multiple sequence examples. For instance, if the batch size is 2, then it may need to return 2 sequences like timesteps 2,3,4 and timesteps 6,7,8.
Hoping that I could explain my problem... Is it possible to use Tensorflow's dataset API in a generator function for such a sequence problem so that I can feed batches of sequences as I explained above? (The generator needs to return data with the shape [batch_size, length_of_each_sequence, nr_inputs_in_each_timestep], where length_of_each_sequence=3 and nr_of_inputs_in_each_timestep=4 in my example.) Or is the best way to do this to write a generator in Python only, maybe by using Pandas..?
ADDENDUM 1:
I have done the following experiment after seeing the answer from #kvish.
import tensorflow as tf
import numpy as np
from tensorflow.contrib.data.python.ops import sliding
sequence = np.array([ [[1]], [[2]], [[3]], [[4]], [[5]], [[6]], [[7]], [[8]], [[9]] ])
labels = [1,0,1,0,1,0,1,0,1]
# create TensorFlow Dataset object
data = tf.data.Dataset.from_tensor_slices((sequence, labels))
# sliding window batch
window_size = 3
window_shift = 1
data = data.apply(sliding.sliding_window_batch(window_size=window_size, window_shift=window_shift))
data = data.shuffle(1000, reshuffle_each_iteration=False)
data = data.batch(3)
#iter = dataset.make_initializable_iterator()
iter = tf.data.Iterator.from_structure(data.output_types, data.output_shapes)
el = iter.get_next()
# create initialization ops
init_op = iter.make_initializer(data)
NR_EPOCHS = 2
with tf.Session() as sess:
for e in range (NR_EPOCHS):
print("\nepoch: ", e, "\n")
sess.run(init_op)
print("1 ", sess.run(el))
print("2 ", sess.run(el))
print("3 ", sess.run(el))
And here is the output:
epoch: 0
1 (array([[[[6]],[[7]],[[8]]], [[[1]],[[2]],[[3]]], [[[2]],[[3]],[[4]]]]),
array([[0, 1, 0], [1, 0, 1], [0, 1, 0]], dtype=int32))
2 (array([[[[7]],[[8]],[[9]]], [[[3]],[[4]],[[5]]], [[[4]],[[5]],[[6]]]]),
array([[1, 0, 1], [1, 0, 1], [0, 1, 0]], dtype=int32))
3 (array([[[[5]],[[6]],[[7]]]]), array([[1, 0, 1]], dtype=int32))
epoch: 1
1 (array([[[[2]],[[3]],[[4]]], [[[7]],[[8]],[[9]]], [[[1]],[[2]],[[3]]]]),
array([[0, 1, 0], [1, 0, 1], [1, 0, 1]], dtype=int32))
2 (array([[[[5]],[[6]],[[7]]], [[[3]],[[4]],[[5]]], [[[4]],[[5]],[[6]]]]),
array([[1, 0, 1], [1, 0, 1], [0, 1, 0]], dtype=int32))
3 (array([[[[6]],[[7]],[[8]]]]),
array([[0, 1, 0]], dtype=int32))
I could not try it on csv file reading yet but I think that this approach should be working quite fine!
But as I see it, the reshuffle_each_iteration parameter is making no difference. Is this really needed? Results are not necessarily identical when it is set to True or False. What is this reshuffle_each_iteration parameter supposed to do here?
I think this answer might be close to what you are looking for!
You create batches by sliding over windows, and then shuffle the input in your case. The shuffle function of the dataset api has a reshuffle_after_each_iteration parameter, which you might probably want to set to False if you want to experiment with setting a random seed and looking at the order of shuffled outputs.

Preparing variable-length data for sklearn

Since this is a complicated problem (at least for me), I will try to keep this as brief as possible.
My data is of the form
import pandas as pd
import numpy as np
# edit: a1 and a2 are linked as they are part of the same object
a1 = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
a2 = np.array([[5, 6, 5], [2, 3], [3, 4, 8, 1]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a1=a1.tolist(),a2=a2.tolist(), b=b, y=y))
a1 a2 b y
0 [1, 2, 3] [5, 6, 5] 6 0
1 [4, 5] [2, 3] 15 1
2 [7, 8, 9, 10] [3, 4, 8, 1] 24 1
which I would like to use in sklearn for classification, e.g.
from sklearn import tree
X = df[['a1', 'a2', 'b']]
Y = df['y']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print(clf.predict([[2., 2.]]))
However, while pandas can handle lists as entries, sklearn, by design, cannot. In this example the clf.fit will result in ValueError: setting an array element with a sequence. to which you can find plenty of answers.
But how do you deal with such data?
I tried to split the data up into multiple columns (i.e. a1[0] ... a1[3] - code for that is a bit lengthy), but a1[3] will be empty (NaN, 0 or whatever invalid value you think of). Imputation does not make sense here, since no value is supposed to be there.
Of course, such a procedure has an impact on the result of the classification as the algorithm might pick up the "zero" value as something meaningful.
If the dataset is large enough, so I thought, it might be worth splitting it up in equal lengths of a1. But this procedure can reduce the power of the classification algorithm, since the length of a1 might help to distinguish between classes.
I also thought of using warm start for algorithms that support (e.g. Perceptron) and fit it to data split by the length of a1. But this would surely fail, would it not? The datasets would have different number of features, so I assume that something would go wrong.
Solutions to this problem surely must exist and I've simply not found the right place in the documentation.
Lets assume for a second those numbers are numerical categories.
What you can do is transform column 'a' into a set of binary columns, of which each corresponds to a possible value of 'a'.
Taking your example code, we would:
import pandas as pd
import numpy as np
a = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a=a.tolist(),b=b,y=y))
from sklearn.preprocessing import MultiLabelBinarizer
MLB = MultiLabelBinarizer()
df_2 = pd.DataFrame(MLB.fit_transform(df['a']), columns=MLB.classes_)
df_2
1 2 3 4 5 7 8 9 10
0 1 1 1 0 0 0 0 0 0
1 0 0 0 1 1 0 0 0 0
2 0 0 0 0 0 1 1 1 1
Than, we can just concat the old and new data:
new_df = pd.concat([df_2, df.drop('a',1)],1)
1 2 3 4 5 7 8 9 10 b y
0 1 1 1 0 0 0 0 0 0 6 0
1 0 0 0 1 1 0 0 0 0 15 1
2 0 0 0 0 0 1 1 1 1 24 1
Please do notice that if you have a training and a test set, it would be wise to first concat em, do the transform, and than separate 'em. Thats because one of the data sets can contain terms that do not belong to the other.
Hope that helps
Edit:
If you are worried that might make your df too big, its perfectly okay to apply PCA to the binarized variables. It will reduce cardinality while maintaining an arbitrary amount of variance/correlation.
Sklearn likes the data in 2d array i.e. shape (batch_size, features)
The simplest solution is to prepare one feature vector by concatenating the arrays using numpy.concatenate. The pass this feature vector to sklearn. Since the length of each column is fixed this should work.

Python Machine Learning Algorithm to Recognize Known Events

I have two sets of data. These data are logged voltages of two points A and B in a circuit. Voltage A is the main component of the circuit, and B is a sub-circuit. Every positive voltage in B is (1) considered a B event and (2) known to be composite of A. I have included sample data where there is a B voltage event, 4,4,0,0,4,4. A real training data set would have many more available data.
How can I train a Python machine learning algorithm to recognize B events given only A data?
Example data:
V(A), V(B)
0, 0
2, 0
5, 4
3, 4
1, 0
3, 4
4, 4
1, 0
0, 0
2, 0
5, 0
7, 0
2, 0
5, 4
9, 4
3, 0
5, 0
4, 4
6, 4
3, 0
2, 0
An idea:
from sklearn.ensemble import RandomForestClassifier
n = 5
X = [df.A.iloc[i:i+n] for i in df.index[:-n+1]]
labels = (df.B > 0)[n-1:]
model = RandomForestClassifier()
model.fit(X, labels)
model.predict(X)
What this does is, it takes the previous n observations as predictors for the 'B' value. On this small data set it achieves 0.94 accuracy (could be overfitting).
EDIT: Corrected a small alignment error.

Categories