I have two sets of data. These data are logged voltages of two points A and B in a circuit. Voltage A is the main component of the circuit, and B is a sub-circuit. Every positive voltage in B is (1) considered a B event and (2) known to be composite of A. I have included sample data where there is a B voltage event, 4,4,0,0,4,4. A real training data set would have many more available data.
How can I train a Python machine learning algorithm to recognize B events given only A data?
Example data:
V(A), V(B)
0, 0
2, 0
5, 4
3, 4
1, 0
3, 4
4, 4
1, 0
0, 0
2, 0
5, 0
7, 0
2, 0
5, 4
9, 4
3, 0
5, 0
4, 4
6, 4
3, 0
2, 0
An idea:
from sklearn.ensemble import RandomForestClassifier
n = 5
X = [df.A.iloc[i:i+n] for i in df.index[:-n+1]]
labels = (df.B > 0)[n-1:]
model = RandomForestClassifier()
model.fit(X, labels)
model.predict(X)
What this does is, it takes the previous n observations as predictors for the 'B' value. On this small data set it achieves 0.94 accuracy (could be overfitting).
EDIT: Corrected a small alignment error.
Related
Say I have a Dataframe whose columns are features that I want to feed to a random forest classifier. These features are signals that are sampled at different rates and each row represents the values outputted by the sensor every 30 seconds. Each feature column that has a list of values have cells that contain lists of the same lenght Say my table looks like this:
|Epoch (30 sec) | Nasal Airflow 25hz | EEG 200hz | Target (0,1,2) |
| -------- | -------------- |-------------- |----- |
| 1 | [12,3,4,5,6...43] | [6,9,8,5...,69] | 1 |
| 2 | [15,45,8,4,9...89] |[7,9.6,8.5,9...,89] | 2 |
| 3 | [18,5,88,400,2...88] |[8,10.15,9.8,9.5...,45] | 0 |
All lists under the Nasal Airflow column has 750 numbers and all lists under the EEG column has 6000 numbers. The target column here is the value I want to predict.
I tried training a random forest classifier with the similar kind of data and it did not work. The error I got was
ValueError: setting an array element with a sequence.
I understand that I could apply some statistical methods like finding the mean, mode, median of each arrays but I feel like I'm losing a lot of data. Are there classifier models that can handle data like this?
(Turning comments into an answer)
If columns contain lists of equal length:
import pandas as pd
from io import StringIO
data_file = StringIO("""airflow|eeg|target
[12,3,4,5,6,43]|[6,9,8,5,69]|0
[15,4,8,4,9,89]|[7,9,8,9,89]|1
[18,5,5,7,2,88]|[8,8,9,9,45]|0
""")
df = pd.read_csv(
data_file,
delimiter="|",
converters={
"airflow": lambda x: x.strip("[]").split(","),
"eeg": lambda x: x.strip("[]").split(","),
},
)
airflow eeg target
0 [12, 3, 4, 5, 6, 43] [6, 9, 8, 5, 69] 0
1 [15, 4, 8, 4, 9, 89] [7, 9, 8, 9, 89] 1
2 [18, 5, 5, 7, 2, 88] [8, 8, 9, 9, 45] 0
Then the easiest option is turn each list into columns representing "airflow at t1, t2, t3, etc."
df[[f"af{i}" for i in range(len(df.airflow[0]))]] = df.airflow.apply(pd.Series)
df[[f"eeg{i}" for i in range(len(df.eeg[0]))]] = df.eeg.apply(pd.Series)
df.drop(["airflow", "eeg"], axis=1, inplace=True)
target af0 af1 af2 af3 af4 af5 eeg0 eeg1 eeg2 eeg3 eeg4
0 0 12 3 4 5 6 43 6 9 8 5 69
1 1 15 4 8 4 9 89 7 9 8 9 89
2 0 18 5 5 7 2 88 8 8 9 9 45
Which then can be used for model fitting. If the lists expand into a large number of features (e.g. 19,500), then feature selection approaches might be worth exploring. If there is a time-series component, non-linear models (like trees) can fit models of the form "target is influenced by the airflow and eeg at t12"—but other methods for time-series classification exist.
from sklearn.linear_model import LogisticRegression
X = df.drop(["target"], axis=1)
y = df["target"]
clf = LogisticRegression().fit(X, y)
I have a query data point with 15 columns and I have a pandas data frame with same columns(15) and i want to find closest data points present in data frame to my query data point. can some one guide me on this ?
Example:
query data point
[1, 2, 3, 4]
df
1 3 5 6
2 7 9 1
2 8 1 8
5 4 9 0
2 4 6 7
here, below rows are closest , in the same way i want to retrieve first n closest data points to my query point.
1 3 5 6
2 4 6 7
I tried clustering but it was too complex for me to understand and KNN is expecting a target variable, so need your help .Thank you!
You can use the Euclidean distance or L2Norm to calculate the distance between each row of your dataframe and your query point.
df = pd.DataFrame([[1, 3, 5, 6],
[2, 7, 9, 1],
[2, 8, 1, 8],
[5, 4, 9, 0],
[2, 4, 6, 7]])
vec = [1, 2, 3, 4]
dist = df.sub(vec, axis=1).pow(2).sum(axis=1).pow(.5)
This gives the output,
0 3.000000
1 8.426150
2 7.549834
3 8.485281
4 4.795832
dtype: float64
You can select the shortest n distances, which give you the positions of n-closest data points to your query points.
Or you can use the np.linlag.norm
dist = np.linalg.norm(source.to_numpy() - vec, axis=1)
which gives you the output
array([3. , 8.42614977, 7.54983444, 8.48528137, 4.79583152])
Check out the answers to this question.
You can try:
query_point = [1, 2, 3, 4]
n = 2
n_closest_points = df.loc[(df - query_point).pow(2).sum(axis=1).nsmallest(n).index]
gives
0 1 2 3
0 1 3 5 6
4 2 4 6 7
We take the sum of squared distance between each row and the query_point by chaining subtraction (which broadcasts), taking square (pow) and summing (sum). Then we require the n closest rows via getting the rows that have the smallest distance (nsmallest). Then this gives a series with values being the squared distance and index indicating the desired rows, so we take its index and look them into the original df (.loc).
I am currently working on the following:
data - with the correct index
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data_values)
wcss.append(kmeans.inertia_)
kmeans = KMeans(n_clusters=2).fit(data_values)
y = kmeans.fit_predict(data_values) # prediction of k
df= pd.DataFrame(y,index = data.index)
....
#got here multiple dicts
Example of y:
[1 2 3 4 5 2 2 5 1 0 0 1 0 0 1 0 1 4 4 4 3 1 0 0 1 0 0 ...]
f = pd.DataFrame(y, columns = [buster] )
f.to_csv('busters.csv, mode = 'a')
y = clusters after determination
I dont know how did I stuck on this.. I am iterating over 20 dataframes, each one consists of one columns and values from 1-9. The index is irrelevent. I am trying to append all frame together but instead it just prints them one after the other. If I put ".T" to transpose it , I still got rows with irrelevent values as index, which I cant remove them because they are actually headers.
Needed result
If the dicts produced in each iteration look like {'Buster1': [0, 2, 2, 4, 5]}, {'Buster2': [1, 2, 3, 4, 5]} ..., using 5 elements here for illustration purposes, and all the lists, i.e., values in the dicts, have the same number of elements (as it is the case in your example), you could create a single dict and use pd.DataFrame directly. (You may also want to take a look at pandas.DataFrame.from_dict.)
You may have lists with more than 5 elements, more than 3 dicts (and thus columns), and you will be generating the dicts with a loop, but the code below should be sufficient for getting the idea.
>>> import pandas as pd
>>>
>>> d = {}
>>> # update d in every iteration
>>> d.update({'Buster 1': [0, 2, 2, 4, 5]})
>>> d.update({'Buster 2': [1, 2, 3, 4, 5]})
>>> # ...
>>> d.update({'Buster n': [0, 9, 3, 0, 0]})
>>>
>>> pd.DataFrame(d, columns=d.keys())
Buster 1 Buster 2 Buster n
0 0 1 0
1 2 2 9
2 2 3 3
3 4 4 0
4 5 5 0
If you have the keys, e.g., 'Buster 1', and values, e.g., [0, 2, 2, 4, 5], separated, as I believe is the case, you can simplify the above (and make it more efficient) by replacing d.update({'Buster 1': [0, 2, 2, 4, 5]}) with d['Buster 1']=[0, 2, 2, 4, 5].
I included columns=d.keys() because depending on your Python and pandas version the ordering of the columns may not be as you expect it to be. You can specify the ordering of the columns through specifying the order in which you provide the keys. For example:
>>> pd.DataFrame(d, columns=sorted(d.keys(),reverse=True))
Buster n Buster 2 Buster 1
0 0 1 0
1 9 2 2
2 3 3 2
3 0 4 4
4 0 5 5
Although it may not apply to your use case, if you do not want to print the index, you can take a look at How to print pandas DataFrame without index.
I am fitting decision tree on the following dataset:
https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data
And following is my code:
balance_data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep= ',', header= None)
le = preprocessing.LabelEncoder()
balance_data = balance_data.apply(le.fit_transform)
X = balance_data.values[:, 0:5]
Y = balance_data.values[:,6]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 100)
#using Gini index
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
#using Information Gain
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
#Gini prediction
y_pred = clf_gini.predict(X_test)
y_pred
#IG prediction
y_pred_en = clf_entropy.predict(X_test)
y_pred_en
In both cases Gini Index and IG, the output is following:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,])
Is there problem with training? Moreover how can I convert this numeric value to string value.
Edit1: I calculated the accuracy and it says 71. Is there a possibility that the only problem is in the display of the output?
Your dataset is unbalanced
Given that your data looks like this:
0 1 2 3 4 5 6
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
And that your target variable is column 6 Y = balance_data.values[:,6]. A quick look into the the target variable distribution leads to conclude that your dataset is unbalanced.
In fact, when starting a new machine learning project, one of the main tasks to do is checking whether your dataset is unbalanced. This can be done by counting the distribution of the observations of the target variable values.
Since your data is a pandas dataframe, your get the values distribution as follows:
In [46]: balance_data.iloc[:,6].value_counts()
Out[46]:
unacc 1210
acc 384
good 69
vgood 65
Name: 6, dtype: int64
As you can see, the dataset contains mainly observations with the target value unacc, 70% to be accurate:
In [49]: 1210/1728.
Out[49]: 0.7002314814814815
As you mentioned, the accuracy of your model is around 71% which corresponds to the percentage of target value unacc from the overall dataset.
There are several techniques to overcome this problem, check the following links for detailed tutorials on that:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/
Since this is a complicated problem (at least for me), I will try to keep this as brief as possible.
My data is of the form
import pandas as pd
import numpy as np
# edit: a1 and a2 are linked as they are part of the same object
a1 = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
a2 = np.array([[5, 6, 5], [2, 3], [3, 4, 8, 1]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a1=a1.tolist(),a2=a2.tolist(), b=b, y=y))
a1 a2 b y
0 [1, 2, 3] [5, 6, 5] 6 0
1 [4, 5] [2, 3] 15 1
2 [7, 8, 9, 10] [3, 4, 8, 1] 24 1
which I would like to use in sklearn for classification, e.g.
from sklearn import tree
X = df[['a1', 'a2', 'b']]
Y = df['y']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print(clf.predict([[2., 2.]]))
However, while pandas can handle lists as entries, sklearn, by design, cannot. In this example the clf.fit will result in ValueError: setting an array element with a sequence. to which you can find plenty of answers.
But how do you deal with such data?
I tried to split the data up into multiple columns (i.e. a1[0] ... a1[3] - code for that is a bit lengthy), but a1[3] will be empty (NaN, 0 or whatever invalid value you think of). Imputation does not make sense here, since no value is supposed to be there.
Of course, such a procedure has an impact on the result of the classification as the algorithm might pick up the "zero" value as something meaningful.
If the dataset is large enough, so I thought, it might be worth splitting it up in equal lengths of a1. But this procedure can reduce the power of the classification algorithm, since the length of a1 might help to distinguish between classes.
I also thought of using warm start for algorithms that support (e.g. Perceptron) and fit it to data split by the length of a1. But this would surely fail, would it not? The datasets would have different number of features, so I assume that something would go wrong.
Solutions to this problem surely must exist and I've simply not found the right place in the documentation.
Lets assume for a second those numbers are numerical categories.
What you can do is transform column 'a' into a set of binary columns, of which each corresponds to a possible value of 'a'.
Taking your example code, we would:
import pandas as pd
import numpy as np
a = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a=a.tolist(),b=b,y=y))
from sklearn.preprocessing import MultiLabelBinarizer
MLB = MultiLabelBinarizer()
df_2 = pd.DataFrame(MLB.fit_transform(df['a']), columns=MLB.classes_)
df_2
1 2 3 4 5 7 8 9 10
0 1 1 1 0 0 0 0 0 0
1 0 0 0 1 1 0 0 0 0
2 0 0 0 0 0 1 1 1 1
Than, we can just concat the old and new data:
new_df = pd.concat([df_2, df.drop('a',1)],1)
1 2 3 4 5 7 8 9 10 b y
0 1 1 1 0 0 0 0 0 0 6 0
1 0 0 0 1 1 0 0 0 0 15 1
2 0 0 0 0 0 1 1 1 1 24 1
Please do notice that if you have a training and a test set, it would be wise to first concat em, do the transform, and than separate 'em. Thats because one of the data sets can contain terms that do not belong to the other.
Hope that helps
Edit:
If you are worried that might make your df too big, its perfectly okay to apply PCA to the binarized variables. It will reduce cardinality while maintaining an arbitrary amount of variance/correlation.
Sklearn likes the data in 2d array i.e. shape (batch_size, features)
The simplest solution is to prepare one feature vector by concatenating the arrays using numpy.concatenate. The pass this feature vector to sklearn. Since the length of each column is fixed this should work.