gplearn library for generating new lines of data from given dataset - python

I am using gplearn library (genetic programming) for generating new rules from the given dataset. Currently I have 11 rows of data with 24 columns(features) that I give as input to the SymbolicRegressor method to get new rules. However, I am only getting only one rule. Generally with crossover shouldn't I get 11 new rules if I give 11 lines of data as input. If I doing it wrong what is the right way of doing it ?
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesRegressor
from gplearn.genetic import SymbolicRegressor
data = pd.read_csv("D:/Subjects/Thesis/snort_rules/ransomware_dataset.csv")
x_train = data.iloc[:,0:23]
y_train = data.iloc[:,:-1]
gp = SymbolicRegressor(population_size=11,
generations=2, stopping_criteria=0.01,
p_crossover=0.8, p_subtree_mutation=0.1,
p_hoist_mutation=0.05, p_point_mutation=0.05,
max_samples=0.9, verbose=1,
parsimony_coefficient=0.01, random_state=0)
gp.fit(x_train, y_train)
print(gp._program)
The output is :
X7/(X15*(-X16*X20 - X19 + X2))

Related

How to make knn faster?

I have a dataset of shape(700000,20) and I want to apply KNN to it.
However on testing it takes really huge time,can someone expert please help to let me know how can I reduce the KNN predicting time.
Is there something like GPU-KNN or something.Please help to let me know.
Below is the code I am using.
import os
os.chdir(os.path.dirname(os.path.realpath(__file__)))
import tensorflow as tf
import pandas as pd
import numpy as np
from joblib import load, dump
import numpy as np
from scipy.spatial import distance
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from dtaidistance import dtw
window_length = 20
n = 5
X_train = load('X_train.pth').reshape(-1,20)
y_train = load('y_train.pth').reshape(-1)
X_test = load('X_test.pth').reshape(-1,20)
y_test = load('y_test.pth').reshape(-1)
#custom metric
def DTW(a, b):
return dtw.distance(a, b)
clf = KNeighborsClassifier(metric=DTW)
clf.fit(X_train, y_train)
#evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
I can suggest reducing the number of features which i think its 20 features from your dataset shape, Which mean you have 20 dimensions.
You can reduce the number of features by using PCA (Principal Component Analysis) like the following:
from sklearn.decomposition import PCA
train_data_pca = PCA(n_components=10)
reduced_train_data = train_data_pca.fit_transform(train_data)
this code will reduce deminisions for example to 10 instead of 20
Second issue in your code, that I see that you are not using th K neighboors value in the classifier, It should be as the following:
clf = KNeighborsClassifier(n_neighbors=n, metric=DTW)
The metric dtw is taking too much time while simple knn is working fine.

ROC curve for multi-class classification without one vs all in python

I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?

Saving the predicted values of a classifier into an excel spreadsheet, python scklearn

Using sklearn I have predicted the values. I want to save these predicted values onto a new excel file along with their unique ID.
from sklearn.ensemble import AdaBoostClassifier as ABC
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn import linear_model
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Initialize the models
#adaboostclassifier
clf_A = ABC(random_state = 1)
clf_A.fit(X_train,y_train)
pred_A=clf_A.predict(X_test)
I want to store the pred_A and the ['SSL'] in the Xtrain file to a new csv file. Any suggestions on how I can do this.
Thanks
The most simplest way is something like this:
clf_A = ABC(random_state = 1)
clf_A.fit(X_train, y_train)
pred_A = clf_A.predict(X_test)
resultingDF = pd.DataFrame()# you create new dataframe
resultingDF['predictions'] = pred_A# you create column with values
resultingDF['SSL'] = ...
I suppose, you take SSL values from X_test. So if your X_test is pandas dataframe it will be:
resultingDF['SSL'] = X_test['SSL'].values
If your X_test is already 2D array, you use index (to get column) (#1) or you save (earlier) SSL column from there somewhere and use it now (#2) or you take X_test and make from it dataframe again and then go code above (this is #3).

Inconsistent neural network results when using sklean and seed

After running a neural network in sklearn, i am receiving inconsistant results, even after implementing the seed function. each time i run the code, i receive different values for MSE and R squared for each tested seed value. These values can range greatly with R squared being anything between -0.1 to 0.6. Im wondering if its a data issue as i only have 22 columns and 241 rows. Ive also tried setting
mlp=MLPRegressor(hidden_layer_sizes=(22,22,22),max_iter=2000,learning_rate_init=0.001,random_state=0)
as well as changing the value of the random_state.
below is my code. Many thanks
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
import numpy as np
data=pd.read_csv(r'''D:\PhD\1styear\machinelearning\NNforF2050\DATAnnF2050.csv''')
print(data.shape)
print(data.dtypes)
x=data.drop('EnergyConsumpManuf',axis=1)
y=data['EnergyConsumpManuf']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(x_train)
x_train=scaler.transform(x_train)
x_test=scaler.transform(x_test)
from sklearn.neural_network import MLPRegressor
from sklearn import metrics
from sklearn.metrics import accuracy_score
from math import sqrt
for i in range(15):
print('np.random.seed(%d)'%(i))
np.random.seed(i)
mlp=MLPRegressor(hidden_layer_sizes=(22,22,22),max_iter=2000,learning_rate_init=0.001)
mlp.fit(x_train,y_train)
predictions=mlp.predict(x_test)
print('MSE train: ',metrics.mean_squared_error(y_test,predictions))
RMS=sqrt(metrics.mean_squared_error(y_test,predictions))
print('RMS',RMS)
RTWO=sklearn.metrics.r2_score(y_test,predictions)
print('RTWO',RTWO)
print('MAE',metrics.mean_absolute_error(y_test,predictions))
You need to set random_state parameter of train_test_split function as well. Without fixed random state, data is split randomly each time, that is why results change each time you run the code.

python classify predictions

I have used sklearn in python to read an excel sheet and predict a group names based on its description. The next step i want to take is grouping similar groups. I am not sure which method will best satisfy what i'm attempting to do.
from __future__ import print_function
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np
path = 'data/mydata.csv'
exp = pd.read_csv(path, names=['group_name', 'description'])
X = exp.description
y = exp.group_name
fixed_X = X[pd.notnull(X)]
fixed_y = y[pd.notnull(y)]
vect = CountVectorizer(token_pattern=u'(?u)\\b\\w\\w+\\b')
nb = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(fixed_X, fixed_y,
random_state=1)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
print(metrics.classification_report(y_test, y_pred_class))
This prints the accuracy of the predicted groups, which is what i want. How do i group similar 'group_names' based on the data and predictions provided?
Desired output(based on predictions and data behind it)
if there are 10 groups total
group 1: [group_name1, group_name5,group_name10]
group 2: [group_name2, group_name4]
group 3: [group_name3, group_name6, group_name7, group_name9]
group 4: [group_name10]
(the number of groups wont matter shouldn't matter, i just want the correct group_names, with similar group_names all in one group.
or
a visual model which shows a clustering of all group names

Categories