Currently, I have a dataset which contains two columns procedure name and their CPTs. For example, Total Knee Arthroplasty-27447, Total Hip Arthroplasty -27130, Open Carpal Tunnel Release-64721. The dataset has 3000 rows and there are total 5 CPT codes(5 classes). I am writing a classification model. When I am passing some wrong input, for example, "open knee arthroplasty carpal tunnel release", it is giving output 64721 which is wrong. Below is the code which I am using. May I know what changes I could make in my code and if choosing a Neural Network for this problem is correct?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neural_network import MLPClassifier
xl = pd.ExcelFile("dataset.xlsx") # reading the data
df = xl.parse('Query 2.2')
# shuffling the data
df=df.sample(frac=1)
X_train, X_test, y_train, y_test = train_test_split(df['procedure'], df['code'], random_state = 0,test_size=0.10)
count_vect = CountVectorizer().fit(X_train)
X_train_counts = count_vect.transform(X_train)
tfidf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)
model= MLPClassifier(hidden_layer_sizes=(25),max_iter=500)
classificationModel=model.fit(X_train_tfidf, y_train)
data_to_be_predicted="open knee arthroplasty carpal tunnel release"
result = classificationModel.predict(count_vect.transform([data_to_be_predicted]))
predictionProbablityMatrix = classificationModel.predict_proba(count_vect.transform([data_to_be_predicted]))
maximumPredictedValue = np.amax(predictionProbablityMatrix)
if maximumPredictedValue * 100 > 99:
print(result[0])
else:
print("00000")
I'd recomend you to use Keras for this problem. All the treatment to the data you did using sklearn after splitting the trainning and testing data could be made with numpy to keras and would be more readable and less confusing to know what's going on. If they're all strings you should split the data by rows with internal python code like
row = data[i].split(',')
would have the three columns in the row splitted.
If you have 5 knew classes then I'd take all the classes and replace their names for numbers in the dataset. I've never used Sklearn to implement a neural network, but it seems you used 25 hidden NN layers, is that right? I don't think you would need this much as well... think 3 would do the job.
Sorry if I couldn't help you more precisely in your problem, but I think you can solve your problem easier if you redo it like I said... good luck, buddy!
edit: Maybe the problem isn't in the parsed dataset, but in the NN implementation, that's why I think Keras is more clear
Related
I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.
So I have this dataset (1000 column by 1000 row) which has two classes, zero or one, where I applied the code below and it gave me a prediction rate of 58% I want to tune it but I am really confused between the different classes and how to select their parameters with this type of data, so I wish if I get some guideline here.
#here I am importing the libraries that I need for this situation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
#reading the data
data = pd.read_csv('train.csv')
x = data.loc[:, 'D_0':'D_1023']
y = data['Class']
test = pd.read_csv('test.csv')
model = svm.SVC(kernel='linear', C=1)
model.fit(x,y)
model.score(x,y)
predictions = model.predict(test)
pd.DataFrame(predictions,
columns=['PredictedScore']).to_csv('prediction.csv')
Sample of the dataset
The parameters really depend upon the data, so there is no general guideline. However, at least trying the "rbf" kernel is worth the effort I think. Also, I would first start by changing the C parameter, as that usually has the largest effect. But again, it depends on the data a lot.
I'm trying to use machine learning to predict stock prices. I'm having issues choosing how long out to predict, I want to be able to predict out 100-200 days in the future. It seems like my code is cutting off the last 200 days and adding it's prediction there, instead of adding a additional 200 days of the prediction. For example if I have 1000 days of data I want the 1000 and the 200 0f the prediction, right now it's cutting off the last 200 and making it's prediction there instead. To make my question easier to understand, say I have a data set with integers 1,2,3,4,5,6,7,8,9,10, how do I tell the predict method I want the next 2, 10 or 20 integers in the series? I'm just starting out following a tutorial, and doing a bad job of it, it kind of just says put this here and that there, not much insight on what everything is doing. Not sure if LinearRegression is the best way to do it, would be happy with any insight you can provide, Thank you!
import pandas as pd
import math`enter code here`
import numpy as np
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import style
df = pd.read_csv('spyqqqq.csv')
df = df[['Open','High','Low','Adj Close','Volume']]
df['label'] = df['Adj Close'].shift(-200)
df.dropna(inplace=True)
X = np.array(df.drop(['label'],1))
y = np.array(df['label'])
X = preprocessing.scale(X)
X_lately = X[:-200]
X_train, X_text, y_train, y_text = cross_validation.train_test_split(X,y, test_size=0.1)
clf = LinearRegression()
clf.fit(X_train, y_train)
accuracy = clf.score(X_text,y_text)
forcast = clf.predict(X_lately)
print("accuracy ",accuracy)
print("forcast ",forcast)
plt.plot(forcast,linestyle='solid',ms=0,color='blue')
plt.show()
:) Very sorry in advance if my code looks like something a total newbie would write. Down below is a portion of my code in python. I am fiddling with sklearn and machine learning techniques.
I trained several Naive Bayes Model based on different datasets and stored them in trained_models
Prior this step i created an object chi_squared of the SelectPercentile class using the chi2 function for feature selection. From my understanding, i should write data_feature_reduced = chi_squared.transform(some_data) then use data_feature_reduced at the time of training like this, ie: nb.fit(data_feature_reduced, data.target)
This is what did, and stored the results objects nb ( and some other informations in the list trained_models.
I am now attempting to apply these models on a different set of data ( actually from the same source, if that matters to the question )
for name, model, intra_result, dev, training_data, chi_squarer in trained_models:
cross_results = []
new_vect= StemmedVectorizer(ngram_range=(1, 4), stop_words='english', max_df=0.90, min_df=2)
for data in demframes:
data_name = data[0]
X_test_data = new_vect.fit_transform(data[1].values.astype('U'))
Y_test_data = data[2]
chi_squared_test_data = chi_squarer.transform(X_test_data)
final_results.append((name, "applied to", data[0], model.score(X_test_data,Y_test_data)))
I have to admit that I am a bit of stranger to the feature selection part.
Here is the error that i get :
ValueError: X has a different shape than during fitting.
at line chi_squared_test_data = chi_squarer.transform(X_test_data)
I am assuming I am doing feature selection in an incorrect manner, Where did I go wrong ?
Thanks to everyone for their help!
I will just paste the comment that helped me solve my problem from #Vivek-Kumar.
This error is due to this line new_vect.fit_transform(). Like your
trained models, you should use the same StemmedVectorizer which was
used at training time.
The same StemmedVectorize object will transform the X_test_data to same shape, what it had during the training. Currently, you are using different object and fitting on it (fit_transform is fit and transform), hence the shape is different. Hence the error.
why not use a pipeline to make it simple? that way you dont have to transform twice and take care of the shapes.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
chi_squarer = SelectKBest(chi2, k=100) # change accordingly
lr = LogisticRegression() # or naive bayes
clf = pipeline.Pipeline([('chi_sq', chi_squarer), ('model', lr)])
# for training:
clf.fit(training_data, targets)
# for predictions:
clf.predict(test_data)
you can also add the new_vect in the pipeline
I am trying my hand in Machine Learning and have been using python based Scikit library for it.
I wish to solve a 'Classification' problem in which a chunk of text (say of 1k-2k words) is classified into one or more category. For this I have been studying scikit for a while now.
As my data being in range 2-3 Million, so I was using SGDClassfier with HashingVectorizer for the purpose using partial_fit learning technique, coded as below:
import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer
import numpy as np
from sklearn.externals import joblib
import copy
data = pd.read_csv(
open('train_shuffled.csv'), error_bad_lines=False)
data_all = copy.deepcopy(data)
target = data['category']
del data['category']
cls = np.unique(target)
model = SGDClassifier(loss='log', verbose=1)
vect = HashingVectorizer(stop_words='english', strip_accents='unicode', analyzer='word')
loop = len(target) / 100
for passes in range(0, 5):
count, r = 0, 0
print("Pass " + str(passes + 1))
for q in range(0, loop):
d = nltk.word_tokenize(data['content'][r:r + 100])
d = vect.fit_transform(d)
t = np.array(target[r:r + 100])
model.partial_fit(d, t, cls)
r = r + 100
data = copy.deepcopy(data_all)
data = data.iloc[np.random.permutation(len(data))]
data = data.reset_index(drop=True)
target = data['category']
del data['category']
print(model)
joblib.dump(model, 'Model.pkl')
joblib.dump(vect, 'Vectorizer.pkl')
While going the learning process, I read in an answer here on stack that manually randomizing the training data on each iteration results into better model.
Using the Classifers and Vectorizers with default parameters, I got an accuracy score of ~58.4%. Since then, I have trying playing with different parameter setting for both Vectorizer and Classifier but no increase in accuracy.
Is anyone able to tell me, if something is wrong I have been doing or what should be done for improving the model score.
Any help will be highly appreciated.
Thanks!
1) consider using GridSearchCv to tune parameters. http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
2)consider feature engineering, to combine existing features into new features. E.G. use the polynomial features, feature selection and feature union tools provided in sklearn.
3) try different models. Not all models work on all problems. Try using an ensemble of simpler models and some kind of decision function to take the outputs of those models and make a prediction. Some are in the enesemble module, but you can use the voting classifiers to make your own.
but by far the best and most important thing to do, look at the data. Find examples of where the classifier performed badly. Why did it perform badly? Can you classify it from reading it (i.e. is it reasonable to expect an algo to classifier that text?). If it can be classified, what does the model miss.
All these will help guide what to do next.