Claasification of testdata containg string columns - python

So I am using Machine Learning to predict class of some data as given below sample.
My data is related to some scheduler running on server and by submission time and server_type I am labeling the class
Dataframe: df= sch_name server_type subit_time submit_by Class
RCALCAPP X3333 165703 AAAA 1
RCALCAPP X3333 105703 BBBB 0
PCALCAPP X3333 165703 AAAA 1
.
.
TCALCAPP X3344 095703 CCCC 0
TO run classifier I am doing lableencoding for string column values. Not sure if it is correct approch to ecode or not but it is working for me
le = preprocessing.LabelEncoder()
df = df.apply(le.fit_transform)
Also I dont need submit_by column to train classifier so I am removing it
featureNames = [col for col in df.columns if col not in ['submit_by','status']]
TO prepare a model I have splitted above dataframe into training, cv, test and using below
trainFeatures = training[featureNames].values
trainClasses = training['status'].values
testFeatures= test[featureNames].values
testClasses = test['status'].values
clf = RandomForestClassifier()
clf.fit(trainFeatures, trainClasses)
score = clf.score(testFeatures, testClasses)
print(score) #.99823742
Till here every thing is okay.classifier is running on data. But nowI want to test new new record for classification. I tried to do following:
test_sch = ['TCALCAPP', 'X3344', '075703']
class_code = clf.predict(test_sch) # [1]
It gave error
ValueError: could not convert string to float: 'TCALCAPP'
And I know the reason as It has not been encoded to number. Here is my problem how to do that exactly. I need to pass encoded value for 'TCALCAPP', 'X3344'. But How I would know encoded value for a new test data. My approach could be wrong but requirement is same as above. Kindly help.

Related

Labelencoder and OneHotEncoder within the same for loop

I am writing a for loop to try to do an encoding for all of my values in a dataset. I have plenty of categorical values and initially the for loop works for the label encoder but I am trying to include a onehotencoder instead of using get_dummies on a separate line.
sample data:
STYP_DESC Gender RACE_DESC DEGREE MAJR_DESC1 FTPT Target
0 New Female White BA Business Administration FT 1
1 New 1st Time Freshmn Female White BA Studio Art FT 1
2 New Male White MBAX Business Administration FT 1
3 New Female Unknown JD Juris Doctor PT 1
4 New Female Asian-American MBAX Business Administration PT 1
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in X_train.columns[1:]:
if X_train[col].dtype == 'object':
if len(list(X_train[col].unique())) <= 2:
le.fit(X_train[col])
X_train[col] = le.transform(X_train[col])
le_count += 1
else:
enc.fit(X_train[[col]])
X_train[[col]] = enc.transform(X_train[[col]])
enc_count +=1
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))
but when I run it, I don't get errors but the encoding is super weird with a slew of tuples being inserted into my new dataset.
When I run the code without the everything in the else clause, it runs fine and I can simply use get_dummies to encode the other variables.
The only issue is when I use get_dummies, I drop_first is set to true; but I lose track of what is supposed to be 0 and what's supposed to be 1. (i.e. this problem is a major issue for tracking Gender and FTPT.
Any suggestions on this? I would use get_dummies but since I'm doing the preprocessing stage after splitting my data I'm worried about a category possibly being dropped out.
Change the transform line encoding else part as below
X_train[col] = enc.transform(X_train[[col]]).toarray()
Here I'm copying the full code, you may try it directly.
So error may be some other part of your code, please check.
styp = ['New','New 1st Time Freshmn','New','New','New']
gend = ['Female','Female','Male','Female','Female']
race = ['White','White','Unknown','Unknown','Asian-American']
deg = ['BA','BA','MBAX','JD','MBAX']
maj = ['Business Administration','Studio Art','Business Administration','Juris Doctor','Business Administration']
ftpt = ['FT','FT','FT','PT','PT']
df = pd.DataFrame({'STYP_DESC':styp, 'Gender':gend, 'RACE_DESC':race,'DEGREE':deg,\
'MAJR_DESC1':maj, 'FTPT':ftpt})
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in df.columns[1:]:
if df[col].dtype == 'object':
if len(list(df[col].unique())) <= 2:
le.fit(df[col])
df[col] = le.transform(df[col])
le_count += 1
else:
enc.fit(df[[col]])
df[col] = enc.transform(df[[col]]).toarray()
enc_count +=1
print(df)
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))

Creating a document term matrix using fit_transform

I have an array that takes in string values from a json file. I want to create a document matrix to see the repeated words but when I pass in the array I get an error:
AttributeError: 'NoneType' object has no attribute 'lower'
This is the line that gets the error all the time:
sparse_matrix = count_vectorizer.fit_transform(issues_description)
issues_description = []
issues_key = []
with open('issues_CLOVER.json') as json_file:
data = json.load(json_file)
for record in data:
issues_key.append(record['key'])
issues_description.append(record['fields']['description'])
df = pd.DataFrame({'Key' : issues_key, 'Description' : issues_description})
df.head(10)
This is the data that gets displayed:
Key Description
0 CLOV-1985 h2. Environment Details\r\n\r\nThis bug occurs...
1 CLOV-1984 Clover fails to instrument source code in case...
2 CLOV-1979 If a type argument for a parameterized type ha...
3 CLOV-1978 Bug affects Clover 3.3.0 and higher.\r\n\r\n \...
4 CLOV-1977 Add support to able to:\r\n * instrument sourc...
5 CLOV-1976 Add support to Groovy code in Clover for Eclip...
6 CLOV-1973 See also --CLOV-1956--.\r\n\r\nIn case HUDSON_...
7 CLOV-1970 Steps to reproduce:\r\n\r\nCoverage Explorer >...
8 CLOV-1967 Test Clover against IntelliJ IDEA 2016.3 EAP (...
9 CLOV-1966 *Problem*\r\n\r\nClover Maven Plugin replaces ...
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(issues_description)
# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=[issues_key[0],issues_key[1],issues_key[2]])
df
What do I change in order to get issues_description a passable arg or can someone point to me what I need to know in order for it to work?
Thanks.

Oversampling a class in classification problem

I have nearly 100000 data point with 15 features for 'disease' and 'no disease' as target.
But my data is imbalanced. 97% of my data is no disease and 3% is disease.
To overcome this I manually created disease data by creating 7 copies from the actual data and merged it with the original data.
using this code.
#selecting data with disease is 1
# Even created unique 'patient ID' by adding a dummy letter as a suffix to the #original ID.
ia = df[df['disease']==1]
dup = pd.DataFrame()
for i,j in zip(['a','b','c','d','e','f'],['B','C','E','F','G','H']):
i = ia.copy()
i['dum'] = j
i["patient ID"] = i["Employee Code"]+ i['dum']
dup= pd.concat([dup,i])
# adding the copies to the original data
df = pd.concat([dup,df])
Please let me know if this is the correct method for oversampling.

Pandas - Retrieve Value from df.loc

Using pandas I have a result (here aresult) from a df.loc lookup that python is telling me is a 'Timeseries'.
sample of predictions.csv:
prediction id
1 593960337793155072
0 991960332793155071
....
code to retrieve one prediction
predictionsfile = pandas.read_csv('predictions.csv')
idtest = 593960337793155072
result = (predictionsfile.loc[predictionsfile['id'] == idtest])
aresult = result['prediction']
A result retreives a data format that cannot be keyed:
In: print aresult
11 1
Name: prediction, dtype: int64
I just need the prediction, which in this case is 1. I've tried aresult['result'], aresult[0] and aresult[1] all to no avail. Before I do something awful like converting it to a string and strip it out, I thought I'd ask here.
A series requires .item() to retrieve its value.
print aresult.item()
1

Python Scikit-Learn PCA: Get Component Score

I am trying to perform a Principal Component Analysis for work. While i have successful in getting the the Principal Components laid out, i don't really know how to assign the resulting Component Score to each line item. I am looking for an output sort of like this.
Town PrinComponent 1 PrinComponent 2 PrinComponent 3
Columbia 0.31989 -0.44216 -0.44369
Middletown -0.37101 -0.24531 -0.47020
Harrisburg -0.00974 -0.06105 0.32792
Newport -0.38678 0.40935 -0.62996
The scikit-learn docs are not being helpful in this circumstance. Can anybody explain to me how i can reach this output?
The code i have so far is below.
def perform_PCA(df):
threshold = 0.1
pca = decomposition.PCA(n_components=3)
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
pca.fit(scaled_data)
pca.transform(scaled_data)
pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
#print pca_components_df
#pca_components_df.to_csv('pca_components_df.csv')
filtered = pca_components_df[abs(pca_components_df) > threshold]
trans_filtered= filtered.T
#print filtered.T #Tranformed Dataframe
trans_filtered.to_csv('trans_filtered.csv')
print pca.explained_variance_ratio_
I pumped the transformed array into the data portion of the DataFrame function, and then defined the index and columns the by putting them into columns= and index= respectively.
pd.DataFrame(data=transformed, columns=["PC1", "PC2"], index=df.index)

Categories