Creating a document term matrix using fit_transform - python

I have an array that takes in string values from a json file. I want to create a document matrix to see the repeated words but when I pass in the array I get an error:
AttributeError: 'NoneType' object has no attribute 'lower'
This is the line that gets the error all the time:
sparse_matrix = count_vectorizer.fit_transform(issues_description)
issues_description = []
issues_key = []
with open('issues_CLOVER.json') as json_file:
data = json.load(json_file)
for record in data:
issues_key.append(record['key'])
issues_description.append(record['fields']['description'])
df = pd.DataFrame({'Key' : issues_key, 'Description' : issues_description})
df.head(10)
This is the data that gets displayed:
Key Description
0 CLOV-1985 h2. Environment Details\r\n\r\nThis bug occurs...
1 CLOV-1984 Clover fails to instrument source code in case...
2 CLOV-1979 If a type argument for a parameterized type ha...
3 CLOV-1978 Bug affects Clover 3.3.0 and higher.\r\n\r\n \...
4 CLOV-1977 Add support to able to:\r\n * instrument sourc...
5 CLOV-1976 Add support to Groovy code in Clover for Eclip...
6 CLOV-1973 See also --CLOV-1956--.\r\n\r\nIn case HUDSON_...
7 CLOV-1970 Steps to reproduce:\r\n\r\nCoverage Explorer >...
8 CLOV-1967 Test Clover against IntelliJ IDEA 2016.3 EAP (...
9 CLOV-1966 *Problem*\r\n\r\nClover Maven Plugin replaces ...
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(issues_description)
# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=[issues_key[0],issues_key[1],issues_key[2]])
df
What do I change in order to get issues_description a passable arg or can someone point to me what I need to know in order for it to work?
Thanks.

Related

Python in enumerate not giving expected output

I'm having an issue with output from in enumerate function. It is adding parenthesis and commas into the data. I'm trying to use the list for a comparison loop. Can anyone tell me why the special characters are added resembling tuples? I'm going crazy here trying to finish this but this bug is causing issues.
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
#NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
df=pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv")
df.head(10)
df.isnull().sum()/df.count()*100
df.dtypes
# Apply value_counts() on column LaunchSite
df[['LaunchSite']].value_counts()
# Apply value_counts on Orbit column
df[['Orbit']].value_counts()
#landing_outcomes = values on Outcome column
landing_outcomes = df[['Outcome']].value_counts()
print(landing_outcomes)
#following causes data issue
for i,outcome in enumerate(landing_outcomes.keys()):
print(i,outcome)
#following also causes an issue to the data
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes
# landing_class = 0 if bad_outcome
# landing_class = 1 otherwise
landing_class = []
for value in df['Outcome'].items():
if value in bad_outcomes:
landing_class.append(0)
else:
landing_class.append(1)
df['Class']=landing_class
df[['Class']].head(8)
df.head(5)
df["Class"].mean()
The issue I'm having is
for i,outcome in enumerate(landing_outcomes.keys()):
print(i,outcome)
is changing my data and giving an output of
0 ('True ASDS',)
1 ('None None',)
2 ('True RTLS',)
3 ('False ASDS',)
4 ('True Ocean',)
5 ('False Ocean',)
6 ('None ASDS',)
7 ('False RTLS',)
additionally, when I run
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes
my output is
{('False ASDS',),
('False Ocean',),
('False RTLS',),
('None ASDS',),
('None None',)}
I do not understand why my data return is far from expected and how to correct it.
Try this
for i, (outcome,) in enumerate(landing_outcomes.keys()):
print(i, outcome)
Or
for i, outcome in enumerate(landing_outcomes.keys()):
print(i, outcome[0])

Properly calculate cosine similarities for low memory on large datasets?

I am following this tutorial here to just learn a bit about content recommenders: https://www.datacamp.com/community/tutorials/recommender-systems-python
but i ran into a Memory Error when running the "content based" part of the tutorial. Upon some reading I found that this has to do with just how large the dataset being used it. I couldn't really find an exact way for this specific case on how to run this with low memory, so instead i modified this a little bit to split the original dataframe up into 6 pieces, run this cosine similarity calculation for each split dataframe, merge together the results, then run this one last time to get a final result. here is my code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, indices, cosine_sim, final=False):
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
if not final:
return metadata.iloc[movie_indices, :]
else:
return metadata['title'].iloc[movie_indices]
# Load Movies Metadata
metadata = pd.read_csv('dataset/movies_metadata.csv', low_memory=False)
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')
split_db = np.array_split(metadata, 6)
source_db = None
search_db = None
db_remove_idx = None
new_db_list = list()
for x, db in enumerate(split_db):
search = db.loc[db['title'] == 'The Dark Knight Rises']
if not search.empty:
source_db = db
new_db_list.append(source_db)
search_db = search
db_remove_idx = x
break
split_db.pop(db_remove_idx)
for x, db in enumerate(split_db):
new_db_list.append(db.append(search_db, ignore_index=True))
del(split_db)
refined_db = None
for db in new_db_list:
small_db = db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(small_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(small_db.index, index=small_db['title']).drop_duplicates()
result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim))
if type(refined_db) != pd.core.frame.DataFrame:
refined_db = result.append(search_db, ignore_index=True)
else:
refined_db = refined_db.append(result, ignore_index=True)
final_db = refined_db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(final_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(final_db.index, index=final_db['title']).drop_duplicates()
final_result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim, final=True))
print(final_result)
i thought this would work, but the results are not even close to what is given in the tutorial:
11 Dracula: Dead and Loving It
13 Nixon
12 Balto
15 Casino
20 Get Shorty
18 Ace Ventura: When Nature Calls
14 Cutthroat Island
16 Sense and Sensibility
19 Money Train
17 Four Rooms
Name: title, dtype: object
could anyone explain what i am doing wrong here? i figured since the dataset was too large by splitting it up, running this "cosine similarity" process as first a refinement, then using the resulting data and running the process again would give a similar result, but then why is the result i am getting so different than what is expected?
And this is the data i am using this against: https://www.kaggle.com/rounakbanik/the-movies-dataset/data

Reading csv in loop stops at row that does not match

I am trying to read a csv then iterate through an sde to find matching features, their fields, and then print them.
There is a table in the list and I'm not able to skip over it and continue reading the csv.
I get the "IOError: table 1 does not exist" and I only get the features that come before the table.
import arcpy
from arcpy import env
import sys
import os
import csv
with open('C:/Users/user/Desktop/features_to_look_for.csv', 'r') as t1:
objectsinESRI = [r[0] for r in csv.reader(t1)]
env.workspace = "//conn/features#dev.sde"
fcs = arcpy.ListFeatureClasses('sometext.*')
for fcs in objectsinESRI:
fieldList = arcpy.ListFields(fcs)
for field in fieldList:
print fcs + " " + ("{0}".format(field.name))
Sample csv rows (can't seem to post a screenshot of the excel file)
feature 1
feature 2
feature 3
feature 4
table 1
feature 5
feature 6
feature 7
feature 8
feature 9
Result
feature 1
feature 2
feature 3
feature 4
Desired Result
feature 1
feature 2
feature 3
feature 4
feature 5
feature 6
feature 7
feature 8
feature 9
So as stated, I have no clue about arcpy but this seems the way so start. Looking at the docs, your objectsInEsri seems to be the equivalent of the datasets in the example. From there I extrapolate the following code which, depending on what print(fc) is printing, you may need to extend with yet another for.
So try this:
for object in objectsInEsri:
for fc in fcs:
print(fc)
Or maybe this:
for object in objectsInEsri:
for fc in fcs:
for field in arcpy.ListFields(fc)
print(object + " " + ("{0}".format(field.name)))
Then I may be completely wrong ofc but then just write first the outermore for, see what is giving to you, and keep building from there :)

Claasification of testdata containg string columns

So I am using Machine Learning to predict class of some data as given below sample.
My data is related to some scheduler running on server and by submission time and server_type I am labeling the class
Dataframe: df= sch_name server_type subit_time submit_by Class
RCALCAPP X3333 165703 AAAA 1
RCALCAPP X3333 105703 BBBB 0
PCALCAPP X3333 165703 AAAA 1
.
.
TCALCAPP X3344 095703 CCCC 0
TO run classifier I am doing lableencoding for string column values. Not sure if it is correct approch to ecode or not but it is working for me
le = preprocessing.LabelEncoder()
df = df.apply(le.fit_transform)
Also I dont need submit_by column to train classifier so I am removing it
featureNames = [col for col in df.columns if col not in ['submit_by','status']]
TO prepare a model I have splitted above dataframe into training, cv, test and using below
trainFeatures = training[featureNames].values
trainClasses = training['status'].values
testFeatures= test[featureNames].values
testClasses = test['status'].values
clf = RandomForestClassifier()
clf.fit(trainFeatures, trainClasses)
score = clf.score(testFeatures, testClasses)
print(score) #.99823742
Till here every thing is okay.classifier is running on data. But nowI want to test new new record for classification. I tried to do following:
test_sch = ['TCALCAPP', 'X3344', '075703']
class_code = clf.predict(test_sch) # [1]
It gave error
ValueError: could not convert string to float: 'TCALCAPP'
And I know the reason as It has not been encoded to number. Here is my problem how to do that exactly. I need to pass encoded value for 'TCALCAPP', 'X3344'. But How I would know encoded value for a new test data. My approach could be wrong but requirement is same as above. Kindly help.

Python FastText: How to create a corpus from a Dataframe Column

I need to create a corpus for my Email Classifer .
Right now am Using fasttext 0.8.3 but it expects text file as a input whereas
i need to pass a dataframe as an input .
It shows error while i am Using following Code :-
```
import fasttext
x_val = df['Message']
y_val = df['Categories']
model = fasttext.skipgram(x_val, y_val)
print model.words
TypeError:
<ipython-input-105-58241a9688b5>
<module>()
----> 1 model = fasttext.skipgram(x_val, y_val)
2 print model.words # list of words in dictionary
fasttext/fasttext.pyx in fasttext.fasttext.skipgram (fasttext/fasttext.cpp:6451)()
fasttext/fasttext.pyx in fasttext.fasttext.train_wrapper (fasttext/fasttext.cpp:5223)()
/root/anaconda2/lib/python2.7/genericpath.pyc in isfile(path)
35 """Test whether a path is a regular file"""
36 try:
---> 37 st = os.stat(path)
38 except os.error:
39 return False
TypeError: coercing to Unicode: need string or buffer, Series found
```
In above code the df['Message'] and df['Categories'] are the dataframe column in which it contains mails and the category respectively .
There are 30123 mails in the dataframe .
I already go through the Documentation of fasttext but i dont find someting useful.
Fasttext Tutorial refrence
Thanks for the Help.

Categories