I want to search all the plant names that are in this text file that I have made using the code below. I do not have a list of plant names or a specific plant name. Is there a way to search and display for all/every plant name in the text file?
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
f = open("5.txt")
text = f.read()
f.close
new_text = word_tokenize(text)
with open("V Token.txt","w") as f:
for w in new_text:
print(w, file=f)
I am making a document classifier and here is my code:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer,
TfidfTransformer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)
inBody = False
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
if inBody:
lines.append(line)
elif line == '\n':
inBody = True
f.close()
message = '\n'.join(lines)
yield path, message
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'resume': message, 'class': classification})
index.append(filename)
return DataFrame(rows, index=index)
data = DataFrame({'resume': [], 'class': []})
data = data.append(dataFrameFromDirectory(r'<path>', 'Yes'))
data = data.append(dataFrameFromDirectory(r'<path>', 'No'))
Then I split the data, and used Tfidf Vectorizer:
tf=TfidfVectorizer(min_df=1, stop_words='english')
data_traintf=tf.fit_transform(data_train)
mnb=MultinomialNB()
mnb.fit(data_traintf,class_train)
After training and testing, I saved my classifier as a pickle file:
import pickle
with open(r'clf.pkl','wb') as f:
pickle.dump(mnb,f)
But when I load it again and try to use the classifier, I get TfidfVectorizer - Vocabulary wasn't fitted error. So I tried using pipeline and saved my vectorizer as well :
from sklearn.pipeline import Pipeline
classifier=Pipeline([('tfidf',tf),('multiNB',mnb)])
with open(r'clf_1.pkl','wb') as f:
pickle.dump(classifier,f)
But still I get the same error. What might be going wrong?
EDIT: The pickle file was stored successfully and on the other end, I loaded the file:
import pickle
with open(r'clf_1.pkl','rb') as f:
clf=pickle.load(f)
And created a test data frame. When I do test_tf=tf.fit(test['resume']) it works fine but pred=clf.predict(test_tf) gives error TypeError: 'TfidfVectorizer' object is not iterable
Do I need to loop through the data frame that has around 15 objects?
i am making a desktop tool for plagiarism checking between documents. I use stopwords, vectorizer tf-idf etc and use cosine similarity to check similarity between two documents
{import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
userinput1 = input ("Enter file name:")
myfile1 = open(userinput1).read()
stop_words = set(stopwords.words("english"))
word1 = nltk.word_tokenize(myfile1)
filtration_sentence = []
for w in word1:
word = word_tokenize(myfile1)
filtered_sentence = [w for w in word if not w in stop_words]
print(filtered_sentence)
userinput2 = input ("Enter file name:")
myfile2 = open(userinput2).read()
stop_words = set(stopwords.words("english"))
word2 = nltk.word_tokenize(myfile2)
filtration_sentence = []
for w in word2:
word = word_tokenize(myfile2)
filtered_sentence = [w for w in word if not w in stop_words]
print(filtered_sentence)
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(myfile1, myfile2):
tfidf = vectorizer.fit_transform([myfile1, myfile2])
return ((tfidf * tfidf.T).A)[0,1]
print(cosine_sim(myfile1,myfile2))}
but the problem is "i have to check similarity of input file from user with the number of files in folder. i tried my best to access folder ,open files automatically but not succeed. "anyone here who can tell me how to access folder containing files and open files one by one and compare with input file.i am using python 3.4.4 and window 7
As per my understanding you need to get all the files present in a directory/ folder
import os
fileList = os.listdir('path_to_the_directory')
for eachFile in fileList:
with open(eachFile, 'rb') as _fp:
fileData = _fp.read()
print("FILE DATA (%s):\n\n%s\n\n"%(_fp.name, fileData))
This will iterate through all the file in a directory and call the function doSomething on the file pointer
This question already has an answer here:
Creating a custom categorized corpus in NLTK and Python
(1 answer)
Closed 6 years ago.
I'm looking to use my own created corpus within Visual Studio Code for MacOSX; I have read probably a hundred forums and I can't wrap my head around what I'm doing wrong as I'm pretty new to programming.
This question seems to be the closes thing I can find to what I need to do; however, I am unaware of how to do the following:
"on a Mac it would be in ~/nltk_data/corpora, for instance. And it looks like you also have to append your new corpus to the __init__.py within .../site-packages/nltk/corpus/."
When answering, please be aware I am using Homebrew and don't want to permanently disable using another path if I need to use a stock NLTK corpora data set as well within the same coding.
If needed, I can post my attempt at coding using "PlaintextCorpusReader" along with the provided traceback below, although I would rather not have to use PlaintextCorpusReader at all for seamless use and would rather just use a simple copy+paste for .txt files into an appropriate location I wish to use in accordance with the append coding.
Thank you.
Traceback (most recent call last):
File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 42, in <module>
short_pos = open("short_reviews/pos.txt", "r").read
IOError: [Errno 2] No such file or directory: 'short_reviews/pos.txt'
EDIT:
Thank you for your responses.
I have taken your advice and moved the folder out of NLTK's corpora.
I've been doing some experimenting with my folder location and I've gotten different tracebacks.
If you are saying the best way to do it is with PlaintextCorpusReader then so be it; however, maybe for my application I'd want to use CategorizedPlaintextCorpusReader?
sys.argv is definitely not what I meant, so I can read up on that later.
First, here is my code without my attempt to use PlaintextCorpusReader which results in the above traceback when the folder "short_reviews" containing the pos.txt and neg.txt files is outside of the NLP folder:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk import word_tokenize
class VoteClassifier(ClassifierI):
def __init__(self, *classifiers):
self._classifiers = classifiers
def classify(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
return mode(votes)
def confidence(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
choice_votes = votes.count(mode(votes))
conf = choice_votes / len(votes)
return conf
# def main():
# file = open("short_reviews/pos.txt", "r")
# short_pos = file.readlines()
# file.close
short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read
documents = []
for r in short_pos.split('\n'):
documents.append( (r, "pos") )
for r in short_neg.split('\n'):
documents.append((r, "neg"))
all_words = []
short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)
for w in short_pos_words:
all_words.append(w. lower())
for w in short_neg_words:
all_words.append(w. lower())
all_words = nltk.FreqDist(all_words)
However, when I move the folder "short_reviews" containing the text files into the NLP folder using the same code as above but without the use of PlaintextCorpusReader the following occurs:
Traceback (most recent call last):
File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 47, in <module>
for r in short_pos.split('\n'):
AttributeError: 'builtin_function_or_method' object has no attribute 'split'
When I move the folder "short_reviews" containing the text files into the NLP folder using the code below with the use of PlaintextCorpusReader the following Traceback occurs:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk import word_tokenize
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'short_reviews'
word_lists = PlaintextCorpusReader(corpus_root, '*')
wordlists.fileids()
class VoteClassifier(ClassifierI):
def __init__(self, *classifiers):
self._classifiers = classifiers
def classify(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
return mode(votes)
def confidence(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
choice_votes = votes.count(mode(votes))
conf = choice_votes / len(votes)
return conf
# def main():
# file = open("short_reviews/pos.txt", "r")
# short_pos = file.readlines()
# file.close
short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read
documents = []
for r in short_pos.split('\n'):
documents.append((r, "pos"))
for r in short_neg.split('\n'):
documents.append((r, "neg"))
all_words = []
short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)
for w in short_pos_words:
all_words.append(w. lower())
for w in short_neg_words:
all_words.append(w. lower())
all_words = nltk.FreqDist(all_words)
Traceback (most recent call last):
File "/Users/jordanXXX/Documents/NLP/bettertrainingdata2", line 18, in <module>
word_lists = PlaintextCorpusReader(corpus_root, '*')
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/plaintext.py", line 62, in __init__
CorpusReader.__init__(self, root, fileids, encoding)
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/api.py", line 87, in __init__
fileids = find_corpus_fileids(root, fileids)
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/util.py", line 763, in find_corpus_fileids
if re.match(regexp, prefix+fileid)]
File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 141, in match
return _compile(pattern, flags).match(string)
File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
error: nothing to repeat
The answer you refer to contains some very poor (or rather, inapplicable) advice. There is no reason to place your own corpus in nltk_data, or to hack nltk.corpus.__init__.py to load it like a native corpus. In fact, do not do these things.
You should use PlaintextCorpusReader. I don't understand your reluctance to do so, but if your files are plain text, it's the right tool to use. Supposing you have a folder NLP/bettertrainingdata, you can build a reader that will load all .txt files in this folder like this:
myreader = nltk.corpus.reader.PlaintextCorpusReader(r"NLP/bettertrainingdata", r".*\.txt")
If you add new files to the folder, the reader will find and use them. If what you want is to be able to use your script with other folders, then just do so-- you don't need a different reader, you need to learn about sys.argv. If you are after a categorized corpus with pos.txt and neg.txt, then you need a CategorizedPlaintextCorpusReader (which see). If it's something else yet that you want, then please edit your question to explain what you are trying to do.
Here is the code:
import nltk
import string
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
path = '/opt/datacourse/data/parts'
token_dict = {}
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = subdir + os.path.sep + file
shakes = open(file_path, 'r')
text = shakes.read()
lowers = text.lower()
no_punctuation = lowers.translate(None, string.punctuation)
token_dict[file] = no_punctuation
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())
After running,it turned out to be:
File "D:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 751, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
According to others' replies,I've chencked text.py and confirmed that min_def = 1 in _init_
Can anyone tell me what's the problem?Much appreciated.