Searching the plant names in a text file using python - python

I want to search all the plant names that are in this text file that I have made using the code below. I do not have a list of plant names or a specific plant name. Is there a way to search and display for all/every plant name in the text file?
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
f = open("5.txt")
text = f.read()
f.close
new_text = word_tokenize(text)
with open("V Token.txt","w") as f:
for w in new_text:
print(w, file=f)

Related

How to loop and delete stop words from a folder

I am currently working on the task of deleting stop words. This code can be run, but I would like to ask how to change it into a loop statement, that is, loop to extract stop words in a folder instead of a single file. It might be the "file1.... this statement", but I don't know how to change it. The code is attached as follows, Thanks!
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
file1 = open(
r"D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\2001\QTR1\20010102_10-K-A_edgar_data_1024302_0001092388-00-500453.txt")
line = file1.read()
words = word_tokenize(line)
words_witout_stop_words = ["" if word in stop_words else word for word in words]
new_words = " ".join(words_witout_stop_words).strip()
appendFile = open(
r"D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\2001\QTR1\20010102_10-K-A_edgar_data_1024302_0001092388-00-500453.txt", 'w')
appendFile.write(new_words)
appendFile.close()
A simple fix for this is to get all the files in the folder and run the code on them.
I have mentioned the whole code here:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import os
path = r"D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\2001\QTR1\""
files = os.listdir(path)
stop_words = set(stopwords.words('english'))
for i in files:
file1 = open(path + i)
line = file1.read()
words = word_tokenize(line)
words_witout_stop_words = ["" if word in stop_words else word for word in words]
new_words = " ".join(words_witout_stop_words).strip()
appendFile = open(path + i, 'w')
appendFile.write(new_words)
appendFile.close()

How to extract a keyword and its page number from a PDF file using NLP?

In the above PDF file, my code has to extract keywords and Table Names like Table 1, Table 2, Title with Bold Letters like INTRODUCTION, CASE PRESENTATION from all pages from the given PDF.
Wrote a small program to extract texts from the PDF file
punctuations = ['(',')',';',':','[',']',',','^','=','-','!','.','{','}','/','#','^','&']
stop_words = stopwords.words('English')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)
and the output I got was as below
From the above output, How to extract keywords like INTRODUCTION, CASE PRESENTATION, Table 1 along with the page number and save them in a output file.
Output Format
INTRODUCTION in Page 1
CASE PRESENTATION in Page 3
Table 1 (Descriptive Statistics) in Page 5
Need help in obtaining output of this format.
Code
def main():
file_name = open("Test1.pdf","rb")
readpdf = PyPDF2.PdfFileReader(file_name)
#Parse thru each page to extract the texts
pdfPages = readpdf.numPages
count=0
text=""
print()
#The while loop will read each page.
while count < pdfPages:
pageObj = readpdf.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text.
else:
text = textract.process(fileurl, method='tesseract', language='eng')
#PRINT THE TEXT EXTRACTED FROM GIVEN PDF
#print(text)
#The function will break text into individual words
tokens = word_tokenize(text)
#print('TOKENS')
#print(tokens)
#Clean the punctuations not required.
punctuations = ['(',')',';',':','[',']',',','^','=','-','!','.','{','}','/','#','^','&']
stop_words = stopwords.words('English')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)
If you want information on which page is some text then you shouldn't add all to one string but you should work with every page separatelly (in for-loop`)
It could be something similar to this. It is code without tesseract which would need method to split PDF to separated pages and works with every page separatelly
pdfPages = readpdf.numPages
# create it before loop
punctuations = ['(',')',';',':','[',']',',','^','=','-','!','.','{','}','/','#','^','&']
stop_words = stopwords.words('English')
#all_pages = []
# work with every page separatelly
for count in range(pdfPages):
pageObj = readpdf.getPage(count)
page_text = pageObj.extractText()
page_tokens = word_tokenize(page_text)
page_keywords = [word for word in page_tokens if not word in stop_words and not word in punctuations]
page_uppercase_words = [word for word in page_keywords if word.isupper()]
#all_pages.append( (count, page_keywords, page_uppercase_words) )
print('page:', count)
print('keywords:', page_keywords)
print('uppercase:', page_uppercase_words)
# TODO: append/save page to file
Issue partially resolved here: https://github.com/konfuzio-ai/document-ai-python-sdk/issues/6#issue-876036328
Check: https://github.com/konfuzio-ai/document-ai-python-sdk
# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init
from konfuzio_sdk.api import get_document_annotations
document_first_annotation = get_document_annotations(document_id=1111)[0]
page_index = document_first_annotation['bbox']['page_index']
keyword = document_first_annotation['offset_string']
The object Annotation in the Konfuzio SDK allows to access directly to the keyword string but, at the moment, not directly to the page index. This attribute will be added soon.
An example to access the first annotation in the first training document of your project would be:
# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init
from konfuzio_sdk.data import Project
my_project = Project()
annotations_first_doc = my_project.documents[0].annotations()
first_annotation = annotations_first_doc[0]
keyword = first_annotation.offset_string
import PyPDF2
import pandas
import numpy
import re
import os,sys
import nltk
import fitz
def main():
file_name = open("File1.pdf","rb")
readPDF = PyPDF2.PdfFileReader(file_name)
call_function(file_name,readPDF)
def call_function(fname,readpdf)
pdfPages = readpdf.numPages
for pageno in range(pdfPages):
doc_name = fitz.open(fname.name)
page = word_tokenize(doc_name[pageno].get_text())
page_texts = [word for word in page if not word in stop_words and not word in punctuations]
print('Page Number:',pageno)
print('Page Texts :',page_texts)

How do I tokenize a text data into words and sentences without getting a type error

My end goal is to use NER models to identify custom entities. Before doing this, I am tokenizing the text data into words and sentences. I have a folder of text files(.txt) that I opened and read into Jupyter using the os library. After reading the text file, whenever I try to tokenize the text files, I get a type error. Could please advise on what I am doing wrong? My code is below, Thanks.
import os
outfile = open('result.txt', 'w')
path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"
files = os.listdir(path)
for file in files:
outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')
outfile.close()
This code runs fine, whenever I run the outfile, I get this below
outfile
<_io.TextIOWrapper name='result.txt' mode='w' encoding='cp1252'>
Next, tokenization.
from nltk.tokenize import sent_tokenize, word_tokenize
sent_tokens = sent_tokenize(outfile)
print(outfile)
word_tokens = word_tokenize(outfile)
print(outfile
But then I get an error after the running the code above. Check for error below
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-62f66183895a> in <module>
1 from nltk.tokenize import sent_tokenize, word_tokenize
----> 2 sent_tokens = sent_tokenize(outfile)
3 print(outfile)
4
5 #word_tokens = word_tokenize(text)
~\AppData\Local\Continuum\anaconda3\envs\nlp_course\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)
93 """
94 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 95 return tokenizer.tokenize(text)
96
97 # Standard word tokenizer.
TypeError: expected string or bytes-like object
(moving comment to answer)
You are trying to process the file object instead of the text in the file. After you create the text file, re-open it and read the entire file before tokenizing.
Try this code:
import os
outfile = open('result.txt', 'w')
path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"
files = os.listdir(path)
for file in files:
with open(path + "/" + file) as f:
outfile.write(f.read() + '\n')
#outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')
outfile.close() # done writing
from nltk.tokenize import sent_tokenize, word_tokenize
with open('result.txt') as outfile: # open for read
alltext = outfile.read() # read entire file
print(alltext)
sent_tokens = sent_tokenize(alltext) # process file text. tokenize sentences
word_tokens = word_tokenize(alltext) # process file text. tokenize words

how to access and open files in folder automatically and check similarity with input file in python

i am making a desktop tool for plagiarism checking between documents. I use stopwords, vectorizer tf-idf etc and use cosine similarity to check similarity between two documents
{import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
userinput1 = input ("Enter file name:")
myfile1 = open(userinput1).read()
stop_words = set(stopwords.words("english"))
word1 = nltk.word_tokenize(myfile1)
filtration_sentence = []
for w in word1:
word = word_tokenize(myfile1)
filtered_sentence = [w for w in word if not w in stop_words]
print(filtered_sentence)
userinput2 = input ("Enter file name:")
myfile2 = open(userinput2).read()
stop_words = set(stopwords.words("english"))
word2 = nltk.word_tokenize(myfile2)
filtration_sentence = []
for w in word2:
word = word_tokenize(myfile2)
filtered_sentence = [w for w in word if not w in stop_words]
print(filtered_sentence)
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(myfile1, myfile2):
tfidf = vectorizer.fit_transform([myfile1, myfile2])
return ((tfidf * tfidf.T).A)[0,1]
print(cosine_sim(myfile1,myfile2))}
but the problem is "i have to check similarity of input file from user with the number of files in folder. i tried my best to access folder ,open files automatically but not succeed. "anyone here who can tell me how to access folder containing files and open files one by one and compare with input file.i am using python 3.4.4 and window 7
As per my understanding you need to get all the files present in a directory/ folder
import os
fileList = os.listdir('path_to_the_directory')
for eachFile in fileList:
with open(eachFile, 'rb') as _fp:
fileData = _fp.read()
print("FILE DATA (%s):\n\n%s\n\n"%(_fp.name, fileData))
This will iterate through all the file in a directory and call the function doSomething on the file pointer

How to create corpus from multiple docx files in Python

I have a folder that consists of various 10 docx files. I am trying to create a corpus, which should be a list of length 10. Each element of the list should refer to the text of each docx document.
I have following function to extract text from docx files:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
import glob
from docx import *
def getText(filename):
document = Document(filename)
newparatextlist = []
for paragraph in document.paragraphs:
newparatextlist.append(paragraph.text.strip().encode("utf-8"))
return newparatextlist
path = 'pat_to_folder/*.docx'
files=glob.glob(path)
corpus_list = []
for f in files:
cur_corpus = getText(f)
corpus_list.append(cur_corpus)
corpus_list[0]
However, if I have content as follows in my word documents:
http://www.actus-usa.com/sampleresume.doc
https://www.myinterfase.com/sjfc/resources/resource_view.aspx?resource_id=53
the above function creates a list of list. How can I simply create a corpus out of the files?
TIA!
I tried this on some different method for my problem. It also consisted of loading various docx files to a corpus... I made some slight changes to your code!
def getText(filename):
doc = Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text.strip("\n"))
return " ".join(fullText)
PATH = "path_to_folder/*.docx"
files = glob.glob(PATH)
corpus_list = []
for f in files:
cur_corpus = getText(f)
corpus_list.append(cur_corpus)
hopefully this solves the problem!
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader(ROOT_PATH, '*.docx')
It should create corpus from all the content of docx files present in the ROOT_PATH

Categories