Sentence Transformers in Python: "[E1002] Span index out of range" - python

As a programming noob, I am trying to find similar sentences in several hundreds of newspaper articles. I have tried my code with a smaller text sample which has worked brilliantly. Now, with a larger text file (using the same code), I get the error code "[E1002] Span index out of range.".
This is my code so far:
!pip install spacy
import spacy
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 2000000
with open('/content/BSE.txt', 'r', encoding="utf-8", errors="ignore") as f:
sentences_articles = f.read()
about_doc = nlp(sentences_articles)
sentences = list(about_doc.sents)
len(sentences)
sentences[:10]
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import torch
embedder = SentenceTransformer('all-mpnet-base-v2')
corpus = sentences
corpus_embeddings = embedder.encode(corpus, show_progress_bar=True, batch_size = 128)
The progress bar stops at 94%, with error "[E1002] Span index out of range". I have used the .readlines() function, which worked, yet because of my text data's nature has produced unusable results (but no error!). I limited the number of words in each sentence, but that didn't help either. I tried several text data (different length, different content), but without success.
Any suggestions on how to fix this?

I had a similar problem with the same mistake, and for me it was solved after changing sentences from a list[Span] to list[str] as this is what .encode() requires. Instead of sentences = list(about_doc.sents), write sentences = list(sent.text for sent in about_doc.sents)

Related

How to search for a word in Word2Vec model

We were given an assignment to research codes and methods to solve "Author Name Disambiguation". I was trying to understand the code provided by "joe817" on GitHub, the repository's link is:
https://github.com/joe817/name-disambiguation
I installed all the requirements and was successful to run the first file "data processing.py", but the second file "DRLgru.py" shows me an error at line 43, saying the model (Word2Vec model) is not iterable. I googled the issue to find and helpful documentation but was not able to find any.
This is the error
Could someone please help me clear this error?
This is the code:
num_step = 20 #GRU时序个数
word_input = 100
paperid_title = {}
with open("gene/paper_title.txt",encoding = 'utf-8') as adictfile: #opening a file
for line in adictfile: #for loop on each line
toks = line.strip().split("\t") #First remove spaces with strip then split into tuple around \t
if len(toks) == 2:
paperid_title[toks[0]] = toks[1] # Assign a paper name to the id before it {'id' : 'paper_name'}
save_model_name = "gene/word2vec.model"
model = word2vec.Word2Vec.load(save_model_name) # Loading a pre-defined model
paper_vec={}
paper_len={}
for paperid in paperid_title: # looping on dictinory id's in paperid_title
split_cut = paperid_title[paperid].split() # make a list which contains each word of title
words_vec = []
for j in split_cut:
if (len(words_vec)<num_step) and (j in model):
words_vec.append(model[j])
I solved it (somewhat). The issue was related to using the newer version of the package, whereas the code was for the older version. So I used google collab to choose an older version of the package.

Kernel keeps dying when trying to build corpus in pandas

I ran this code in the past and it worked fine. A couple of months later, it keeps causing the kernel to die.
I reinstalled and updated all conda/python related files. It doesn't seem to matter. It stalls out on the last line, and no error message is printed out.
It worked once, and failed 7 of last 8 times.
corpus = df['reviewText']
import nltk
import re
nltk.download('stopwords')
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
def normalize_document(doc):
# lower case and remove special characters\whitespaces
doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
doc = doc.lower()
doc = doc.strip()
# tokenize document
tokens = wpt.tokenize(doc)
# filter stopwords out of document
filtered_tokens = [token for token in tokens if token not in stop_words]
# re-create document from filtered tokens
doc = ' '.join(filtered_tokens)
return doc
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(corpus)
Happy to hear any suggestions or ideas. If there is some way to display an error, or reason for the kernel dying, please let me know.
This seems to help:
# Get rid of accumulated garbage
import gc
gc.collect()

NLTK CorpusReader for Indian language

Trying to get NLTK to do analysis on a Punjabi corpus downloaded from an Indian government research site, the script is Gurmikhi. My primary goal is to get word frequency distributions on the entire corpus, so the aim here is to get all the words tokenized.
My issue seems to be with how NLTK is reading the text because when I use Python's built in methods:
with open("./Punjabi_Corpora/Panjabi_Monolingual_TextCorpus_Sample.txt", "r") as f:
lines = [line for line in f]
fulltxt = "".join(lines)
print(fulltxt.split)
Result (not perfect, but workable):
['\ufeffਜਤਿੰਦਰ', 'ਸਾਬੀ', 'ਜਲੰਧਰ,', '10', 'ਜਨਵਰੀ-ਦੇਸ਼-ਵਿਦੇਸ਼', 'ਦੇ',...]
However when using NLTK, as such:
from nltk.corpus import PlaintextCorpusReader
corpus_root = "./Punjabi_Corpora"
corpus = PlaintextCorpusReader(corpus_root,"Panjabi Monolingual_TextCorpus_Sample.txt")
corpus.words('Panjabi_Monolingual_TextCorpus_Sample.txt')
I get the following
['ਜਤ', 'ਿੰ', 'ਦਰ', 'ਸ', 'ਾ', 'ਬ', 'ੀ', 'ਜਲ', 'ੰ', 'ਧਰ', ...]
Here, NLTK thinks that each character glyph is a full word, I guess it's Indic script knowledge isn't quite there yet :)
From what I could surmise based on the NLTK docs, the issue has to do with the Unicode encoding, it seems there is some disagreement between the file and NLTK... I've been tinkering and Googling as far as I am able and have hit the wall.
Any ideas would be greatly appreciated!
You are right. According to the doc, PlainTextCorpusReader is a reader set for ascii inputs. So it is not surprising that it does not work properly.
I am not a pro on this subject, but I tried to use the IndianCorpusReader instead with your dataset and it seems working :
from nltk.corpus import IndianCorpusReader
corpus = IndianCorpusReader("./Punjabi_Corpora", "Panjabi Monolingual_TextCorpus_Sample.txt")
print(corpus.words('Panjabi Monolingual_TextCorpus_Sample.txt'))
And the output :
['ਜਤਿੰਦਰ', 'ਸਾਬੀ', 'ਜਲੰਧਰ', '10', 'ਜਨਵਰੀ-ਦੇਸ਼-ਵਿਦੇਸ਼', ...]
Tested on Python 3.

Cluttered, uninterpretable output from PytagCloud in Python

I am trying to create a tag cloud in python using pytagcloud and I am using the following code to generate it:
from pytagcloud import create_tag_image, make_tags
from pytagcloud.lang.counter import get_tag_counts
with open("fileName.txt") as file:
Data1 = file.read().lower()
Data = Data1.split()
Data = "%s " * len(Data) % tuple(Data)
tags = make_tags(get_tag_counts(Data), maxsize=150)
create_tag_image(tags, 'cloud_large.png', size=(1200, 800))
The code runs without errors (takes a while though) but the output file that it generates is quite cluttered and not easy to read. Here's the output file:
Why am I getting this weird unreadable matrix-like clutter in the center? How can I get rid of it?
The tag cloud doesn't appear to be in the center of the file, how can that be done?
Any help would be greatly appreciated.
P.S. - I am using Python 2.7
if it still relevant,
what i did to solve this was to add value to minsize parameter and filter out all the smallest words (which probably appears once in the text). i guess it happens because of explosion in the number of words.
my code looks like:
tags = make_tags(get_tag_counts(MY_TEXT), maxsize=120, minsize=5)
tags = [a for a in tags if a['size'] > 7]
create_tag_image(tags, 'images/cloud_large.png', size=(900, 600), fontname='Reenie Beanie', background=(0,0,0))
and the result:
i chose the values empirically.

Searching text in a PDF using Python? [duplicate]

This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 2 months ago.
Problem
I'm trying to determine what type a document is (e.g. pleading, correspondence, subpoena, etc) by searching through its text, preferably using python. All PDFs are searchable, but I haven't found a solution to parsing it with python and applying a script to search it (short of converting it to a text file first, but that could be resource-intensive for n documents).
What I've done so far
I've looked into pypdf, pdfminer, adobe pdf documentation, and any questions here I could find (though none seemed to directly solve this issue). PDFminer seems to have the most potential, but after reading through the documentation I'm not even sure where to begin.
Is there a simple, effective method for reading PDF text, either by page, line, or the entire document? Or any other workarounds?
This is called PDF mining, and is very hard because:
PDF is a document format designed to be printed, not to be parsed. Inside a PDF document,
text is in no particular order (unless order is important for printing), most of the time
the original text structure is lost (letters may not be grouped
as words and words may not be grouped in sentences, and the order they are placed in
the paper is often random).
There are tons of software generating PDFs, many are defective.
Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is pretty low level, but it makes more sense when you know
what problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).
An expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR.
So my answer is no, there is no such thing as a simple, effective method for extracting text from PDF files - if your documents have a known structure, you can fine-tune the rules and get good results, but it is always a gambling.
I would really like to be proven wrong.
[update]
The answer has not changed but recently I was involved with two projects: one of them is using computer vision in order to extract data from scanned hospital forms. The other extracts data from court records. What I learned is:
Computer vision is at reach of mere mortals in 2018. If you have a good sample of already classified documents you can use OpenCV or SciKit-Image in order to extract features and train a machine learning classifier to determine what type a document is.
If the PDF you are analyzing is "searchable", you can get very far extracting all the text using a software like pdftotext and a Bayesian filter (same kind of algorithm used to classify SPAM).
So there is no reliable and effective method for extracting text from PDF files but you may not need one in order to solve the problem at hand (document type classification).
I am totally a green hand, but this script works for me:
# import packages
import PyPDF2
import re
# open the pdf file
reader = PyPDF2.PdfReader("test.pdf")
# get number of pages
num_pages = len(reader.pages)
# define key terms
string = "Social"
# extract text and do the search
for page in reader.pages:
rext = page.extract_text()
# print(text)
res_search = re.search(string, text)
print(res_search)
I've written extensive systems for the company I work for to convert PDF's into data for processing (invoices, settlements, scanned tickets, etc.), and #Paulo Scardine is correct--there is no completely reliable and easy way to do this. That said, the fastest, most reliable, and least-intensive way is to use pdftotext, part of the xpdf set of tools. This tool will quickly convert searchable PDF's to a text file, which you can read and parse with Python. Hint: Use the -layout argument. And by the way, not all PDF's are searchable, only those that contain text. Some PDF's contain only images with no text at all.
I recently started using ScraperWiki to do what you described.
Here's an example of using ScraperWiki to extract PDF data.
The scraperwiki.pdftoxml() function returns an XML structure.
You can then use BeautifulSoup to parse that into a navigatable tree.
Here's my code for -
import scraperwiki, urllib2
from bs4 import BeautifulSoup
def send_Request(url):
#Get content, regardless of whether an HTML, XML or PDF file
pageContent = urllib2.urlopen(url)
return pageContent
def process_PDF(fileLocation):
#Use this to get PDF, covert to XML
pdfToProcess = send_Request(fileLocation)
pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read())
return pdfToObject
def parse_HTML_tree(contentToParse):
#returns a navigatibale tree, which you can iterate through
soup = BeautifulSoup(contentToParse)
return soup
pdf = process_PDF('http://greenteapress.com/thinkstats/thinkstats.pdf')
pdfToSoup = parse_HTML_tree(pdf)
soupToArray = pdfToSoup.findAll('text')
for line in soupToArray:
print line
This code is going to print a whole, big ugly pile of <text> tags.
Each page is separated with a </page>, if that's any consolation.
If you want the content inside the <text> tags, which might include headings wrapped in <b> for example, use line.contents
If you only want each line of text, not including tags, use line.getText()
It's messy, and painful, but this will work for searchable PDF docs. So far I've found this to be accurate, but painful.
Here is the solution that I found it comfortable for this issue. In the text variable you get the text from PDF in order to search in it. But I have kept also the idea of spiting the text in keywords as I found on this website: https://medium.com/#rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f from were I took this solution, although making nltk was not very straightforward, it might be useful for further purposes:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
def searchInPDF(filename, key):
occurrences = 0
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuation = ['(',')',';',':','[',']',',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuation]
for k in keywords:
if key == k: occurrences+=1
return occurrences
pdf_filename = '/home/florin/Downloads/python.pdf'
search_for = 'string'
print searchInPDF (pdf_filename,search_for)
I agree with #Paulo PDF data-mining is a huge pain. But you might have success with pdftotext which is part of the Xpdf suite freely available here:
http://www.foolabs.com/xpdf/download.html
This should be sufficient for your purpose if you are just looking for single keywords.
pdftotext is a command line utility, but very straightforward to use. It will give you text files, which you may find easier to work with.
If you are on bash, There is a nice tool called pdfgrep,
Since, This is in apt repository, You can install this with:
sudo apt install pdfgrep
It had served my requirements well.
Trying to pick through PDFs for keywords is not an easy thing to do. I tried to use the pdfminer library with very limited success. It’s basically because PDFs are pandemonium incarnate when it comes to structure. Everything in a PDF can stand on its own or be a part of a horizontal or vertical section, backwards or forwards. Pdfminer was having issues translating one page, not recognizing the font, so I tried another direction — optical character recognition of the document. That worked out almost perfectly.
Wand converts all the separate pages in the PDF into image blobs, then you run OCR over the image blobs. What I have as a BytesIO object is the content of the PDF file from the web request. BytesIO is a streaming object that simulates a file load as if the object was coming off of disk, which wand requires as the file parameter. This allows you to just take the data in memory instead of having to save the file to disk first and then load it.
Here’s a very basic code block that should be able to get you going. I can envision various functions that would loop through different URL / files, different keyword searches for each file, and different actions to take, possibly even per keyword and file.
# http://docs.wand-py.org/en/0.5.9/
# http://www.imagemagick.org/script/formats.php
# brew install freetype imagemagick
# brew install PIL
# brew install tesseract
# pip3 install wand
# pip3 install pyocr
import pyocr.builders
import requests
from io import BytesIO
from PIL import Image as PI
from wand.image import Image
if __name__ == '__main__':
pdf_url = 'https://www.vbgov.com/government/departments/city-clerk/city-council/Documents/CurrentBriefAgenda.pdf'
req = requests.get(pdf_url)
content_type = req.headers['Content-Type']
modified_date = req.headers['Last-Modified']
content_buffer = BytesIO(req.content)
search_text = 'tourism investment program'
if content_type == 'application/pdf':
tool = pyocr.get_available_tools()[0]
lang = 'eng' if tool.get_available_languages().index('eng') >= 0 else None
image_pdf = Image(file=content_buffer, format='pdf', resolution=600)
image_jpeg = image_pdf.convert('jpeg')
for img in image_jpeg.sequence:
img_page = Image(image=img)
txt = tool.image_to_string(
PI.open(BytesIO(img_page.make_blob('jpeg'))),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
if search_text in txt.lower():
print('Alert! {} {} {}'.format(search_text, txt.lower().find(search_text),
modified_date))
req.close()
This answer follows #Emma Yu's:
If you want to print out all the matches of a string pattern on every page.
(Note that Emma's code prints a match per page):
import PyPDF2
import re
pattern = input("Enter string pattern to search: ")
fileName = input("Enter file path and name: ")
object = PyPDF2.PdfFileReader(fileName)
numPages = object.getNumPages()
for i in range(0, numPages):
pageObj = object.getPage(i)
text = pageObj.extractText()
for match in re.finditer(pattern, text):
print(f'Page no: {i} | Match: {match}')
A version using PyMuPDF. I find it to be more robust than PyPDF2.
import fitz
import re
# load document
doc = fitz.open(filename)
# define keyterms
String = "hours"
# get text, search for string and print count on page.
for page in doc:
text = ''
text += page.getText()
print(f'count on page {page.number +1} is: {len(re.findall(String, text))}')
Example with pdfminer.six
from pdfminer import high_level
with open('file.pdf', 'rb') as f:
text = high_level.extract_text(f)
print(text)
Compared to PyPDF2, it can work with cyrillic

Categories