How to segment text into sub-sentences based on enumerators?

How to segment text into sub-sentences based on enumerators? - python

I am segmenting sentences for a text in python using nltk PunktSentenceTokenizer(). However, there are many long sentences appears in a enumerated way and I need to get the sub sentence in this case.
Example:
The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.
The required output would be :
"The api allows the user to achieve following goals aXXXXX. ", "The api allows the user to achieve following goals bXXXXX." and "The api allows the user to achieve following goals cXXXXX. "
How can I achieve this goal?

To get the sub-sequences you could use a RegExp Tokenizer.
An example how to use it to split the sentence could look like this:
from nltk.tokenize.regexp import regexp_tokenize
str1 = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'
parts = regexp_tokenize(str1, r'\(\w\)\s*', gaps=True)
start_of_sentence = parts.pop(0)
for part in parts:
print(" ".join((start_of_sentence, part)))

I'll just skip over the obvious question (being: "What have you tried so far?"). As you may have found out already, PunktSentenceTokenizer isn't really going to help you here, since it will leave your input sentence in one piece.
The best solution depends heavily on the predictability of your input. The following will work on your example, but as you can see it relies on there being a colon and some comma's. If they're not there, it's not going to help you.
import re
from nltk import PunktSentenceTokenizer
s = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'
#sents = PunktSentenceTokenizer().tokenize(s)
p = s.split(':')
for l in p[1:]:
i = l.split(',')
for j in i:
j = re.sub(r'\([a-z]\)', '', j).strip()
print("%s: %s" % (p[0], j))

Related

Using POS and PUNCT tokens in custom sentence boundaries in spaCy

I am trying to split sentences into clauses using spaCy for classification with a MLLib. I have searched for one of two solutions that I consider the best way to approach but haven't quite had much luck.
Option: Would be to use the tokens in the doc i.e. token.pos_ that match to SCONJ and split as a sentence.
Option: Would be to create a list using whatever spaCy has as a dictionary of values it identifies as SCONJ
The issue with 1 is that I only have .text, .i, and no .pos_ as the custom boundaries (as far as I am aware needs to be run before the parser.
The issue with 2 is that I can't seem to find the dictionary. It is also a really hacky approach.
import deplacy
from spacy.language import Language
# Uncomment to visualise how the tokens are labelled
# deplacy.render(doc)
custom_EOS = ['.', ',', '!', '!']
custom_conj = ['then', 'so']
#Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text in custom_EOS:
doc[token.i + 1].is_sent_start = True
if token.text in custom_conj:
doc[token.i].is_sent_start = True
return doc
def set_sentence_breaks(doc):
for token in doc:
if token == "SCONJ":
doc[token.i].is_sent_start = True
def main():
text = "In the add user use case, we need to consider speed and reliability " \
"so use of a relational DB would be better than using SQLite. Though " \
"it may take extra effort to convert #Bot"
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
# for token in doc:
# print(token.pos_)
print("Sentences:", [sent.text for sent in doc.sents])
if __name__ == "__main__":
main()
Current Output
Sentences: ['In the add user use case,',
'we need to consider speed and reliability,
'so the use of a relational DB would be better than using SQLite.',
'Though it may take extra effort to convert #Bot']

I would recommend not trying to do anything clever with is_sent_starts - while it is user-accessible, it's really not intended to be used in that way, and there is at least one unresolved issue related to it.
Since you just need these divisions for some other classifier, it's enough for you to just get the string, right? In that case I recommend you run the spaCy pipeline as usual and then split sentences on SCONJ tokens (if just using SCONJ is working for your use case). Something like:
out = []
for sent in doc.sents:
last = sent[0].i
for tok in sent:
if tok.pos_ == "SCONJ":
out.append(doc[last:tok.i])
last = tok.i + 1
out.append(doc[last:sent[-1].i])
Alternately, if that's not good enough, you can identify subsentences using the dependency parse to find verbs in subsentences (by their relation to SCONJ, for example), saving the subsentences, and then adding another sentence based on the root.

Determine if text is in English?

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:
[ "this is some text written in English",
"this is some more text written in English",
"Ce n'est pas en anglais" ]
For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?
I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.

There is a library called langdetect. It is ported from Google's language-detection available here:
https://pypi.python.org/pypi/langdetect
It supports 55 languages out of the box.

You might be interested in my paper The WiLI benchmark dataset for written
language identification. I also benchmarked a couple of tools.
TL;DR:
CLD-2 is pretty good and extremely fast
lang-detect is a tiny bit better, but much slower
langid is good, but CLD-2 and lang-detect are much better
NLTK's Textcat is neither efficient nor effective.
You can install lidtk and classify languages:
$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"
fra

Pretrained Fast Text Model Worked Best For My Similar Needs
I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.
After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.
With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.
class English_Check:
def __init__(self):
# Don't need to train a model to detect languages. A model exists
# that is very good. Let's use it.
pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
self.model = fasttext.load_model(pretrained_model_path)
def predictionict_languages(self, text_file):
this_D = {}
with open(text_file, 'r') as f:
fla = f.readlines() # fla = file line array.
# fasttext doesn't like newline characters, but it can take
# an array of lines from a file. The two list comprehensions
# below, just clean up the lines in fla
fla = [line.rstrip('\n').strip(' ') for line in fla]
fla = [line for line in fla if len(line) > 0]
for line in fla: # Language predict each line of the file
language_tuple = self.model.predictionict(line)
# The next two lines simply get at the top language prediction
# string AND the confidence value for that prediction.
prediction = language_tuple[0][0].replace('__label__', '')
value = language_tuple[1][0]
# Each top language prediction for the lines in the file
# becomes a unique key for the this_D dictionary.
# Everytime that language is found, add the confidence
# score to the running tally for that language.
if prediction not in this_D.keys():
this_D[prediction] = 0
this_D[prediction] += value
self.this_D = this_D
def determine_if_file_is_english(self, text_file):
self.predictionict_languages(text_file)
# Find the max tallied confidence and the sum of all confidences.
max_value = max(self.this_D.values())
sum_of_values = sum(self.this_D.values())
# calculate a relative confidence of the max confidence to all
# confidence scores. Then find the key with the max confidence.
confidence = max_value / sum_of_values
max_key = [key for key in self.this_D.keys()
if self.this_D[key] == max_value][0]
# Only want to know if this is english or not.
return max_key == 'en'
Below is the application / instantiation and use of the above class for my needs.
file_list = # some tool to get my specific list of files to check for English
en_checker = English_Check()
for file in file_list:
check = en_checker.determine_if_file_is_english(file)
if not check:
print(file)

This is what I've used some time ago.
It works for texts longer than 3 words and with less than 3 non-recognized words.
Of course, you can play with the settings, but for my use case (website scraping) those worked pretty well.
from enchant.checker import SpellChecker
max_error_count = 4
min_text_length = 3
def is_in_english(quote):
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True
print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
> False
> True

Use the enchant library
import enchant
dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False
This example is taken directly from their website

If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:
http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/
If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.

import enchant
def check(text):
text=text.split()
dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
for i in range(len(text)):
if(dictionary.check(text[i])==False):
o = "False"
break
else:
o = ("True")
return o

Sentence clustering

I have a huge number of names from different sources.
I need to extract all the groups (part of the names), which repeat from one to another.
In the example below program should locate: Post, Office, Post Office.
I need to get popularity count.
So I want to extract a sorted by popularity list of phrases.
Here is an example of names:
Post Office - High Littleton
Post Office Pilton Outreach Services
Town Street Post Office
post office St Thomas
Basically need to find out some algorithm or better library, to get such results:
Post Office: 16999
Post: 17934
Office: 16999
Tesco: 7300
...
Here is the full example of names.
I wrote a code which is fine for single words, but not for sentences:
from textblob import TextBlob
import operator
title_file = open("names.txt", 'r')
blob = TextBlob(title_file.read())
list = sorted(blob.word_counts.items(), key=operator.itemgetter(1))
print list

You are not looking for clustering (and that is probably why "all of them suck" for #andrewmatte).
What you are looking for is word counting (or more precisely, n-gram-counting). Which is actually a much easier problem. Thst is why you won't be finding any library for that...
Well, actually you jave some libraries. In python, for example, the collections module has the class Counter that has much of the reusable code.
An untested, very basic code:
from collections import Counter
counter = Counter()
for s in sentences:
words = s.split(" ")
for i in range(len(words)):
counter.add(words[i])
if i > 0: counter.add((words[i-1], words[i]))
You csn get the most frequent from counter. If you want words and word pairs separate, feel free to use two counters. If you need longer phrases, add an inner loop. You may also want to clean sentences (e.g. lowercase) and use a regexp for splitting.

Are you looking for something like this?
workspace={}
with open('names.txt','r') as f:
for name in f:
if len(name): # makes sure line isnt empty
if name in workspace:
workspace[name]+=1
else:
workspace[name]=1
for name in workspace:
print "{}: {}".format(name,workspace[name])

Efficient data structure for searching a dictionary of words in python using difflib?

I am trying to write a spellchecker and I wanted to use difflib to implement it. Basically I have a list of technical terms that I added to the standard unix dictionary (/usr/share/dict/words) that I'm storing in a file I call dictionaryFile.py.
I have another script just called stringSim.py where I import the dictionary and test sample strings against it. Here is the basic code:
import os, sys
import difflib
import time
from dictionaryFile import wordList
inputString = "dictiunary"
print "Search query: "+inputString
startTime = time.time()
inputStringSplit = inputString.split()
for term in inputStringSplit:
termL = term.lower()
print "Search term: "+term
closeMatches = difflib.get_close_matches(termL,wordList)
if closeMatches[0] == termL:
print "Perfect Match"
else:
print "Possible Matches"
print "\n".join(closeMatches)
print time.time() - startTime, "seconds"
It returns the following:
$ python stringSim.py
Search query: dictiunary
Search term: dictiunary
Possible Matches
dictionary
dictionary's
discretionary
0.492614984512 seconds
I'm wondering if there are better strategies I could be using for looking up similar matches (assuming a word is misspelled). This is for a web application so I am trying to optimize this part of the code to be a little snappier. Is there a better way I could structure the wordList variable (right now it is just a list of words)?
Thanks.

I'm not sure difflib is best solution for this kind of work; typically, spellcheckers use some sort of edit distance, e.g. Levenshtein distance. NLTK includes implementation(s) of edit distance, I'd start there instead.

Best way to match a large list against strings in python

I have a python list that contains about 700 terms that I would like to use as metadata for some database entries in Django. I would like to match the terms in the list against the entry descriptions to see if any of the terms match but there are a couple of issues. My first issue is that there are some multiword terms within the list that contain words from other list entries. An example is:
Intrusion
Intrusion Detection
I have not gotten very far with re.findall as it will match both Intrusion and Intrusion Detection in the above example. I would only want to match Intrusion Detection and not Intrusion.
Is there a better way to do this type of matching? I thought maybe maybe about trying NLTK but it didn't look like it could help with this type of matching.
Edit:
So to add a little more clarity, I have a list of 700 terms such as firewall or intrusion detection. I would like to try to match these words in the list against descriptions that I have stored in a database to see if any match, and I will use those terms in metadata. So if I have the following string:
There are many types of intrusion detection devices in production today.
and if I have a list with the following terms:
Intrusion
Intrusion Detection
I would like to match 'intrusion detection', but not 'intrusion'. Really I would like to also be able to match singular/plural instances too, but I may be getting ahead of myself. The idea behind all of this is to take all of the matches and put them in a list, and then process them.

If you need more flexibility to match entry descriptions, you can combine nltk and re
from nltk.stem import PorterStemmer
import re
let's say you have different descriptions of the same event ie. a rewrite of the system. You can use nltk.stem to capture rewrite, rewriting, rewrites, singular and plural forms etc.
master_list = [
'There are many types of intrusion detection devices in production today.',
'The CTO approved a rewrite of the system',
'The CTO is about to approve a complete rewrite of the system',
'The CTO approved a rewriting',
'Breaching of Firewalls'
]
terms = [
'Intrusion Detection',
'Approved rewrite',
'Firewall'
]
stemmer = PorterStemmer()
# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)
# add 'match anything after it' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [''.join(stem + '.*' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print('')
for sentence in master_list:
match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
matches = [m.group(0) for m in match_obs if m]
print(matches)
Output:
['Intrus.*Detect.*', 'Approv.*rewrit.*', 'Firewal.*']
['intrusion detection devices in production today.']
['approved a rewrite of the system']
['approve a complete rewrite of the system']
['approved a rewriting']
['Firewalls']
EDIT:
To see which of the terms caused the match:
for sentence in master_list:
# regex_patterns maps directly onto terms (strictly speaking it's one-to-one and onto)
for term, pattern in zip(terms, regex_patterns):
if re.search(pattern, sentence, flags=re.IGNORECASE):
# process term (put it in the db)
print('TERM: {0} FOUND IN: {1}'.format(term, sentence))
Output:
TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewalls

This question is unclear, but from what I understand you have a Master List of terms. Say one term per line. Next you have a list of test data, where some of the test data will be in the master list, and some wont. You want to see if the test data is in the master list and if it is perform a task.
Assuming your Master List looks like this
Intrusion Detection
Firewall
FooBar
and your Test Data Looks like this
Intrusion
Intrusion Detection
foo
bar
this simple script should lead you in the right direction
#!/usr/bin/env python
import sys
def main():
'''useage tester.py masterList testList'''
#open files
masterListFile = open(sys.argv[1], 'r')
testListFile = open(sys.argv[2], 'r')
#bulid master list
# .strip() off '\n' new line
# set to lower case. Intrusion != intrusion, but should.
masterList = [ line.strip().lower() for line in masterListFile ]
#run test
for line in testListFile:
term = line.strip().lower()
if term in masterList:
print term, "in master list!"
#perhaps grab your metadata using a like %%
else:
print "OH NO!", term, "not found!"
#close files
masterListFile.close()
testListFile.close()
if __name__ == '__main__':
main()
SAMPLE OUTPUT
OH NO! intrusion not found!
intrusion detection in master list!
OH NO! foo not found!
OH NO! bar not found!
there are several other ways to do this, but this should point you in the right direction. if your list is large (700 really isn't that large) consider using a dict, I feel they quicker. especially if yo plan to query a database. perhaps a dictionary structure could look like {term: information about term}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.