I'm writing a celebrity trivia quiz in python that takes clues from Wikipedia.
I'm using the following code to split the paragraphs into sentences:
sentences = line.split(". ")
It works for everything except when there's a word that ends in a period in the sentence. For example, "XXX is a U.S. senator." gets incorrectly split into "XXX is a U.S."
I've created a list of exceptions where I remove the period from such words:
line = line.replace("Dr. ", "Dr ").replace("Mr. ", "Mr ").replace("Gen. ", "Gen ").replace("No. ", "No ").replace("U.S. ", "US ")
But for anything not in the list (e.g. "U.K." or "Inc."), the sentence gets stopped at the word ending in a period.
I'm not sure how else I can approach this. How can I preserve these words while still splitting into sentences?
This might work:
paragraphs = full_content.split("\n\n")
Where full_content is the data you want to split into paragraphs.
You can use list of abbreviations with dot.
You should add this list into a file and check that if . belong a word at the file, skip it
I know you can use noun extraction to get nouns out of sentences but how can I use sentence overlays/maps to take out phrases?
For example:
Sentence Overlay:
"First, #action; Second, Foobar"
"First, Dance and Code; Second, Foobar"
I want to return:
action = "Dance and Code"
Normal Noun Extractions wont work because it wont always be nouns
The way sentences are phrased differs so it cant be words[x] ... because the positioning of the words changes
You can slightly rewrite your string templates to turn them into regexps, and see which one (or which ones) match.
>>> template = "First, (?P<action>.*); Second, Foobar"
>>> mo = re.search(template, "First, Dance and Code; Second, Foobar")
>>> if mo:
Dance and Code
You can even transform your existing strings into this kind of regexp (after escaping regexp metacharacters like .?*()).
>>> template = "First, #action; (Second, Foobar...)"
>>> re_template = re.sub(r"\\#(\w+)", r"(?P<\g<1>>.*)", re.escape(template))
>>> print(re_template)
First\,\ (?P<action>.*)\;\ \(Second\,\ Foobar\.\.\.\)
I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.
Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.
Your goal is to improve text, not necessarily to make it perfect; so the approach you outline makes sense in my opinion. I would keep it simple and use a "greedy" approach: Start with the first fragment and stick pieces to it as long as the result is in the dictionary; if the result is not, spit out what you have so far and start over with the next fragment. Yes, occasionally you'll make a mistake with cases like the me thod, so if you'll be using this a lot, you could look for something more sophisticated. However, it's probably good enough.
Mainly what you require is a large dictionary. If you'll be using it a lot, I would encode it as a "prefix tree" (a.k.a. trie), so that you can quickly find out if a fragment is the start of a real word. The nltk provides a Trie implementation.
Since this kind of spurious word breaks are inconsistent, I would also extend my dictionary with words already processed in the current document; you may have seen the complete word earlier, but now it's broken up.
--Solution 1:
Lets think of these chunks in your sentence as beads on an abacus, with each bead consisting of a partial string, the beads can be moved left or right to generate the permutations. The position of each fragment is fixed between two adjacent fragments.
In current case, the beads would be :
This solves 2 subproblems:
a) Bead is a single unit,so We do not care about permutations within the bead i.e. permutations of "more" are not possible.
b) The order of the beads is constant, only the spacing between them changes. i.e. "more" will always be before "recen" and so on.
Now, generate all the permutations of these beads , which will give output like :
morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent
Then score these permutations based on how many words from your relevant dictionary they contain, most correct results can be easily filtered out.
more recently the development, which is a potent will score higher than morerecentlythedevelop ment, wh ich is a po ten t
Code which does the permutation part of the beads:
import re
def gen_abacus_perms(frags):
if len(frags) == 0:
return []
if len(frags) == 1:
return [frags[0]]
prefix_1 = "{0}{1}".format(frags[0],frags[1])
prefix_2 = "{0} {1}".format(frags[0],frags[1])
if len(frags) == 2:
nres = [prefix_1,prefix_2]
return nres
rem_perms = gen_abacus_perms(frags[2:])
res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] + \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
return res
broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
I would suggest an alternate approach which makes use of text analysis intelligence already developed by folks working on similar problems and having worked on big corpus of data which depends on dictionary and grammar .e.g. search engines.
I am not well aware of such public/paid apis, so my example is based on google results.
Lets try to use google :
You can keep putting your invalid terms to Google, for multiple passes, and keep evaluating the results for some score based on your lookup dictionary.
here are two relevant outputs by using 2 passes of your text :
This outout is used for a second pass :
Which gives you the conversion as ""more recently the development, which is a potent".
To verify the conversion, you will have to use some similarity algorithm and scoring to filter out invalid / not so good results.
One raw technique could be using a comparison of normalized strings using difflib.
>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
>>> output_norm
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
I would recommend stripping away the spaces and looking for dictionary words to break it down into. There are a few things you can do to make it more accurate. To make it get the first word in text with no spaces, try taking the entire string, and going through dictionary words from a file (you can download several such files from http://wordlist.sourceforge.net/), the longest ones first, than taking off letters from the end of the string you want to segment. If you want it to work on a big string, you can make it automatically take off letters from the back so that the string you are looking for the first word in is only as long as the longest dictionary word. This should result in you finding the longest words, and making it less likely to do something like classify "asynchronous" as "a synchronous". Here is an example that uses raw input to take in the text to correct and a dictionary file called dictionary.txt:
dict = open("dictionary.txt",'r') #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip() #strips away spaces
spaced = [] #this is the list of newly broken up words
parsing = True #this represents when the while loop can end
while parsing:
if len(words) == 0: #checks if all of the text has been broken into words, if it has been it will end the while loop
parsing = False
iterating = True
for iteration in range(45): #goes through each of the possible word lengths, starting from the biggest
if iterating == False:
word = words[:45-iteration] #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
for line in dict:
line = line[:-1] #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
if line == word: #this finds if this is the word we are looking for
words = words[-(len(word)):] #takes away the word from the text list
iterating = False
print ' '.join(spaced) #prints the output
If you want it to be even more accurate, you could try using a natural language parsing program, there are several available for python free online.
Here's something really basic:
chunks = []
for chunk in my_str.split():
joined = ''.join(chunks)
if is_word(joined):
print joined,
del chunks[:]
# deal with left overs
if chunks:
print ''.join(chunks)
I assume you have a set of valid words somewhere that can be used to implement is_word. You also have to make sure it deals with punctuation. Here's one way to do that:
def is_word(wd):
if not wd:
return False
# Strip of trailing punctuation. There might be stuff in front
# that you want to strip too, such as open parentheses; this is
# just to give the idea, not a complete solution.
if wd[-1] in ',.!?;:':
wd = wd[:-1]
return wd in valid_words
You can iterate through a dictionary of words to find the best fit. Adding the words together when a match is not found.
def iterate(word,dictionary):
for word in dictionary:
if words in possibleWord:
added = True
added = False
return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
added,new_word = interate(word,dictionary)
while True:
if added == False:
word += possible[sentence.find(possibleWord)]
This should work. For the variable dictionary, download a txt file of every single english word, then open it in your program.
my index.py file be like
from wordsegment import load, segment
my index.php file be like
<title>py script</title>
<h1>Hey There!Python Working Successfully In A PHP Page.</h1>
$python = `python index.py`;
echo $python;
Hope this will work
There are two sentences in "test_tweet1.txt"
#francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
#mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
In "Personal.txt"
The Game (rapper)
The Notorious B.I.G.
The Undertaker
Tom Cruise
Tony Romo
Triple H
My codes:
import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
count1 = count1 + 1
print "\n",count1, line
ltext1 = line.split(" ")
for i,text in enumerate(ltext1):
if text in rpopular_person:
print text
text2 = ' '.join(ltext1)
Results from the codes showed:
1 #francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
2 #mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
I tried to match word from "test_tweet1.txt" with "Personal.txt".
Expected result:
Any suggestion?
Your problem seems to be that rpopular_person is just a single string. Therefore, when you ask things like 'to' in rpopular_person, you get a value of True, because the characters 't', 'o' appear in sequence. I am assuming that the same goes for 'the' elsewhere in Personal.txt.
What you want to do is split up Personal.txt into individual words, the way you're splitting your tweets. You can also make the resulting list of words into a set, since that'll make your lookup much faster. Something like this:
people = set(popular_person.read().split())
It's also worth noticing that I'm calling split(), with no arguments. This splits on all whitespace--newlines, tabs, and so on. This way you get everything separately like you intend. Or, if you don't actually want this (since this will give you results of "The" all the time based on your edited contents of Personal.txt), make it:
people = set(popular_person.read().split('\n'))
This way you split on newlines, so you only look for full name matches.
You're not getting "Romo" because that's not a word in your tweet. The word in your tweet is "Romo." with a period. This is very likely to remain a problem for you, so what I would do is rearrange your logic (assuming speed isn't an issue). Rather than looking at each word in your tweet, look at each name in your Personal.txt file, and see if it's in your full tweet. This way you don't have to deal with punctuation and so on. Here's how I'd rewrite your functionality:
rpopular_person = set(personal.split())
with open("Personal.txt") as p:
people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
for tweet in tweets:
for person in people:
if person in tweet:
print person
you need to split rpopular_person to get it to match words instead of substrings
rpopular_person = open('C:/Users/Personal.txt').read().split()
this gives:
the reason Romo isn't showing up is that on your line split you have "Romo." Maybe you should look for rpopular_person in the lines, instead of the other way around. Maybe something like this
popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
print "\n", count1, line
for person in popular_person:
if person in line:
print person
I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
abstract: "...long abstract here..."
highlights: [
concept: 'a word',
start: 1,
end: 10
concept: 'cancer',
start: 123,
end: 135
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?
You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).
Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)