Finding words within paragraph using Python [closed]

Finding words within paragraph using Python [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Let say I have the following words, Test_wrds = ['she', 'her','women'] that I would like to see whether any one of them present in the following str paragraph,
text= "What recent discussions she has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on women."
The question is, How to find these Test_wrds in text and bold them in different colours as well as count them how many times Test_wrds appeared in Para. So I am expecting output something like this,
text= " What recent discussions **she** has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on **women**.
So far, I have written the following codes:
text=" Q: What recent discussions she has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on women."
Test_wrds = ['she', 'her','women']
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
# word split
Wrd_token=[token.orth_ for token in doc]
I am not getting an idea on how to proceed further after this. I used spacy as I found to be powerful and easy for my future coding.
Thanks in advance.

First of all in order to count how many times each word from Test_wrds list exists in text you can use ORTH which is an ID of the verbatim text content (see here).
import spacy
from spacy.lang.en import English
from spacy.attrs import ORTH
text=" Q: What recent discussions she has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on women."
Test_wrds = ['she', 'her','women']
nlp = English()
doc = nlp(text)
# Dictionairy with keys each word's id representation and values the number of times this word appears in your text string
count_number = doc.count_by(ORTH)
for wid, number in sorted(count_number.items(), key=lambda x: x[1]):
# nlp.vocap.strings[wid] gives the word corresponding to id
if nlp.vocab.strings[wid] in Test_wrds:
print(number, nlp.vocab.strings[wid])
Output:
1 she
1 women
Second, in order to replace each word with bold you can try
import re
# Avoid words followed by '.' without empty space
text = text.replace('.', ' .')
lista = text.split()
for word in Test_wrds:
if word in lista:
indices = [i for i,j in enumerate(lista) if j==word] # Find list indices
for index in indices:
lista[index] = re.sub(lista[index], '**'+word+'**', lista[index])
new_text = ' '.join(lista)
Output :
>>> new_text
'Q: What recent discussions **she** has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on **women** .'

Related

Get python to look for words and output them [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I've been looking into developing a discord bot that can reply to messages by reading their contents and checking if they appear in a list.
My problem is, I need to find a reliable way of getting python to look for certain words from a text, see if they appear in the given list and output the words that are detected.
I've managed to get it working somewhat myself with the following code:
if any(word in text in list):
print("Word Spotted")
I would really apreciate some help.

Here's some code that does something like what you're describing. But really it sounds like you need to spend a significant amount of time working through some basic Python tutorials before you will be able to implement this.
import re
key_words = set(['foo', 'bar', 'baz'])
typed_str = 'You are such a Foo BAR!'
print key_words & set(re.findall('[a-z]+', typed_str.lower()))

I'm not sure exactly what is being asked but somethings to consider (in no particular order) if you are building a bot that is taking in raw user input.
capitalization sensitivity
spell check
understanding intent simplistically
If your environment allows access to libraries you might consider checking out TextBlob. The following commands will give you the functionality needed for the example below.
pip install textblob
python -m textblob.download_corpora
core function
from textblob import TextBlob, Word
import copy
def score_intent(rawstring,keywords,weights=None,threshold=0.01,debug=False):
"""
rawstring: string of text with words that you want to detect
keywords: list of words that you are looking for
weights: (optional) dictionary with relative weights of words you want
threshold: spellcheck confidence threshold
debug: boolean for extra print statements to help debug
"""
allwords = TextBlob(rawstring).words
allwords = [w.upper() for w in allwords]
keywords = [k.upper() for k in keywords]
processed_input_as_list = spellcheck_subject_matter_specific(rawstring,keywords,threshold=threshold,debug=debug)
common_words = intersection(processed_input_as_list,keywords)
intent_score = len(common_words)
if weights:
for special_word in weights.keys():
if special_word.upper() in common_words:
# the minus one is so we dont double count a word.
intent_score = intent_score + weights[special_word] -1
if debug:
print "intent score: %s" %intent_score
print "words of interest found in text: {}".format(common_words)
# you could return common_words and score intent based on the list.
# return common_words, intent_score
return common_words
utilities for intersection & spellchecking
def intersection(a,b):
"""
a and b are lists
function returns a list that is the intersection of the two
"""
return list(set(a)&set(b))
def spellcheck_subject_matter_specific(rawinput,subject_matter_vector,threshold=0.01,capitalize=True,debug=False):
"""
rawinput: all the text that you want to check for spelling
subject_matter_vector: only the words that are worth spellchecking for (since the function can be sort of sensitive it might correct words that you don't want to correct)
threshold: the spell check confidence needed to update the word to the correct spelling
capitalize: boolean determining if you want the return string to be capitalized.
"""
new_input = copy.copy(rawinput)
for w in TextBlob(rawinput).words:
spellchecked_vec = w.spellcheck()
if debug:
print "Word: %s" %w
print "Spellchecked Guesses & Confidences: %s" %spellchecked_vec
print "Only spellchecked confidences greater than {} and in this list {} will be included".format(threshold,subject_matter_vector)
corrected_words = [z[0].upper() for z in spellchecked_vec if z[1] > threshold]
important_words = intersection(corrected_words,subject_matter_vector)
for new_word in important_words:
new_input = new_input + ' ' + new_word
inputBlob = TextBlob(new_input)
processed_input = inputBlob.words
if capitalize:
processed_input = [word.upper() for word in processed_input]
return processed_input
Usage Example
discord_str = "Hi, i want to talk about codee and pYtHon"
words2detect = ["python","code"]
score_intent(rawstring=discord_str,keywords=words2detect,threshold=0.01,debug=True)
output
intent score: 2
words of interest found in text: ['PYTHON', 'CODE']

Creating dictionary from a sentence where first character is key and word is value [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have a sentence called "myString" , What i am trying to do is that,
creating a dictionary from the sentence where first character of the
each word must be the key of the dictionary( white , w), and all words
starting with that character must be the values of that
keys.('w',['white','with']).
I have already written some python code. I want to know which code
snippet is better or is there any better approach to this problem.
Like a Dictionary Comprehension.?
Output i want to generate.
{'w': ['white', 'with', 'well'], 'h': ['hats', 'hackers', 'hackers',
'hackable', 'hacker', 'hired'] ...}
myString = "White hats are hackers employed with the efforts of
keeping data safe from other hackers by looking for loopholes and
hackable areas This type of hacker typically gets paid quite well and
receives no jail time due to the consent of the company that hired
them"
counterDict = {}
for word in myString.lower().split():
fChar = word[0]
if fChar not in counterDict:
counterDict[fChar] = []
counterDict[fChar].append(word)
print(counterDict)
Using dictionary.get method
for word in myString.lower().split():
fChar = word[0]
counterDict.get(word,[]).append(word)
print(counterDict)
collections.defaultdict()
import collections
counterDict = collections.defaultdict(list)
for word in myString.lower().split():
fChar = word[0]
counterDict[fChar].append(word)
print(counterDict)
collections.defaultdict( ) + list comprehension.
import collections
counterDict = collections.defaultdict(list)
[ counterDict[word[0]].append(word) for word in myString.lower().split() ]
print(counterDict)

You can use dict comprehension for assigning default values to the counterDict, and then append:
myString = "White hats are hackers employed with the efforts of
keeping data safe from other hackers by looking for loopholes and
hackable areas This type of hacker typically gets paid quite well and
receives no jail time due to the consent of the company that hired
them"
new_string = myString.split()
counterDict = {i[0].lower():[] for i in new_string}
for i in new_string:
counterDict[i[0].lower()].append(i)

This should work for your purpose:
from collections import defaultdict
counter_dict = defaultdict(list)
word_list = [(word[0], word) for word in my_string.lower().split()] #index 0 and the word is taken
for letter, word in word_list:
counter_dict[letter].append(word)

In case you like one-liners and obsessed with *comprehensions(like me), you can combine dictionary comprehension with
list comprehension:
new_string = myString.lower().split() #helps readability
counterDict = {i[0]:[z for z in new_string if z[0] == i[0]] for i in new_string}

How can I use 'remove()' method with two or more while loops without ValueError? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I have a string of text
text = u"Hey, there, hope you are doing good?????? or maybe not?"
and a token version using spacy, I'm using spacy because I want to be able to use its other features like part of speech tagging, lemmatization and so on. The problem I'd love to solve is removing stop words like ['?',',',you'] from the token. The tokenized version of token is saved in toks
token = nlp(text)
toks = []
for t in token:
toks.append(t.lower_)
I was thinking of using multiple while loops like this
while "?" in token.text:
toks.remove("?")
while "," in token.text:
toks.remove(",")
while "you" in token.text:
toks.remove("you")
but I keep getting ValueError: list.remove(x): x not in list which is perfectly understandable, as it keeps removing until there is nothing to remove which thereby leads to an error.
However I found a way to handle the error using
while True:
try:
if '?' in tokens.text:
toks.remove('?')
except:
try:
if ',' in tokens.text:
toks.remove(',')
except:
try:
if 'you' in tokens.text:
toks.remove('you')
except:
break
I'm not getting the error any more, but I feel like there should be a better way to solve the problem without nested loops. Can you suggest a cleaner way?

Since you seem to want to exclude all tokens from a given set of tokens, it's easier to just ignore them while creating the toks list:
from spacy.en import English
unwanted_tokens = {'?', ',', 'you'}
text = u"Hey, there, hope you are doing good?????? or maybe not?"
nlp = English()
tokens = nlp(text)
toks = []
for t in tokens:
if t.lower_ not in unwanted_tokens:
toks.append(t.lower_)
>>> toks
[u'hey', u'there', u'hope', u'are', u'doing', u'good', u' ', u'or', u'maybe', u'not']
The for loop could be replaced by a list comprehension:
toks = [t.lower_ for t in tokens if t.lower_ not in unwanted_tokens]
If, for reasons that you don't show in your question, you must remove the tokens after toks has been created, then you can just use a list comprehension:
toks = [t for t in toks if t not in unwanted_tokens]

Use the str.replace method, with the empty string as the new string.
for target in ['?', ',', 'you']:
text = text.replace(target, '')
What this does is loop through items that need to be replaced and inserts empty strings every time it sees that string

Python callback in a class? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I found code text summarization in Github, and I will change this program became Tkinter program.I have problem when will get value in class using Button widget and show result in Text widget.How to get value of method summarize in this code use Tkinter button?I usualy only use function or procedure nothing class and method.This code already running in intrepreter.
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
class NaiveSummarizer:
def summarize(self, input, num_sentences ):
punt_list=['.',',','!','?']
summ_sentences = []
sentences = sent_tokenize(input)
lowercase_sentences =[sentence.lower()
for sentence in sentences]
#print lowercase_sentences
s=list(input)
ts=''.join([ o for o in s if not o in punt_list ]).split()
lowercase_words=[word.lower() for word in ts]
words = [word for word in lowercase_words if word not in stopwords.words()]
word_frequencies = FreqDist(words)
most_frequent_words = [pair[0] for pair in
word_frequencies.items()[:100]]
# add sentences with the most frequent words
for word in most_frequent_words:
for i in range(0, len(lowercase_sentences)):
if len(summ_sentences) < num_sentences:
if (lowercase_sentences[i] not in summ_sentences and word in lowercase_sentences[i]):
summ_sentences.append(sentences[i])
break
# reorder the selected sentences
summ_sentences.sort( lambda s1, s2: input.find(s1) - input.find(s2) )
return " ".join(summ_sentences)
if __name__ == "__main__":
naivesum = NaiveSummarizer()
text='''
To see a world in a grain of sand,
And a heaven in a wild flower,
Hold infinity in the palm of your hand,
And eternity in an hour.
A robin redbreast in a cage
Puts all heaven in a rage.
A dove-house fill'd with doves and pigeons
Shudders hell thro' all its regions.
'''
text2 = '''
You conclude with the aphorism of Hippocrates, "Qui gravi morbo correpti dolores non sentiunt, us mens aegrotat" (Those who do not perceive that they are wasted by serious illness are sick in mind), and suggest that I am in need of medicine not only to conquer my malady, but even more, to sharpen my senses for the condition of my inner self. I would fain give you an answer such as you deserve, fain reveal myself to you entirely, but I do not know how to set about it. Hardly do I know whether I am still the same person to whom your precious letter is addressed.
'''
print(naivesum.summarize(text2,3))
print(naivesum.summarize(text,2))

You can not directly use the summarize function as a callback for your button; instead you should wrap it into another function that calls summarize and then displays the result in the Entry widget.
First, you have to add a text variable to you widget so you can read and write the text, like this:
self.outputvar = StringVar()
Entry(self, textvariable=self.outputvar)
Now you can add a callback function to your button, like in your other question, doing this:
def callback(self):
text = "lorem ipsum" # get actual input text, maybe from another widget
num = 3 # get proper value for whatever this is for
result = self.summarize(text, num) # call summarize
self.outputvar.set(result) # show results in your widget
Alternatively, you could use a Text widget; here, inserting text is handled differently though.

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.

Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.

To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.

This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.