Derive words from string based on key words - python

I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)

You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.

We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']

Related

I need help to automatically DEcensore a text (lot's of text to be prosseced)

I have a web story that has cencored word in it with asterix
right now i'm doing it with a simple and dumb str.replace
but as you can imagine this is a pain and I need to search in the text to find all instance of the censoring
here is bastard instance that are capitalized, plurial and with asterix in different places
toReplace = toReplace.replace("b*stard", "bastard")
toReplace = toReplace.replace("b*stards", "bastards")
toReplace = toReplace.replace("B*stard", "Bastard")
toReplace = toReplace.replace("B*stards", "Bastards")
toReplace = toReplace.replace("b*st*rd", "bastard")
toReplace = toReplace.replace("b*st*rds", "bastards")
toReplace = toReplace.replace("B*st*rd", "Bastard")
toReplace = toReplace.replace("B*st*rds", "Bastards")
is there a way to compare all word with "*" (or any other replacement character) to an already compiled dict and replace them with the uncensored version of the word ?
maybe regex but I don't think so
Using regex alone will likely not result in a full solution for this. You would likely have an easier time if you have a simple list of the words that you want to restore, and use Levenshtein distance to determine which one is closest to a given word that you have found a * in.
One library that may help with this is fuzzywuzzy.
The two approaches that I can think of quickly:
Split the text so that you have 1 string per word. For each word, if '*' in word, then compare it to the list of replacements to find which is closest.
Use re.sub to identify the words that contain a * character, and write a function that you would use as the repl argument to determine which replacement it is closest to and return that replacement.
Additional resources:
Python: find closest string (from a list) to another string
Find closest string match from list
How to find closest match of a string from a list of different length strings python?
You can use re module to find matches between the censored word and words in your wordlist.
Replace * with . (dot has special meaning in regex, it means "match every character") and then use re.match:
import re
wordlist = ["bastard", "apple", "orange"]
def find_matches(censored_word, wordlist):
pat = re.compile(censored_word.replace("*", "."))
return [w for w in wordlist if pat.match(w)]
print(find_matches("b*st*rd", wordlist))
Prints:
['bastard']
Note: If you want match exact word, add $ at the end of your pattern. That means appl* will not match applejuice in your dictionary for example.

Finding most common occurrence of a character that follows another

I'm currently working on a small piece of code and I seem to have run into a roadblock. I was wondering if it's possible to (because I cannot, for the life of me, figure it out) find the most common occurrence of a character that follows a specific character or string?
For example, say I have the following sentence:
"this is a test sentence that happens to be short"
How would could I determine, for example, the most common character that occurs after the letter h?
In this specific example, doing it by hand, I get something like this:
{"i": 1, "a": 2, "o": 1}
I'd then like to be able to get the key of the highest value--in this case, a.
Using Counter from collections, I've been able to find the most common occurrence of a specific word or character, but I'm not sure how to do this specific implementation of doing the most common occurrence after. Any help would be greatly appreciated, thanks!
(The code I wrote to find the most common occurrence of a letter in a file:
Counter(text).most_common(1), which does include white spaces )
EDIT:
How would this be done with words? For example, if I had the sentence: "whales are super neat, but whales don't make good pets. whales are cool."
How would I find the most common character that occurs after the words whales?
In this instance, removing white spaces, the most common character would be a
Just split them by your character and then get the letter after it
import collections
sentence = "this is a test sentence that happens to be short"
character = 'h'
letters_after_some_character = [part[0] for part in str.split(character)[1:] if part[0].isalpha()]
print(collections.Counter(letters_after_some_character).most_common())
If you want a solution without using regex:
import collections
sentence = "this is a test sentence that happens to be short"
characters = [sentence[i] for i in range(1,len(sentence)) if sentence[i-1] == 'h']
most_common_char = collections.Counter(characters).most_common(1)
Using the Counter class we can try:
import collections
s = "this is a test sentence that happens to be short"
s = re.sub(r'^.*n|\s*', '', s)
print(collections.Counter(s).most_common(1)[0])
The above would print o as it is the most frequent character occurring after the last n. Note that we also strip off whitespace before calling collections count.

Matching if any keyword from a list is present in a string

I have a list of keywords. A sample is:
['IO', 'IO Combination','CPI Combos']
Now what I am trying to do is see if any of these keywords is present in a string. For example, if my string is: there is a IO competition coming in Summer 2018. So for this example since it contains IO, it should identify that but if the string is there is a competition coming in Summer 2018 then it should not identify any keywords.
I wrote this Python code but it also identifies IO in competition:
if any(word.lower() in string_1.lower() for word in keyword_list):
print('FOUND A KEYWORD IN STRING')
I also want to identify which keyword was identified in the string (if any present). What is the issue in my code and how can I make sure that it matches only complete words?
Regex solution
You'll need to implement word boundaries here:
import re
keywords = ['IO', 'IO Combination','CPI Combos']
words_flat = "|".join(r'\b{}\b'.format(word) for word in keywords)
rx = re.compile(words_flat)
string = "there is a IO competition coming in Summer 2018"
match = rx.search(string)
if match:
print("Found: {}".format(match.group(0)))
else:
print("Not found")
Here, your list is joined with | and \b on both sides.
Afterwards, you may search with re.search() which prints "Found: IO" in this example.
Even shorter with a direct comprehension:
rx = re.compile("|".join(r'\b{}\b'.format(word) for word in keywords))
Non-regex solution
Please note that you can even use a non-regex solution for single words, you just have to reorder your comprehension and use split() like
found = any(word in keywords for word in string.split())
if found:
# do sth. here
Notes
The latter has the drawback that strings like
there is a IO. competition coming in Summer 2018
# ---^---
won't work while they do count as a "word" in the regex solution (hence the approaches are yielding different results). Additionally, because of the split() function, combined phrases like CPI Combos cannot be found. The regex solution has the advantage to even support lower and uppercase scenarios (just apply flag = re.IGNORECASE).
It really depends on your actual requirements.
for index,key in enumerate(mylist):
if key.find(mystring) != -1:
return index
It loops over your list, on every item in the list, it checks if your string is contained in the item, if it does, find() returns -1 which means it is contained, and if that happens, you get the index of the item where it was found with the help of enumerate().

Return first word in sentence? [duplicate]

This question already has answers here:
How to extract the first and final words from a string?
(7 answers)
Closed 5 years ago.
Heres the question I have to answer for school
For the purposes of this question, we will define a word as ending a sentence if that word is immediately followed by a period. For example, in the text “This is a sentence. The last sentence had four words.”, the ending words are ‘sentence’ and ‘words’. In a similar fashion, we will define the starting word of a sentence as any word that is preceded by the end of a sentence. The starting words from the previous example text would be “The”. You do not need to consider the first word of the text as a starting word. Write a program that has:
An endwords function that takes a single string argument. This functioin must return a list of all sentence ending words that appear in the given string. There should be no duplicate entries in the returned list and the periods should not be included in the ending words.
The code I have so far is:
def startwords(astring):
mylist = astring.split()
if mylist.endswith('.') == True:
return my list
but I don't know if I'm using the right approach. I need some help
Several issues with your code. The following would be a simple approach. Create a list of bigrams and pick the second token of each bigram where the first token ends with a period:
def startwords(astring):
mylist = astring.split() # a list! Has no 'endswith' method
bigrams = zip(mylist, mylist[1:])
return [b[1] for b in bigrams if b[0].endswith('.')]
zip and list comprehenion are two things worth reading up on.
mylist = astring.split()
if mylist.endswith('.')
that cannot work, one of the reasons being that mylist is a list, and doesn't have endswith as a method.
Another answer fixed your approach so let me propose a regular expression solution:
import re
print(re.findall(r"\.\s*(\w+)","This is a sentence. The last sentence had four words."))
match all words following a dot and optional spaces
result: ['The']
def endwords(astring):
mylist = astring.split('.')
temp_words = [x.rpartition(" ")[-1] for x in mylist if len(x) > 1]
return list(set(temp_words))
This creates a set so there are no duplicates. Then goes on a for loop in a list of sentences (split by ".") then for each sentence, splits it in words then using [:-1] makes a list of the last word only and gets [0] item in that list.
print (set([ x.split()[:-1][0] for x in s.split(".") if len(x.split())>0]))
The if in theory is not needed but i couldn't make it work without it.
This works as well:
print (set([ x.split() [len(x.split())-1] for x in s.split(".") if len(x.split())>0]))
This is one way to do it ->
#!/bin/env/ python
from sets import Set
sentence = 'This is a sentence. The last sentence had four words.'
uniq_end_words = Set()
for word in sentence.split():
if '.' in word:
# check if period (.) is at the end
if '.' == word[len(word) -1]:
uniq_end_words.add(word.rstrip('.'))
print list(uniq_end_words)
Output (list of all the end words in a given sentence) ->
['words', 'sentence']
If your input string has a period in one of its word (lets say the last word), something like this ->
'I like the documentation of numpy.random.rand.'
The output would be - ['numpy.random.rand']
And for input string 'I like the documentation of numpy.random.rand a lot.'
The output would be - ['lot']

How can you use Python to count the unique words (without special characters/ cases interfering) in a text document

I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:
Total word count
Total count of unique words (without case and special characters interfering)
The number of sentences
Average words in a sentence
Find common used phrases (a phrase of 3 or more words used over 3 times)
A list of words used, in order of descending frequency (without case and special characters interfering)
The ability to accept input from STDIN, or from a file specified on the command line
So far I have this Python program to print total word count:
with open('/Users/name/Desktop/20words.txt', 'r') as f:
p = f.read()
words = p.split()
wordCount = len(words)
print "The total word count is:", wordCount
So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog, dog., "dog, and dog, as different words)
file=open("/Users/name/Desktop/20words.txt", "r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k, v in wordcount.items():
print k, v
Thank you for any help you can give!
Certainly the most difficult part is identifying the sentences. You could use a regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter for counting all of those instead of doing this manually. Use str.lower to convert either the text as a whole or the individual words to lowercase.
This should help you getting startet:
import re, collections
text = """Sentences start with an upper-case letter. Do they always end
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two,
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""
sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
print n, s
word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
print n, w
For "more power", you could use some natural language toolkit, but this might be a bit much for this task.
If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.
word = word.strip().strip("'").strip('"')...
This will remove the occurrence of these characters on the extremities of the word.
This probably isn't as efficient as using some NLP library, but it can get the job done.
str.strip Docs

Categories