Hii i am new to regex and stuck with this question.
Q- Identify all of words that look like names in the sentence. In other words, those which are capitalized but aren't the first word in the sentence.
sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
Here's what i did ...but not getting any output(Excluding the text from begining till i get any capital letter word which is name)
p = re.compile(r'[^A-Z]\w+[A-Z]\w+')
m = p.finditer(sentence)
for m in m:
print(m)
Assuming there's always only one space after a dot before another sentence begins, you can use a negative lookbehind pattern to exclude names that are preceded by a dot and a space, and another negative lookbehind pattern to exclude the beginning of the string. Also use \b to ensure that a captial letter is matched at a word boundary:
re.findall(r'(?<!\. )(?<!^)\b[A-Z]\w*', sentence)
This returns:
['Harry', 'Susy']
You use a positive lookbehind to look for a capitalization pattern for a word not at the beginning of a sentence.
Like so:
>>> sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
>>> re.findall(r'(?<=[a-z,][ ])([A-Z][a-z]*)', sentence)
['Harry', 'Susy']
Imo best done with nltk:
from nltk import sent_tokenize, word_tokenize
sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
for sent in sent_tokenize(sentence):
words = word_tokenize(sent)
possible_names = [word for word in words[1:] if word[0].isupper()]
print(possible_names)
Or - if you're into comprehensions:
names = [word
for sent in sent_tokenize(sentence)
for word in word_tokenize(sent)[1:]
if word[0].isupper()]
Which will yield
['Harry', 'Susy']
You're overwriting your m variable. Try this:
p = re.compile(r'[^A-Z]\w+[A-Z]\w+')
for m in p.finditer(sentence):
print(m)
Related
I want to extract full phrase (one or multiple words) that contain the specific substring. Substring can have one multiple words, and words from substring can 'break'/'split' words in the test_string, but desired output is full phrase/word from test_string, for example
test_string = 'this is an example of the text that I have, and I want to by amplifier and lamp'
substring1 = 'he text th'
substring2 = 'amp'
if substring1 in test_string:
print("substring1 found")
if substring2 in test_string:
print("substring2 found")
My desired output is:
[the text that]
[example, amplifier, lamp]
FYI
Substring can be at the beginning of the word, middle or end...it does not matter.
If you want something robust I would do something like that:
re.findall(r"((?:\w+)?" + re.escape(substring2) + r"(?:\w+)?)", test_string)
This way you can have whatever you want in substring.
Explanation of the regex:
'(?:\w+)' Non capturing group
'?' zero or one
I have done this at the begining and at the end of your substring as it can be the start or the end of the missing part
To answer the latest comment about how to get the punctuation as well. I would do something like that using string.punctuation
import string
pattern = r"(?:[" + r"\w" + re.escape(string.punctuation) + r"]+)?"
re.findall("(" + pattern + re.escape(substring2) + pattern + ")",
test_string)
Doing so, will match any punctuation in the word at the beginning and the end. Like: [I love you.., I love you!!, I love you!?, ?I love you!, ...]
this is a job for regex, as you could do:
import re
substring2 = 'amp'
test_string = 'this is an example of the text that I have'
print("matches for substring 1:",re.findall(r"(\w+he text th\w+)", test_string))
print("matches for substring 2:",re.findall(r"(\w+amp\w+)",test_string))
Output:
matches for substring 1:['the text that']
matches for substring 2:['example']
I have a list of words like substring = ["one","multiple words"] from which i want to check if a sentence contains any of these words.
sentence1 = 'This Sentence has ONE word'
sentence2 = ' This sentence has Multiple Words'
My code to check using any operator:
any(sentence1.lower() in s for s in substring)
This is giving me false even if the word is present in my sentence. I don't want to use regex as it would be an expensive operation for huge data.
Is there any other approach to this?
I think you should reverse your order:
any(s in sentence1.lower() for s in substring)
you're checking if your substring is a part of your sentence, NOT if your sentence is a part of any of your substrings.
As mentioned in other answers, this is what will get you the correct answer if you want to detect substrings:
any(s in sentence1.lower() for s in substring)
However, if your goal is to find words instead of substrings, this is incorrect. Consider:
sentence = "This is an aircraft"
words = ["air", "hi"]
any(w in sentence.lower() for w in words) # True.
The words "air" and "hi" are not in the sentence, but it returns True anyway. Instead, if you want to check for words, you should use:
any(w in sentence.lower().split(' ') for w in words)
use this scenario.
a="Hello Moto"
a.find("Hello")
It will give you an index in return. If the string is not there it will return -1
We have the repetitive words like Mr and Mrs in a text. We would like to add a space before and after the keywords Mr and Mrs. But, the word Mr is getting repetitive in Mrs. Please assist in solving the query:
Input:
Hi This is Mr.Sam. Hello, this is MrsPamela.Mr.Sam, what is your call about? Mrs.Pamela, I have a question for you.
import re
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ("Mr", "Mrs")
def add_spaces(string, words):
for word in words:
# pattern to match any non-space char before the word
patt1 = re.compile('\S{}'.format(word))
matches = re.findall(patt1, string)
for match in matches:
non_space_char = match[0]
string = string.replace(match, '{} {}'.format(non_space_char, word))
# pattern to match any non-space char after the word
patt2 = re.compile('{}\S'.format(word))
matches = re.findall(patt2, string)
for match in matches:
non_space_char = match[-1]
string = string.replace(match, '{} {}'.format(word, non_space_char))
return string
print(add_spaces(s, words))
Present Output:
Hi This is Mr .Sam. Hello, this is Mr sPamela. Mr .Sam, what is your call about? Mr s.Pamela, I have a question for you.
Expected Output:
Hi This is Mr .Sam. Hello, this is Mrs Pamela. Mr .Sam, what is your call about? Mrs .Pamela, I have a question for you.
You didn't specify anything after the letter 'r' so your pattern will match any starting with a space character followed by 'M' and 'r', so this will capture any ' Mr' even if it's followed by a 's' such as Mrs, that's why your your first pattern adds a space in the middle of Mrs.
A better pattern would be r'\bMr\b'
'\b' captures word boundaries, see the doc for further explanations: https://docs.python.org/3/library/re.html
I do not have a very extense knowledge of re module, but I came up with a solution which is extendable to any number of words and string and that perfectly works (tested in python3), although it is probably a very extense one and you may find something more optimized and much more concise.
On the other hand, it is not very difficult to understand the procedure:
To begin with, the program orders the words list from descending
length.
Then, it finds the matches of the longer words first and takes note
of the sections where the matches were already done in order not to
change them again. (Note that this introduces a limitation, but it
is necessary, due to the program cannot know if you want to allow
that a word in the variable word can be contained in other, anyway
it does not affect you case)
When it has taken note of all matches (in a non-blocked part of the
string) of a word, it adds the corresponding spaces and corrects the
blocked indexes (they have moved due to the insertion of the spaces)
Finally, it does a trim to eliminate multiple spaces
Note: I used a list for the variable words instead of a tuple
import re
def add_spaces(string, words):
# Get the lenght of the longest word
max_lenght = 0
for word in words:
if len(word)>max_lenght:
max_lenght = len(word)
print("max_lenght = ", max_lenght)
# Order words in descending lenght
ordered_words = []
i = max_lenght
while i>0:
for word in words:
if len(word)==i:
ordered_words.append(word)
i -= 1
print("ordered_words = ", ordered_words)
# Iterate over words adding spaces with each match and "blocking" the match section so not to modify it again
blocked_sections=[]
for word in ordered_words:
matches = [match.start() for match in re.finditer(word, string)]
print("matches of ", word, " are: ", matches)
spaces_position_to_add = []
for match in matches:
blocked = False
for blocked_section in blocked_sections:
if match>=blocked_section[0] and match<=blocked_section[1]:
blocked = True
if not blocked:
# Block section and store position to modify after
blocked_sections.append([match,match+len(word)])
spaces_position_to_add.append([match,match+len(word)+1])
# Add the spaces and update the existing blocked_sections
spaces_added = 0
for new_space in spaces_position_to_add:
# Add space before and after the word
string = string[:new_space[0]+spaces_added]+" "+string[new_space[0]+spaces_added:]
spaces_added += 1
string = string[:new_space[1]+spaces_added]+" "+string[new_space[1]+spaces_added:]
spaces_added += 1
# Update existing blocked_sections
for blocked_section in blocked_sections:
if new_space[0]<blocked_section[0]:
blocked_section[0] += 2
blocked_section[1] += 2
# Trim extra spaces
string = re.sub(' +', ' ', string)
return string
### MAIN ###
if __name__ == '__main__':
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ["Mr", "Mrs"]
print(s)
print(add_spaces(s,words))
How to find word(S) in a sentence that end with a pattern using regex
I have list of patterns I want to match within a sentence
For example
my_list = ['one', 'this']
sentence = 'Someone dothis onesome thisis'
Result should return only words that end with items from my_list
['Someone','dothis'] only
since I do not want to match onesome or thisis
You can end your pattern with the word boundary metacharacter \b. It will match anything that is not a word character, including the end of the string. So, in that specific case, the pattern would be (one|this)\b.
To actually create a regex from your my_list variable, assuming that no reserved characters are present, you can do:
import re
def words_end_with(sentence, my_list):
return re.findall(r"({})\b".format("|".join(my_list)), sentence)
If you're using Python 3.6+, you can also use an f-string, to do this formatting inside the string itself:
import re
def words_end_with(sentence, my_list):
return re.findall(fr"({'|'.join(my_list)})\b", sentence)
See https://www.regular-expressions.info/wordboundaries.html
You can use the following pattern:
\b(\w+(one|this))\b
It says match whole words within word boundaries (\b...\b), and within whole words match any word character (\w+) followed by the literal one or this ((one|this))
https://regex101.com/r/UzhnSw/1/
I need to count number of words in sentence. I do it with
word_matrix[i][j] = sentences[i].count([*words_dict][j])
But it also counts when a word is included in other word, for example 'in' is included in 'interactive'. How to avoid it?
You could use collections.Counter for this:
from collections import Counter
s = 'This is a sentence'
Counter(s.lower().split())
# Counter({'this': 1, 'is': 1, 'a': 1, 'sentence': 1})
You can just do this:
sentence = 'this is a test sentence'
word_count = len(sentence.split(' '))
in this case word_count would be 5.
use split to tokenise the words of statement, then use logic if word exist in dict then increment the value by one otherwise add the word with count as one :
paragraph='Nory was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been'
words=paragraph.split()
word_count={}
counter=0
for i in words:
if i in word_count:
word_count[i]+=1
else:
word_count[i]=1
print(word_count)
Depending on the situation, the most efficient solution would be using collection.Counter, but you will miss all the words with a symbol:
i.e. in will be different from interactive (as you want), but will also be different from in:.
An alternative solution that consider this problem could be counting the matched pattern of a RegEx:
import re
my_count = re.findall(r"(?:\s|^)({0})(?:[\s$\.,;:])".format([*words_dict][j]), sentences[i])
print(len(my_count))
What is the RegEx doing?
For a given word, you match:
the same word preceded by a space or start of line (\s|^)
and followed by a space, end of the line, a dot, comma, and any symbol in the square brackets ([\s$\.,;:])