Extract words from sentence that are containing substring - python

I want to extract full phrase (one or multiple words) that contain the specific substring. Substring can have one multiple words, and words from substring can 'break'/'split' words in the test_string, but desired output is full phrase/word from test_string, for example
test_string = 'this is an example of the text that I have, and I want to by amplifier and lamp'
substring1 = 'he text th'
substring2 = 'amp'
if substring1 in test_string:
print("substring1 found")
if substring2 in test_string:
print("substring2 found")
My desired output is:
[the text that]
[example, amplifier, lamp]
FYI
Substring can be at the beginning of the word, middle or end...it does not matter.

If you want something robust I would do something like that:
re.findall(r"((?:\w+)?" + re.escape(substring2) + r"(?:\w+)?)", test_string)
This way you can have whatever you want in substring.
Explanation of the regex:
'(?:\w+)' Non capturing group
'?' zero or one
I have done this at the begining and at the end of your substring as it can be the start or the end of the missing part
To answer the latest comment about how to get the punctuation as well. I would do something like that using string.punctuation
import string
pattern = r"(?:[" + r"\w" + re.escape(string.punctuation) + r"]+)?"
re.findall("(" + pattern + re.escape(substring2) + pattern + ")",
test_string)
Doing so, will match any punctuation in the word at the beginning and the end. Like: [I love you.., I love you!!, I love you!?, ?I love you!, ...]

this is a job for regex, as you could do:
import re
substring2 = 'amp'
test_string = 'this is an example of the text that I have'
print("matches for substring 1:",re.findall(r"(\w+he text th\w+)", test_string))
print("matches for substring 2:",re.findall(r"(\w+amp\w+)",test_string))
Output:
matches for substring 1:['the text that']
matches for substring 2:['example']

Related

Regex to find the names in every sentence using python

Hii i am new to regex and stuck with this question.
Q- Identify all of words that look like names in the sentence. In other words, those which are capitalized but aren't the first word in the sentence.
sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
Here's what i did ...but not getting any output(Excluding the text from begining till i get any capital letter word which is name)
p = re.compile(r'[^A-Z]\w+[A-Z]\w+')
m = p.finditer(sentence)
for m in m:
print(m)
Assuming there's always only one space after a dot before another sentence begins, you can use a negative lookbehind pattern to exclude names that are preceded by a dot and a space, and another negative lookbehind pattern to exclude the beginning of the string. Also use \b to ensure that a captial letter is matched at a word boundary:
re.findall(r'(?<!\. )(?<!^)\b[A-Z]\w*', sentence)
This returns:
['Harry', 'Susy']
You use a positive lookbehind to look for a capitalization pattern for a word not at the beginning of a sentence.
Like so:
>>> sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
>>> re.findall(r'(?<=[a-z,][ ])([A-Z][a-z]*)', sentence)
['Harry', 'Susy']
Imo best done with nltk:
from nltk import sent_tokenize, word_tokenize
sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
for sent in sent_tokenize(sentence):
words = word_tokenize(sent)
possible_names = [word for word in words[1:] if word[0].isupper()]
print(possible_names)
Or - if you're into comprehensions:
names = [word
for sent in sent_tokenize(sentence)
for word in word_tokenize(sent)[1:]
if word[0].isupper()]
Which will yield
['Harry', 'Susy']
You're overwriting your m variable. Try this:
p = re.compile(r'[^A-Z]\w+[A-Z]\w+')
for m in p.finditer(sentence):
print(m)

Match a word ending with pattern in a sentence

How to find word(S) in a sentence that end with a pattern using regex
I have list of patterns I want to match within a sentence
For example
my_list = ['one', 'this']
sentence = 'Someone dothis onesome thisis'
Result should return only words that end with items from my_list
['Someone','dothis'] only
since I do not want to match onesome or thisis
You can end your pattern with the word boundary metacharacter \b. It will match anything that is not a word character, including the end of the string. So, in that specific case, the pattern would be (one|this)\b.
To actually create a regex from your my_list variable, assuming that no reserved characters are present, you can do:
import re
def words_end_with(sentence, my_list):
return re.findall(r"({})\b".format("|".join(my_list)), sentence)
If you're using Python 3.6+, you can also use an f-string, to do this formatting inside the string itself:
import re
def words_end_with(sentence, my_list):
return re.findall(fr"({'|'.join(my_list)})\b", sentence)
See https://www.regular-expressions.info/wordboundaries.html
You can use the following pattern:
\b(\w+(one|this))\b
It says match whole words within word boundaries (\b...\b), and within whole words match any word character (\w+) followed by the literal one or this ((one|this))
https://regex101.com/r/UzhnSw/1/

Python: delete all characters before the first letter in a string

After a thorough search I could find how to delete all characters before a specific letter but not before any letter.
I am trying to turn a string from this:
" This is a sentence. #contains symbol and whitespace
To this:
This is a sentence. #No symbols or whitespace
I have tried the following code, but strings such as the first example still appear.
for ch in ['\"', '[', ']', '*', '_', '-']:
if ch in sen1:
sen1 = sen1.replace(ch,"")
Not only does this fail to delete the double quote in the example for some unknown reason but also wouldn't work to delete the leading whitespace as it would delete all of the whitespace.
Thank you in advance.
Instead of just removing white spaces, for removing any char before first letter, do this :
#s is your string
for i,x in enumerate(s):
if x.isalpha() #True if its a letter
pos = i #first letter position
break
new_str = s[pos:]
import re
s = " sthis is a sentence"
r = re.compile(r'.*?([a-zA-Z].*)')
print r.findall(s)[0]
Strip all whitespace and punctuation:
>>> text.lstrip(string.punctuation + string.whitespace)
'This is a sentence. #contains symbol and whitespace'
Or, an alternative, find the first character that is an ascii letter. For example:
>>> pos = next(i for i, x in enumerate(text) if x in string.ascii_letters)
>>> text[pos:]
'This is a sentence. #contains symbol and whitespace'
This is a very basic version; i.e. it uses syntax that beginners in Python will easily understand.
your_string = "1324 $$ '!' '' # this is a sentence."
while len(your_string) > 0 and your_string[0] not in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz":
your_string = your_string[1:]
print(your_string)
#prints "this is a sentence."
Pros: Simple, no imports
Cons: The while loop could be avoided if you feel comfortable using list comprehensions.
Also, the string that you're comparing to could be simpler using regex.
Drop everything up to the first alpha character.
import itertools as it
s = " - .] * This is a sentence. #contains symbol and whitespace"
"".join(it.dropwhile(lambda x: not x.isalpha(), s))
# 'This is a sentence. #contains symbol and whitespace'
Alternatively, iterate the string and test if each character is in a blacklist. If true strip the character, otherwise short-circuit.
def lstrip(s, blacklist=" "):
for c in s:
if c in blacklist:
s = s.lstrip(c)
continue
return s
lstrip(s, blacklist='\"[]*_-. ')
# 'This is a sentence. #contains symbol and whitespace'
You can use re.sub
import re
text = " This is a sentence. #contains symbol and whitespace"
re.sub("[^a-zA-Z]+", " ", text)
re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)

How to remove words containing only numbers in python?

I have some text in Python which is composed of numbers and alphabets. Something like this:
s = "12 word word2"
From the string s, I want to remove all the words containing only numbers
So I want the result to be
s = "word word2"
This is a regex I have but it works on alphabets i.e. it replaces each alphabet by a space.
re.sub('[\ 0-9\ ]+', ' ', line)
Can someone help in telling me what is wrong? Also, is there a more time-efficient way to do this than regex?
Thanks!
You can use this regex:
>>> s = "12 word word2"
>>> print re.sub(r'\b[0-9]+\b\s*', '', s)
word word2
\b is used for word boundary and \s* will remove 0 or more spaces after your number word.
Using a regex is probably a bit overkill here depending whether you need to preserve whitespace:
s = "12 word word2"
s2 = ' '.join(word for word in s.split() if not word.isdigit())
# 'word word2'
Without using any external library you could do:
stringToFormat = "12 word word2"
words = ""
for word in stringToFormat.split(" "):
try:
int(word)
except ValueError:
words += "{} ".format(word)
print(words)

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Categories