How to find word(S) in a sentence that end with a pattern using regex
I have list of patterns I want to match within a sentence
For example
my_list = ['one', 'this']
sentence = 'Someone dothis onesome thisis'
Result should return only words that end with items from my_list
['Someone','dothis'] only
since I do not want to match onesome or thisis
You can end your pattern with the word boundary metacharacter \b. It will match anything that is not a word character, including the end of the string. So, in that specific case, the pattern would be (one|this)\b.
To actually create a regex from your my_list variable, assuming that no reserved characters are present, you can do:
import re
def words_end_with(sentence, my_list):
return re.findall(r"({})\b".format("|".join(my_list)), sentence)
If you're using Python 3.6+, you can also use an f-string, to do this formatting inside the string itself:
import re
def words_end_with(sentence, my_list):
return re.findall(fr"({'|'.join(my_list)})\b", sentence)
See https://www.regular-expressions.info/wordboundaries.html
You can use the following pattern:
\b(\w+(one|this))\b
It says match whole words within word boundaries (\b...\b), and within whole words match any word character (\w+) followed by the literal one or this ((one|this))
https://regex101.com/r/UzhnSw/1/
Related
Basically, I start with inserting the word "brand" where I replace a single character in the word with an underscore and try and find all words that match the remaining characters. For example:
"b_and" would return: "band", "brand", "bland" .... etc.
I started with using re.sub to substitute the underscore in the character. But I'm really lost on where to go next. I only want words that are different by this underscore, either without the underscore or by replacing it with a letter. Like if the word "under" was to run through the list, i wouldn't want it to return "understood" or "thunder", just a single character difference. Any ideas would be great!
I tried replacing the character with every letter in the alphabet first, then back checking if that word is in the dictionary, but that took such a long time, I really want to know if there's a faster way
from itertools import chain
dictionary=open("Scrabble.txt").read().split('\n')
import re,string
#after replacing the word with "_", we find words in the dictionary that match the pattern
new=[]
for letter in string.ascii_lowercase:
underscore=re.sub('_', letter, word)
if underscore in dictionary:
new.append(underscore)
if new == []:
pass
else:
return new
IIUC this should do it. I'm doing it outside a function so you have a working example, but it's straightforward to do it inside a function.
string = 'band brand bland cat dand bant bramd branding blandisher'
word='brand'
new=[]
for n,letter in enumerate(word):
pattern=word[:n]+'\w?'+word[n+1:]
new.extend(re.findall(pattern,string))
new=list(set(new))
Output:
['bland', 'brand', 'bramd', 'band']
Explanation:
We're using regex to do what you're looking. In this case, in every iteration we're taking one letter out of "brand" and making the algorithm look for any word that matches. So it'll look for:
_rand, b_and, br_nd, bra_d, bran_
For the case of "b_and" the pattern is b\w?and, which means: find a word with b, then any character may or may not appear, and then 'and'.
Then it adds to the list all words that match.
Finally I remove duplicates with list(set(new))
Edit: forgot to add string vairable.
Here's a version of Juan C's answer that's a bit more Pythonic
import re
dictionary = open("Scrabble.txt").read().split('\n')
pattern = "b_and" # change to what you need
pattern = pattern.replace('_', '.?')
pattern += '\\b'
matching_words = [word for word in dictionary if re.match(pattern, word)]
Edit: fixed the regex according to your comment, quick explanation:
pattern = "b_and"
pattern = pattern.replace('_', '.?') # pattern is now b.?and, .? matches any one character (or none at all)
pattern += '\\b' # \b prevents matching with words like "bandit" or words longer than "b_and"
I am using the following regex to split a phrase passed in as a string, into a list of words.
Because there might be other letters, I'm using the UTF flag. This works great in most cases:
phrase = 'hey look out'
word_list = re.split(r'[\W_]+', unicode(phrase, 'utf-8').lower(), flags=re.U)
word_list [u'hey', u'look', u'out']
But, if the phrase is a sentence that ends with a period like this, it will create a blank value in the list:
phrase = 'hey, my spacebar_is broken.'
word_list [u'hey', u'my', u'spacebar', u'is', u'broken', u'']
My work around is to use
re.split(r'[\W_]+', unicode(phrase.strip('.'), 'utf-8').lower(), flags=re.U)
but I wanted to know is there was a way to solve it within the regex expression?
\W selects non-word characters. Since . is a non-word character, the string is split on it. Since there is nothing after the period, you get an empty string. If you want to avoid this, you'll need to either strip the separator characters of the ends of the string
phrase = re.sub(r'^[\W_]+|[\W_]+$', '', phrase)
or filter the resulting array to remove empty strings.
word_list = [word for word in word_list if word]
Alternatively, you can get the words by matching them directly rather than splitting:
words = re.findall(r'[^\W_]+', phrase)
Can someone help me with this kind of regular expression matching?
For example, I'm searching through list containing different strings with a letter iterating at the end of the string:
MonsterA
MonsterB
MonsterC
HeroA
HeroB
HeroC
...
What I need this script to return is only the preceding part of the string, in this example Monster and Hero.
If you absolutely need a regex:
re.match(r"(.*)[A-Z]", word).group(1)
But it is not the most efficient if you just want to remove the last character.
You could use a positive lookahead assertion (?=...) to check the words ends in a single uppercase character and then use word boudaries \b...\b to ensure it does not match patterns that arent whole words:
>>> text = "This re will match MonsterA and HeroB but not heroC or MonsterCC"
>>> re.findall(r"\b[A-Z][a-z]+(?=[A-Z]\b)", text)
['Monster', 'Hero']
re.findall returns all such matches in a list.
I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.
I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.
This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word
Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens
I am trying to construct regular expression in pythond for following rules,
Accept Words containing only alphabets
Words may contain - ( hypen)
word can not end with special character, for eg. :) ( pls consider these two)
Word can not start with _ (underscore) but may end with _ (underscore)
For eg.
Accept Words
Hello
Hello-World
Hello_
Hello1
Reject words
_hello_
hello:
hello:)
I have come up with following regular expression,
'(?!_)[\w-]+(?!:)'
It still accepts all words just skipping _ at the stat and : at the end,
Can somebody point, what's the wrong with my regular expression
Thanks
You can add a leading and trailing \b.
words = ["Hello", "Hello-World", "Hello_", "Hello1", "_hello_", "hello:",
"hello:)" ]
import re
for word in words:
print re.findall(r'\b(?!_)[\w-]+(?!:)\b', word)
Output:
['Hello']
['Hello-World']
['Hello_']
['Hello1']
[]
[]
[]
From http://docs.python.org/2/library/re.html
\b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.
There's still quite a bit of ambiguity in what you're asking for, but here's another solution for the sample set you gave, pre this fiddle
^[A-Za-z-]+[_\d]?$