Capture words that do not have hyphen in it - Python - python

For example we have this text:
Hello but I don't want1 this non-object word in it.
Using regular expression, how can extract words that must start with a letter and that only have letters or numbers in it? For example in this example I only want:
Hello but I want1 this word in it
Any help would be appreciated! Thanks!

You can use lookarounds in your regex:
>>> str = "Hello but I don't want1 this non-object word in it."
>>> print re.findall(r'(?:(?<=\s)|(?<=^))\w+(?=[.\s]|$)', str)
['Hello', 'but', 'I', 'want1', 'this', 'word', 'in', 'it']
RegEx Demo

extract words that must start with a letter and that only have letters or
numbers in it
The alternative solution using re.sub function(from re module):
s = "Hello but I don't want this non-object word in it."
s = re.sub(r'\s?\b[a-zA-Z]+?[^\w ][\w]+?\b', '', s)
print(s)
The output:
Hello but I want this word in it.

Related

Not getting expected output for some reason?

Question: please debug logic to reflect expected output
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(word_list)
OUTPUT is :
['Hello', 'there', '.', '']
Problem: needs to be expected without space
Expected: ['Hello', 'there', '.']
First of all the actual output you shared is not right, it is ['Hello', ' ', 'there', '.', ''] because-
The \W, Matches anything other than a letter, digit or underscore. Equivalent to [^a-zA-Z0-9_] so it is splitting your string by space(\s) and literal dot(.) character
So if you want to get the expected output you need to do some further processing like the below-
With Earlier Code:
import re
s = "Hello there."
l = list(filter(str.strip,re.split(r"(\W+)", s)))
print(l)
With Edited code:
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(list(filter(None,word_list)))
Output:
['Hello', 'there', '.']
Working Code: https://rextester.com/KWJN38243
assuming word is "Hello there.", the results make sense. See the split function documentation: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
You have put capturing parenthesis in the pattern, so you are splitting the string on non-word characters, and also return the characters used for splitting.
Here is the string:
Hello there.
Here is how it is split:
Hello|there|
that means you have three values: hello there and an empty string '' in the last place.
And the values you split on are a space and a period
So the output should be the three values and the two characters that we split on:
hello - space - there - period - empty string
which is exactly what I get.
import re
s = "Hello there."
t = re.split(r"(\W+)", s)
print(t)
output:
['Hello', ' ', 'there', '.', '']
Further Explanation
From your question is may be that you think because the string ends with a non-word character that there would be nothing "after" it, but this is not how splitting works. If you think back to CSV files (which have been around forever, and consider a CSV file like this:
date,product,qty,price
20220821,P1,10,20.00
20220821,P2,10,
The above represents a csv file with four fields, but in line two the last field (which definitely exists) is missing. And it would be parsed as an empty string if we split on the comma.

Match words that don't start with a certain letter using regex

I am learning regex but have not been able to find the right regex in python for selecting characters that start with a particular alphabet.
Example below
text='this is a test'
match=re.findall('(?!t)\w*',text)
# match returns
['his', '', 'is', '', 'a', '', 'est', '']
match=re.findall('[^t]\w+',text)
# match
['his', ' is', ' a', ' test']
Expected : ['is','a']
With regex
Use the negative set [^\Wt] to match any alphanumeric character that is not t. To avoid matching subsets of words, add the word boundary metacharacter, \b, at the beginning of your pattern.
Also, do not forget that you should use raw strings for regex patterns.
import re
text = 'this is a test'
match = re.findall(r'\b[^\Wt]\w*', text)
print(match) # prints: ['is', 'a']
See the demo here.
Without regex
Note that this is also achievable without regex.
text = 'this is a test'
match = [word for word in text.split() if not word.startswith('t')]
print(match) # prints: ['is', 'a']
You are almost on the right track. You just forgot \b (word boundary) token:
\b(?!t)\w+
Live demo

Regex in Python to match words with special characters

I have this code
import re
str1 = "These should be counted as a single-word, b**m !?"
match_pattern = re.findall(r'\w{1,15}', str1)
print(match_pattern)
I want the output to be:
['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']
The output should exclude non-words such as the "!?" what are the other validation should I use to match and achieve the desired output?
I would use word boundaries (\b) filled with 1 or more non-space:
match_pattern = re.findall(r'\b\S+\b', str1)
result:
['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']
!? is skipped thanks to word boundary magic, which don't consider that as a word at all either.
Probably you want something like [^\s.!?] instead of \w but what exactly you want is not evident from a single example. [^...] matches a single character which is not one of those between the brackets and \s matches whitespace characters (space, tab, newline, etc).
You can also achieve a similar result not using RegEx:
string = "These should be counted as a single-word, b**m !?"
replacements = ['.',',','?','!']
for replacement in replacements:
if replacement in string:
string = string.replace(replacement, "");
print string.split()
>>> ['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']

Extract text after specific character

I need to extract the word after the #
How can I do that? What I am trying:
text="Hello there #bob !"
user=text[text.find("#")+1:]
print user
output:
bob !
But the correct output should be:
bob
A regex solution for fun:
>>> import re
>>> re.findall(r'#(\w+)', '#Hello there #bob #!')
['Hello', 'bob']
>>> re.findall(r'#(\w+)', 'Hello there bob !')
[]
>>> (re.findall(r'#(\w+)', 'Hello there #bob !') or None,)[0]
'bob'
>>> print (re.findall(r'#(\w+)', 'Hello there bob !') or None,)[0]
None
The regex above will pick up patterns of one or more alphanumeric characters following an '#' character until a non-alphanumeric character is found.
Here's a regex solution to match one or more non-whitespace characters if you want to capture a broader range of substrings:
>>> re.findall(r'#(\S+?)', '#Hello there #bob #!')
['Hello', 'bob', '!']
Note that when the above regex encounters a string like #xyz#abc it will capture xyz#abc in one result instead of xyz and abc separately. To fix that, you can use the negated \s character class while also negating # characters:
>>> re.findall(r'#([^\s#]+)', '#xyz#abc some other stuff')
['xyz', 'abc']
And here's a regex solution to match one or more alphabet characters only in case you don't want any numbers or anything else:
>>> re.findall(r'#([A-Za-z]+)', '#Hello there #bobv2.0 #!')
['Hello', 'bobv']
So you want the word starting after # up to a whitespace?
user=text[text.find("#")+1:].split()[0]
print(user)
bob
EDIT: as #bgstech note, in cases where the string does not have a "#", make a check before:
if "#" in text:
user=text[text.find("#")+1:].split()[0]
else:
user="something_else_appropriate"

Finding whether a word exist in a string in python

Here is the piece of code where i want help.
listword=["os","slow"]
sentence="photos"
if any(word in sentence for word in listword):
print "yes"
It prints yes as os is present in photos.
But I want to know whether there is os as a "word" is present in the string instead of os present as part of the word.Is there any way without converting sentence into list of words.Basically i dont want the program to print yes.It has to print yes only if string contains os word.
Thanks
You'd need to use regular expressions, and add \b word boundary anchors around each word when matching:
import re
if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence) for word in listword):
print 'yes'
The \b boundary anchor matches at string start and end points, and anywhere there is a transition between word and non-word characters (so between a space and a letter or digit, or between punctuation and a letter or digit).
The re.escape() function ensures that all regular expression metacharacters are escaped and we match on the literal contents of word and not accidentally interpret anything in there as an expression.
Demo:
>>> listword = ['foo', 'bar', 'baz']
>>> sentence = 'The quick fox jumped over the barred door'
>>> if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence) for word in listword):
... print 'yes'
...
>>> sentence = 'The tradition to use fake names like foo, bar or baz originated at MIT'
>>> if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence) for word in listword):
... print 'yes'
...
yes
By using a regular expression, you now can match case-insensitively as well:
if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence, re.I) for word in listword):
print 'yes'
In this demo both the and mit qualify even though the case in the sentence differs:
>>> listword = ['the', 'mit']
>>> if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence, re.I) for word in listword):
... print 'yes'
...
yes
As people have pointed out, you can use regular expressions to split your string into a list words. This is known as tokenization.
If regular expressions aren't working well enough for you, then I suggest having a look at NTLK -- a Python natural language processing library. It contains a wide range of tokenizers that will split your string based on whitespace, punctuation, and other features that may be too tricky to capture with a regex.
Example:
>>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> "buy" in wordpunct_tokenize(s)
True
This is simple, and will not work if sentence string contains commas, but still
if any (" {0} ".format a in sentence for a in listword):
>>> sentence="photos"
>>> listword=["os","slow"]
>>> pat = r'|'.join(r'\b{0}\b'.format(re.escape(x)) for x in listword)
>>> bool(re.search(pat, sentence))
False
>>> listword=["os","slow", "photos"]
>>> pat = r'|'.join(r'\b{0}\b'.format(re.escape(x)) for x in listword)
>>> bool(re.search(pat, sentence))
True
While I especially like the tokenizer and the regular expression solutions, I do believe they are kind of overkill for this kind of situation, which can be very effectively solved just by using the str.find() method.
listword = ['os', 'slow']
sentence = 'photos'
for word in listword:
if sentence.find(word) != -1:
print 'yes'
Although this might not be the most elegant solution, it still is (in my opinion) the most suitable solution for people that just started out fiddling with the language.

Categories