Finding whether a word exist in a string in python - python

Here is the piece of code where i want help.
listword=["os","slow"]
sentence="photos"
if any(word in sentence for word in listword):
print "yes"
It prints yes as os is present in photos.
But I want to know whether there is os as a "word" is present in the string instead of os present as part of the word.Is there any way without converting sentence into list of words.Basically i dont want the program to print yes.It has to print yes only if string contains os word.
Thanks

You'd need to use regular expressions, and add \b word boundary anchors around each word when matching:
import re
if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence) for word in listword):
print 'yes'
The \b boundary anchor matches at string start and end points, and anywhere there is a transition between word and non-word characters (so between a space and a letter or digit, or between punctuation and a letter or digit).
The re.escape() function ensures that all regular expression metacharacters are escaped and we match on the literal contents of word and not accidentally interpret anything in there as an expression.
Demo:
>>> listword = ['foo', 'bar', 'baz']
>>> sentence = 'The quick fox jumped over the barred door'
>>> if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence) for word in listword):
... print 'yes'
...
>>> sentence = 'The tradition to use fake names like foo, bar or baz originated at MIT'
>>> if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence) for word in listword):
... print 'yes'
...
yes
By using a regular expression, you now can match case-insensitively as well:
if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence, re.I) for word in listword):
print 'yes'
In this demo both the and mit qualify even though the case in the sentence differs:
>>> listword = ['the', 'mit']
>>> if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence, re.I) for word in listword):
... print 'yes'
...
yes

As people have pointed out, you can use regular expressions to split your string into a list words. This is known as tokenization.
If regular expressions aren't working well enough for you, then I suggest having a look at NTLK -- a Python natural language processing library. It contains a wide range of tokenizers that will split your string based on whitespace, punctuation, and other features that may be too tricky to capture with a regex.
Example:
>>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> "buy" in wordpunct_tokenize(s)
True

This is simple, and will not work if sentence string contains commas, but still
if any (" {0} ".format a in sentence for a in listword):

>>> sentence="photos"
>>> listword=["os","slow"]
>>> pat = r'|'.join(r'\b{0}\b'.format(re.escape(x)) for x in listword)
>>> bool(re.search(pat, sentence))
False
>>> listword=["os","slow", "photos"]
>>> pat = r'|'.join(r'\b{0}\b'.format(re.escape(x)) for x in listword)
>>> bool(re.search(pat, sentence))
True

While I especially like the tokenizer and the regular expression solutions, I do believe they are kind of overkill for this kind of situation, which can be very effectively solved just by using the str.find() method.
listword = ['os', 'slow']
sentence = 'photos'
for word in listword:
if sentence.find(word) != -1:
print 'yes'
Although this might not be the most elegant solution, it still is (in my opinion) the most suitable solution for people that just started out fiddling with the language.

Related

Capture words that do not have hyphen in it - Python

For example we have this text:
Hello but I don't want1 this non-object word in it.
Using regular expression, how can extract words that must start with a letter and that only have letters or numbers in it? For example in this example I only want:
Hello but I want1 this word in it
Any help would be appreciated! Thanks!
You can use lookarounds in your regex:
>>> str = "Hello but I don't want1 this non-object word in it."
>>> print re.findall(r'(?:(?<=\s)|(?<=^))\w+(?=[.\s]|$)', str)
['Hello', 'but', 'I', 'want1', 'this', 'word', 'in', 'it']
RegEx Demo
extract words that must start with a letter and that only have letters or
numbers in it
The alternative solution using re.sub function(from re module):
s = "Hello but I don't want this non-object word in it."
s = re.sub(r'\s?\b[a-zA-Z]+?[^\w ][\w]+?\b', '', s)
print(s)
The output:
Hello but I want this word in it.

Regex parsing text and get relevant words / characters

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo

Extract text after specific character

I need to extract the word after the #
How can I do that? What I am trying:
text="Hello there #bob !"
user=text[text.find("#")+1:]
print user
output:
bob !
But the correct output should be:
bob
A regex solution for fun:
>>> import re
>>> re.findall(r'#(\w+)', '#Hello there #bob #!')
['Hello', 'bob']
>>> re.findall(r'#(\w+)', 'Hello there bob !')
[]
>>> (re.findall(r'#(\w+)', 'Hello there #bob !') or None,)[0]
'bob'
>>> print (re.findall(r'#(\w+)', 'Hello there bob !') or None,)[0]
None
The regex above will pick up patterns of one or more alphanumeric characters following an '#' character until a non-alphanumeric character is found.
Here's a regex solution to match one or more non-whitespace characters if you want to capture a broader range of substrings:
>>> re.findall(r'#(\S+?)', '#Hello there #bob #!')
['Hello', 'bob', '!']
Note that when the above regex encounters a string like #xyz#abc it will capture xyz#abc in one result instead of xyz and abc separately. To fix that, you can use the negated \s character class while also negating # characters:
>>> re.findall(r'#([^\s#]+)', '#xyz#abc some other stuff')
['xyz', 'abc']
And here's a regex solution to match one or more alphabet characters only in case you don't want any numbers or anything else:
>>> re.findall(r'#([A-Za-z]+)', '#Hello there #bobv2.0 #!')
['Hello', 'bobv']
So you want the word starting after # up to a whitespace?
user=text[text.find("#")+1:].split()[0]
print(user)
bob
EDIT: as #bgstech note, in cases where the string does not have a "#", make a check before:
if "#" in text:
user=text[text.find("#")+1:].split()[0]
else:
user="something_else_appropriate"

Create a python regular expression for finding 1 to 4 chars uppercase

I already have the following regular expression:
'([A-Z0-9]{1,4}(?![A-Z0-9]))'
that meets the following requirements.
1-4 Characters in length
All Uppercase
Can be a mix of numbers and
letters
Now Say I have this string "This is A test of a TREE, HOUSE"
result = ['T', 'A', 'TREE']
I don't want the first 'T' because it is not on it's own and is part of a word.
How would I go about modifying the re search to account for this?
Thanks
[Edit: Spelling]
You can use word boundaries \b around your pattern.
>>> import re
>>> s = 'This is A test of a TREE, HOUSE'
>>> re.findall(r'\b[A-Z0-9]{1,4}\b', s)
['A', 'TREE']

Stripping punctuation from unique strings in an input file

This question ( Best way to strip punctuation from a string in Python ) deals with stripping punctuation from an individual string. However, I'm hoping to read text from an input file, but only print out ONE COPY of all strings without ending punctuation. I have started something like this:
f = open('#file name ...', 'a+')
for x in set(f.read().split()):
print x
But the problem is that if the input file has, for instance, this line:
This is not is, clearly is: weird
It treats the three different cases of "is" differently, but I want to ignore any punctuation and have it print "is" only once, rather than three times. How do I remove any kind of ending punctuation and then put the resulting string in the set?
Thanks for any help. (I am really new to Python.)
import re
for x in set(re.findall(r'\b\w+\b', f.read())):
should be more able to distinguish words correctly.
This regular expression finds compact groups of alphanumerical characters (a-z, A-Z, 0-9, _).
If you want to find letters only (no digits and no underscore), then replace the \w with [a-zA-Z].
>>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']
You can use translation tables if you don't care about replacing your punctuation characters with white space, for eg.
>>> from string import maketrans
>>> punctuation = ",;.:"
>>> replacement = " "
>>> trans_table = maketrans(punctuation, replacement)
>>> 'This is not is, clearly is: weird'.translate(trans_table)
'This is not is clearly is weird'
# And for your case of creating a set of unique words.
>>> set('This is not is clearly is weird'.split())
set(['This', 'not', 'is', 'clearly', 'weird'])

Categories