Properly check if word is in string? - python

Say, for example, I want to check if the word test is in a string. Normally, I'd just:
if 'test' in theString
But I want to make sure it's the actual word, not just the string. For example, test in "It was detestable" would yield a false positive. I could check to make sure it contains (\s)test(\s) (spaces before and after), but than "...prepare for the test!" would yield a false negative. It seems my only other option is this:
if ' test ' in theString or ' test.' in theString or ' test!' in theString or.....
Is there a way to do this properly, something like if 'test'.asword in theString?

import re
if re.search(r'\btest\b', theString):
pass
This will look for word boundaries on either end of test. From the docs, \b:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.

Related

Regex to replace characters unless they're inside of a word?

How do I replace a set of characters in a string unless they're part of a word? For example, if I have the text "ur the wurst person ur", I want to replace "ur" with "youre". So the final text would be "youre the wurst person youre". I don't want the "ur" inside of wurst to be changed because it's inside of a word. Is there a generic regex way to do this in python? I don't want to have to worry if "ur" has a space before or after, etc., only if it's part of another word. Thanks!
What I've tried so far is a simple
result = re.sub("ur", "youare", text)
but this also replaces the "ur" inside of "wurst". If I use the word boundaries as in
result = re.sub(r"\bur\b", "youare", text)
it will miss the last occurrence of "ur" in the string.
Without using regular expressions...
You could split the string at each space with string.split() and then, in a list comprehension, replace words 'ur' with 'youre'. This may look something like:
s = "ur the wurst person ur"
result = " ".join(['youre' if w == 'ur' else w for w in s.split()])
Hope this helps!
result = re.sub(r'\bur\b', r'youare', "ur the wurst person ur")
from the python documentation:
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

Matching an apostrophe only within a word or string

I'm looking for a Python regex that can match 'didn't' and returns only the character that is immediately preceded by an apostrophe, like 't, but not the 'd or t' at the beginning and end.
I have tried (?=.*\w)^(\w|')+$ but it only matches the apostrophe at the beginning.
Some more examples:
'I'm' should only match 'm and not 'I
'Erick's' should only return 's and not 'E
The text will always start and end with an apostrophe and can include apostrophes within the text.
To match an apostrophe inside a whole string = match it anwyhere but at the start/end of the string:
(?!^)'(?!$)
See the regex demo.
Often, the apostophe is searched only inside a word (but in fact, a pair of words where the second one is shortened), then you may use
\b'\b
See this regex demo. Here, the ' is preceded and followed with a word boundary, so that ' could be preceded with any word, letter or _ char. Yes, _ char and digits are allowed to be on both sides.
If you need to match a ' only between two letters, use
(?<=[A-Za-z])'(?=[A-Za-z]) # ASCII only
(?<=[^\W\d_])'(?=[^\W\d_]) # Any Unicode letters
See this regex demo.
As for this current question, here is a bunch of possible solutions:
import re
s = "'didn't'"
print(s.strip("'")[s.strip("'").find("'")+1])
print(re.search(r'\b\'(\w)', s).group(1))
print(re.search(r'\b\'([^\W\d_])', s).group(1))
print(re.search(r'\b\'([a-z])', s, flags=re.I).group(1))
print(re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I))
The s.strip("'")[s.strip("'").find("'")+1] gets the character after the first ' after stripping the leading/trailing apostrophes.
The re.search(r'\b\'(\w)', s).group(1) solution gets the word (i.e. [a-zA-Z0-9_], can be adjusted from here) char after a ' that is preceded with a word char (due to the \b word boundary).
The re.search(r'\b\'([^\W\d_])', s).group(1) is almost identical to the above solution, it only fetches a letter character as [^\W\d_] matches any char other than a non-word, digit and _.
Note that the re.search(r'\b\'([a-z])', s, flags=re.I).group(1) solution is next to identical to the above one, but you cannot make it Unicode aware with re.UNICODE.
The last re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I) just shows how to fetch multiple letter chars from a string input.

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Trying to parse a string to two seperate strings based on case

I'm currently working on a python bot which retrieves information from a meta block on an HTML page. I get the content of the meta block, and now I am stuck on trying to parse it to two different strings.
An example of the content would be:
Lowercase Words WITH UPPERCASE CONTAINING 2 AND ALSO ', AND MANY MORE CHARACTERS
So far I have:
lowercase = ' '.join(w for w in content.split() if (not w.isupper()) and (not w.isdigit()))
uppercase = ' '.join(w for w in content.split() if (w.isupper() or w.isdigit()))
where the uppercase string is meant to contain everything that isn't the words "Lowercase" or "Words"
I have not been able to find much help with this sort of issue, and was wondering if anyone would know of a trick or work around? Thanks
Why not use regular expressions:
import re
s = "Lowercase Words WITH UPPERCASE CONTAINING 2 AND ALSO ', AND MANY MORE CHARACTERS"
match = re.match(r"(([^\s]*[a-z]+[^\s]*\s+)+)([^a-z]+)", s)
if match:
lowercase = match.group(1)
uppercase = match.group(3)
This will match a single line string beginning with an arbitrary number of words of which each must contain at least one lower case letter(a-z). Note, that camel-case is also recognized as a lower case string (e.g. "LowerCase"). The second part will then match the rest of the string which must not contain any lower case letters.
Let's try to understand the regexp now:
We want to match lower case words, so we write: [a-z]+But this will only match words that are completely made up from lower-case letters - we want to allow other characters as well and match the word as lower case if it contains at least one lower case character. [^\s] will match any character that is not a white-space (word boundary). We combine both patterns like this: [^\s]*[a-z]+[^\s]*.This matches any number of non-whitespace characters (even zero) followed by lower-case characters and then followed by any sequence of non-whitespace characters again. So this basically means, we match any sequence that does not contain white-space and at least one lower-case letter.Now we make a sequence of such words, delimited by whitespace: ([^\s]*[a-z]+[^\s]*\s+)+
Matching the upper case part is pretty straight-forward, because we only need to match everything (including whitespace) that is not a lower-case character: [^a-z]+
To make matches of both patterns available through groups, we wrap 'em up in braces again:
lowercase: (([^\s]*[a-z]+[^\s]*\s+)+)
uppercase: ([^a-z]+)
Perhaps you need to adjust the pattern further, to suit your needs, but I believe this should be a good starting point...
Something like this?
>>> from string import punctuation as punc
def ispunc(strs):
return all(x in punc for x in strs)
...
>>> strs = "Lowercase Words WITH UPPERCASE CONTAINING 2 AND ALSO ', AND MANY MORE CHARACTERS"
>>> ' '.join(w for w in strs.split() if (w.isupper() or w.isdigit() or ispunc(w)))
"WITH UPPERCASE CONTAINING 2 AND ALSO ', AND MANY MORE CHARACTERS"
>>> ' '.join(w for w in strs.split() if (not w.isupper()) and (not w.isdigit() and not ispunc(w)))
'Lowercase Words'
>>>

how remove special characters from the end of every word in a string?

i want it match only the end of every word
example:
"i am test-ing., i am test.ing-, i am_, test_ing,"
output should be:
"i am test-ing i am test.ing i am test_ing"
>>> import re
>>> test = "i am test-ing., i am test.ing-, i am_, test_ing,"
>>> re.sub(r'([^\w\s]|_)+(?=\s|$)', '', test)
'i am test-ing i am test.ing i am test_ing'
Matches one or more non-alphanumeric characters ([^\w\s]|_) followed by either a space (\s) or the end of the string ($). The (?= ) construct is a lookahead assertion: it makes sure that a matching space is not included in the match, so it doesn't get replaced; only the [\W_]+ gets replaced.
Okay, but why [^\w\s]|_, you ask? The first part matches anything that's not alphanumeric or an underscore ([^\w]) or whitespace ([^\s]), i.e. punctuation characters. Except we do want to eliminate underscores, so we then include those with |_.

Categories