Matching a string Python - python

So basically I want a program that will only work if the user types something like "I am sick" or "I am too cool" but will not work if they make a typo like "pi am cool".
Here's what I have so far:
text = input("text here: ")
if re.search("i am", text) is not None:
print("correct")
So basically, I just need help with if someone types in "Pi am cool" right now it will think that is correct. However I do not want that, I want it so that it has to be exactly "i am cool" however. Since I am creating a ai bot for a school project I need it so the sentence could be "man, I am so cool" and it will pick it up and print back correct, but if it was typed "Man, TI am so cool" with a typo I don't want it to print out correct.

Use \b word boundary anchors:
if re.search(r"\bi am\b", text) is not None:
\b matches points in the text that go from a non-word character to a word character, and vice-versa, so space followed by a letter, or a letter followed by a word.
Because \b in a regular python string is interpreted as a backspace character, you need to either use a raw string literal (r'...') or escape the backslash ("\\bi am\\b").
You may also want to add the re.IGNORE flag to your search to find both lower and uppercase text:
if re.search(r"\bi am\b", text, re.IGNORE) is not None:
Demo:
>>> re.search(r"\bi am\b", 'I am so cool!', re.I).group()
'I am'
>>> re.search(r"\bi am\b", 'WII am so cool!', re.I) is None
True
>>> re.search(r"\bi am\b", 'I ammity so cool!', re.I) is None
True

Related

Replace every instance of a word with another word without breaking other words containing that word

Word, word, word... Sorry for the title.
Let's say I want to replace every instance of "yes" to "no" in a string. I can just use string.replace(). But then there's this problem:
string = "yes eyes yesterday yes"
new_str = string.replace("yes", "no")
# new_str -> "no eno noterday no"
How can I preserve "eyes" and "yesterday" as is, with changing "yes" to "no".
You can use re here.
re.sub(r'\byes\b','no',"yes eyes yesterday yes")
# 'no eyes yesterday no'
From docs:
\b-
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
" ".join(["no" if word=="yes" else word for word in string.split()])
'no eyes yesterday no'
The explanation:
First, break the string into a list of individual words:
string.split()
['yes', 'eyes', 'yesterday', 'yes']
Then iterate over this list of individual words and use the expression
"no" if word=="yes" else word
to replace every "yes" with "no" in a list comprehension
["no" if word=="yes" else word for word in string.split()]
['no', 'eyes', 'yesterday', 'no']
Finally, return this changed list back to a string with the .join() method of the string " " (the delimiter).
Try this:
import re
string = "yes eyes yesterday yes"
new_str = re.sub(r"\byes\b", "no", string)
Output:
no eyes yesterday no
If you use regex, you can specify word boundaries with \b:
import re
sentence = 'yes no yesyes'
sentence = re.sub(r'\byes\b', 'no', sentence)
print(sentence)
Output:
no no yesyes
Notice that 'yesyes' is not changed (to 'no').
You can read more about Python's re module here.

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

Lowercase letter after certain character?

I like some ways of how string.capwords() behaves, and some ways of how .title() behaves, but not one single one.
I need abbreviations capitalized, which .title() does, but not string.capwords(), and string.capwords() does not capitalize letters after single quotes, so I need a combination of the two. I want to use .title(), and then I need to lowercase the single letter after an apostrophe only if there are no spaces between.
For example, here's a user's input:
string="it's e.t.!"
And I want to convert it to:
>>> "It's E.T.!"
.title() would capitalize the 's', and string.capwords() would not capitalize the "e.t.".
You can use regular expression substitution (See re.sub):
>>> s = "it's e.t.!"
>>> import re
>>> re.sub(r"\b(?<!')[a-z]", lambda m: m.group().upper(), s)
"It's E.T.!"
[a-z] will match lowercase alphabet letter. But not after ' ((?<!') - negative look-behind assertion). And the letter should appear after the word boundary; so t will not be matched.
The second argument to re.sub, lambda will return substitution string. (upper version of the letter) and it will be used for replacement.
a = ".".join( [word.capitalize() for word in "it's e.t.!".split(".")] )
b = " ".join( [word.capitalize() for word in a.split(" ")] )
print(b)
Edited to use the capitalize function instead. Now it's starting to look like something usable :). But this solution doesn't work with other whitespace characters. For that I would go with falsetru's solution.
if you don't want to use regex , you can always use this simple for loop
s = "it's e.t.!"
capital_s = ''
pos_quote = s.index("'")
for pos, alpha in enumerate(s):
if pos not in [pos_quote-1, pos_quote+1]:
alpha = alpha.upper()
capital_s += alpha
print capital_s
hope this helps :)

how to not remove apostrophe only for some words in text file in python

In a sentence, How can I remove apostrophe, double quotes, comma and so on for all words excluding words like it's, what's etc.. and at end of the sentence there must be a space between word and full stop.
For example
Input Sentence :
"'This has punctuation, and it's hard to remove. ?"
Desired Output Sentence :
This has punctuation and it's hard to remove .
Use a negative look-behind
(?<!\w)["'?]|,(?= )
REmove the matched '"? characters through re.sub.
DEMO
And your code would be,
>>> s = '\"\'This has punctuation, and it\'s hard to remove. ?\" '
>>> m = re.sub(r'(?<!\w)[\"\'\?]|,(?= )', r'', s)
>>> m
"This has punctuation and it's hard to remove. "
I propose this code:
import re
sentences = [""""'This has punctuation, and it's hard to remove. ?" """,
"Did you see Cress' haircut?.",
"This 'thing' hasn't a really bad habit, you know?.",
"'I bought this for $30 from Best Buy it's. What a waste of money! The ear gels are 'comfortable at first, but what's after an hour."]
for s in sentences:
# Remove the specified characters
new_s = re.sub(r"""["?,$!]|'(?!(?<! ')[ts])""", "", s)
# Deal with the final dot
new_s = re.sub(r"\.", " .", new_s)
print(new_s)
ideone demo
Output:
This has punctuation and it's hard to remove .
Did you see Cress haircut .
This thing hasn't a really bad habit you know .
I bought this for 30 from Best Buy it's . What a waste of money The ear gels are comfortable at first but what's after an hour .
The regex:
["?,$!] # Match " ? , $ or !
| # OR
' # A ' if it does not have...
(?!
(?<! ')
[ts] # t or s after it, provided it has no ` '` before the t or s
)
Use this:
(?<![tT](?=.[sS]))["'?:;,.]
If you also want to leave the period at the end of a line (as long as it is preceded by a space):
(?<![tT](?=.[sS]))(?<! (?=.$))["'?:;,.]
My take for this is, remove all quotations which are at either end of a word. So split the sentences to word (separated by white-space) and strip any leading or trailing quotation marks from the words
>>> ''.join(e.strip(string.punctuation) for e in re.split("(\s)",st))
"This has punctuation and it's hard to remove "
Use the string.strip(delimiter) function for the outside quotes
like this :
output = chaine.strip("\"")
Be careful, you have to escape some characters with a '\' like ', ", \, and so on. Or you can enter them as "'", '"' (unsure).
Edit : mmh, didn't think about the apostrophes, if the only problem is the apostrophes you can strip the rest first then parse it manually with a for statement, place indice of first apostrophe found then if followed by an 's', leave it, I don't know, you have to set lexical/semantical rules before coding it.
Edit 2 :
If the string is only a sentence, and always has a dot at the end, and always needs the space, then use this at the end :
chaine[:-2]+" "+chaine[-2:]

Properly check if word is in string?

Say, for example, I want to check if the word test is in a string. Normally, I'd just:
if 'test' in theString
But I want to make sure it's the actual word, not just the string. For example, test in "It was detestable" would yield a false positive. I could check to make sure it contains (\s)test(\s) (spaces before and after), but than "...prepare for the test!" would yield a false negative. It seems my only other option is this:
if ' test ' in theString or ' test.' in theString or ' test!' in theString or.....
Is there a way to do this properly, something like if 'test'.asword in theString?
import re
if re.search(r'\btest\b', theString):
pass
This will look for word boundaries on either end of test. From the docs, \b:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.

Categories