regex word boundaries+quotes - python

I have the following expression that should match the entire given word in case insensitive way.Quotes are part of the word so I check whether the word is preceded or followed by any quote. For example, the word "foo" shouldn't match the text "foo's".
word = "foo"
pattern = re.compile(r'(?<![a-z\'])%s(?![a-z\'])' % word,flags=re.IGNORECASE)
The exception are triple quotes, if the word is inside(next to) the triple quotes it should match:
pattern.search("'''foo bar baz'''")
"foo" should be found this time but it doesn't because the word is preceded by a quote.

((?<![a-z\'\"])|(?<=\'{3}))foo((?![a-z\'\"])|(?=\'{3}))

Use regex (?:(?<=''')|(?<!'))\bfoo\b(?:(?=''')|(?!'))
pattern = re.compile(r'(?:(?<=\'\'\')|(?<!\'))\b%s\b(?:(?=\'\'\')|(?!\'))' % word,flags=re.IGNORECASE)

Without using lookahead:
>>> pat = r'([\'\"]{3}|\b)foo\1'
>>> m = re.search(pat, 'My """foo""" is rich')
>>> re.search(pat, 'My """foo""" is rich').groups()
('"""',)
>>> re.search(pat, "My '''foo''' is rich").groups()
("'''",)
>>> re.search(pat, 'My """foo"" is rich').groups()
('',)
>>> re.search(pat, 'My """foo\'\'\' is rich').groups()
('',)

Related

How do I remove a string that starts with '#' and ends with a blank character by using regular expressions in Python?

So I have this text:
"#Natalija What a wonderful day, isn't it #Kristina123 ?"
I tried to remove these two substrings that start with the character '#' by using re.sub function but it didn't work.
How do I remove the susbstring that starts with this character?
Try this regex :
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
t = re.sub('#.*? ', '', text)
print(t)
OUTPUT :
What a wonderful day, isn't it ?
This should work.
# matches the character #
\w+ matches any word character as many times as possible, so it stops at blank character
Code:
import re
regex = r"#\w+"
subst = "XXX"
test_str = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
print (result)
output:
XXX What a wonderful day, isn't it XXX ?
It's possible to do it with re.sub(), it would be something like this:
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
output = re.sub('#[a-zA-Z0-9]+\s','',text)
print(output) # Output: What a wonderful day, isn't it ?
# matches the # character
[a-zA-Z0-9] matches alphanumerical (uppercase and lowercase)
"+" means "one or more" (otherwise it would match only one of those characters)
\s matches whitespaces
Alternatively, this can also be done without using the module re. You can first split the sentence into words. Then remove the words containing the # character and finally join the words into a new sentence.
if __name__ == '__main__':
original_text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
individual_words = original_text.split(' ')
words_without_tags = [word for word in individual_words if '#' not in word]
new_sentence = ' '.join(words_without_tags)
print(new_sentence)
I think this would be work for you. The pattern #\w+?\s will determine expressions which start with # continued by one or more alphanumeric characters then finish with an optional white space.
import re
string = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
pattern = '#\w+?\s'
replaced = re.sub(pattern, '', string)
print(replaced)

Python Regex replace all newline characters directly followed by a char with char

Example String:
str = "test sdf sfwe \n \na dssdf
I want to replace the:
\na
with
a
Where 'a' could be any character.
I tried:
str = "test \n \na"
res = re.sub('[\n.]','a',str)
But how can I store the character behind the \n and use it as replacement?
You may use this regex with a capture group:
>>> s = "test sdf sfwe \n \na dssdf"
>>> >>> print re.sub(r'\n(.)', r'\1', s)
test sdf sfwe a dssdf
Search regex r'\n(.)' will match \n followed by any character and capture following character in group #1
Replacement r'\1' is back-reference to capture group #1 which is placed back in original string.
Better to avoid str as variable name since it is a reserve keyword (function) in python.
If by any character you meant any non-space character then use this regex with use of \S (non-whitespace character) instead of .:
>>> print re.sub(r'\n(\S)', r'\1', s)
test sdf sfwe
a dssdf
Also this lookahead based approach will also work that doesn't need any capture group:
>>> print re.sub(r'\n(?=\S)', '', s)
test sdf sfwe
a dssdf
Note that [\n.] will match any one of \n or literal dot only not \n followed by any character,
Find all the matches:
matches = re.findall( r'\n\w', str )
Replace all of them:
for m in matches :
str = str.replace( m, m[1] )
That's all, folks! =)
I think that the best way for you so you don't have more spaces in your text is the following:
string = "test sdf sfwe \n \na dssdf"
import re
' '.join(re.findall('\w+',string))
'test sdf sfwe a dssdf'

How to find a string in text and return the string from text?

I need find a string which already has special chars removed. So, I want to do is to find that string in a sentence and return the string with special chars.
Ex: string = France09
Sentence : i leaved in France'09.
now I did re.search('France09',sentence), it will return True or False. But I want to get the output as France'09.
Can any one help me.
From the docs (https://docs.python.org/2/library/re.html#re.search), search is not returning True or False:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
Have a look at https://regex101.com/r/18NJ2E/1
TL;DR
import re
regex = r"(?P<relevant_info>France'09)"
test_str = "Sentence : i leaved in France'09."
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print(match.group('relevant_info'))
Try This:
Input_str = "i leaved in France'09"
Word_list = Input_str.split(" ")
for val in Word_list:
if not val.isalnum():
print(val)
Output:
France'09
You will need to create a regular expression that matches the special characters at any location:
import re
Sentence = "i leaved in France'09"
Match = 'France09'
Match2 = "[']*".join(Match)
m = re.search(Match2, Sentence)
print(m.group(0))
Match2 gets the value "F[']*r[']*a[']*n[']*c[']*e[']*0[']*9". You can add other special characters into the ['] part.

How to replace part of string via regex with saving part of pattern?

For example, I have strings like this:
string s = "chapter1 in chapters"
How can I replace it with regex to this:
s = "chapter 1 in chapters"
e.g. I need only to insert whitespace between "chapter" and it's number if it exists. re.sub(r'chapter\d+', r'chapter \d+ , s) doesn't work.
You can use lookarounds:
>>> s = "chapter1 in chapters"
>>> print re.sub(r"(?<=\bchapter)(?=\d)", ' ', s)
chapter 1 in chapters
RegEx Breakup:
(?<=\bchapter) # asserts a position where preceding text is chapter
(?=d) # asserts a position where next char is a digit
You can use capture groups, Something like this -
>>> s = "chapter1 in chapters"
>>> re.sub(r'chapter(\d+)',r'chapter \1',s)
'chapter 1 in chapters'

Finding items in quotes, but not escaped quotes, in python using re

Suppose there is a series of strings. Important items are enclosed in quotes, but other items are enclosed in escaped quotes. How can you return only the important items?
Example where both are returned:
import re
testString = 'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = = '"([^\\\"]*)"'
print re.findall( pattern, testString)
Result prints
['one', 'two']
How can I get python's re to only print
['one']
You can use negative lookbehinds to ensure there's no backslash before the quote:
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = r'(?<!\\)"([^"]*)(?<!\\)"'
# ^^^^^^^ ^^^^^^^
print re.findall(pattern, testString)
regex101 demo
ideone demo
Here even though you are using \" to mark other items but in python it is interpreted as "two" only.You can use python raw strings where \" will be treated as \"
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = '"(\w*)"'
print re.findall( pattern, testString)

Categories