Finding items in quotes, but not escaped quotes, in python using re - python

Suppose there is a series of strings. Important items are enclosed in quotes, but other items are enclosed in escaped quotes. How can you return only the important items?
Example where both are returned:
import re
testString = 'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = = '"([^\\\"]*)"'
print re.findall( pattern, testString)
Result prints
['one', 'two']
How can I get python's re to only print
['one']

You can use negative lookbehinds to ensure there's no backslash before the quote:
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = r'(?<!\\)"([^"]*)(?<!\\)"'
# ^^^^^^^ ^^^^^^^
print re.findall(pattern, testString)
regex101 demo
ideone demo

Here even though you are using \" to mark other items but in python it is interpreted as "two" only.You can use python raw strings where \" will be treated as \"
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = '"(\w*)"'
print re.findall( pattern, testString)

Related

Replace symbol before match using regex in Python

I have strings such as:
text1 = ('SOME STRING,99,1234 FIRST STREET,9998887777,ABC')
text2 = ('SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF')
text3 = ('ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI')
Desired output:
SOME STRING 99,1234 FIRST STREET,9998887777,ABC
SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF
ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI
My idea: Use regex to find occurrences of 1-5 digits, possibly preceded by a symbol, that are between two commas and not followed by a space and letters, then replace by this match without the preceding comma.
Something like:
text.replace(r'(,\d{0,5},)','.........')
If you would use regex module instead of re then possibly:
import regex
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(regex.sub(r'(?<!^.*,.*),(?=#?\d+,\d+)', ' ', str))
You might be able to use re if you sure there are no other substring following the pattern in the lookahead.
import re
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(re.sub(r',(?=#?\d+,\d+)', ' ', str))
Easier to read alternative if SOME STRING, SOME OTHER STRING, and ANOTHER STRING never contain commas:
text1.replace(",", " ", 1)
which just replaces the first comma with a space
Simple, yet effective:
my_pattern = r"(,)(\W?\d{0,5},)"
p = re.compile(my_pattern)
p.sub(r" \2", text1) # 'SOME STRING 99,1234 FIRST STREET,9998887777,ABC'
p.sub(r" \2", text2) # 'SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF'
p.sub(r" \2", text3) # 'ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI'
Secondary pattern with non-capturing group and verbose compilation:
my_pattern = r"""
(?:,) # Non-capturing group for single comma.
(\W?\d{0,5},) # Capture zero or one non-ascii characters, zero to five numbers, and a comma
"""
# re.X compiles multiline regex patterns
p = re.compile(my_pattern, flags = re.X)
# This time we will use \1 to implement the first captured group
p.sub(r" \1", text1)
p.sub(r" \1", text2)
p.sub(r" \1", text3)

regex and python

I have a string:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
I want to extract the "T" character from the Timestamp, ie change it to:
"123ABC,'2009-12-23 23:45:58.544-04:00'"
I am trying this:
newString = re.sub('(?:\-\d{2})T(?:\d{2}\:)', ' ', myString)
BUT, the returned string is:
"123ABC,'2009-12 45:58.544-04:00'"
The "non capturing groups" don't appear to be "non capturing", and it's removing everything. What am I doing wrong?
You can use lookarounds (positive lookbehind and -ahead):
(?<=\d)T(?=\d)
See a demo on regex101.com.
In Python this would be:
import re
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
rx = r'(?<=\d)T(?=\d)'
# match a T surrounded by digits
new_string = re.sub(rx, ' ', myString)
print new_string
# 123ABC,'2009-12-23 23:45:58.544-04:00'
See a demo on ideone.com.
regex seems a bit of an overkill:
mystring.replace("T"," ")
I'd use capturing groups, unanchored lookbehinds are costly in terms of regex performance:
(\d)T(\d)
And replace with r'\1 \2' replacement pattern containing backreferences to the digit before and after T. See the regex demo
Python demo:
import re
s = "123ABC,'2009-12-23T23:45:58.544-04:00'"
reg = re.compile(r'(\d)T(\d)')
s = reg.sub(r'\1 \2', s)
print(s)
That T is trapped in between numbers and will always be alone on the right. You could use a rsplit and join:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ABC,'2009-12-23 23:45:58.544-04:00'"
Trying this on a leading T somewhere in the string:
myString = "123ATC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ATC,'2009-12-23 23:45:58.544-04:00'"

Find all strings that are in between two sub strings

I have the following string as an example:
string = "## cat $$ ##dog$^"
I want to extract all the stringa that are locked between "##" and "$", so the output will be:
[" cat ","dog"]
I only know how to extract the first occurrence:
import re
r = re.compile('##(.*?)$')
m = r.search(string)
if m:
result_str = m.group(1)
Thoughts & suggestions on how to catch them all are welcomed.
Use re.findall() to get every occurrence of your substring. $ is considered a special character in regular expressions meaning — "the end of the string" anchor, so you need to escape $ to match a literal character.
>>> import re
>>> s = '## cat $$ ##dog$^'
>>> re.findall(r'##(.*?)\$', s)
[' cat ', 'dog']
To remove the leading and trailing whitespace, you can simply match it outside of the capture group.
>>> re.findall(r'##\s*(.*?)\s*\$', s)
['cat', 'dog']
Also, if the context has a possibility of spanning across newlines, you may consider using negation.
>>> re.findall(r'##\s*([^$]*)\s*\$', s)

Replace all text between 2 strings python

Lets say I have:
a = r''' Example
This is a very annoying string
that takes up multiple lines
and h#s a// kind{s} of stupid symbols in it
ok String'''
I need a way to do a replace(or just delete) and text in between "This" and "ok" so that when I call it, a now equals:
a = "Example String"
I can't find any wildcards that seem to work. Any help is much appreciated.
You need Regular Expression:
>>> import re
>>> re.sub('\nThis.*?ok','',a, flags=re.DOTALL)
' Example String'
Another method is to use string splits:
def replaceTextBetween(originalText, delimeterA, delimterB, replacementText):
leadingText = originalText.split(delimeterA)[0]
trailingText = originalText.split(delimterB)[1]
return leadingText + delimeterA + replacementText + delimterB + trailingText
Limitations:
Does not check if the delimiters exist
Assumes that there are no duplicate delimiters
Assumes that delimiters are in correct order
The DOTALL flag is the key. Ordinarily, the '.' character doesn't match newlines, so you don't match across lines in a string. If you set the DOTALL flag, re will match '.*' across as many lines as it needs to.
Use re.sub : It replaces the text between two characters or symbols or strings with desired character or symbol or string.
format: re.sub('A?(.*?)B', P, Q, flags=re.DOTALL)
where
A : character or symbol or string
B : character or symbol or string
P : character or symbol or string which replaces the text between A and B
Q : input string
re.DOTALL : to match across all lines
import re
re.sub('\nThis?(.*?)ok', '', a, flags=re.DOTALL)
output : ' Example String'
Lets see an example with html code as input
input_string = '''<body> <h1>Heading</h1> <p>Paragraph</p><b>bold text</b></body>'''
Target : remove <p> tag
re.sub('<p>?(.*?)</p>', '', input_string, flags=re.DOTALL)
output : '<body> <h1>Heading</h1> <b>bold text</b></body>'
Target : replace <p> tag with word : test
re.sub('<p>?(.*?)</p>', 'test', input_string, flags=re.DOTALL)
otput : '<body> <h1>Heading</h1> test<b>bold text</b></body>'
a=re.sub('This.*ok','',a,flags=re.DOTALL)
If you want first and last words:
re.sub(r'^\s*(\w+).*?(\w+)$', r'\1 \2', a, flags=re.DOTALL)

regex word boundaries+quotes

I have the following expression that should match the entire given word in case insensitive way.Quotes are part of the word so I check whether the word is preceded or followed by any quote. For example, the word "foo" shouldn't match the text "foo's".
word = "foo"
pattern = re.compile(r'(?<![a-z\'])%s(?![a-z\'])' % word,flags=re.IGNORECASE)
The exception are triple quotes, if the word is inside(next to) the triple quotes it should match:
pattern.search("'''foo bar baz'''")
"foo" should be found this time but it doesn't because the word is preceded by a quote.
((?<![a-z\'\"])|(?<=\'{3}))foo((?![a-z\'\"])|(?=\'{3}))
Use regex (?:(?<=''')|(?<!'))\bfoo\b(?:(?=''')|(?!'))
pattern = re.compile(r'(?:(?<=\'\'\')|(?<!\'))\b%s\b(?:(?=\'\'\')|(?!\'))' % word,flags=re.IGNORECASE)
Without using lookahead:
>>> pat = r'([\'\"]{3}|\b)foo\1'
>>> m = re.search(pat, 'My """foo""" is rich')
>>> re.search(pat, 'My """foo""" is rich').groups()
('"""',)
>>> re.search(pat, "My '''foo''' is rich").groups()
("'''",)
>>> re.search(pat, 'My """foo"" is rich').groups()
('',)
>>> re.search(pat, 'My """foo\'\'\' is rich').groups()
('',)

Categories