Regex to remove all punctuation and anything enclosed by brackets - python

I'm trying to remove all punctuation and anything inside brackets or parentheses from a string in python. The idea is to somewhat normalize song names to get better results when I query the MusicBrainz WebService.
Sample input: T.N.T. (live) [nyc]
Expected output: T N T
I can do it in two regexes, but I would like to see if it can be done in just one. I tried the following, which didn't work...
>>> re.sub(r'\[.*?\]|\(.*?\)|\W+', ' ', 'T.N.T. (live) [nyc]')
'T N T live nyc '
If I split the \W+ into it's own regex and run it second, I get the expected result, so it seems that \W+ is eating the braces and parens before the first two options can deal with them.

You are correct that the \W+ is eating the braces, remove the + and you should be set:
>>> re.sub(r'\[.*?\]|\(.*?\)|\W', ' ', 'T.N.T. (live) [nyc]')
'T N T '

Here's a mini-parser that does the same thing I wrote as an exercise. If your effort to normalize gets much more complex, you may start to look at parser-based solutions. This works like a tiny parser.
# Remove all non-word chars and anything between parens or brackets
def consume(I):
I = iter(I)
lookbehind = None
def killuntil(returnchar):
while True:
ch = I.next()
if ch == returnchar:
return
for i in I:
if i in 'abcdefghijklmnopqrstuvwyzABCDEFGHIJKLMNOPQRSTUVWXYZ':
yield i
lookbehind = i
elif not i.strip() and lookbehind != ' ':
yield ' '
lookbehind = ' '
elif i == '(':
killuntil(')')
elif i == '[':
killuntil(']')
elif lookbehind != ' ':
lookbehind = ' '
yield ' '
s = "T.N.T. (live) [nyc]"
c = consume(s)

The \W+ eats the brackets, because it "has a run": It starts matching at the dot after the second T, and matches on until and including the first parenthesis: . (. After that, it starts matching again from bracket to bracket: ) [.

\W
When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_].
So try r'\[.*?\]|\(.*?\)|{.*?}|[^a-zA-Z0-9_()[\]{}]+'.
Andrew's solution is probably better, though.

Related

Replace all a in the middle of string by * using regex

I wanted to replace all 'A' in the middle of string by '*' using regex in python
I tried this
re.sub(r'[B-Z]+([A]+)[B-Z]+', r'*', 'JAYANTA ')
but it outputs - '*ANTA '
I would want it to be 'J*Y*NTA'
Can someone provide the required code? I would like an explanation of what is wrong in my code if possible.
Using the non-wordboundary \B.
To make sure that the A's are surrounded by word characters:
import re
str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)
print(str)
Prints:
J*Y*NTA POKED AG*STYA WITH B*MBOO
Alternatively, if you want to be more specific that it has to be surrounded by upper case letters. You can use lookbehind and lookahead instead.
str = re.sub(r'(?<=[A-Z])A+(?=[A-Z])', r'*', str)
>>> re.sub(r'(?!^)[Aa](?!$)','*','JAYANTA')
'J*Y*NTA'
My regex searches for an A but it cannot be at the start of the string (?!^) and not at the end of the string (?!$).
Lookahead assertion:
>>> re.sub(r'A(?=[A-Z])', r'*', 'JAYANTA ')
'J*Y*NTA '
In case if word start and end with 'A':
>>> re.sub(r'(?<=[A-Z])A(?=[A-Z])', r'*', 'AJAYANTA ')
'AJ*Y*NTA '

insert char with regular expression

I have a string '(abc)def(abc)' and I would like to turn it into '(a|b|c)def(a|b|c)'. I can do that by:
word = '(abc)def(abc)'
pattern = ''
while index < len(word):
if word[index] == '(':
pattern += word[index]
index += 1
while word[index+1] != ')':
pattern += word[index]+'|'
index += 1
pattern += word[index]
else:
pattern += word[index]
index += 1
print pattern
But I want to use regular expression to make it shorter. Can you show me how to insert char '|' between only characters that are inside the parentheses by regular expression?
How about
>>> import re
>>> re.sub(r'(?<=[a-zA-Z])(?=[a-zA-Z-][^)(]*\))', '|', '(abc)def(abc)')
'(a|b|c)def(a|b|c)'
(?<=[a-zA-Z]) Positive look behind. Ensures that the postion to insert is preceded by an alphabet.
(?=[a-zA-Z-][^)(]*\)) Postive look ahead. Ensures that the postion is followed by alphabet
[^)(]*\) ensures that the alphabet within the ()
[^)(]* matches anything other than ( or )
\) ensures that anything other than ( or ) is followed by )
This part is crutial, as it does not match the part def since def does not end with )
I dont have enough reputation to comment, but the regex you are looking for will look like this:
"(.*)"
For each string you find, insert the parentheses between each pair of characters.
let me explain each part of the regex:
( - *represends the character.*
. - A dot in regex represends any possible character.
\* - In regex, this sign represends zero to infinite appearances of the previous character.
) - *represends the character.*
This way, you are looking for any appearance of "()" with characters between them.
Hope I helped :)
([^(])(?=[^(]*\))(?!\))
Try this.Replace with \1|.See demo.
https://regex101.com/r/sH8aR8/13
import re
p = re.compile(r'([^(])(?=[^(]*\))(?!\))')
test_str = "(abc)def(abc)"
subst = "\1|"
result = re.sub(p, subst, test_str)
If you have only single characters in your round brackets, then what you could do would be to simply replace the round brackets with square ones. So the initial regex will look like this: (abc)def(abc) and the final regex will look like so: [abc]def[abc]. From a functional perspective, (a|b|c) has the same meaning as [abc].
A simple Python version to achieve the same thing. Regex is a bit hard to read and often hard to debug or change.
word = '(abc)def(abc)'
split_w = word.replace('(', ' ').replace(')', ' ').split()
split_w[0] = '|'.join( list(split_w[0]) )
split_w[2] = '|'.join( list(split_w[2]) )
print "(%s)%s(%s)" % tuple(split_w)
We split the given string into three parts, pipe-separate the first and the last part and join them back.

Split leading whitespace from rest of string

I'm not sure how to exactly convey what I'm trying to do, but I'm trying to create a function to split off a part of my string (the leading whitespace) so that I can edit it with different parts of my script, then add it again to my string after it has been altered.
So lets say I have the string:
" That's four spaces"
I want to split it so I end up with:
" " and "That's four spaces"
You can use re.match:
>>> import re
>>> re.match('(\s*)(.*)', " That's four spaces").groups()
(' ', "That's four spaces")
>>>
(\s*) captures zero or more whitespace characters at the start of the string and (.*) gets everything else.
Remember though that strings are immutable in Python. Technically, you cannot edit their contents; you can only create new string objects.
For a non-Regex solution, you could try something like this:
>>> mystr = " That's four spaces"
>>> n = next(i for i, c in enumerate(mystr) if c != ' ') # Count spaces at start
>>> (' ' * n, mystr[n:])
(' ', "That's four spaces")
>>>
The main tools here are next, enumerate, and a generator expression. This solution is probably faster than the Regex one, but I personally think that the first is more elegant.
Why don't you try matching instead of splitting?
>>> import re
>>> s = " That's four spaces"
>>> re.findall(r'^\s+|.+', s)
[' ', "That's four spaces"]
Explanation:
^\s+ Matches one or more spaces at the start of a line.
| OR
.+ Matches all the remaining characters.
One solution is to lstrip the string, then figure out how many characters you've removed. You can then 'modify' the string as desired and finish by adding the whitespace back to your string. I don't think this would work properly with tab characters, but for spaces only it seems to get the job done:
my_string = " That's four spaces"
no_left_whitespace = my_string.lstrip()
modified_string = no_left_whitespace + '!'
index = my_string.index(no_left_whitespace)
final_string = (' ' * index) + modified_string
print(final_string) # That's four spaces!
And a simple test to ensure that we've done it right, which passes:
assert final_string == my_string + '!'
One thing you can do it make a list out of string.that is
x=" That's four spaces"
y=list(x)
z="".join(y[0:4]) #if this is variable you can apply a loop over here to detect spaces from start
k="".join(y[4:])
s=[]
s.append(z)
s.append(k)
print s
This is a non regex solution which will not require any imports

regEx: To match two groups of chars

I want a regEx to match some text that contains both alpha and numeric chars. But I do NOT want it to match only alpha or numbers.
E.g. in python:
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
import re
rr = re.compile('([0-9a-z]{8})')
print 'sub=', rr.sub('########', s)
print 'findall=', rr.findall(s)
generates following output:
sub= [########: ########]: STARTED at ########ng job number ########
findall= ['mytaskid', '3fee46d2', 'processi', '10022001']
I want it to be:
sub= [mytaskid: ########]: STARTED at processing job number 10022001
findall= ['3fee46d2']
Any ideas... ??
In this case it's exactly 8 chars always, it would be even more wonderful to have a regEx that doesn't have {8} in it, i.e. it can match even if there are more or less than 8 chars.
-- edit --
Question is more to understand if there is a way to write a regEx such that I can combine 2 patterns (in this case [0-9] and [a-z]) and ensure the matched string matches both patterns, but number of chars matched from each set is variable. E.g. s could also be
s = 'mytaskid 3fee46d2 STARTED processing job number 10022001'
-- answer --
Thanks to all for the answers, all them give me what I want, so everyone gets a +1 and the first one to answer gets the accepted answer. Although jerry explains it the best. :)
If anyone is a stickler for performance, there is nothing to choose from, they're all the same.
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
def testIt(regEx):
from timeit import timeit
s = '[mytaskid: 3333fe46d2]: STARTED at processing job number 10022001'
assert (re.sub('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b', '########', s) ==
'[mytaskid: ########]: STARTED at processing job number 10022001'), '"%s" does not work.' % regEx
print 'sub() with \'', regEx, '\': ', timeit('rr.sub(\'########\', s)', number=500000, setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
print 'findall() with \'', regEx, '\': ', timeit('rr.findall(s)', setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
testIt('\\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\\b')
testIt('\\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\\b')
testIt('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b')
testIt('\\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\\b')
produced:
sub() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.328042736387
findall() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.350668751542
sub() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.314759661193
findall() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.35618526928
sub() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.322802906619
findall() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.35330467656
sub() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.320779061371
findall() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.347522144274
Try following regex:
\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b
This will match a word containing a digit followed an alphabet or vice versa.
Hence it will cover a complete set of those words which contain at-least one digit and one alphabet.
Note: Although it is not the case with python, I have observed that not all varieties of tools support lookahead and lookbehind. So I prefer to avoid them if possible.
You need to use the look ahead (?=...).
This one matches all words with at least one out of [123] and [abc].
>>> re.findall('\\b(?=[abc321]*[321])[abc321]*[abc][abc321]*\\b', ' 123abc 123 abc')
['123abc']
This way you can do AND for constraints to the same string.
>>> help(re)
(?=...) Matches if ... matches next, but doesn't consume the string.
An other way is to ground it and to say: with one of [abc] and one of [123] means there is at least a [123][abc] or a [abc][123] in the string resulting in
>>> re.findall('\\b[abc321]*(?:[abc][123]|[123][abc])[abc321]*\\b', ' 123abc 123 abc')
['123abc']
Not the most beautiful regular expression, but it works:
\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b
If the format is the same each time, that is:
[########: ########]: STARTED at ########ng job number ########
You can use:
([^\]\s]+)\]
With re.findall, or re.search and getting .group(1) if you use re.search.
[^\]\s]+ is a negated class and will match any character except space (and family) or closing square bracket.
The regex basically looks for characters (except ] or spaces) up until a closing square bracket.
If you want to match any string containing both alpha and numeric characters, you will need a lookahead:
\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b
Used like so:
result = re.search(r'\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b', text, re.I)
re.I is for ignorecase.
\b is a word boundary and will match only between a 'word' character and a 'non-word' character (or start/end of string).
(?=[0-9]*[a-z]) is a positive lookahead and makes sure there's at least 1 alpha in the part to be matched.
(?=[a-z]*[0-9]) is a similar lookahead but checks for digits.
You can use more specific regular expression and skip the findall.
import re
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
mo = re.search(':\s+(\w+)', s)
print mo.group(1)

Regex divide with upper-case

I would like to replace strings like 'HDMWhoSomeThing' to 'HDM Who Some Thing' with regex.
So I would like to extract words which starts with an upper-case letter or consist of upper-case letters only. Notice that in the string 'HDMWho' the last upper-case letter is in the fact the first letter of the word Who - and should not be included in the word HDM.
What is the correct regex to achieve this goal? I have tried many regex' similar to [A-Z][a-z]+ but without success. The [A-Z][a-z]+ gives me 'Who Some Thing' - without 'HDM' of course.
Any ideas?
Thanks,
Rukki
#! /usr/bin/env python
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'HDMWhoSomeMONKEYThingXYZ'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print ' '.join(result)
Output:
HDM Who Some MONKEY Thing XYZ
Judging by lines of code, this task is a much more natural fit with re.findall:
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z][a-z]*)'
print ' '.join(re.findall(pattern, 'HDMWhoSomeMONKEYThingX'))
Output:
HDM Who Some MONKEY Thing X
Try to split with this regular expression:
/(?=[A-Z][a-z])/
And if your regular expression engine does not support splitting empty matches, try this regular expression to put spaces between the words:
/([A-Z])(?![A-Z])/
Replace it with " $1" (space plus match of the first group). Then you can split at the space.
one liner :
' '.join(a or b for a,b in re.findall('([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))',s))
using regexp
([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))
So 'words' in this case are:
Any number of uppercase letters - unless the last uppercase letter is followed by a lowercase letter.
One uppercase letter followed by any number of lowercase letters.
so try:
([A-Z]+(?![a-z])|[A-Z][a-z]*)
The first alternation includes a negative lookahead (?![a-z]), which handles the boundary between an all-caps word and an initial caps word.
May be '[A-Z]*?[A-Z][a-z]+'?
Edit: This seems to work: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+
import re
def find_stuff(str):
p = re.compile(r'[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
m = p.findall(str)
result = ''
for x in m:
result += x + ' '
print result
find_stuff('HDMWhoSomeThing')
find_stuff('SomeHDMWhoThing')
Prints out:
HDM Who Some Thing
Some HDM Who Thing

Categories