How can I replace all occurrences of a substring using regex? - python

I have a string, s = 'sdfjoiweng%#$foo$fsoifjoi', and I would like to replace 'foo' with 'bar'.
I tried re.sub(r'\bfoo\b', 'bar', s) and re.sub(r'[foo]', 'bar', s), but it doesn't do anything. What am I doing wrong?

You can replace it directly:
>>> import re
>>> s = 'sdfjoiweng%#$foo$fsoifjoi'
>>> print(re.sub('foo','bar',s))
sdfjoiweng%#$bar$fsoifjoi
It will also work for more occurrences of foo like below:
>>> s = 'sdfjoiweng%#$foo$fsoifoojoi'
>>> print(re.sub('foo','bar',s))
sdfjoiweng%#$bar$fsoibarjoi
If you want to replace only the 1st occurrence of foo and not all the foo occurrences in the string then alecxe's answer does exactly that.

re.sub(r'\bfoo\b', 'bar', s)
Here, the \b defines the word boundaries - positions between a word character (\w) and a non-word character - exactly what you have matching for foo inside the sdfjoiweng%#$foo$fsoifjoi string. Works for me:
In [1]: import re
In [2]: s = 'sdfjoiweng%#$foo$fsoifjoi'
In [3]: re.sub(r'\bfoo\b', 'bar', s)
Out[3]: 'sdfjoiweng%#$bar$fsoifjoi'

You can use replace function directly instead of using regex.
>>> s = 'sdfjoiweng%#$foo$fsoifjoifoo'
>>>
>>> s.replace("foo","bar")
'sdfjoiweng%#$bar$fsoifjoibar'
>>>
>>>

To further add to the above, the code below shows you how to replace multiple words at once! I've used this to replace 165,000 words in 1 step!!
Note \b means no sub string matching..must be a whole word..if you remove it then it will make sub-string match.
import re
s = 'thisis a test'
re.sub('\bthis\b|test','',s)
This gives:
'thisis a '

Related

Python: Change uppercase letter

I can't figure out how to replace the second uppercase letter in a string in python.
for example:
string = "YannickMorin"
I want it to become yannick-morin
As of now I can make it all lowercase by doing string.lower() but how to put a dash when it finds the second uppercase letter.
You can use Regex
>>> import re
>>> split_res = re.findall('[A-Z][^A-Z]*', 'YannickMorin')
['Yannick', 'Morin' ]
>>>'-'.join(split_res).lower()
This is more a task for regular expressions:
result = re.sub(r'[a-z]([A-Z])', r'-\1', inputstring).lower()
Demo:
>>> import re
>>> inputstring = 'YannickMorin'
>>> re.sub(r'[a-z]([A-Z])', r'-\1', inputstring).lower()
'yannic-morin'
Find uppercase letters that are not at the beginning of the word and insert a dash before. Then convert everything to lowercase.
>>> import re
>>> re.sub(r'\B([A-Z])', r'-\1', "ThisIsMyText").lower()
'this-is-my-text'
the lower() method does not change the string in place, it returns the value that either needs to be printed out, or assigned to another variable. You need to replace it.. One solution is:
strAsList = list(string)
strAsList[0] = strAsList[0].lower()
strAsList[7] = strAsList[7].lower()
strAsList.insert(7, '-')
print (''.join(strAsList))

Extracting alphanumeric substring from a string in python

i have a string in python
text = '(b)'
i want to extract the 'b'. I could strip the first and the last letter of the string but the reason i wont do that is because the text string may contain '(a)', (iii), 'i)', '(1' or '(2)'. Some times they contain no parenthesis at all. but they will always contain an alphanumeric values. But i equally want to retrieve the alphanumeric values there.
this feat will have to be accomplished in a one line code or block of code that returns justthe value as it will be used in an iteratively on a multiple situations
what is the best way to do that in python,
I don't think Regex is needed here. You can just strip off any parenthesis with str.strip:
>>> text = '(b)'
>>> text.strip('()')
'b'
>>> text = '(iii)'
>>> text.strip('()')
'iii'
>>> text = 'i)'
>>> text.strip('()')
'i'
>>> text = '(1'
>>> text.strip('()')
'1'
>>> text = '(2)'
>>> text.strip('()')
'2'
>>> text = 'a'
>>> text.strip('()')
'a'
>>>
Regarding #MikeMcKerns' comment, a more robust solution would be to pass string.punctuation to str.strip:
>>> from string import punctuation
>>> punctuation # Just to demonstrate
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>>
>>> text = '*(ab2**)'
>>> text.strip(punctuation)
'ab2'
>>>
You could do this through python's re module,
>>> import re
>>> text = '(5a)'
>>> match = re.search(r'\(?([0-9A-Za-z]+)\)?', text)
>>> match.group(1)
'5a'
>>> text = '*(ab2**)'
>>> match = re.search(r'\(?([0-9A-Za-z]+)\)?', text)
>>> match.group(1)
'ab2'
Not fancy, but this is pretty generic
>>> import string
>>> ''.join(i for i in text if i in string.ascii_letters+'0123456789')
This works for all sorts of combinations of parenthesis in the middle of the string, and also if you have other non-alphanumeric characters (aside from the parenthesis) present.
re.match(r'\(?([a-zA-Z0-9]+)', text).group(1)
for your input provided by exmple it would be:
>>> a=['(a)', '(iii)', 'i)', '(1' , '(2)']
>>> [ re.match(r'\(?([a-zA-Z0-9]+)', text).group(1) for text in a ]
['a', 'iii', 'i', '1', '2']

Removing many types of chars from a Python string

I have some string X and I wish to remove semicolons, periods, commas, colons, etc, all in one go. Is there a way to do this that doesn't require a big chain of .replace(somechar,"") calls?
You can use the translate method with a first argument of None:
string2 = string1.translate(None, ";.,:")
Alternatively, you can use the filter function:
string2 = filter(lambda x: x not in ";,.:", string1)
Note that both of these options only work for non-Unicode strings and only in Python 2.
You can use re.sub to pattern match and replace. The following replaces h and i only with empty strings:
In [1]: s = 'byehibyehbyei'
In [1]: re.sub('[hi]', '', s)
Out[1]: 'byebyebye'
Don't forget to import re.
>>> import re
>>> foo = "asdf;:,*_-"
>>> re.sub('[;:,*_-]', '', foo)
'asdf'
[;:,*_-] - List of characters to be matched
'' - Replace match with nothing
Using the string foo.
For more information take a look at the re.sub(pattern, repl, string, count=0, flags=0) documentation.
Don't know about the speed, but here's another example without using re.
commas_and_stuff = ",+;:"
words = "words; and stuff!!!!"
cleaned_words = "".join(c for c in words if c not in commas_and_stuff)
Gives you:
'words and stuff!!!!'

Split string without non-characters

I'm trying to split a string that looks like this for example:
':foo [bar]'
Using str.split() on this of course returns [':foo','[bar]']
But how can I make it return just ['foo','bar'] containing only these characters?
I don't like regular expressions, but do like Python, so I'd probably write this as
>>> s = ':foo [bar]'
>>> ''.join(c for c in s if c.isalnum() or c.isspace())
'foo bar'
>>> ''.join(c for c in s if c.isalnum() or c.isspace()).split()
['foo', 'bar']
The ''.join idiom is a little strange, I admit, but you can almost read the rest in English: "join every character for the characters in s if the character is alphanumeric or the character is whitespace, and then split that".
Alternatively, if you know that the symbols you want to remove will always be on the outside and the word will still be separated by spaces, and you know what they are, you might try something like
>>> s = ':foo [bar]'
>>> s.split()
[':foo', '[bar]']
>>> [word.strip(':[]') for word in s.split()]
['foo', 'bar']
Do str.split() as normal, and then parse each element to remove the non-letters. Something like:
>>> my_string = ':foo [bar]'
>>> parts = [''.join(c for c in s if c.isalpha()) for s in my_string.split()]
['foo', 'bar']
You'll have to pass through the list ['foo','[bar]'] and strip out all non-letter characters, using regular expressions. Check Regex replace (in Python) - a simpler way? for examples and references to documentation.
You have to try regular expressions.
Use re.sub() to replace :,[,] characters and than split your resultant string with white space as delimiter.
>>> st = ':foo [bar]'
>>> import re
>>> new_st = re.sub(r'[\[\]:]','',st)
>>> new_st.split(' ')
['foo', 'bar']

Determining the unmatched portion of a string using a regex in Python

Suppose I have a string "a foobar" and I use "^a\s*" to match "a ".
Is there a way to easily get "foobar" returned? (What was NOT matched)
I want to use a regex to look for a command word and also use the regex to remove the command word from the string.
I know how to do this using something like:
mystring[:regexobj.start()] + email[regexobj.end():]
But this falls apart if I have multiple matches.
Thanks!
Use re.sub:
import re
s = "87 foo 87 bar"
r = re.compile(r"87\s*")
s = r.sub('', s)
print s
Result:
foo bar
from http://docs.python.org/library/re.html#re.split
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
so your example would be
>>> re.split(r'(^a\s*)', "a foobar")
['', 'a ', 'foobar']
at which point you can separate the odd items (your match) from the even items (the rest).
>>> l = re.split(r'(^a\s*)', "a foobar")
>>> l[1::2] # matching strings
['a ']
>>> l[::2] # non-matching strings
['', 'foobar']
This has the advantage over re.sub in that you can tell, when, where, and how many matches were found.
>>> import re
>>> re.sub("87\s*", "", "87 foo 87 bar")
'foo bar'
Instead of splitting or separating, maybe you can use re.sub and substitute a blank, empty string ("") whenever you find the pattern. For example...
>>> import re
>>> re.sub("^a\s*", "","a foobar")
'foobar''
>>> re.sub("a\s*", "","a foobar a foobar")
'foobr foobr'
>>> re.sub("87\s*", "","87 foo 87 bar")
'foo bar'

Categories