The goal is to prefix and suffix all occurrences of a substring (case-insensitive) in a source string. I basically need to figure out how to get from source_str to target_str.
source_str = 'You ARe probably familiaR with wildcard'
target_str = 'You [b]AR[/b]e probably famili[b]aR[/b] with wildc[b]ar[/b]d'
In this example, I am finding all occurrences of 'ar' (case insensitive) and replacing each occurrence by itself (i.e. AR, aR and ar respectively), with a prefix ([b])and suffix ([/b]).
>>> import re
>>> source_str = 'You ARe probably familiaR with wildcard'
>>> re.sub(r"(ar)", r"[b]\1[/b]", source_str, flags=re.IGNORECASE)
'You [b]AR[/b]e probably famili[b]aR[/b] with wildc[b]ar[/b]d'
Something like
import re
ar_re = re.compile("(ar)", re.I)
print ar_re.sub(r"[b]\1[/b]", "You ARe probably familiaR with wildcard")
perhaps?
Related
I want to find out if a substring is contained in the string and remove it from it without touching the rest of the string. The thing is that the substring pattern that I have to perform the search on is not exactly what will be contained in the string. In particular the problem is due to spanish accent vocals and, at the same time, uppercase substring, so for example:
myString = 'I'm júst a tésting stríng'
substring = 'TESTING'
Perform something to obtain:
resultingString = 'I'm júst a stríng'
Right now I've read that difflib library can compare two strings and weight it similarity somehow, but I'm not sure how to implement this for my case (without mentioning that I failed to install this lib).
Thanks!
This normalize() method might be a little overkill and maybe using the code from #Harpe at https://stackoverflow.com/a/71591988/218663 works fine.
Here I am going to break the original string into "words" and then join all the non-matching words back into a string:
import unicodedata
def normalize(text):
return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()
myString = "I'm júst a tésting stríng"
substring = "TESTING"
newString = " ".join(word for word in myString.split(" ") if normalize(word) != normalize(substring))
print(newString)
giving you:
I'm júst a stríng
If your "substring" could be multi-word I might think about switching strategies to a regex:
import re
import unicodedata
def normalize(text):
return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()
myString = "I'm júst á tésting stríng"
substring = "A TESTING"
match = re.search(f"\\s{ normalize(substring) }\\s", normalize(myString))
if match:
found_at = match.span()
first_part = myString[:found_at[0]]
second_part = myString[found_at[1]:]
print(f"{first_part} {second_part}".strip())
I think that will give you:
I'm júst stríng
You can use the package unicodedata to normalize accented letters to ascii code letters like so:
import unicodedata
output = unicodedata.normalize('NFD', "I'm júst a tésting stríng").encode('ascii', 'ignore')
print(str(output))
which will give
b"I'm just a testing string"
You can then compare this with your input
"TESTING".lower() in str(output).lower()
which should return True.
Normally we would write the following to replace one match:
namesRegex = re.compile(r'(is)|(life)', re.I)
replaced = namesRegex.sub(r"butter", "There is no life in the void.")
print(replaced)
output:
There butter no butter in the void.
What i want is to replace, probably using back references, each group with a specific text. Namely i want to replace the first group (is) with "are" and the second group (life) with "butterflies".
Maybe something like that. But the following is not working code.
namesRegex = re.compile(r'(is)|(life)', re.I)
replaced = namesRegex.sub(r"(are) (butterflies)", r"\1 \2", "There is no life in the void.")
print(replaced)
Is there a way to replace multiple groups in one statement in python?
You can use a replacement by lambda, mapping the keywords you want to associate:
>>> re.sub(r'(is)|(life)', lambda x: {'is': 'are', 'life': 'butterflies'}[x.group(0)], "There is no life in the void.")
'There are no butterflies in the void.'
You can define a map of keys and replacements first and then use a lambda function in replacement:
>>> repl = {'is': 'are', 'life': 'butterflies'}
>>> print re.sub(r'is|life', lambda m: repl[m.group()], "There is no life in the void.")
There are no butterflies in the void.
I will also suggest you to use word boundaries around your keys to safeguard your search patterns:
>>> print re.sub(r'\b(?:is|life)\b', lambda m: repl[m.group()], "There is no life in the void.")
There are no butterflies in the void.
You may use a dictionary with search-replacement values and use a simple \w+ regex to match words:
import re
dt = {'is' : 'are', 'life' : 'butterflies'}
namesRegex = re.compile(r'\w+')
replaced = namesRegex.sub(lambda m: dt[m.group()] if m.group() in dt else m.group(), "There is no life in the void.")
print(replaced)
See a Python demo
With this approach, you do not have to worry about creating a too large regex pattern based on alternation. You may adjust the pattern to include word boundaries, or only match letters (e.g. [\W\d_]+), etc. as per the requirements. The main point is that the pattern should match all the search terms that are keys in the dictionary.
The if m.group() in dt else m.group() part is checking if the found match is present as a key in the dictionary, and if it is not, just returns the match back. Else, the value from the dictionary is returned.
If you want just to replace specific words, go no further than str.replace().
s = "There is no life in the void."
s.replace('is', 'are').replace('life', 'butterflies') # => 'There are no butterflies in the void.'
I have many fill-in-the-blank sentences in strings,
e.g. "6d) We took no [pains] to hide it ."
How can I efficiently parse this string (in Python) to be
"We took no to hide it"?
I also would like to be able to store the word in brackets (e.g. "pains") in a list for use later. I think the regex module could be better than Python string operations like split().
This will give you all the words inside the brackets.
import re
s="6d) We took no [pains] to hide it ."
matches = re.findall('\[(.*?)\]', s)
Then you can run this to remove all bracketed words.
re.sub('\[(.*?)\]', '', s)
just for fun (to do the gather and substitution in one iteration)
matches = []
def subber(m):
matches.append(m.groups()[0])
return ""
new_text = re.sub("\[(.*?)\]",subber,s)
print new_text
print matches
import re
s = 'this is [test] string'
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
Output
'test'
For your example you could use this regex:
(.*\))(.+)\[(.+)\](.+)
You will get four groups that you can use to create your resulting string and save the 3. group for later use:
6d)
We took no
pains
to hide it .
I used .+ here because I don't know if your strings always look like your example. You can change the .+ to alphanumeric or sth. more special to your case.
import re
s = '6d) We took no [pains] to hide it .'
m = re.search(r"(.*\))(.+)\[(.+)\](.+)", s)
print(m.group(2) + m.group(4)) # "We took no to hide it ."
print(m.group(3)) # pains
import re
m = re.search(".*\) (.*)\[.*\] (.*)","6d) We took no [pains] to hide it .")
if m:
g = m.groups()
print g[0] + g[1]
Output :
We took no to hide it .
I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.
In Perl it is possible to do something like this (I hope the syntax is right...):
$string =~ m/lalala(I want this part)lalala/;
$whatIWant = $1;
I want to do the same in Python and get the text inside the parenthesis in a string like $1.
If you want to get parts by name you can also do this:
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
>>> m.groupdict()
{'first_name': 'Malcom', 'last_name': 'Reynolds'}
The example was taken from the re docs
See: Python regex match objects
>>> import re
>>> p = re.compile("lalala(I want this part)lalala")
>>> p.match("lalalaI want this partlalala").group(1)
'I want this part'
import re
astr = 'lalalabeeplalala'
match = re.search('lalala(.*)lalala', astr)
whatIWant = match.group(1) if match else None
print(whatIWant)
A small note: in Perl, when you write
$string =~ m/lalala(.*)lalala/;
the regexp can match anywhere in the string. The equivalent is accomplished with the re.search() function, not the re.match() function, which requires that the pattern match starting at the beginning of the string.
import re
data = "some input data"
m = re.search("some (input) data", data)
if m: # "if match was successful" / "if matched"
print m.group(1)
Check the docs for more.
there's no need for regex. think simple.
>>> "lalala(I want this part)lalala".split("lalala")
['', '(I want this part)', '']
>>> "lalala(I want this part)lalala".split("lalala")[1]
'(I want this part)'
>>>
import re
match = re.match('lalala(I want this part)lalala', 'lalalaI want this partlalala')
print match.group(1)
import re
string_to_check = "other_text...lalalaI want this partlalala...other_text"
p = re.compile("lalala(I want this part)lalala") # regex pattern
m = p.search(string_to_check) # use p.match if what you want is always at beginning of string
if m:
print m.group(1)
In trying to convert a Perl program to Python that parses function names out of modules, I ran into this problem, I received an error saying "group" was undefined. I soon realized that the exception was being thrown because p.match / p.search returns 0 if there is not a matching string.
Thus, the group operator cannot function on it. So, to avoid an exception, check if a match has been stored and then apply the group operator.
import re
filename = './file_to_parse.py'
p = re.compile('def (\w*)') # \w* greedily matches [a-zA-Z0-9_] character set
for each_line in open(filename,'r'):
m = p.match(each_line) # tries to match regex rule in p
if m:
m = m.group(1)
print m