Let's say I have a string like this:
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
and I want to turn it into
'(xy09 and foobar or (abc123 and something))'
then - in this particular case - I could simply do
s.replace('X_', "")
which gives the desired output.
However, in my actual data there might be not only X_ but also other prefixes, so the above replace statement does not work.
What I would need instead is a replacement of
a capital letter followed by an underscore and an arbitrary sequence of letters and numbers
by
everything after the first underscore.
So, to extract the desired elements I could use:
import re
print(re.findall('[A-Z]{1}_[a-zA-Z0-9]+', s))
which prints
['X_xy09', 'X_foobar', 'X_abc123', 'X_something']
how can I now replace those elements so that I obtain
'(xy09 and foobar or (abc123 and something))'
?
If you need to remove an uppercase ASCII letter with an underscore after it, only when not preceded with a word char and when followed with an alphanumeric char, you may use
import re
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
print(re.sub(r'\b[A-Z]_([a-zA-Z0-9])', r'\1', s))
See the Python demo and a regex demo.
Pattern details
\b - a leading word boundary
[A-Z]_ - an ASCII uppercase letter and _
([a-zA-Z0-9]) - Group 1 (later referenced to with \1 from the replacement pattern): 1 alphanumeric char.
If you just need to replace a capital letter followed by an underscore, you can use the regular expression r'[A-Z]_'.
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
re.sub(r'[A-Z]_', '', s)
You may need to add to it if you have other criteria not mentioned. (For example, some of your target values follow a word boundary and some follow parentheses.) The above might give you the wrong output if you have input like XY_something. It depends on what you expect the output to be.
Another re.sub() approach:
import re
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
result = re.sub(r'[A-Z]_(?=[a-zA-Z0-9]+)', '', s)
print(result)
The output:
(xy09 and foobar or (abc123 and something))
[A-Z]_(?=[a-zA-Z0-9]+) - (?=...) positive lookahead assertion, ensures that substituted [A-Z]_ substring is followed by alphanumeric sequence [a-zA-Z0-9]+
You could use re.sub() with a lookahead assertion:
>>> import re
>>> s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
>>> re.sub(r'\b[A-Z]_(?=[a-zA-Z0-9])', '', s)
'(xy09 and foobar or (abc123 and something))'
from the docs:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Related
In this problem I'm trying to find symbols and spaces between two alphanumeric characters. I am using regular expressions, but I cannot get result as I want. Any valuable tricks for this code is appreciated (only for regex solution):
import re
s = "This$#is% Matrix# %!"
regex_pattern = '\w(.[#_!#$%^&*()<>?/\|}{~:\s]*)\w' # needed to be solve
re.findall(regex_pattern, s)
Output is:
['h', '$#', '% ', 't']
Expected output is:
['$#', '% ']
You can try this simple pattern:
import re
s = "This$#is% Matrix# %!"
regex_pattern = '(?<=\w)[^\w]+?(?=\w)'
print(re.findall(regex_pattern, s))
Output:
['$#', '% ']
Basically, the pattern (?<=\w)[^\w]+?(?=\w) searches for clumps of all non-alphanumeric characters (that has to be at least one character in length) that are between 2 alphanumeric characters.
Using a regex find all approach:
s = "This$#is% Matrix# %!"
matches = re.findall(r'(?<=\w)[#_!#$%^&*()<>?/\|}{~:\s]+(?=\w)', s)
print(matches) # ['$#', '% ']
This approach is similar to yours, except that it simply searches for a sequence of symbols or whitespace characters which are surrounded on both sides by word characters.
Your regex uses quantifier * (0 or more) to match a series of non-alpha chars, so you get matches with no non-alpha characters between; you should use + to match one or more non-alpha chars:
import re
s = "This$#is% Matrix# %!"
regex_pattern = r'\w([#_!#$%^&*()<>?/\|}{~:\s]+)\w' # needed to be solve
print(re.findall(regex_pattern, s) )
Output:
['$#', '% ']
My 'trick' is is to use e.g. regex101.com to make sure the regex works before going to code, and to build up the regex a step at a time so you know when you add a step and the regex stops matching that it was the most recent step causing problems.
Your shortest solution is
import re
s = "This$#is% Matrix# %!"
regex_pattern = r'\b\W+\b'
print( re.findall(regex_pattern, s) ) # => ['$#', '% ']
See the online Python demo.
Why it works
\b - the word boundary followed with \W pattern matches a location that is right after a word char (i.e. a letter, digit or _)
\W+ - matches one or more non-word chars, the chars other than letters, digits and underscores
\b - right after \W, the word boundary matches a location that is immediately followed with a word char.
See the regex demo.
Is there a way to add space between the characters of a string such as the following: 'abakə̃tə̃'?
The usual ' '.join('abakə̃tə̃') approach returns 'a b a k ə ̃ t ə ̃', I am looking for 'a b a k ə̃ t ə̃'.
Thanks in advance.
You can use re.findall with a pattern that matches a word character optionally followed by an non-word character (which matches an accent):
import re
s = 'abakə̃tə̃'
print(' '.join(re.findall(r'\w\W?', s)))
For Python 3.7+, where zero-width patterns are allowed in re.split, you can use a lookahead and a lookbehind pattern split the string at positions that are followed by a word character and preceded by any character:
print(' '.join(re.split(r'(?<=.)(?=\w)', s)))
Both of the above would output:
a b a k ə̃ t ə
I am trying to implement a regex which splits the string on all math operators but no hyphen in the string:
dummy_string= "I Dont_Know The-Meaning_2018-You Know_Meaning_2017+You Know_Meaning_2017"
string_list = re.split("[0-9][+/*\-][A-Za-z]", dummy_string)
print(string_list)
>>['I Dont_Know The-Meaning_201', 'ou Know_Meaning_201', 'ou Know_Meaning_2017']
Expected Output:
>>['I Dont_Know The-Meaning_2018', 'You Know_Meaning_201', 'You Know_Meaning_2017']
I am using regex (re) package for this.
You may use (?<=[0-9]) and (?=[A-Za-z]) lookarounds instead of consuming patterns:
import re
dummy_string= "I Dont_Know The-Meaning_2018-You Know_Meaning_2017+You Know_Meaning_2017"
string_list = re.split("(?<=[0-9])[+/*-](?=[A-Za-z])", dummy_string)
print(string_list)
# => ['I Dont_Know The-Meaning_2018', 'You Know_Meaning_2017', 'You Know_Meaning_2017']
See the Python demo
When you use [0-9][+/*\-][A-Za-z] to split a string, the digit before a non-word delimiter and a letter after it are consumed, i.e. added to the match value, and re.split removes this text from the resulting output. When using lookarounds, the matched texts remain "unconsumed", they are not added to the match value and thus remain in the re.split output.
Note that you do not have to escape - when it is at the end of the character class, [+/*-] = [+/*\-]. If you plan to add more chars into the class, you may keep - escaped to avoid further issues.
I want to add space between Persian number and Persian letter like this:
"سعید123" convert to "سعید 123"
Java code of this procedure is like below.
str.replaceAll("(?<=\\p{IsDigit})(?=\\p{IsAlphabetic})", " ").
But I can't find any python solution.
There is a short regex which you may rely on to match boundary between letters and digits (in any language):
\d(?=[^_\d\W])|[^_\d\W](?=\d)
Live demo
Breakdown:
\d Match a digit
(?=[^_\d\W]) Preceding a letter from a language
| Or
[^_\d\W] Match a letter from a language
(?=\d) Preceding a digit
Python:
re.sub(r'\d(?![_\d\W])|[^_\d\W](?!\D)', r'\g<0> ', str, flags = re.UNICODE)
But according to this answer, this is the right way to accomplish this task:
re.sub(r'\d(?=[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی])|[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی](?=\d)', r'\g<0> ', str, flags = re.UNICODE)
I am not sure if this is a correct approach.
import re
k = "سعید123"
m = re.search("(\d+)", k)
if m:
k = " ".join([m.group(), k.replace(m.group(), "")])
print(k)
Output:
123 سعید
You may use
re.sub(r'([^\W\d_])(\d)', r'\1 \2', s, flags=re.U)
Note that in Python 3.x, re.U flag is redundant as the patterns are Unicode aware by default.
See the online Python demo and a regex demo.
Pattern details
([^\W\d_]) - Capturing group 1: any Unicode letter (literally, any char other than a non-word, digit or underscore chars)
(\d) - Capturing group 2: any Unicode digit
The replacement pattern is a combination of the Group 1 and 2 placeholders (referring to corresponding captured values) with a space in between them.
You may use a variation of the regex with a lookahead:
re.sub(r'[^\W\d_](?=\d)', r'\g<0> ', s)
See this regex demo.
I wish to find all words that start with "Am" and this is what I tried so far with python
import re
my_string = "America's mom, American"
re.findall(r'\b[Am][a-zA-Z]+\b', my_string)
but this is the output that I get
['America', 'mom', 'American']
Instead of what I want
['America', 'American']
I know that in regex [Am] means match either A or m, but is it possible to match A and m as well?
The [Am], a positive character class, matches either A or m. To match a sequence of chars, you need to use them one after another.
Remove the brackets:
import re
my_string = "America's mom, American"
print(re.findall(r'\bAm[a-zA-Z]+\b', my_string))
# => ['America', 'American']
See the Python demo
This pattern details:
\b - a word boundary
Am - a string of chars matched as a sequence Am
[a-zA-Z]+ - 1 or more ASCII letters
\b - a word boundary.
Don't use character class:
import re
my_string = "America's mom, American"
re.findall(r'\bAm[a-zA-Z]+\b', my_string)
re.findall(r'(Am\w+)', my_text, re.I)