In this problem I'm trying to find symbols and spaces between two alphanumeric characters. I am using regular expressions, but I cannot get result as I want. Any valuable tricks for this code is appreciated (only for regex solution):
import re
s = "This$#is% Matrix# %!"
regex_pattern = '\w(.[#_!#$%^&*()<>?/\|}{~:\s]*)\w' # needed to be solve
re.findall(regex_pattern, s)
Output is:
['h', '$#', '% ', 't']
Expected output is:
['$#', '% ']
You can try this simple pattern:
import re
s = "This$#is% Matrix# %!"
regex_pattern = '(?<=\w)[^\w]+?(?=\w)'
print(re.findall(regex_pattern, s))
Output:
['$#', '% ']
Basically, the pattern (?<=\w)[^\w]+?(?=\w) searches for clumps of all non-alphanumeric characters (that has to be at least one character in length) that are between 2 alphanumeric characters.
Using a regex find all approach:
s = "This$#is% Matrix# %!"
matches = re.findall(r'(?<=\w)[#_!#$%^&*()<>?/\|}{~:\s]+(?=\w)', s)
print(matches) # ['$#', '% ']
This approach is similar to yours, except that it simply searches for a sequence of symbols or whitespace characters which are surrounded on both sides by word characters.
Your regex uses quantifier * (0 or more) to match a series of non-alpha chars, so you get matches with no non-alpha characters between; you should use + to match one or more non-alpha chars:
import re
s = "This$#is% Matrix# %!"
regex_pattern = r'\w([#_!#$%^&*()<>?/\|}{~:\s]+)\w' # needed to be solve
print(re.findall(regex_pattern, s) )
Output:
['$#', '% ']
My 'trick' is is to use e.g. regex101.com to make sure the regex works before going to code, and to build up the regex a step at a time so you know when you add a step and the regex stops matching that it was the most recent step causing problems.
Your shortest solution is
import re
s = "This$#is% Matrix# %!"
regex_pattern = r'\b\W+\b'
print( re.findall(regex_pattern, s) ) # => ['$#', '% ']
See the online Python demo.
Why it works
\b - the word boundary followed with \W pattern matches a location that is right after a word char (i.e. a letter, digit or _)
\W+ - matches one or more non-word chars, the chars other than letters, digits and underscores
\b - right after \W, the word boundary matches a location that is immediately followed with a word char.
See the regex demo.
Related
I have a string Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23.
Removing all the numbers that are before the dot and after the word.
Ignoring the first part of the string i.e. "Node57Name123".
Should not remove the digits if they are inside words.
Tried re.sub(r"\d+","",string) but it removed every other digit.
The output should look like this "Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape"
Can you please point me to the right direction.
You can use
re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text)
See the regex demo.
Details:
^([^.]*\.) - zero or more chars other than a dot and then a . char at the start of the string captured into Group 1 (referred to with \1 from the replacement pattern)
| - or
\d+(?![^.]) - one or more digits followed with a dot or end of string (=(?=\.|$)).
See the Python demo:
import re
text = r'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
print( re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text) )
## => Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
Just to give you a non-regex alternative' using rstrip(). We can feed this function a bunch of characters to remove from the right of the string e.g.: rstrip('0123456789'). Alternatively we can also use the digits constant from the string module:
from string import digits
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = '.'.join([s.split('.')[0]] + [i.rstrip(digits) for i in s.split('.')[1:]])
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
EDIT:
If you must use a regular pattern, it seems that the following covers your sample:
(\.[^.]*?)\d+\b
Replace with the 1st capture group, see the online demo
( - Open capture group:
\.[^.]*? - A literal dot followed by 0+ non-dot characters (lazy).
) - Close capture group.
\d+\b - Match 1+ digits up to a word-boundary.
A sample:
import re
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = re.sub(r'(\.[^.]*?)\d+\b', r'\1', s)
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
I want to add space between Persian number and Persian letter like this:
"سعید123" convert to "سعید 123"
Java code of this procedure is like below.
str.replaceAll("(?<=\\p{IsDigit})(?=\\p{IsAlphabetic})", " ").
But I can't find any python solution.
There is a short regex which you may rely on to match boundary between letters and digits (in any language):
\d(?=[^_\d\W])|[^_\d\W](?=\d)
Live demo
Breakdown:
\d Match a digit
(?=[^_\d\W]) Preceding a letter from a language
| Or
[^_\d\W] Match a letter from a language
(?=\d) Preceding a digit
Python:
re.sub(r'\d(?![_\d\W])|[^_\d\W](?!\D)', r'\g<0> ', str, flags = re.UNICODE)
But according to this answer, this is the right way to accomplish this task:
re.sub(r'\d(?=[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی])|[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی](?=\d)', r'\g<0> ', str, flags = re.UNICODE)
I am not sure if this is a correct approach.
import re
k = "سعید123"
m = re.search("(\d+)", k)
if m:
k = " ".join([m.group(), k.replace(m.group(), "")])
print(k)
Output:
123 سعید
You may use
re.sub(r'([^\W\d_])(\d)', r'\1 \2', s, flags=re.U)
Note that in Python 3.x, re.U flag is redundant as the patterns are Unicode aware by default.
See the online Python demo and a regex demo.
Pattern details
([^\W\d_]) - Capturing group 1: any Unicode letter (literally, any char other than a non-word, digit or underscore chars)
(\d) - Capturing group 2: any Unicode digit
The replacement pattern is a combination of the Group 1 and 2 placeholders (referring to corresponding captured values) with a space in between them.
You may use a variation of the regex with a lookahead:
re.sub(r'[^\W\d_](?=\d)', r'\g<0> ', s)
See this regex demo.
I have a text like this format,
s = '[aaa]foo[bbb]bar[ccc]foobar'
Actually the text is Chinese car review like this
【最满意】整车都很满意,最满意就是性价比,...【空间】空间真的超乎想象,毫不夸张,...【内饰】内饰还可以吧,没有多少可以说的...
Now I want to split it to these parts
[aaa]foo
[bbb]bar
[ccc]foobar
first I tried
>>> re.findall(r'\[.*?\].*?',s)
['[aaa]', '[bbb]', '[ccc]']
only got first half.
Then I tried
>>> re.findall(r'(\[.*?\].*?)\[?',s)
['[aaa]', '[bbb]', '[ccc]']
still only got first half
At last I have to get the two parts respectively then zip them
>>> re.findall(r'\[.*?\]',s)
['[aaa]', '[bbb]', '[ccc]']
>>> re.split(r'\[.*?\]',s)
['', 'foo', 'bar', 'foobar']
>>> for t in zip(re.findall(r'\[.*?\]',s),[e for e in re.split(r'\[.*?\]',s) if e]):
... print(''.join(t))
...
[aaa]foo
[bbb]bar
[ccc]foobar
So I want to know if exists some regex could directly split it to these parts?
One of the approaches:
import re
s = '[aaa]foo[bbb]bar[ccc]foobar'
result = re.findall(r'\[[^]]+\][^\[\]]+', s)
print(result)
The output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
\[ or \] - matches the bracket literally
[^]]+ - matches one or more characters except ]
[^\[\]]+ - matches any character(s) except brackets \[\]
I think this could work:
r'\[.+?\]\w+'
Here it is:
>>> re.findall(r"(\[\w*\]\w+)",s)
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Explanation:
parenthesis means the group to search. Witch group:
it should start by a braked \[ followed by some letters \w
then the matched braked braked \] followed by more letters \w
Notice you should to escape braked with \.
I think if input string format is "strict enough", it's possible to try something w/o regexp. It may look as a microoptimisation, but could be interesting as a challenge.
result = map(lambda x: '[' + x, s[1:].split("["))
So I tried to check performance on a 1Mil iterations and here are my results (seconds):
result = map(lambda x: '[' + x, s[1:].split("[")) # 0.89862203598
result = re.findall(r'\[[^]]+\][^\[\]]+', s) # 1.48306798935
result = re.findall(r'\[.+?\]\w+', s) # 1.47224497795
result = re.findall(r'(\[\w*\]\w+)', s) # 1.47370815277
\[.*?\][a-zA-Z]*
This regex should capture anything that start with [somethinghere]Any letters from a to Z
you can play on regex101 to try out different ones and it's easy to make your own regex there
All you need is findall and here is very simple pattern without making it complicated:
import re
print(re.findall(r'\[\w+\]\w+','[aaa]foo[bbb]bar[ccc]foobar'))
output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Detailed solution:
import re
string_1='[aaa]foo[bbb]bar[ccc]foobar'
pattern=r'\[\w+\]\w+'
print(re.findall(pattern,string_1))
explanation:
\[\w+\]\w+
\[ matches the character [ literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed
I wish to find all words that start with "Am" and this is what I tried so far with python
import re
my_string = "America's mom, American"
re.findall(r'\b[Am][a-zA-Z]+\b', my_string)
but this is the output that I get
['America', 'mom', 'American']
Instead of what I want
['America', 'American']
I know that in regex [Am] means match either A or m, but is it possible to match A and m as well?
The [Am], a positive character class, matches either A or m. To match a sequence of chars, you need to use them one after another.
Remove the brackets:
import re
my_string = "America's mom, American"
print(re.findall(r'\bAm[a-zA-Z]+\b', my_string))
# => ['America', 'American']
See the Python demo
This pattern details:
\b - a word boundary
Am - a string of chars matched as a sequence Am
[a-zA-Z]+ - 1 or more ASCII letters
\b - a word boundary.
Don't use character class:
import re
my_string = "America's mom, American"
re.findall(r'\bAm[a-zA-Z]+\b', my_string)
re.findall(r'(Am\w+)', my_text, re.I)
I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)