Given a name string, I want to validate a few basic conditions:
-The characters belong to a recognized script/alphabet (Latin, Chinese, Arabic, etc) and aren't say, emojis.
-The string doesn't contain digits and is of length < 40
I know the latter can be accomplished via regex but is there a unicode way to accomplish the first? Are there any text processing libraries I can leverage?
You should be able to check this using the Unicode Character classes in regex.
[\p{P}\s\w]{40,}
The most important part here is the \w character class using Unicode mode:
\p{P} matches any kind of punctuation character
\s matches any kind of invisible character (equal to [\p{Z}\h\v])
\w match any word character in any script (equal to [\p{L}\p{N}_])
Live Demo
You may want to add more like \p{Sc} to match currency symbols, etc.
But to be able to take advantage of this, you need to use the regex module (an alternative to the standard re module) that supports Unicode codepoint properties with the \p{} syntax.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import regex as re
regex = r"[\p{P}\s\w]{40,}"
test_str = ("Wow cool song!Wow cool song!Wow cool song!Wow cool song! 🕺🏻 \nWow cool song! 🕺🏻Wow cool song! 🕺🏻Wow cool song! 🕺🏻\n")
matches = re.finditer(regex, test_str, re.UNICODE | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
PS: .NET Regex gives you some more options like \p{IsGreek}.
Related
Please, I'm trying to grab some parameters from a string. The parameters start with : or $ and are enclosed between brackets.
Ex:
some text [more text :Parameter1] more text [more (:Parameter2)]
My goal is to get two matches as the following:
Full match: [more text :Parameter1]
Group 1: :Parameter1
Full match: [more (:Parameter2)]
Group 1: :Parameter2
The following regex almost works. Except for the cases when the parameter itself is enclosed between parenthesis like Parameter2.
r"\\[.*?([:\$].*?)]"
and in these cases I get:
Full match: [more text :Parameter2]
Group 1: :Parameter2)
Note that group1 comes with the last parenthesis.
I couldn't find a way to remove it. Appreciate any help.
regex101 tests
Thanks.
If you want the parameter to be between the opening and the matching closing parenthesis, you might make use of negated character classes [^][()$:] to match any character that is not in the character class.
To match either of the possibilities you could use an alternation which will give you 2 capturing groups:
\[[^][()$:]*(?:\(([:$][^][()$:]+)\)|([:$][^][()$:]+))\]
About the pattern
\[ Match [
[^][()$:]* Match 0+ times any character that is not in the character class
(?: Non capturing group
\( Match (
( Capturing group 1
[:$][^][()$:]+ Match $ or :, then match 1+ chars not in the character class
) Close group 1
\) Match )
| Or
( Capturing group 2
[:$][^][()$:]+ Match $ or :, then match 1+ chars not in the character class
) Close group 2
) Close non capturing group
\] Match ]
Regex demo
With extended regex pattern:
import re
s = 'some text [more text :Parameter1] more text [more (:Parameter2)]'
res = re.findall(r'(\[[^\[\]:$]+\(?([:$][^:$)]+)\)?\])', s)
print(res)
The output (in format (<full_match>, <group_1>)):
[('[more text :Parameter1]', ':Parameter1'), ('[more (:Parameter2)]', ':Parameter2')]
This regex does what you want:
\[.*?([:\$].*?)\)?]
Output:
[more text :Parameter1]
:Parameter1
[more (:Parameter2)]
:Parameter2
I'd suggest a simple expression,
(\[[^(:]+([^]]+)\])
and then scripting the rest of the problem to avoid look-arounds.
Test
import re
regex = r"(\[[^(:]+([^]]+)\])"
test_str = "some text [more text :Parameter1] more text [more (:Parameter2)]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
You can use the following regex:
(\[[^:]+([:$][^])]+)[])]+)
It will be faster than using lazy quantifiers.
Regex details:
\[ matches [
[^:]+ matches 1 or more times any characters but a :
([:$][^])]+) second group:
[:$]matches either : or $
[^])]+ matches 1 or more times any characters but a ] or )
[])]+ matches ] and/or ) at least one time
Demo
import re
s = 'some text [more text :Parameter1] more text [more (:Parameter2)]'
print(re.findall(r'(\[[^:]+([:$][^])]+)[])]+)', s)
Output:
[('[more text :Parameter1]', ':Parameter1'), '[more text (:Parameter2)]', ':Parameter2')]
This question already has answers here:
Retrieving parameters from a URL
(20 answers)
Closed 3 years ago.
I'm using Python 3.7. I want to extract the portion of a url between the "q=...&" part of a query string. I have this code
href = span.a['href']
print("href:" + href)
matchObj = re.match( r'q=(.*?)\&', href, re.M|re.I)
if matchObj:
criteria = matchObj.group(1)
but despite the fact that my href is this
href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB
the "matchObj" is always NoneType and the subsequent lines aren't evaluated. What else do I need to do to fix my regex?
You can use the urllib module
Ex:
import urllib.parse as urlparse
url = "href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB"
data = urlparse.urlparse(url)
print(urlparse.parse_qs(data.query)['q'][0])
Output:
bet i won t get one share
You're using the wrong function if you wish to match in the middle of the string.
re.match only matches from start of the string
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding match object.
Here use re.search instead.
import re
href = 'href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB'
print("href:" + href)
matchObj = re.search( r'q=(.*?)\&', href, re.M|re.I)
if matchObj:
criteria = matchObj.group(1)
print(criteria)
'bet+i+won+t+get+one+share'
Here, we would apply a simple expression with left and right boundaries such as:
&q=(.+?)&
Demo
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"&q=(.+?)&"
test_str = "href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB
"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
RegEx Circuit
jex.im visualizes regular expressions:
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
I m generally curious why re.findall makes sutch weid stuff as finding empty strings, tuples (what that suppose to mean). It seems it does not take clausures () normally, als o interpretes | wrong like ab | cd is (ab)| (cd) , not a (b|c)d like you would think normally. Because of that i cant define regex what i need.
But in this example ie see clear wrong behaviour on the simple pattern:
([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}
what describes simple urls like gskinner.com, www.capitolconnection.org what you can see on regex help in https://regexr.com/ , i recognize with re.findall :
hotmail.
living.
item.
2.
4S.
means letters then just. How can that be?
Full code, where i try to filter out jonk from the text is :
import re
singles = r'[()\.\/$%=0-9,?!=; \t\n\r\f\v\":\[\]><]'
digits_str = singles + r'[()\-\.\/$%=0-9 \t\n\r\f\v\'\":\[\]]*'
#small_word = '[a-zA-Z0-9]{1,3}'
#junk_then_small_word = singles + small_word + '(' + singles + small_word + ')*'
email = singles + '\S+#\S*'
http_str = r'[^\.]+\.+[^\.]+\.+([^\.]+\.+)+?'
http = '(http|https|www)' + http_str
web_address = '([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}'
pat = email + '|' + digits_str
d_pat = re.compile(web_address)
text = '''"Lucy Gonzalez" test-defis-wtf <stagecoachmama#hotmail.com> on 11/28/2000 01:02:22 PM
http://www.living.com/shopping/item/item.jhtml?.productId=LC-JJHY-2.00-10.4S.I will send checks
directly to the vendor for any bills pre 4/20. I will fax you copies. I will also try and get the payphone transferred.
www.capitolconnection.org <http://www.capitolconnection.org>.
and/or =3D=3D=3D=3D=3D=3D=3D= O\'rourke'''
print('findall:')
for x in re.findall(d_pat,text):
print(x)
print('split:')
for x in re.split(d_pat,text):
print(x)
From the documentation of re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Your regex has groups, namely the part in parenthesis. If you want to display the entire match, put your regex in one big group (put parenthesis around the whole thing) and then do print(x[0]) instead of print(x).
I'm guessing that our expression has to be modified here, and that might be the problem, for instance, if we wish to match the desired patterns we would start with an expression similar to:
([a-zA-Z0-9]+)\.
if we wish to have 1 to 3 chars after the ., we would expand it to:
([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?
Demo 1
Demo 2
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?"
test_str = ("hotmail.\n"
"living.\n"
"item.\n"
"2.\n"
"4S.\n"
"hotmail.com\n"
"living.org\n"
"item.co\n"
"2.321\n"
"4S.123")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
lrgstPlace = features[0]
strLrgstPlace = str(lrgstPlace)
longtide = re.match("r(lat=)([\-\d\.]*)",strLrgstPlace)
print (longtide)
This is how my features list looks like
Feature(place='28km S of Cliza, Bolivia', long=-65.8913, lat=-17.8571, depth=358.34, mag=6.3)
Feature(place='12km SSE of Volcano, Hawaii', long=-155.2005, lat=19.3258333, depth=6.97, mag=5.54)
Why does the regex cant match anything?Its just gives me "None" as a result.
I think you meant to put the r outside the quotes:
r"(lat=)([\-\d\.]*)"
Your original expression works fine, we might want to slightly modify it, if we wish to just extract the lat numbers:
(?:lat=)([0-9\.\-]+)(?:,)
where ([0-9\.\-]+) would capture our desired lat, and we wrap it with two non-capturing groups:
(?:lat=)
(?:,)
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:lat=)([0-9\.\-]+)(?:,)"
test_str = "Feature(place='28km S of Cliza, Bolivia', long=-65.8913, lat=-17.8571, depth=358.34, mag=6.3) Feature(place='12km SSE of Volcano, Hawaii', long=-155.2005, lat=19.3258333, depth=6.97, mag=5.54)"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
I have the following shape of string: PW[Yasui Chitetsu]; and would like to get only the name inside the brackets: Yasui Chitetsu. I'm trying something like
[^(PW\[)](.*)[^\]]
as a regular expression, but the last bracket is still in it. How do I unselect it? I don't think I need anything fancy like look behinds, etc, for this case.
The Problems with What You've Tried
There are a few problems with what you've tried:
It will omit the first and last characters of your match from the group, giving you something like asui Chitets.
It will have even more errors on strings that start with P or W. For example, in PW[Paul McCartney], you would match only ul McCartne with the group and ul McCartney with the full match.
The Regex
You want something like this:
(?<=\[)([^]]+)(?=\])
Here's a regex101 demo.
Explanation
(?<=\[) means that the match must be preceded by [
([^]]+) matches 1 or more characters that are not ]
(?=\])means that the match must be followed by ]
Sample Code
Here's some sample code (from the above regex101 link):
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\[)([^]]+)(?=\])"
test_str = "PW[Yasui Chitetsu]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Semicolons
In your title, you mentioned finding text between semicolons. The same logic would work for that, giving you this regex:
(?<=;)([^;]+)(?=;)