regex matching is unable to select alphanumeric string with spaces in python

regex matching is unable to select alphanumeric string with spaces in python - python

I have the following list of expressions in python
LIST1=["AR BR_18_0138249", "AR R_16_01382649", "BR 16 0138264", "R 16 01382679" ]
In the above string a few patterns are alpha numeric but there is a space between the two second set of sequences. I expect the following output
"AR BR_18_0138249"
"AR R_16_01382649"
"BR 16 0138264"
"R 16 01382679"
I have tried the following code
import regex as re
pattern = r"(\bB?R_\w+)(?!.*\1)|(\bB?R \w+)(?!.*\1)|(\bR?^sd \w+)(?!.*\1)"
for i in LIST1:
rest = re.search(pattern, i)
if rest:
print(rest.group(1))
I have obtained the following result
BR_18_0138249
R_16_01382649
None
None
I am unable to get the sequences with the spaces. I request someone to guide me in this regard

You can use
\b(B?R(?=([\s_]))(?:\2\d+)+)\b(?!.*\b\1\b)
See the regex demo
Details
\b - a word boundary
(B?R(?=([\s_]))(?:\2\d+)+) - Group 1: an optional B, then R, then one or more sequences of a whitespace or underscore followed with one or more digits (if you need to support letters here, replace \d+ with [^\W_])
\b - a word boundary
(?!.*\b\1\b) - a negative lookahead that fails the match if there are
.* - any zero or more chars other than line break chars, as many as possible
\b\1\b - the same value as in Group 1 matched as a whole word (not enclosed with letters, digits or underscores).
See a Python re demo (you do not need the PyPi regex module here):
import re
LIST1=["AR BR_18_0138249", "AR R_16_01382649", "BR 16 0138264", "R 16 01382679" ]
pattern = r"\b(B?R(?=([\s_]))(?:\2\d+)+)\b(?!.*\b\1\b)"
for i in LIST1:
rest = re.search(pattern, i)
if rest:
print(rest.group(1))

This does the work:
[A-Z]{1,2}\s([A-Z]{1,2}+(?:_[0-9]+)*|[0-9]+(?:\s[0-9]+)*)
This regex gives below output:
AR BR_18_0138249
AR R_16_01382649
BR 16 0138264
R 16 01382679
See demo here

Related

How to create regex to match a string that contains only hexadecimal numbers and arrows?

I am using a string that uses the following characters:
0-9
a-f
A-F
-
>
The mixture of the greater than and hyphen must be:
->
-->
Here is the regex that I have so far:
[0-9a-fA-F\-\>]+
I tried these others using exclusion with ^ but they didn't work:
[^g-zG-Z][0-9a-fA-F\-\>]+
^g-zG-Z[0-9a-fA-F\-\>]+
[0-9a-fA-F\-\>]^g-zG-Z+
[0-9a-fA-F\-\>]+^g-zG-Z
[0-9a-fA-F\-\>]+[^g-zG-Z]
Here are some samples:
"0912adbd->12d1829-->218990d"
"ab2c8d-->82a921->193acd7"

Firstly, you don't need to escape - and >
Here's the regex that worked for me:
^([0-9a-fA-F]*(->)*(-->)*)*$
Here's an alternative regex:
^([0-9a-fA-F]*(-+>)*)*$
What does the regex do?
^ matches the beginning of the string and $ matches the ending.
* matches 0 or more instances of the preceding token
Created a big () capturing group to match any token.
[0-9a-fA-F] matches any character that is in the range.
(->) and (-->) match only those given instances.
Putting it into a code:
import re
regex = "^([0-9a-fA-F]*(->)*(-->)*)*$"
re.match(re.compile(regex),"0912adbd->12d1829-->218990d")
re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")
re.match(re.compile(regex),"this-failed->so-->bad")
You can also convert it into a boolean:
print(bool(re.match(re.compile(regex),"0912adbd->12d1829-->218990d")))
print(bool(re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")))
print(bool(re.match(re.compile(regex),"this-failed->so-->bad")))
Output:
True
True
False
I recommend using regexr.com to check your regex.

If there must be an arrow present, and not at the start or end of the string using a case insensitive pattern:
^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$
Explanation
^ Start of string
[a-f\d]+ Match 1+ chars a-f or digits
(?: Non capture group to repeat as a whole
-{1,2}>[a-f\d]+ Match - or -- and > followed by 1+ chars a-f or digits
)+ Close the non capture group and repeat 1+ times
$ End of string
See a regex demo and a Python demo.
import re
pattern = r"^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$"
s = ("0912adbd->12d1829-->218990d\n"
"ab2c8d-->82a921->193acd7\n"
"test")
print(re.findall(pattern, s, re.I | re.M))
Output
[
'0912adbd->12d1829-->218990d',
'ab2c8d-->82a921->193acd7'
]

You can construct the regex by steps. If I understand your requirements, you want a sequence of hexadecimal numbers (like a01d or 11efeb23, separated by arrows with one or two hyphens (-> or -->).
The hex part's regex is [0-9a-fA-F]+ (assuming it cannot be empty).
The arrow's regex can be -{1,2}> or (->|-->).
The arrow is only needed before each hex number but the first, so you'll build the final regex in two parts: the first number, then the repetition of arrow and number.
So the general structure will be:
NUMBER(ARROW NUMBER)*
Which gives the following regex:
[0-9a-fA-F]+(-{1,2}>[0-9a-fA-F]+)*

Python regex match space-separated words that contain two or fewer o characters

I am new to python and trying to solve some problems (in the way to learn).
I want to match space-separated words that contain two or fewer o characters.
That is what I actually did:
import re
pattern = r'\b(?:[^a\s]*o){1}[^a\s]*\b'
text = "hop hoop hooop hoooop hooooop"
print(re.findall(pattern, text))
When I run my code it does match all the words in the string..
Any suggestion?

You can use
import re
pattern = r'(?<!\S)(?:[^\so]*o){0,2}[^o\s]*(?!\S)'
text = "hop hoop hooop hoooop hooooop"
print(re.findall(pattern, text))
# Non regx solution:
print([x for x in text.split() if x.count("o") < 3])
See the Python demo. Both yield ['hop', 'hoop'].
The (?<!\S)(?:[^\so]*o){0,2}[^o\s]*(?!\S) regex matches
(?<!\S) - a left-hand whitespace boundary
(?:[^\so]*o){0,2} - zero, one or two occurrences of any zero or more chars other than whitespace and o char, and then an o char
[^o\s]* - zero or more chars other than o and whitespace
(?!\S) - a right-hand whitespace boundary

A way to match a SSHA hash using a regular expression

I'm trying to match four hashes that look like this:
{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=
{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4=
{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c=
{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c=
I've successfully matched the first two with this regular expression: \D{5,}[a-zA-Z0-9]\w+\(?= however I am unable to get a full match on the third or the fourth one. What is a better regular expression to match the given hashes?

Note that \D{5,} matches 5 or more non-digit chars, and then [a-zA-Z0-9] matches an ASCII letter or digit and \w+ matches 1+ letters/digits/_. So, if you have - or / in the string, it won't get matches. Or if the first 5 chars contain a digit.
I suggest the following pattern:
\{[^{}]*}[a-zA-Z0-9][\w/-]+=?
See the regex demo.
It matches:
\{[^{}]*} - a {, then 0+ chars other than { and } and then } (note you may further precise it: \{\w+} to match {, 1 or more letters/digits/_, and then }, or even \{(?:SS?HA|MD5)} to match SHA, SSHA or MD5 enclosed with {...})
[a-zA-Z0-9] - an ASCII letter or digit
[\w/-]+ - 1 or more word chars (letters, digits or _)
=? - an optional, 1 or 0 occurrences (due to the ? quantifier) = symbols (greedy ? makes it match a = if it is found).
Python demo:
import re
s = """
TEXT {SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=
{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4= and some more
{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c text here
{MD5}5/DNVWwyafo-oIEzHnhv30rSN7c= maybe."""
rx = r"\{[^{}]*}[a-zA-Z0-9][\w/-]+=?"
print(re.findall(rx, s))
# => ['{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=', '{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4=', '{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c', '{MD5}5/DNVWwyafo-oIEzHnhv30rSN7c=']

I would suggest something along these lines:
\{[SHAMD5]{3,4}\}[^=]+=?
It will match a { then 3 or 4 characters that are the combinations you have listed of characters. You can change that to [A-Z0-9] to broaden it, but I like to keep it tighter to start. Then a }. Then all (at least 1) non = characters. Ending with an optional = character. Here is my python demo:
import re
textlist = [
"{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M="
,"{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4="
,"{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c="
,"{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c="
,"{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c"
,"test for break below"
,"{WORD}stuff="
,"{MD55/DNVWwyafo-pIEaHNhv39sSN7c="
,"MD5}5/DNVWwyafo-pIEaHNhv39sSN7c="
]
for text in textlist:
if re.search("\{[SHAMD5]{3,4}\}[^=]+=?", text):
print ("match")
else:
print ("no soup for you")
Note the end of the list has a few tests to make sure the regex doesn't just succeed on anything random.

the use of regular expression

I'm new in regular expression, but I want to match a pattern in about 2 million strings.
There three forms of the origin strings shown as follows:
EC-2A-07<EC-1D-10>
EC-2-07
T1-ZJF-4
I want to get three parts of substrings besides -, which is to say I　want to get EC, 2A, 07respectively. Especially, for the first string, I just want to divide the part before <.
I have tried .+[\d]\W, but cannot recognize EC-2-07, then I use .split('-') to split the string, and then use index in the returned list to get what I want. But it is low efficient.
Can you figure out a high efficient regular expression to meet my requirements?? Thanks a lot!

You need to use
^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})
See the regex demo
Details:
^ - start of string
([A-Z0-9]{2}) - Group 1 capturing 2 uppercase ASCII letters or digits
-- - a hyphen
([A-Z0-9]{1,3}) - Group 2 capturing 1 to 3 uppercase ASCII letters or digits
- - a hyphen
([A-Z0-9]{1,2}) - Group 3 capturing 1 to 2 uppercase ASCII letters or digits.
You may adjust the values in the {min,max} quantifiers as required.
Sample Python demo:
import re
regex = r"^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})"
test_str = "EC-2A-07<EC-1D-10>\nEC-2-07\nT1-ZJF-4"
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)
#or with lines
lines = test_str.split('\n')
rx = re.compile("([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})")
for line in lines:
m = rx.match(line)
if m:
print('{0} :: {1} :: {2}'.format(m.group(1), m.group(2), m.group(3)))

You can try this:
^(\w+)-(\w+)-(\w+)(?=\W).*$
Explanation
Python Demo

repeated pattern in regex

I am trying to catch a repeated pattern in my string. The subpattern starts with the beginning of word or ":" and ends with ":" or end of word. I tried findall and search in combination of multiple matching ((subpattern)__(subpattern))+ but was not able what is wrong:
cc = "GT__abc23_1231:TF__XYZ451"
import regex
ma = regex.match("(\b|\:)([a-zA-Z]*)__(.*)(:|\b)", cc)
Expected output:
GT, abc23_1231, TF, XYZ451
I saw a bunch of questions like this, but it did not help.

It seems you can use
(?:[^_:]|(?<!_)_(?!_))+
See the regex demo
Pattern details:
(?:[^_:]|(?<!_)_(?!_))+ - 1 or more sequences of:
[^_:] - any character but _ and :
(?<!_)_(?!_) - a single _ not enclosed with other _s
Python demo with re based solution:
import re
p = re.compile(r'(?:[^_:]|(?<!_)_(?!_))+')
s = "GT__abc23_1231:TF__XYZ451"
print(p.findall(s))
# => ['GT', 'abc23_1231', 'TF', 'XYZ451']
If the first character is always not a : and _, you may use an unrolled regex like:
r'[^_:]+(?:_(?!_)[^_:]*)*'
It won't match the values that start with single _ though (so, an unrolled regex is safer).

Use the smallest common denominator in "starts and ends with a : or a word-boundary", that is the word-boundary (your substrings are composed with word characters):
>>> import re
>>> cc = "GT__abc23_1231:TF__XYZ451"
>>> re.findall(r'\b([A-Za-z]+)__(\w+)', cc)
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]
Testing if there are : around is useless.
(Note: no need to add a \b after \w+, since the quantifier is greedy, the word-boundary becomes implicit.)
[EDIT]
According to your comment: "I want to first split on ":", then split on double underscore.", perhaps you dont need regex at all:
>>> [x.split('__') for x in cc.split(':')]
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex matching is unable to select alphanumeric string with spaces in python - python

This does the work: [A-Z]{1,2}\s([A-Z]{1,2}+(?:_[0-9]+)|[0-9]+(?:\s[0-9]+)) This regex gives below output: AR BR_18_0138249 AR R_16_01382649 BR 16 0138264 R 16 01382679 See demo here

Related

How to create regex to match a string that contains only hexadecimal numbers and arrows?

Python regex match space-separated words that contain two or fewer o characters

A way to match a SSHA hash using a regular expression

the use of regular expression

repeated pattern in regex

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex matching is unable to select alphanumeric string with spaces in python - python

This does the work: [A-Z]{1,2}\s([A-Z]{1,2}+(?:_[0-9]+)*|[0-9]+(?:\s[0-9]+)*) This regex gives below output: AR BR_18_0138249 AR R_16_01382649 BR 16 0138264 R 16 01382679 See demo here

Related

How to create regex to match a string that contains only hexadecimal numbers and arrows?

Python regex match space-separated words that contain two or fewer o characters

A way to match a SSHA hash using a regular expression

the use of regular expression

repeated pattern in regex

Categories

Resources

This does the work: [A-Z]{1,2}\s([A-Z]{1,2}+(?:_[0-9]+)|[0-9]+(?:\s[0-9]+)) This regex gives below output: AR BR_18_0138249 AR R_16_01382649 BR 16 0138264 R 16 01382679 See demo here