I got a string like that:
LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)
I want to look only for OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0), but the OR could be LD as well. _080T_SAF_OUT could be different being always alphanumeric with bottom slash sometimes. COIL(xxSF[4].Flt[120].0), must be always in the format COIL(xxSF["digits"].Flt["digits"]."digits")
I am trying to use the re library of Python 2.7.
m = re.search('\OR|\LD'+'\('+'.+'+'\)'+'+'\COIL+'\('+'\xxSF+'\['+'\d+'+'\].'+ Flt\['+'\d+'+'\]'+'\.'+'\d+', Text)
My Output:
OR(abc_TEST_X)LD(xxSF[16].Flt[0].22
OR
LD(TEST_X_dsfa)OR(WASS_READY)COIL(xxSF[16].Flt[11].10
The first one is the right one which I am getting I want to discard the second one and the third one.
I think that the problem is here:
'\('+'.+'+'\)'
Because of I just want to find something alphanumeric and possibly with symbols between the first pair of paréntesis, and I am not filtering this situation.
You should group alternations like (?:LD|OR), and to match any chars other than ( and ) you may use [^()]* rather than .+ (.+ matches any chars, as many as possible, hence it matches across parentheses).
Here is a Python demo:
import re
Text = 'LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)'
m = re.search(r'(?:OR|LD)\([^()]*\)COIL\(xxSF\[\d+]\.Flt\[\d+]\.\d+', Text)
if m:
print(m.group()) # => OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0
Pattern details
(?:OR|LD) - a non-capturing group matching OR or LD
\( - a ( char
[^()]* - a negated character class matching 0+ chars other than ( and )
\)COIL\(xxSF\[ - )COIL(xxSF[ substring
\d+ - 1+ digits
]\.Flt\[ - ].Flt[ substring
\d+]\.\d+ - 1+ digits, ]. substring and 1+ digits
See the regex demo.
TIP Add a \b before (?:OR|LD) to match them as whole words (not as part of NOR and NLD).
Thanks, I am capturing everything which I want. Just something else to filter. Take a look to some Outputs:
OR(_1B21_A53021_2_En)OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1B21_A53021_2_En)LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
I only want to capture the last one "LD" or "OR" as follow:
OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
Related
I am trying to capture instances in my dataframe where a string has the following format:
/random a/random b/random c/capture this/random again/random/random
Where a string is preceded by four instances of /, and more than two / appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return None.
In this instance capture this should be captured and placed into a new column.
This is what I tried:
def extract_special_string(df, column):
df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x).group(0) if re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x) else None)
extract_special_string(df, 'column')
However nothing is being captured. Can anybody help with this regex? Thanks.
You can use
df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/]+)(?:/[^/]*){2}', expand=False)
See the regex demo
Details:
^ - start of string
(?:[^/]*/){4} - four occurrences of any zero or more chars other than / and then a / char
([^/]+) - Capturing group 1:one or more chars other than a / char
(?:/[^/]*){2} - two occurrences of a / char and then any zero or more chars other than /.
An alternative regex approach would be to use non-greedy quantifiers.
import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1)) # 'capture this'
/(?:.*?/){3} - match the part before capture this, matching any but as few as possible characters between each pair of /s (use noncapturing group to ignore the contents)
(.*?) - capture capture this (since this is a capturing group, we can fetch the contents from <match_object>.group(1)
(?:/.*?){2,} - same as the first part, match as few characters as possible in between each pair of /s
I have examined a previous question relating to optional capture groups in Python, but this has not been helpful. Attempting to follow, the code I have is below:
import re
c = re.compile(r'(?P<Prelude>.*?)'
r'(?:Discussion:(?P<Discussion>.+?))?'
r'(?:References:(?P<References>.*?))?',
re.M|re.S)
test_text = r"""Prelude strings
Discussion: this is some
text.
References: My bad, I have none.
"""
test_text2 = r"""Prelude strings
Discussion: this is some
text.
"""
print(c.match(test_text).groups())
print(c.match(test_text2).groups())
Both print ('Prelude strings', None, None) instead of capturing the two groups. I am unable to determine why.
The expected result is ('Prelude strings', ' this is some\ntext.', ' My bad, I have none.') for the first, and the second the same but with None as the third capture group. It should also be possible to delete the Discussion lines and still capture References.
You can use
c = re.compile(r'^(?P<Prelude>.*?)'
r'(?:Discussion:\s*(?P<Discussion>.*?)\s*)?'
r'(?:References:\s*(?P<References>.*?))?$',
re.S)
One-line regex pattern as a string:
(?s)^(?P<Prelude>.*?)(?:Discussion:\s*(?P<Discussion>.*?)\s*)?(?:References:\s*(?P<References>.*?))?$
See the regex demo.
Details:
(?s) - same as re.S, makes . match line break chars
^ - start of the whole string (note that it no longer matches start of any line, since I removed the re.M flag)
(?P<Prelude>.*?) - Group "Prelude": any zero or more chars as few as possible
(?:Discussion:\s*(?P<Discussion>.*?)\s*)? - an optional non-capturing group matching one or zero occurrences of the following sequence:
Discussion: - a fixed string
\s* - zero or more whitespaces
(?P<Discussion>.*?) - Group "Discussion": zero or more chars as few as possible
\s* - zero or more whitespaces
(?:References:\s*(?P<References>.*?))? - an optional non-capturing group matching one or zero occurrences of the following sequence:
References: - a fixed string
\s* - zero or more whitespaces
(?P<References>.*?) - Group "References": any zero or more chars as few as possible
$ - end of the string.
I am using a string that uses the following characters:
0-9
a-f
A-F
-
>
The mixture of the greater than and hyphen must be:
->
-->
Here is the regex that I have so far:
[0-9a-fA-F\-\>]+
I tried these others using exclusion with ^ but they didn't work:
[^g-zG-Z][0-9a-fA-F\-\>]+
^g-zG-Z[0-9a-fA-F\-\>]+
[0-9a-fA-F\-\>]^g-zG-Z+
[0-9a-fA-F\-\>]+^g-zG-Z
[0-9a-fA-F\-\>]+[^g-zG-Z]
Here are some samples:
"0912adbd->12d1829-->218990d"
"ab2c8d-->82a921->193acd7"
Firstly, you don't need to escape - and >
Here's the regex that worked for me:
^([0-9a-fA-F]*(->)*(-->)*)*$
Here's an alternative regex:
^([0-9a-fA-F]*(-+>)*)*$
What does the regex do?
^ matches the beginning of the string and $ matches the ending.
* matches 0 or more instances of the preceding token
Created a big () capturing group to match any token.
[0-9a-fA-F] matches any character that is in the range.
(->) and (-->) match only those given instances.
Putting it into a code:
import re
regex = "^([0-9a-fA-F]*(->)*(-->)*)*$"
re.match(re.compile(regex),"0912adbd->12d1829-->218990d")
re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")
re.match(re.compile(regex),"this-failed->so-->bad")
You can also convert it into a boolean:
print(bool(re.match(re.compile(regex),"0912adbd->12d1829-->218990d")))
print(bool(re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")))
print(bool(re.match(re.compile(regex),"this-failed->so-->bad")))
Output:
True
True
False
I recommend using regexr.com to check your regex.
If there must be an arrow present, and not at the start or end of the string using a case insensitive pattern:
^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$
Explanation
^ Start of string
[a-f\d]+ Match 1+ chars a-f or digits
(?: Non capture group to repeat as a whole
-{1,2}>[a-f\d]+ Match - or -- and > followed by 1+ chars a-f or digits
)+ Close the non capture group and repeat 1+ times
$ End of string
See a regex demo and a Python demo.
import re
pattern = r"^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$"
s = ("0912adbd->12d1829-->218990d\n"
"ab2c8d-->82a921->193acd7\n"
"test")
print(re.findall(pattern, s, re.I | re.M))
Output
[
'0912adbd->12d1829-->218990d',
'ab2c8d-->82a921->193acd7'
]
You can construct the regex by steps. If I understand your requirements, you want a sequence of hexadecimal numbers (like a01d or 11efeb23, separated by arrows with one or two hyphens (-> or -->).
The hex part's regex is [0-9a-fA-F]+ (assuming it cannot be empty).
The arrow's regex can be -{1,2}> or (->|-->).
The arrow is only needed before each hex number but the first, so you'll build the final regex in two parts: the first number, then the repetition of arrow and number.
So the general structure will be:
NUMBER(ARROW NUMBER)*
Which gives the following regex:
[0-9a-fA-F]+(-{1,2}>[0-9a-fA-F]+)*
following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).
This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]
doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)
About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo
I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b