Regular expression not contains a specific patterns python - python

I'm working with regular expression on python then I've the followings string that I need to parse some like
XCT_GRUPO_INVESTIGACION_F1.sql
XCT_GRUPO_INVESTIGACION_F2.sql
XCT_GRUPO_INVESTIGACION.sql
XCS_GRUPO_INVESTIGACION.sql
The I need to parse all the string that has ??T, but the string not must containt somthing like F1,F34,constrains and others
So I've the following pattern
([a-zA-Z][a-zA-Z][tT]_([a-zA-Z]).*.(sql|SQL)$)
[a-zA-Z][a-zA-Z][tT]_ = check the first and second value could be whatever but I need to be followed by t_ or T_
([a-zA-Z]).* = any value a-z and A-Z any times
(sql|SQL)$ = must be end with sql or SQL
I get something like
ICT_GRUPO_INVESTIGACION_F1.sql
ICT_GRUPO_INVESTIGACION_F2.sql
ICT_GRUPO_INVESTIGACION.sql
But this contains F1,F?,constrains and others
how can I say to the regular expression that in the expression ([a-zA-Z]).* no contains f1 | f? | others_expresion_that_Iwanna

This regular expression should work:
([a-zA-Z][a-zA-Z][tT]_(?:(?!_F[0-9]).)*?\.(sql|SQL))
You may put any number of unwanted combinations here (?!_F[0-9]|other_expression|...)
There are following parts in the regular expression:
[a-zA-Z] #match any letter
[a-zA-Z] #match any letter
[tT]_ #match 't_' or 'T_'
(?: #start non-capturing group
(?!_F[0-9]) #negative lookahead, asserts that what immediately
#follows the current position in the string is not _f[0-9]
. #match any single character
)*? #end group, repeat it multiple times but as few as possible
\. #match period character
(sql|SQL) #match 'sql' or 'SQL'
You could find additional information here, here and here

Related

Regex - How do i find this specific slice of string inside a bigger whole string

following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).
This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]
doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)
About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo

Check in python if self designed pattern matches

I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b

Extract Substring values with key pattern

Probably a regex question (forgive my broken english).
I need to identify a sub string that starts with a certain value.
For example, take the following string:
"Select 1 from user.table1 inner join user.table2..."
I need to extract all the words that start with "user" and end with "blank space". So, after applying this "unkown" regex to the above string, it would produce the following result:
table1
table2
I tried to use the "re.findall" function, but couldn't find a way to specify the start and end patterns.
So, how can extract the substrings using a starting pattern?
Try Positive Lookbehind :
import re
pattern=r'(?<=user\.)(\w+)?\s'
string_1="Select 1 from user.table1 inner join user.table2 ..."
match=re.findall(pattern,string_1)
print(match)
output:
['table1', 'table2']
regex information:
(?<=user\.)(\w+)?\s
`Positive Lookbehind` `(?<=user\.)`
Assert that the Regex below matches
user matches the characters user literally (case sensitive)
\. matches the character . literally (case sensitive)
1st Capturing Group (\w+)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
\w+ matches any word character (equal to [a-zA-Z0-9_])
If that pattern doesn't work try this : (?<=user\.)\w+
You can try it like this:
re.findall(r'\buser\.(..*?)\b',
"Select 1 from user.table1 inner join user.table2...")
This will return:
['table1', 'table2']

Match only the string that has strings after last underscore

I am trying to match string with underscores, throughout the string there are underscores but I want to match the strings that that has strings after the last underscore: Let me provide an example:
s = "hello_world"
s1 = "hello_world_foo"
s2 = "hello_world_foo_boo"
In my case I only want to capture s1 and s2.
I started with following, but can't really figure how I would do the match to capture strings that has strings after hello_world's underscore.
rgx = re.compile(ur'(?P<firstpart>\w+)[_]+(?P<secondpart>\w+)$', re.I | re.U)
Try this:
reobj = re.compile("^(?P<firstpart>[a-z]+)_(?P<secondpart>[a-z]+)_(?P<lastpart>.*?)$", re.IGNORECASE)
result = reobj.findall(subject)
Regex Explanation
^(?P<firstpart>[a-z]+)_(?P<secondpart>[a-z]+)_(?P<lastpart>.*?)$
Options: case insensitive
Assert position at the beginning of the string «^»
Match the regular expression below and capture its match into backreference with name “firstpart” «(?P<firstpart>[a-z]+)»
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference with name “secondpart” «(?P<secondpart>[a-z]+)»
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference with name “lastpart” «(?P<lastpart>.*?)»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
If I understand what you are asking for (you want to match string with more than one underscore and following text)
rgx = re.compile(ur'(?P<firstpart>\w+)[_]+(?P<secondpart>\w+)_[^_]+$', re.I | re.U)

What does "?:" mean in a Python regular expression?

Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)
It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.
Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.
It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html
(?:...) means a non cature group. The group will not be captured.

Categories