Extract Substring values with key pattern - python

Probably a regex question (forgive my broken english).
I need to identify a sub string that starts with a certain value.
For example, take the following string:
"Select 1 from user.table1 inner join user.table2..."
I need to extract all the words that start with "user" and end with "blank space". So, after applying this "unkown" regex to the above string, it would produce the following result:
table1
table2
I tried to use the "re.findall" function, but couldn't find a way to specify the start and end patterns.
So, how can extract the substrings using a starting pattern?

Try Positive Lookbehind :
import re
pattern=r'(?<=user\.)(\w+)?\s'
string_1="Select 1 from user.table1 inner join user.table2 ..."
match=re.findall(pattern,string_1)
print(match)
output:
['table1', 'table2']
regex information:
(?<=user\.)(\w+)?\s
`Positive Lookbehind` `(?<=user\.)`
Assert that the Regex below matches
user matches the characters user literally (case sensitive)
\. matches the character . literally (case sensitive)
1st Capturing Group (\w+)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
\w+ matches any word character (equal to [a-zA-Z0-9_])
If that pattern doesn't work try this : (?<=user\.)\w+

You can try it like this:
re.findall(r'\buser\.(..*?)\b',
"Select 1 from user.table1 inner join user.table2...")
This will return:
['table1', 'table2']

Related

Using Regex to search for a string unless it finds another string first

Hello I'm trying to use regex to search through a markdown file for a date and only get a match if it finds an instance of a specific string before it finds another date.
This is what I have right now and it definitely doesn't work.
(\d{2}\/\d{2}\/\d{2})(string)?(^(\d{2}\/\d{2}\/\d{2}))
So in this instance It would throw a match since the string is before the next date:
01/20/20
string
01/21/20
Here it shouldn't match since the string is after the next date:
01/20/20
this isn't the phrase you're looking for
01/21/20
string
Any help on this would be greatly appreciated.
You could match a date like pattern. Then use a tempered greedy token approach (?:(?!\d{2}\/\d{2}\/\d{2}).)* to match string without matching another date first.
If you have matched the string, use a non greedy dot .*? to match the first occurrence of the next date.
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string.*?\d{2}\/\d{2}\/\d{2}
Regex demo | Python demo
For example (using re.DOTALL to make the dot match a newline)
import re
regex = r"\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}"
test_str = """01/20/20\n\n"
"string\n\n"
"01/21/20\n\n"
"01/20/20\n\n"
"this isn't the phrase you're looking for\n\n"
"01/21/20\n\n"
"string"""
print(re.findall(regex, test_str, re.DOTALL))
Output
['01/20/20\n\n"\n\t"string\n\n"\n\t"01/21/20']
If the string can not occur 2 times between the date, you might use
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}|string).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}
Regex demo
Note that if you don't want the string and the dates to be part of a larger word, you could add word boundaries \b
One approach here would be to use a tempered dot to ensure that the regex engine does not cross over the ending date while trying to find the string after the starting date. For example:
inp = """01/20/20
string # <-- this is matched
01/21/20
01/20/20
01/21/20
string""" # <-- this is not matched
matches = re.findall(r'01/20/20(?:(?!\b01/21/20\b).)*?(\bstring\b).*?\b01/21/20\b', inp, flags=re.DOTALL)
print(matches)
This prints string only once, that match being the first occurrence, which legitimately sits in between the starting and ending dates.

Find something between parentheses

I got a string like that:
LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)
I want to look only for OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0), but the OR could be LD as well. _080T_SAF_OUT could be different being always alphanumeric with bottom slash sometimes. COIL(xxSF[4].Flt[120].0), must be always in the format COIL(xxSF["digits"].Flt["digits"]."digits")
I am trying to use the re library of Python 2.7.
m = re.search('\OR|\LD'+'\('+'.+'+'\)'+'+'\COIL+'\('+'\xxSF+'\['+'\d+'+'\].'+ Flt\['+'\d+'+'\]'+'\.'+'\d+', Text)
My Output:
OR(abc_TEST_X)LD(xxSF[16].Flt[0].22
OR
LD(TEST_X_dsfa)OR(WASS_READY)COIL(xxSF[16].Flt[11].10
The first one is the right one which I am getting I want to discard the second one and the third one.
I think that the problem is here:
'\('+'.+'+'\)'
Because of I just want to find something alphanumeric and possibly with symbols between the first pair of paréntesis, and I am not filtering this situation.
You should group alternations like (?:LD|OR), and to match any chars other than ( and ) you may use [^()]* rather than .+ (.+ matches any chars, as many as possible, hence it matches across parentheses).
Here is a Python demo:
import re
Text = 'LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)'
m = re.search(r'(?:OR|LD)\([^()]*\)COIL\(xxSF\[\d+]\.Flt\[\d+]\.\d+', Text)
if m:
print(m.group()) # => OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0
Pattern details
(?:OR|LD) - a non-capturing group matching OR or LD
\( - a ( char
[^()]* - a negated character class matching 0+ chars other than ( and )
\)COIL\(xxSF\[ - )COIL(xxSF[ substring
\d+ - 1+ digits
]\.Flt\[ - ].Flt[ substring
\d+]\.\d+ - 1+ digits, ]. substring and 1+ digits
See the regex demo.
TIP Add a \b before (?:OR|LD) to match them as whole words (not as part of NOR and NLD).
Thanks, I am capturing everything which I want. Just something else to filter. Take a look to some Outputs:
OR(_1B21_A53021_2_En)OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1B21_A53021_2_En)LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
I only want to capture the last one "LD" or "OR" as follow:
OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);

Check in python if self designed pattern matches

I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b

Regular expression not contains a specific patterns python

I'm working with regular expression on python then I've the followings string that I need to parse some like
XCT_GRUPO_INVESTIGACION_F1.sql
XCT_GRUPO_INVESTIGACION_F2.sql
XCT_GRUPO_INVESTIGACION.sql
XCS_GRUPO_INVESTIGACION.sql
The I need to parse all the string that has ??T, but the string not must containt somthing like F1,F34,constrains and others
So I've the following pattern
([a-zA-Z][a-zA-Z][tT]_([a-zA-Z]).*.(sql|SQL)$)
[a-zA-Z][a-zA-Z][tT]_ = check the first and second value could be whatever but I need to be followed by t_ or T_
([a-zA-Z]).* = any value a-z and A-Z any times
(sql|SQL)$ = must be end with sql or SQL
I get something like
ICT_GRUPO_INVESTIGACION_F1.sql
ICT_GRUPO_INVESTIGACION_F2.sql
ICT_GRUPO_INVESTIGACION.sql
But this contains F1,F?,constrains and others
how can I say to the regular expression that in the expression ([a-zA-Z]).* no contains f1 | f? | others_expresion_that_Iwanna
This regular expression should work:
([a-zA-Z][a-zA-Z][tT]_(?:(?!_F[0-9]).)*?\.(sql|SQL))
You may put any number of unwanted combinations here (?!_F[0-9]|other_expression|...)
There are following parts in the regular expression:
[a-zA-Z] #match any letter
[a-zA-Z] #match any letter
[tT]_ #match 't_' or 'T_'
(?: #start non-capturing group
(?!_F[0-9]) #negative lookahead, asserts that what immediately
#follows the current position in the string is not _f[0-9]
. #match any single character
)*? #end group, repeat it multiple times but as few as possible
\. #match period character
(sql|SQL) #match 'sql' or 'SQL'
You could find additional information here, here and here

What does "?:" mean in a Python regular expression?

Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)
It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.
Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.
It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html
(?:...) means a non cature group. The group will not be captured.

Categories