Python Regex Subgroup Capturing - python

I'm trying to parse the following string:
constructor: function(some, parameters, here) {
With the following regex:
re.search("(\w*):\s*function\((?:(\w*)(?:,\s)*)*\)", line).groups()
And I'm getting:
('constructor', '')
But I was expecting something more like:
('constructor', 'some', 'parameters', 'here')
What am I missing?

If you change your pattern to:
print re.search(r"(\w*):\s*function\((?:(\w+)(?:,\s)?)*\)", line).groups()
You'll get:
('constructor', 'here')
This is because (from docs):
If a group is contained in a part of the pattern that matched multiple times, the last match is returned.
If you can do this in one step, I don't know how. Your alternative, of course is to do something like:
def parse_line(line):
cons, args = re.search(r'(\w*):\s*function\((.*)\)', line).groups()
mats = re.findall(r'(\w+)(?:,\s*)?', args)
return [cons] + mats
print parse_line(line) # ['constructor', 'some', 'parameters', 'here']

One option is to use more advanced regex instead of the stock re. Among other nice things, it supports captures, which, unlike groups, save every matching substring:
>>> line = "constructor: function(some, parameters, here) {"
>>> import regex
>>> regex.search("(\w*):\s*function\((?:(\w+)(?:,\s)*)*\)", line).captures(2)
['some', 'parameters', 'here']

The re module doesn't support repeated captures: the group count is fixed. Possible workarounds include:
1) Capture the parameters as a string and then split it:
match = re.search("(\w*):\s*function\(([\w\s,]*)\)", line).groups()
args = [arg.strip() for arg in math[1].split(",")]
2) Capture the parameters as a string and then findall it:
match = re.search("(\w*):\s*function\(([\w\s,]*)\)", line).groups()
args = re.findall("(\w+)(?:,\s)*", match[1])
3) If your input string has already been verified, you can just findall the whole thing:
re.findall("(\w+)[:,)]", string)
Alternatively, you can use the regex module and captures(), as suggested by #georg.

You might need two operations here (search and findall):
[re.search(r'[^:]+', given_string).group()] + re.findall(r'(?<=[ (])\w+?(?=[,)])', given_string)
Output: ['constructor', 'some', 'parameters', 'here']

Related

Extract number from a string using a pattern

I have strings like :
's3://bukcet_name/tables/name=moonlight/land/timestamp=2020-06-25 01:00:23.180745/year=2019/month=5'
And from it I would like to obtain a tuple contain the year value and the month value as first and second element of my tuple.
('2019', '5')
For now I did this :
([elem.split('=')[-1:][0] for elem in part[0].split('/')[-2:]][0], [elem.split('=')[-1:][0] for elem in part[0].split('/')[-2:]][1])
It isn't very elegant, how could I do better ?
Use, re.findall along with the given regex pattern:
import re
matches = re.findall(r'(?i)/year=(\d+)/month=(\d+)', string)
Result:
# print(matches)
[('2019', '5')]
Test the regex pattern here.
Perhaps regular expressions could do it. I would use regular expressions to capture the strings 'year=2019' and 'month=5' then return the item at index [-1] by splitting these two with the character '='. Hold on, let me open up my Sublime and try to write actual code which suits your specific case.
import re
search_string = 's3://bukcet_name/tables/name=moonlight/land/timestamp=2020-06-25 01:00:23.180745/year=2019/month=5'
string1 = re.findall(r'year=\d+', search_string)
string2 = re.findall(r'month=\d+', search_string)
result = (string1[0].split('=')[-1], string2[0].split('=')[-1]) print(result)

How to regex search strings in different order in python

I have a script that takes in an argument and tries to find a match using regex. On single values, I don't have any issues, but when I pass multiple words, the order matters. What can I do so that the regex returns no matter what the order of the supplied words are? Here is my example script:
import re
from sys import argv
data = 'some things other stuff extra words'
pattern = re.compile(argv[1])
search = re.search(pattern, data)
print search
if search:
print search.group(0)
print data
So based on my example, if I pass "some things" as an arg, then it matches, but if i pass "things some", it doesn't matches, and I would like it to. Optionally, I would like it to also return if either "some" or "things" match.
The argument passed could possibly be a regex
I think you want something like this:
search = filter(None, (re.search(arg, data) for arg in argv[1].split()))
Or
search = re.search('|'.join(argv[1].split()), data)
You can then check the search results, if len(search) == len(argv[1].split()), then it means all patterns matched, and if search is truthy, then it means at least one of them matched.
Ok, I think I got it, you can use a lookahead assertion like this:
>>> re.search('(?=.*thing)(?=.*same)', data)
You can obviously programatically build such regex:
re.search(''.join('(?=.*{})'.format(arg) for arg in argv[1].split()), data)
I think it would be better to just create several regexes and match each of them against the string. If any of them matches, you return True.
If you are just trying to match constant strings, the in operator is enough:
'some' in data or 'things' in data
You could also just split the data text into sublists, and check if the ordering/reverse ordering of search exists in it:
import re
data = 'some, things other stuff extra words blah.'
search = "things, some"
def search_text(data, search):
data_words = re.compile('\w+').findall(data)
# ['some', 'things', 'other', 'stuff', 'extra', 'words', 'blah']
search_words = re.compile('\w+').findall(search)
# ['things', 'some']
len_search = len(search_words)
candidates = [data_words[i:i+len_search] for i in range(0, len(data_words)-1, len_search-1)]
# [['some', 'things'], ['things', 'other'], ['other', 'stuff'], ['stuff', 'extra'], ['extra', 'words'], ['words', 'blah']]
return search_words in candidates or search_words[::-1] in candidates
print(search_text(data, search))
Which Outputs:
True

How to combine multiple regex into single one in python?

I'm learning about regular expression. I don't know how to combine different regular expression to make a single generic regular expression.
I want to write a single regular expression which works for multiple cases. I know this is can be done with naive approach by using or " | " operator.
I don't like this approach. Can anybody tell me better approach?
You need to compile all your regex functions. Check this example:
import re
re1 = r'\d+\.\d*[L][-]\d*\s[A-Z]*[/]\d*'
re2 = '\d*[/]\d*[A-Z]*\d*\s[A-Z]*\d*[A-Z]*'
re3 = '[A-Z]*\d+[/]\d+[A-Z]\d+'
re4 = '\d+[/]\d+[A-Z]*\d+\s\d+[A-Z]\s[A-Z]*'
sentences = [string1, string2, string3, string4]
for sentence in sentences:
generic_re = re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).findall(sentence)
To findall with an arbitrary series of REs all you have to do is concatenate the list of matches which each returns:
re_list = [
'\d+\.\d*[L][-]\d*\s[A-Z]*[/]\d*', # re1 in question,
...
'\d+[/]\d+[A-Z]*\d+\s\d+[A-z]\s[A-Z]*', # re4 in question
]
matches = []
for r in re_list:
matches += re.findall( r, string)
For efficiency it would be better to use a list of compiled REs.
Alternatively you could join the element RE strings using
generic_re = re.compile( '|'.join( re_list) )
I see lots of people are using pipes, but that seems to only match the first instance. If you want to match all, then try using lookaheads.
Example:
>>> fruit_string = "10a11p"
>>> fruit_regex = r'(?=.*?(?P<pears>\d+)p)(?=.*?(?P<apples>\d+)a)'
>>> re.match(fruit_regex, fruit_string).groupdict()
{'apples': '10', 'pears': '11'}
>>> re.match(fruit_regex, fruit_string).group(0)
'10a,11p'
>>> re.match(fruit_regex, fruit_string).group(1)
'11'
(?= ...) is a look ahead:
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
.*?(?P<pears>\d+)p
find a number followed a p anywhere in the string and name the number "pears"
You might not need to compile both regex patterns. Here is a way, let's see if it works for you.
>>> import re
>>> text = 'aaabaaaabbb'
>>> A = 'aaa'
>>> B = 'bbb'
>>> re.findall(A+B, text)
['aaabbb']
>>>
further read read_doc
If you need to squash multiple regex patterns together the result can be annoying to parse--unless you use P<?> and .groupdict() but doing that can be pretty verbose and hacky. If you only need a couple matches then doing something like the following could be mostly safe:
bucket_name, blob_path = tuple(item for item in matches.groups() if item is not None)

Python Regex Get String Between Two Substrings

First off, I know this may seem like a duplicate question, however, I could find no working solution to my problem.
I have string that looks like the following:
string = "api('randomkey123xyz987', 'key', 'text')"
I need to extract randomkey123xyz987 which will always be between api(' and ',
I was planning on using Regex for this, however, I seem to be having some trouble.
This is the only progress that I have made:
import re
string = "api('randomkey123xyz987', 'key', 'text')"
match = re.findall("\((.*?)\)", string)[0]
print match
The following code returns 'randomkey123xyz987', 'key', 'text'
I have tried to use [^'], but my guess is that I am not properly inserting it into the re.findall function.
Everything that I am trying is failing.
Update: My current workaround is using [2:-4], but I would still like to avoid using match[2:-4].
If the string contains only one instance, use re.search() instead:
>>> import re
>>> s = "api('randomkey123xyz987', 'key', 'text')"
>>> match = re.search(r"api\('([^']*)'", s).group(1)
>>> print match
randomkey123xyz987
You want the string between the ( and ,, you are catching everything between the parens:
match = re.findall("api\((.*?),", string)
print match
["'randomkey123xyz987'"]
Or match between the '':
match = re.findall("api\('(.*?)'", string)
print match
['randomkey123xyz987']
If that is how your strings actually look you can split:
string = "api('randomkey123xyz987', 'key', 'text')"
print(string.split(",",1)[0][4:])
You should use the following regex:
api\('(.*?)'
Assuming that api( is fixed prefix
It matches api(, then captures what appears next, until ' token.
>>> re.findall(r"api\('(.*?)'", "api('randomkey123xyz987', 'key', 'text')")
['randomkey123xyz987']
If you are certain that randomkey123xyz987 will always be between "api('" and "',", then using the split() method can get it done in one line. This way you will not have to use regex matching. It will match the pattern between the starting and ending delimiter which is "api('" and "',
".
>>> string = "api('randomkey123xyz987', 'key', 'text')"
>>> value = (string.split("api('")[1]).split("',")[0]
>>> print value
randomkey123xyz987

How can I get part of regex match as a variable in python?

In Perl it is possible to do something like this (I hope the syntax is right...):
$string =~ m/lalala(I want this part)lalala/;
$whatIWant = $1;
I want to do the same in Python and get the text inside the parenthesis in a string like $1.
If you want to get parts by name you can also do this:
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
>>> m.groupdict()
{'first_name': 'Malcom', 'last_name': 'Reynolds'}
The example was taken from the re docs
See: Python regex match objects
>>> import re
>>> p = re.compile("lalala(I want this part)lalala")
>>> p.match("lalalaI want this partlalala").group(1)
'I want this part'
import re
astr = 'lalalabeeplalala'
match = re.search('lalala(.*)lalala', astr)
whatIWant = match.group(1) if match else None
print(whatIWant)
A small note: in Perl, when you write
$string =~ m/lalala(.*)lalala/;
the regexp can match anywhere in the string. The equivalent is accomplished with the re.search() function, not the re.match() function, which requires that the pattern match starting at the beginning of the string.
import re
data = "some input data"
m = re.search("some (input) data", data)
if m: # "if match was successful" / "if matched"
print m.group(1)
Check the docs for more.
there's no need for regex. think simple.
>>> "lalala(I want this part)lalala".split("lalala")
['', '(I want this part)', '']
>>> "lalala(I want this part)lalala".split("lalala")[1]
'(I want this part)'
>>>
import re
match = re.match('lalala(I want this part)lalala', 'lalalaI want this partlalala')
print match.group(1)
import re
string_to_check = "other_text...lalalaI want this partlalala...other_text"
p = re.compile("lalala(I want this part)lalala") # regex pattern
m = p.search(string_to_check) # use p.match if what you want is always at beginning of string
if m:
print m.group(1)
In trying to convert a Perl program to Python that parses function names out of modules, I ran into this problem, I received an error saying "group" was undefined. I soon realized that the exception was being thrown because p.match / p.search returns 0 if there is not a matching string.
Thus, the group operator cannot function on it. So, to avoid an exception, check if a match has been stored and then apply the group operator.
import re
filename = './file_to_parse.py'
p = re.compile('def (\w*)') # \w* greedily matches [a-zA-Z0-9_] character set
for each_line in open(filename,'r'):
m = p.match(each_line) # tries to match regex rule in p
if m:
m = m.group(1)
print m

Categories