Should I reuse a compiled regex? - python

this is a quick question:
How would I specify a regex which can be used several times with multiple match strings? I might not have worded that right, but I will try to show some code.
I have this regex:
regex = compile(r'(?=(%s))')
In a for loop, I will try and match the string I have to one I specify for the regex so that at each iteration, I can change the string being matched and it will try to match it.
So is this possible, can I do something like
regex.findall(myStringString, myMatchString)
in the code or would I have to recompile the regex in order for it to match a new string?
More clarification:
I want to do this:
re.findall('(?=(%s))' %myMatchString, mySearchString)
but because myMatchString will be changing at each iteration of the loop, I want to do it like this so I can match the new string:
regex = re.compile(r'(?=(%s))')
regex.findall( myMatchString, mySearchString)
Thanks for reading

well, if I understand what you say, all you want to write is :
def match_on_list_of_strings(list_of_strings):
regex = compile(r'(?=(%s))')
for string in list_of_strings:
yield regex.findall(string)
That will apply your match on the strings as many times there are strings in the list of strings, while your regex been compiled only once.
Aaaah... but you don't need a regex for that:
def match_on_list_of_strings(bigstring, list_of_strings):
for string in list_of_strings:
if string in bigstring:
yield string
or if you really want to use a re:
def match_on_list_of_strings(bigstring, list_of_strings):
for string in list_of_strings:
if re.match('.*'+string+'.*', bigstring):
yield string
And then to answer your question, no you can't compile the destination string into a regex, but only the contrary. When you compile a regex, what you do is transform the actual regexp into an internal representation of the automaton. You might want to read courses on NFA and regexps

The point of re.compile is to explicitly declare you're going to re-use the same pattern again and again - and hopefully avoid any compilation that may be required.
As what you're doing is not necessarily re-using the same pattern, then you're better off letting the re system cache patterns (it caches n many - but can't remember exactly how many), and just use re.findall(...)/whatever your regex afresh each time.

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

parsing string with specific name in python

i have string like this
<name:john student male age=23 subject=\computer\sience_{20092973}>
i am confused ":","="
i want to parsing this string!
so i want to split to list like this
name:john
job:student
sex:male
age:23
subject:{20092973}
parsing string with specific name(name, job, sex.. etc) in python
i already searching... but i can't find.. sorry..
how can i this?
thank you.
It's generally a good idea to give more than one example of the strings you're trying to parse. But I'll take a guess. It looks like your format is pretty simple, and primarily whitespace-separated. It's simple enough that using regular expressions should work, like this, where line_to_parse is the string you want to parse:
import re
matchval = re.match("<name:(\S+)\s+(\S+)\s+(\S+)\s+age=(\S+)\s+subject=[^\{]*(\{\S+\})", line_to_parse)
matchgroups = matchval.groups()
Now matchgroups will be a tuple of the values you want. It should be trivial for you to take those and get them into the desired format.
If you want to do many of these, it may be worth compiling the regular expression; take a look at the re documentation for more on this.
As for the way the expression works: I won't go into regular expressions in general (that's what the re docs are for) but in this case, we want to get a bunch of strings that don't have any whitespace in them, and have whitespace between them, and we want to do something odd with the subject, ignoring all the text except the part between { and }.
Each "(...)" in the expression saves whatever is inside it as a group. Each "\S+" stands for one or more ("+") characters that aren't whitespace ("\S"), so "(\S+)" will match and save a string of length at least one that has no whitespace in it. Each "\s+" does the opposite: it has not parentheses around it, so it doesn't save what it matches, and it matches at one or more ("+") whitespace characters ("\s"). This suffices for most of what we want. At the end, though, we need to deal with the subject. "[...]" allows us to list multiple types of characters. "[^...]" is special, and matches anything that isn't in there. {, like [, (, and so on, needs to be escaped to be normal in the string, so we escape it with \, and in the end, that means "[^{]*" matches zero or more ("*") characters that aren't "{" ("[^{]"). Since "*" and "+" are "greedy", and will try to match as much as they can and still have the expression match, we now only need to deal with the last part. From what I've talked about before, it should be pretty clear what "({\S+})" does.

Comments in string and strings in comments

I am trying to count characters in comments included in C code using Python and Regex, but no success. I can erase strings first to get rid of comments in strings, but this will erase string in comments too and result will be bad ofc. Is there any chance to ask by using regex to not match strings in comments or vice versa?
No, not really.
Regex is not the correct tool to parse nested structures like you describe; instead you will need to parse the C syntax (or the "dumb subset" of it you're interested in, anyway), and you might find regex helpful in that. A relatively simple state machine with three states (CODE, STRING, COMMENT) would do it.
Regular expressions are not always a replacement for a real parser.
You can strip out all strings that aren't in comments by searching for the regular expression:
'[^'\r\n]+'|(//.*|/\*(?s:.*?)\*/)
and replacing with:
$1
Essentially, this searches for the regex string|(comment) which matches a string or a comment, capturing the comment. The replacement is either nothing if a string was matched or the comment if a comment was matched.
Though regular expressions are not a replacement for a real parser you can quickly build a rudimentary parser by creating a giant regex that alternates all of the tokens you're interested in (comments and strings in this case). If you're writing a bit of code to handle comments, but not those in strings, iterate over all the matches of the above regex, and count the characters in the first capturing group if it participated in the match.

String Simple Substitution

What's the easiest way of me converting the simpler regex format that most users are used to into the correct re python regex string?
As an example, I need to convert this:
string = "*abc+de?"
to this:
string = ".*abc.+de.?"
Of course I could loop through the string and build up another string character by character, but that's surely an inefficient way of doing this?
Those don't look like regexps you're trying to translate, they look more like unix shell globs. Python has a module for doing this already. It doesn't know about the "+" syntax you used, but neither does my shell, and I think the syntax is nonstandard.
>>> import fnmatch
>>> fnmatch.fnmatch("fooabcdef", "*abcde?")
True
>>> help(fnmatch.fnmatch)
Help on function fnmatch in module fnmatch:
fnmatch(name, pat)
Test whether FILENAME matches PATTERN.
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
An initial period in FILENAME is not special.
Both FILENAME and PATTERN are first case-normalized
if the operating system requires it.
If you don't want this, use fnmatchcase(FILENAME, PATTERN).
>>>
.replacing() each of the wildcards is the quick way, but what if the wildcarded string contains other regex special characters? eg. someone searching for 'my.thing*' probably doesn't mean that '.' to match any character. And in the worst case things like match-group-creating parentheses are likely to break your final handling of the regex matches.
re.escape can be used to put literal characters into regexes. You'll have to split out the wildcard characters first though. The usual trick for that is to use re.split with a matching bracket, resulting in a list in the form [literal, wildcard, literal, wildcard, literal...].
Example code:
wildcards= re.compile('([?*+])')
escapewild= {'?': '.', '*': '.*', '+': '.+'}
def escapePart((parti, part)):
if parti%2==0: # even items are literals
return re.escape(part)
else: # odd items are wildcards
return escapewild[part]
def convertWildcardedToRegex(s):
parts= map(escapePart, enumerate(wildcards.split(s)))
return '^%s$' % (''.join(parts))
You'll probably only be doing this substitution occasionally, such as each time a user enters a new search string, so I wouldn't worry about how efficient the solution is.
You need to generate a list of the replacements you need to convert from the "user format" to a regex. For ease of maintenance I would store these in a dictionary, and like #Konrad Rudolph I would just use the replace method:
def wildcard_to_regex(wildcard):
replacements = {
'*': '.*',
'?': '.?',
'+': '.+',
}
regex = wildcard
for (wildcard_pattern, regex_pattern) in replacements.items():
regex = regex.replace(wildcard_pattern, regex_pattern)
return regex
Note that this only works for simple character replacements, although other complex code can at least be hidden in the wildcard_to_regex function if necessary.
(Also, I'm not sure that ? should translate to .? -- I think normal wildcards have ? as "exactly one character", so its replacement should be a simple . -- but I'm following your example.)
I'd use replace:
def wildcard_to_regex(str):
return str.replace("*", ".*").replace("?", .?").replace("#", "\d")
This probably isn't the most efficient way but it should be efficient enough for most purposes. Notice that some wildcard formats allow character classes which are more difficult to handle.
Here is a Perl example of doing this. It is simply using a table to replace each wildcard construct with the corresponding regular expression. I've done this myself previously, but in C. It shouldn't be too hard to port to Python.

regex, how to exlude search in match

Might be a bit messy title, but the question is simple.
I got this in Python:
string = "start;some;text;goes;here;end"
the start; and end; word is always at the same position in the string.
I want the second word which is some in this case. This is what I did:
import re
string = "start;some;text;goes;here;end"
word = re.findall("start;.+?;" string)
In this example, there might be a few things to modify to make it more appropriate, but in my actual code, this is the best way.
However, the string I get back is start;some;, where the search characters themselves is included in the output. I could index both ;, and extract the middle part, but there have to be a way to only get the actual word, and not the extra junk too?
No need for regex in my opinion, but all you need is a capture group here.
word = re.findall("start;(.+?);", string)
Another improvement I'd like to suggest is not using .. Rather be more specific, and what you are looking for is simply anything else than ;, the delimiter.
So I'd do this:
word = re.findall("start;([^;]+);", string)

Categories