String Simple Substitution - python

What's the easiest way of me converting the simpler regex format that most users are used to into the correct re python regex string?
As an example, I need to convert this:
string = "*abc+de?"
to this:
string = ".*abc.+de.?"
Of course I could loop through the string and build up another string character by character, but that's surely an inefficient way of doing this?

Those don't look like regexps you're trying to translate, they look more like unix shell globs. Python has a module for doing this already. It doesn't know about the "+" syntax you used, but neither does my shell, and I think the syntax is nonstandard.
>>> import fnmatch
>>> fnmatch.fnmatch("fooabcdef", "*abcde?")
True
>>> help(fnmatch.fnmatch)
Help on function fnmatch in module fnmatch:
fnmatch(name, pat)
Test whether FILENAME matches PATTERN.
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
An initial period in FILENAME is not special.
Both FILENAME and PATTERN are first case-normalized
if the operating system requires it.
If you don't want this, use fnmatchcase(FILENAME, PATTERN).
>>>

.replacing() each of the wildcards is the quick way, but what if the wildcarded string contains other regex special characters? eg. someone searching for 'my.thing*' probably doesn't mean that '.' to match any character. And in the worst case things like match-group-creating parentheses are likely to break your final handling of the regex matches.
re.escape can be used to put literal characters into regexes. You'll have to split out the wildcard characters first though. The usual trick for that is to use re.split with a matching bracket, resulting in a list in the form [literal, wildcard, literal, wildcard, literal...].
Example code:
wildcards= re.compile('([?*+])')
escapewild= {'?': '.', '*': '.*', '+': '.+'}
def escapePart((parti, part)):
if parti%2==0: # even items are literals
return re.escape(part)
else: # odd items are wildcards
return escapewild[part]
def convertWildcardedToRegex(s):
parts= map(escapePart, enumerate(wildcards.split(s)))
return '^%s$' % (''.join(parts))

You'll probably only be doing this substitution occasionally, such as each time a user enters a new search string, so I wouldn't worry about how efficient the solution is.
You need to generate a list of the replacements you need to convert from the "user format" to a regex. For ease of maintenance I would store these in a dictionary, and like #Konrad Rudolph I would just use the replace method:
def wildcard_to_regex(wildcard):
replacements = {
'*': '.*',
'?': '.?',
'+': '.+',
}
regex = wildcard
for (wildcard_pattern, regex_pattern) in replacements.items():
regex = regex.replace(wildcard_pattern, regex_pattern)
return regex
Note that this only works for simple character replacements, although other complex code can at least be hidden in the wildcard_to_regex function if necessary.
(Also, I'm not sure that ? should translate to .? -- I think normal wildcards have ? as "exactly one character", so its replacement should be a simple . -- but I'm following your example.)

I'd use replace:
def wildcard_to_regex(str):
return str.replace("*", ".*").replace("?", .?").replace("#", "\d")
This probably isn't the most efficient way but it should be efficient enough for most purposes. Notice that some wildcard formats allow character classes which are more difficult to handle.

Here is a Perl example of doing this. It is simply using a table to replace each wildcard construct with the corresponding regular expression. I've done this myself previously, but in C. It shouldn't be too hard to port to Python.

Related

Checking for string format using regex

This is an easy question, but I am still getting stuck. I have a bunch of files and I have to check whether they are following this format.
abc_monthname[0-9]_v[0-9].xlsx
I have done some very generic like : r^[A-Za-z]+_[A-Za-z0-9]+_v[0-9]+\.xlsx$''
but this will fail for some cases. I want to give a strict rule. How do I achieve this in python?
You probably want to use + quantifiers with the numeric portion of your regex:
^abc_monthname[0-9]+_v[0-9]+\.xlsx$
Note also that dot is a metacharacter and should be escaped with backslash. Here is a sample script
filename = "^abc_monthname123_v2.xslx"
if re.search(r'^abc_monthname[0-9]+_v[0-9]+\.xlsx$', filename):
print("MATCH")

how can I substitute a matched string in python

I have a string ="/One/Two/Three/Four"
I want to convert it to ="Four"
I can do this in one line in perl
string =~ s/.*+\///g
How Can I do this in python?
str_name="/One/Two/Three/Four"
str_name.split('/')[-1]
In general, split is a safe way to convert a string into a list based on some reg-ex. Then, we can call the last element in that list, which happens to be "Four" in this case.
Hope this helps.
Python's re module can handle regular expressions. For this case, you'd do
import re
my_str = "/One/Two/Three/Four"
new_str = re.sub(".*/", "", my_str)
# 'Four'
re.sub() is the regex replacement method. Like your perl regex, we simply look for any number of characters, followed by a slash, and then replace that with the empty string. What's left is what's after the last slash, which is 4.
The are alot of possibilities to solve this. One way would be by indexing the string. Other string method can be found here
string ="/One/Two/Three/Four"
string[string.index('Four'):]
Additionally you could split the string by the slash with .split('/')
print(string.split('/')[-1])
Another option would be regular expressions: see here

parsing string with specific name in python

i have string like this
<name:john student male age=23 subject=\computer\sience_{20092973}>
i am confused ":","="
i want to parsing this string!
so i want to split to list like this
name:john
job:student
sex:male
age:23
subject:{20092973}
parsing string with specific name(name, job, sex.. etc) in python
i already searching... but i can't find.. sorry..
how can i this?
thank you.
It's generally a good idea to give more than one example of the strings you're trying to parse. But I'll take a guess. It looks like your format is pretty simple, and primarily whitespace-separated. It's simple enough that using regular expressions should work, like this, where line_to_parse is the string you want to parse:
import re
matchval = re.match("<name:(\S+)\s+(\S+)\s+(\S+)\s+age=(\S+)\s+subject=[^\{]*(\{\S+\})", line_to_parse)
matchgroups = matchval.groups()
Now matchgroups will be a tuple of the values you want. It should be trivial for you to take those and get them into the desired format.
If you want to do many of these, it may be worth compiling the regular expression; take a look at the re documentation for more on this.
As for the way the expression works: I won't go into regular expressions in general (that's what the re docs are for) but in this case, we want to get a bunch of strings that don't have any whitespace in them, and have whitespace between them, and we want to do something odd with the subject, ignoring all the text except the part between { and }.
Each "(...)" in the expression saves whatever is inside it as a group. Each "\S+" stands for one or more ("+") characters that aren't whitespace ("\S"), so "(\S+)" will match and save a string of length at least one that has no whitespace in it. Each "\s+" does the opposite: it has not parentheses around it, so it doesn't save what it matches, and it matches at one or more ("+") whitespace characters ("\s"). This suffices for most of what we want. At the end, though, we need to deal with the subject. "[...]" allows us to list multiple types of characters. "[^...]" is special, and matches anything that isn't in there. {, like [, (, and so on, needs to be escaped to be normal in the string, so we escape it with \, and in the end, that means "[^{]*" matches zero or more ("*") characters that aren't "{" ("[^{]"). Since "*" and "+" are "greedy", and will try to match as much as they can and still have the expression match, we now only need to deal with the last part. From what I've talked about before, it should be pretty clear what "({\S+})" does.

Combine case sensitive regex and case insensitive regex into one

I have multiple filters for files (I'm using python). Some of them are glob filters some of them are regular expressions. I have both case sensitive and case insensitive globs and regexes. I can transform the glob into a regular expression with translate.
I can combine the case sensitive regular expressions into one big regular expression. Let's call it R_sensitive.
I can combine the case insensitive regular expressions into one big regular expression (case insensitive). Let's call it R_insensitive.
Is there a way to combine R_insensitive and R_sensitive into one regular expression? The expression would be (of course) case sensitive?
Thanks,
Iulian
NOTE: The way I combine expressions is the following:
Having R1,R2,R3 regexes I make R = (R1)|(R2)|(R3).
EXAMPLE:
I'm searching for "*.txt" (insensitive glob). But I have another glob that is like this: "*abc*" (case sensitive). How to combine (from programming) the 2 regex resulted from "fnmatch.translate" when one is case insensitive while the other is case sensitive?
Unfortunately, the regex ability you describe is either ordinal modifiers or a modifier span. Python does not support either, though here are what they would look like:
Ordinal Modifiers: (?i)case_insensitive_match(?-i)case_sensitive_match
Modifier Spans: (?i:case_insensitive_match)(?-i:case_sensitive_match)
In Python, they both fail to parse in re. The closest thing you could do (for simple or small matches) would be letter groups:
[Cc][Aa][Ss][Ee]_[Ii][Nn][Ss][Ee][Nn][Ss][Ii][Tt][Ii][Vv][Ee]_[Mm][Aa][Tt][Cc][Hh]case_sensitive_match
Obviously, this approach would be best for something where the insensitive portion is very brief, so I'm afraid it wouldn't be the best choice for you.
What you need is a way to convert a case-insensitive-flagged regexp into a regexp that works equivalent without the flag.
To do this fully generally is going to be a nightmare.
To do this just for fnmatch results is a whole lot easier.
If you need to handle full Unicode case rules, it will still be very hard.
If you only need to handle making sure each character c also matches c.upper() and c.lower(), it's very easy.
I'm only going to explain the easy case, because it's probably what you want, given your examples, and it's easy. :)
Some modules in the Python standard library are meant to serve as sample code as well as working implementations; these modules' docs start with a link directly to their source code. And fnmatch has such a link.
If you understand regexp syntax, and glob syntax, and look at the source to the translate function, it should be pretty easy to write your own translatenocase function.
Basically: In the inner else clause for building character classes, iterate over the characters, and for each character, if c.upper() != c.lower(), append both instead of c. Then, in the outer else clause for non-special characters, if c.upper() != c.lower(), append a two-character character class consisting of those two characters.
So, translatenocase('*.txt') will return something like r'.*\.[tT][xX][tT]' instead of something like r'.*\.txt'. But normal translate('*abc*') will of course return the usual r'.*abc.*'. And you can combine these just by using an alternation, as you apparently already know how to do.

Should I reuse a compiled regex?

this is a quick question:
How would I specify a regex which can be used several times with multiple match strings? I might not have worded that right, but I will try to show some code.
I have this regex:
regex = compile(r'(?=(%s))')
In a for loop, I will try and match the string I have to one I specify for the regex so that at each iteration, I can change the string being matched and it will try to match it.
So is this possible, can I do something like
regex.findall(myStringString, myMatchString)
in the code or would I have to recompile the regex in order for it to match a new string?
More clarification:
I want to do this:
re.findall('(?=(%s))' %myMatchString, mySearchString)
but because myMatchString will be changing at each iteration of the loop, I want to do it like this so I can match the new string:
regex = re.compile(r'(?=(%s))')
regex.findall( myMatchString, mySearchString)
Thanks for reading
well, if I understand what you say, all you want to write is :
def match_on_list_of_strings(list_of_strings):
regex = compile(r'(?=(%s))')
for string in list_of_strings:
yield regex.findall(string)
That will apply your match on the strings as many times there are strings in the list of strings, while your regex been compiled only once.
Aaaah... but you don't need a regex for that:
def match_on_list_of_strings(bigstring, list_of_strings):
for string in list_of_strings:
if string in bigstring:
yield string
or if you really want to use a re:
def match_on_list_of_strings(bigstring, list_of_strings):
for string in list_of_strings:
if re.match('.*'+string+'.*', bigstring):
yield string
And then to answer your question, no you can't compile the destination string into a regex, but only the contrary. When you compile a regex, what you do is transform the actual regexp into an internal representation of the automaton. You might want to read courses on NFA and regexps
The point of re.compile is to explicitly declare you're going to re-use the same pattern again and again - and hopefully avoid any compilation that may be required.
As what you're doing is not necessarily re-using the same pattern, then you're better off letting the re system cache patterns (it caches n many - but can't remember exactly how many), and just use re.findall(...)/whatever your regex afresh each time.

Categories