I want to do a string replace, to remove all of the special and unsafe characters used in a search phrase, to something suitable to be inserted into a Google URL.
I could use multiple instances of .replace or re.sub, but that seems inefficient. Is there a faster or more Pythonic way of doing this? I'm thinking I'm moving beyond beginner and into intermediate lately, because of all my attempts at making my code cleaner and more efficient.
Instead of doing the replacement yourself, I would suggest using urllib.quote(), which returns a URL-safe string by converting special characters to %xx escapes.
The benefit here is that you can easily obtain the original string from your URL safe version using urllib.unquote() (and you don't need to write the code yourself!).
One alternative is string.translate
e.g.
>>> string.translate('ds..ad$ds#a', None, '.,##$')
'dsaddsa'
Another alternative is the use of re.sub, see the regex documentation for more details.
As an example:
# re.sub(pattern, replacement, target_string)
>>> re.sub("#|#|\$", "" , 'asdf##$asdf')
'asdfasdf'
Note that you can specify the signs you want to replace by an empty string, meaning you can add/remove special characters. It does however require you to have some knowledge of regex patterns.
Related
I have multiple filters for files (I'm using python). Some of them are glob filters some of them are regular expressions. I have both case sensitive and case insensitive globs and regexes. I can transform the glob into a regular expression with translate.
I can combine the case sensitive regular expressions into one big regular expression. Let's call it R_sensitive.
I can combine the case insensitive regular expressions into one big regular expression (case insensitive). Let's call it R_insensitive.
Is there a way to combine R_insensitive and R_sensitive into one regular expression? The expression would be (of course) case sensitive?
Thanks,
Iulian
NOTE: The way I combine expressions is the following:
Having R1,R2,R3 regexes I make R = (R1)|(R2)|(R3).
EXAMPLE:
I'm searching for "*.txt" (insensitive glob). But I have another glob that is like this: "*abc*" (case sensitive). How to combine (from programming) the 2 regex resulted from "fnmatch.translate" when one is case insensitive while the other is case sensitive?
Unfortunately, the regex ability you describe is either ordinal modifiers or a modifier span. Python does not support either, though here are what they would look like:
Ordinal Modifiers: (?i)case_insensitive_match(?-i)case_sensitive_match
Modifier Spans: (?i:case_insensitive_match)(?-i:case_sensitive_match)
In Python, they both fail to parse in re. The closest thing you could do (for simple or small matches) would be letter groups:
[Cc][Aa][Ss][Ee]_[Ii][Nn][Ss][Ee][Nn][Ss][Ii][Tt][Ii][Vv][Ee]_[Mm][Aa][Tt][Cc][Hh]case_sensitive_match
Obviously, this approach would be best for something where the insensitive portion is very brief, so I'm afraid it wouldn't be the best choice for you.
What you need is a way to convert a case-insensitive-flagged regexp into a regexp that works equivalent without the flag.
To do this fully generally is going to be a nightmare.
To do this just for fnmatch results is a whole lot easier.
If you need to handle full Unicode case rules, it will still be very hard.
If you only need to handle making sure each character c also matches c.upper() and c.lower(), it's very easy.
I'm only going to explain the easy case, because it's probably what you want, given your examples, and it's easy. :)
Some modules in the Python standard library are meant to serve as sample code as well as working implementations; these modules' docs start with a link directly to their source code. And fnmatch has such a link.
If you understand regexp syntax, and glob syntax, and look at the source to the translate function, it should be pretty easy to write your own translatenocase function.
Basically: In the inner else clause for building character classes, iterate over the characters, and for each character, if c.upper() != c.lower(), append both instead of c. Then, in the outer else clause for non-special characters, if c.upper() != c.lower(), append a two-character character class consisting of those two characters.
So, translatenocase('*.txt') will return something like r'.*\.[tT][xX][tT]' instead of something like r'.*\.txt'. But normal translate('*abc*') will of course return the usual r'.*abc.*'. And you can combine these just by using an alternation, as you apparently already know how to do.
this is a quick question:
How would I specify a regex which can be used several times with multiple match strings? I might not have worded that right, but I will try to show some code.
I have this regex:
regex = compile(r'(?=(%s))')
In a for loop, I will try and match the string I have to one I specify for the regex so that at each iteration, I can change the string being matched and it will try to match it.
So is this possible, can I do something like
regex.findall(myStringString, myMatchString)
in the code or would I have to recompile the regex in order for it to match a new string?
More clarification:
I want to do this:
re.findall('(?=(%s))' %myMatchString, mySearchString)
but because myMatchString will be changing at each iteration of the loop, I want to do it like this so I can match the new string:
regex = re.compile(r'(?=(%s))')
regex.findall( myMatchString, mySearchString)
Thanks for reading
well, if I understand what you say, all you want to write is :
def match_on_list_of_strings(list_of_strings):
regex = compile(r'(?=(%s))')
for string in list_of_strings:
yield regex.findall(string)
That will apply your match on the strings as many times there are strings in the list of strings, while your regex been compiled only once.
Aaaah... but you don't need a regex for that:
def match_on_list_of_strings(bigstring, list_of_strings):
for string in list_of_strings:
if string in bigstring:
yield string
or if you really want to use a re:
def match_on_list_of_strings(bigstring, list_of_strings):
for string in list_of_strings:
if re.match('.*'+string+'.*', bigstring):
yield string
And then to answer your question, no you can't compile the destination string into a regex, but only the contrary. When you compile a regex, what you do is transform the actual regexp into an internal representation of the automaton. You might want to read courses on NFA and regexps
The point of re.compile is to explicitly declare you're going to re-use the same pattern again and again - and hopefully avoid any compilation that may be required.
As what you're doing is not necessarily re-using the same pattern, then you're better off letting the re system cache patterns (it caches n many - but can't remember exactly how many), and just use re.findall(...)/whatever your regex afresh each time.
I thought this should work, but it doesn't:
import re
if re.match("\Qbla\E", "bla"):
print "works!"
Why it doesn't work? Can I use the '\Q' and '\E' symbols in python? How?
Python's regex engine doesn't support those; see ยง7.2.1 "Regular Expression Syntax" in the Python documentation for a list of what it does support. However, you can get the same effect by writing re.match(re.escape("bla"), "bla"); re.escape is a function that inserts backslashes before all special characters.
By the way, you should generally use "raw" strings, r"..." instead of just "...", since otherwise backslashes will get processed twice (once when the string is parsed, and then again by the regex engine), which means you have to write things like \\b instead of \b. Using r"..." prevents that first processing pass, so you can just write \b.
Unfortunately, Python doesn't support the \Q and \E escape sequences. You just have to escape everything yourself.
Python doesn't support \Q...\E .
Ref: http://www.regular-expressions.info/refflavors.html
But that doesn't means it doesn't support escaping strings of metacharacters.
Ref: http://docs.python.org/library/re.html#re.escape
I am trying to count characters in comments included in C code using Python and Regex, but no success. I can erase strings first to get rid of comments in strings, but this will erase string in comments too and result will be bad ofc. Is there any chance to ask by using regex to not match strings in comments or vice versa?
No, not really.
Regex is not the correct tool to parse nested structures like you describe; instead you will need to parse the C syntax (or the "dumb subset" of it you're interested in, anyway), and you might find regex helpful in that. A relatively simple state machine with three states (CODE, STRING, COMMENT) would do it.
Regular expressions are not always a replacement for a real parser.
You can strip out all strings that aren't in comments by searching for the regular expression:
'[^'\r\n]+'|(//.*|/\*(?s:.*?)\*/)
and replacing with:
$1
Essentially, this searches for the regex string|(comment) which matches a string or a comment, capturing the comment. The replacement is either nothing if a string was matched or the comment if a comment was matched.
Though regular expressions are not a replacement for a real parser you can quickly build a rudimentary parser by creating a giant regex that alternates all of the tokens you're interested in (comments and strings in this case). If you're writing a bit of code to handle comments, but not those in strings, iterate over all the matches of the above regex, and count the characters in the first capturing group if it participated in the match.
What's the easiest way of me converting the simpler regex format that most users are used to into the correct re python regex string?
As an example, I need to convert this:
string = "*abc+de?"
to this:
string = ".*abc.+de.?"
Of course I could loop through the string and build up another string character by character, but that's surely an inefficient way of doing this?
Those don't look like regexps you're trying to translate, they look more like unix shell globs. Python has a module for doing this already. It doesn't know about the "+" syntax you used, but neither does my shell, and I think the syntax is nonstandard.
>>> import fnmatch
>>> fnmatch.fnmatch("fooabcdef", "*abcde?")
True
>>> help(fnmatch.fnmatch)
Help on function fnmatch in module fnmatch:
fnmatch(name, pat)
Test whether FILENAME matches PATTERN.
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
An initial period in FILENAME is not special.
Both FILENAME and PATTERN are first case-normalized
if the operating system requires it.
If you don't want this, use fnmatchcase(FILENAME, PATTERN).
>>>
.replacing() each of the wildcards is the quick way, but what if the wildcarded string contains other regex special characters? eg. someone searching for 'my.thing*' probably doesn't mean that '.' to match any character. And in the worst case things like match-group-creating parentheses are likely to break your final handling of the regex matches.
re.escape can be used to put literal characters into regexes. You'll have to split out the wildcard characters first though. The usual trick for that is to use re.split with a matching bracket, resulting in a list in the form [literal, wildcard, literal, wildcard, literal...].
Example code:
wildcards= re.compile('([?*+])')
escapewild= {'?': '.', '*': '.*', '+': '.+'}
def escapePart((parti, part)):
if parti%2==0: # even items are literals
return re.escape(part)
else: # odd items are wildcards
return escapewild[part]
def convertWildcardedToRegex(s):
parts= map(escapePart, enumerate(wildcards.split(s)))
return '^%s$' % (''.join(parts))
You'll probably only be doing this substitution occasionally, such as each time a user enters a new search string, so I wouldn't worry about how efficient the solution is.
You need to generate a list of the replacements you need to convert from the "user format" to a regex. For ease of maintenance I would store these in a dictionary, and like #Konrad Rudolph I would just use the replace method:
def wildcard_to_regex(wildcard):
replacements = {
'*': '.*',
'?': '.?',
'+': '.+',
}
regex = wildcard
for (wildcard_pattern, regex_pattern) in replacements.items():
regex = regex.replace(wildcard_pattern, regex_pattern)
return regex
Note that this only works for simple character replacements, although other complex code can at least be hidden in the wildcard_to_regex function if necessary.
(Also, I'm not sure that ? should translate to .? -- I think normal wildcards have ? as "exactly one character", so its replacement should be a simple . -- but I'm following your example.)
I'd use replace:
def wildcard_to_regex(str):
return str.replace("*", ".*").replace("?", .?").replace("#", "\d")
This probably isn't the most efficient way but it should be efficient enough for most purposes. Notice that some wildcard formats allow character classes which are more difficult to handle.
Here is a Perl example of doing this. It is simply using a table to replace each wildcard construct with the corresponding regular expression. I've done this myself previously, but in C. It shouldn't be too hard to port to Python.