I have a list in python with some strings, and I need to know witch item in the list is like "A1_8301". This "_" means that can be any char.
Is there a quick way to do that?
If I was using SQL, i just type something like "where x like "A1_8301"
Thank you!
In Python you'd use a regular expression:
import re
pattern = re.compile(r'^A1.8301$')
matches = [x for x in yourlist if pattern.match(x)]
This produces a list of elements that match your requirements.
The ^ and $ anchors are needed to prevent substring matches; BA1k8301-42 should not match, for example. The re.match() call will only match at the start of the tested string, but using ^ makes this a little more explicit and mirrors the $ for the end-of-string anchor nicely.
The _ in a SQL like is translated to ., meaning match one character.
regular expressions are probably the way to go. IIRC, % should map to .* and _ should map to ..
matcher = re.compile('^A1.8301$')
list_of_string = [s for s in stringlist if matcher.match(s)]
Related
Give an string like '/apps/platform/app/app_name/etc', I can use
p = re.compile('/apps/(?P<p1>.*)/app/(?P<p2>.*)/')
to get two matched groups of platform and app_name, but how can I use re.sub function (or maybe better way) to replace those two groups with other string like windows and facebook? So the final string would like /apps/windows/app/facebook/etc.
Separate group replacement wouldn't be possible through regex. So i suggest you to do like this.
(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/
DEMO
Then replace the matched characters with windows\2facebook/ . And also i suggest you to define your regex as raw string. Lookbehind is used inorder to avoid extra capturing group.
>>> s = '/apps/platform/app/app_name/etc'
>>> re.sub(r'(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/', r'windows\2facebook/', s)
'/apps/windows/app/facebook/etc'
I'm trying to convert a perl regex to python equivalent.
Line in perl:
($Cur) = $Line =~ m/\s*\<stat\>(.+)\<\/stat\>\s*$/i;
What I've attempted, but doesn't seem to work:
m = re.search('<stat>(.*?)</stat>/i', line)
cur = m.group(0)
almost /i means case insensitive
m = re.search(r'<stat>(.*?)</stat>',line,re.IGNORECASE)
also use the r modifier on the string so you dont need to escape stuff like angle brackets.
but my guess is a better solution is to use an html/xml parser like beautifulsoup or other similar packages
Something like the following ...
r is Python’s raw string notation for regex patterns and to avoid escaping, after the prefix comes your regular expression following your string data. re.I is used for case-insensitive matching.
See the re documentation explaining this in more detail.
To find your match, you could use the group() method of MatchObject like the following:
cur = re.search(r'<stat>([^<]*)</stat>', line).group(1)
Using search() matches only the first occurrence, use findall() to match all occurrences.
matches = re.findall(r'<stat>([^<]*)</stat>', line)
So, I have a long sequence of Unicode characters that I want to match using regular expressions:
char_set = '\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
(These are all the uppercase characters comprehended in the Unicode range 0-382. Most of them are accented. PEP8 discourages the use of non-ASCII characters in Python scripts, so I'm using the Unicode codes instead of the string literals.)
If I simply compile that long string directly, it works. For instance, this matches all the words that begin with one of those characters:
regex = re.compile(u'\A[\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D]')
But I want to re-use that same sequence of characters in several other regular expressions. I could simply copy and paste it every time, but that's ugly. So based on previous answers to similar questions I've tried this:
regex = re.compile(u'\A[%s]' % char_set)
No good. Somehow the above expression seems to match ANY character, not just the ones hardcoded under the variable 'char_set'.
I've also tried this:
regex = re.compile(u'\A[' + char_set + ']')
And this:
regex = re.compile(u'\A[' + re.escape(char_set) + ']')
And this too:
regex = re.compile(u'\A[{ }]'.format(char_set))
None of which works as expected.
Any thoughts? What am I doing wrong?
(I'm using Python 2.7 and Mac OS X 10.6)
When you're using a pattern with a set of characters in square brackets, you don't want to put any vertical bar (|) characters in the set. Instead, just string the characters together and it should work. Here's a session where I tried out your characters with no problems after stripping the | chars:
>>> import re
>>> char_set = u'\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
>>> fixed_char_set = char_set.replace("|", "") # remove the unneeded vertical bars
>>> pattern = ur"\A[{}]".format(fixed_char_set) # create a pattern string
>>> regex = re.compile(pattern) # compile the pattern into a regex object
>>> print regex.match("%foo") # "%" is not in the character set, so match returns None
None
edit: Actually, it seems like there must be some other issue going on, since I don't match "%foo" even if I use your original char_set without stripping out anything. Please give examples of text that is matching when it shouldn't!
I have a file saving IP addresses to names in format
<<%#$192.168.8.40$#% %##Name_of_person##% >>
I read This file and now want to extract the list using pythons regular expressions
list=re.findall("<<%#$(\S+)$#%\s%##(\w+\s*\w*)##%\s>>",ace)
print list
But the list is always an empty list..
can anyone tell me where is the mistake in the regular expression
edit-ace is the variable saving the contents read from the file
$ is a special character in regular expressions, meaning "end of line" (or "end of string", depending on the flavour). Your regex has other characters following the $, and as such only matches strings that have those characters after the end, which is impossible.
You will need to escape the $, like so: \$
I would suggest the following regular expression (formatted as a raw string since you are using Python):
r"<<%#\$([^$]+)\$#%\s%##([^#]+)##%\s>>"
That is, <<%#$, then one or more non-$ characters, $#%, a whitespace character, %##, one or more non-# characters, ##%, whitespace, >>.
Something like:
text = '<<%#$192.168.8.40$#% %##Name_of_person##% >>'
ip, name = [el[1] for el in re.findall(r'%#(.)(.+?)\1#%', text)]
If you can get any with just splitting on '#' and '$' then...
from itertools import itemgetter
ip, name = itemgetter(1, 3)(re.split(r'[#\$]', text))
You could also just use built-in string functions:
tmp = text.split('$')
ip, name = tmp[1], tmp[2].split('#')[1]
u use a invalid regex pattern.
you may use
r"<\%#\$(\S+)\$#\%\s\%##(\w+\s*\w*)##\%\s>>" replace
"<<%#$(\S+)$#%\s%##(\w+\s*\w*)##%\s>>" in fandall method
good luck~!
What's the easiest way of me converting the simpler regex format that most users are used to into the correct re python regex string?
As an example, I need to convert this:
string = "*abc+de?"
to this:
string = ".*abc.+de.?"
Of course I could loop through the string and build up another string character by character, but that's surely an inefficient way of doing this?
Those don't look like regexps you're trying to translate, they look more like unix shell globs. Python has a module for doing this already. It doesn't know about the "+" syntax you used, but neither does my shell, and I think the syntax is nonstandard.
>>> import fnmatch
>>> fnmatch.fnmatch("fooabcdef", "*abcde?")
True
>>> help(fnmatch.fnmatch)
Help on function fnmatch in module fnmatch:
fnmatch(name, pat)
Test whether FILENAME matches PATTERN.
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
An initial period in FILENAME is not special.
Both FILENAME and PATTERN are first case-normalized
if the operating system requires it.
If you don't want this, use fnmatchcase(FILENAME, PATTERN).
>>>
.replacing() each of the wildcards is the quick way, but what if the wildcarded string contains other regex special characters? eg. someone searching for 'my.thing*' probably doesn't mean that '.' to match any character. And in the worst case things like match-group-creating parentheses are likely to break your final handling of the regex matches.
re.escape can be used to put literal characters into regexes. You'll have to split out the wildcard characters first though. The usual trick for that is to use re.split with a matching bracket, resulting in a list in the form [literal, wildcard, literal, wildcard, literal...].
Example code:
wildcards= re.compile('([?*+])')
escapewild= {'?': '.', '*': '.*', '+': '.+'}
def escapePart((parti, part)):
if parti%2==0: # even items are literals
return re.escape(part)
else: # odd items are wildcards
return escapewild[part]
def convertWildcardedToRegex(s):
parts= map(escapePart, enumerate(wildcards.split(s)))
return '^%s$' % (''.join(parts))
You'll probably only be doing this substitution occasionally, such as each time a user enters a new search string, so I wouldn't worry about how efficient the solution is.
You need to generate a list of the replacements you need to convert from the "user format" to a regex. For ease of maintenance I would store these in a dictionary, and like #Konrad Rudolph I would just use the replace method:
def wildcard_to_regex(wildcard):
replacements = {
'*': '.*',
'?': '.?',
'+': '.+',
}
regex = wildcard
for (wildcard_pattern, regex_pattern) in replacements.items():
regex = regex.replace(wildcard_pattern, regex_pattern)
return regex
Note that this only works for simple character replacements, although other complex code can at least be hidden in the wildcard_to_regex function if necessary.
(Also, I'm not sure that ? should translate to .? -- I think normal wildcards have ? as "exactly one character", so its replacement should be a simple . -- but I'm following your example.)
I'd use replace:
def wildcard_to_regex(str):
return str.replace("*", ".*").replace("?", .?").replace("#", "\d")
This probably isn't the most efficient way but it should be efficient enough for most purposes. Notice that some wildcard formats allow character classes which are more difficult to handle.
Here is a Perl example of doing this. It is simply using a table to replace each wildcard construct with the corresponding regular expression. I've done this myself previously, but in C. It shouldn't be too hard to port to Python.