I need to be able to call a function in a pattern matching statement with the matched patterned as argument and use the return result of the function as replacement for the matched pattern. In perl, one can use eval function to do that. Here is the perl example:
test =~ s/(>?,)(\d+)/eval q{&sum($1,$2)}/ge;
How can I do this in Python.
Thanks for your help.
Python's parallel to Perl's eval EXPR is also named eval.
To be more specific,
eval( $code )
is equivalent to
DOLLAR_AT = None # Global var $#
def perl_eval( code ):
try:
DOLLAR_AT = None
return eval( code )
catch BaseException as err:
DOLLAR_AT = err
perl_eval( code )
But you're actually asking about the parallel of /e. This is what allows the replacement expression to be a Perl expression. In Python, similar results can be achieved by passing a function for re.sub's second parameter.
$test =~ s/(\d+)\+(\d+)/ $1 + $2 /eg;
is equivalent to
test = re.sub( r"(\d+)\+(\d+)", lambda _: str( int( _.group(1) ) + int( _.group(2) ) ), test )
Now, let's look at the code you posted. We don't care about $# here, so the Python equivalent of
$test =~ s/(>?,)(\d+)/ eval q{sum( $1, $2 )} /ge;
is
import re
def replacer( match ):
try:
return eval( "sum(" + match.group(1) + ", " + match.group(2) + ")" )
except:
return None
test = re.sub( r"(>?,)(\d+)", replacer, test )
[Missing: The Perl snippet warns when the replacement expression returns None.]
That said, this use of eval is completely wrong.
Because sum(>,123) is not valid Perl, the Perl snippet is equivalent to
$test =~ s/(>)?,(\d+)/ defined( $1 ) ? undef : eval { sum( $2 ) } /ge;
or
import re
def replacer( match ):
if match.group(1) is None:
return None
else
try:
return sum( match.group(2) )
except:
return None
test = re.sub( r"(>)?,(\d+)", replacer, test )
[Missing: The Perl snippet warns when the replacement expression returns None.]
[Python's parallel to eval BLOCK is try.]
This makes absolutely no sense. The following is probably closer to the author's intentions:
$test =~ s/(>?,)(\d+)/ sum( $1, $2 ) /ge;
or
import re
test = re.sub( r"(>?,)(\d+)", lambda _ : sum( _.group(1), _.group(2) ), test )
That might not be it. Passing >, or , to a sub named sum doesn't seem right. But without knowing what sum does and what the snippet is supposed to do, I can't comment further.
I have a file where each line contains a string that represents one or more email addresses.
Multiple addresses can be grouped inside curly braces as follows:
{name.surname, name2.surnam2}#something.edu
Which means both addresses name.surname#something.edu and name2.surname2#something.edu are valid (this format is commonly used in scientific papers).
Moreover, a single line can also contain curly brackets multiple times. Example:
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
results in:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
Any suggestion on how I can parse this format to extract all email addresses? I'm trying with regexes but I'm currently struggling.
Pyparsing is a PEG parser that gives you an embedded DSL to build up parsers that can read through expressions like this, with resulting code that is more readable (and maintainable) than regular expressions, and flexible enough to add afterthoughts (wait, some parts of the email can be in quotes?).
pyparsing uses '+' and '|' operators to build up your parser from smaller bits. It also supports named fields (similar to regex named groups) and parse-time callbacks. See how this all rolls together below:
import pyparsing as pp
LBRACE, RBRACE = map(pp.Suppress, "{}")
email_part = pp.quotedString | pp.Word(pp.printables, excludeChars=',{}#')
# define a compressed email, and assign names to the separate parts
# for easier processing - luckily the default delimitedList delimiter is ','
compressed_email = (LBRACE
+ pp.Group(pp.delimitedList(email_part))('names')
+ RBRACE
+ '#'
+ email_part('trailing'))
# add a parse-time callback to expand the compressed emails into a list
# of constructed emails - note how the names are used
def expand_compressed_email(t):
return ["{}#{}".format(name, t.trailing) for name in t.names]
compressed_email.addParseAction(expand_compressed_email)
# some lists will just contain plain old uncompressed emails too
# Combine will merge the separate tokens into a single string
plain_email = pp.Combine(email_part + '#' + email_part)
# the complete list parser looks for a comma-delimited list of compressed
# or plain emails
email_list_parser = pp.delimitedList(compressed_email | plain_email)
pyparsing parsers come with a runTests method to test your parser against various test strings:
tests = """\
# original test string
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
# a tricky email containing a quoted string
{x.y, z.k}#edu.com, "{a, b}"#domain.com
# just a plain email
plain_old_bob#uni.elsewhere
# mixed list of plain and compressed emails
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, plain_old_bob#uni.elsewhere
"""
email_list_parser.runTests(tests)
Prints:
# original test string
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
['a.b#uni.somewhere', 'c.d#uni.somewhere', 'e.f#uni.somewhere', 'x.y#edu.com', 'z.k#edu.com']
# a tricky email containing a quoted string
{x.y, z.k}#edu.com, "{a, b}"#domain.com
['x.y#edu.com', 'z.k#edu.com', '"{a, b}"#domain.com']
# just a plain email
plain_old_bob#uni.elsewhere
['plain_old_bob#uni.elsewhere']
# mixed list of plain and compressed emails
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, plain_old_bob#uni.elsewhere
['a.b#uni.somewhere', 'c.d#uni.somewhere', 'e.f#uni.somewhere', 'x.y#edu.com', 'z.k#edu.com', 'plain_old_bob#uni.elsewhere']
DISCLOSURE: I am the author of pyparsing.
Note
I'm more familiar with JavaScript than Python, and the basic logic is the same regardless (the different is syntax), so I've written my solutions here in JavaScript. Feel free to translate to Python.
The Issue
This question is a bit more involved than a simple one-line script or regular expression, but depending on the specific requirements you may be able to get away with something rudimentary.
For starters, parsing an e-mail is not trivially boiled down to a single regular expression. This website has several examples of regular expressions that will match "many" e-mails, but explains the trade-offs (complexity versus accuracy) and goes on to include the RFC 5322 standard regular expression that should theoretically match any e-mail, followed by a paragraph for why you shouldn't use it. However even that regular expression assumes that a domain name taking the form of an IP address can only consist of a tuple of four integers ranging from 0 to 255 -- it doesn't allow for IPv6
Even something as simple as:
{a, b}#domain.com
Could get tripped up because technically according to the e-mail address specification an e-mail address can contain ANY ASCII characters surrounded by quotes. The following is a valid (single) e-mail address:
"{a, b}"#domain.com
To accurately parse an e-mail would require that you read the characters one letter at a time and build a finite state machine to track whether you are within a double-quote, within a curly brace, before the #, after the #, parsing a domain name, parsing an IP, etc. In this way you could tokenize the address, locate your curly brace token, and parse it independently.
Something Rudimentary
Regular expressions are not the way to go for 100% accuracy and support for all e-mails, *especially* if you want to support more than one e-mail on a single line. But we'll start with them and try to build from there.
You've probably tried a regular expression like:
/\{(([^,]+),?)+\}\#(\w+\.)+[A-Za-z]+/
Match a single curly brace...
Followed by one or more instances of:
One or more non-comma characters...
Followed by zero or one commas
Followed by a single closing curly brace...
Followed by a single #
Followed by one or more instances of:
One or more "word" characters...
Followed by a single .
Followed by one or more alpha characters
This should match something roughly of the form:
{one, two}#domain1.domain2.toplevel
This handles validating, next is the issue of extracting all valid e-mails. Note that we have two sets of parenthesis in the name portion of the e-mail address that are nested: (([^,]+),?). This causes a problem for us. Many regular expression engines don't know how to return matches in this case. Consider what happens when I run this in JavaScript using my Chrome developer console:
var regex = /\{(([^,]+),?)+\}\#(\w+\.)+[A-Za-z]+/
var matches = "{one, two}#domain.com".match(regex)
Array(4) [ "{one, two}#domain.com", " two", " two", "domain." ]
Well that wasn't right. It found two twice, but didn't find one once! To fix this, we need to eliminate the nesting and do this in two steps.
var regexOne = /\{([^}]+)\}\#(\w+\.)+[A-Za-z]+/
"{one, two}#domain.com".match(regexOne)
Array(3) [ "{one, two}#domain.com", "one, two", "domain." ]
Now we can use the match and parse that separately:
// Note: It's important that this be a global regex (the /g modifier) since we expect the pattern to match multiple times
var regexTwo = /([^,]+,?)/g
var nameMatches = matches[1].match(regexTwo)
Array(2) [ "one,", " two" ]
Now we can trim these and get our names:
nameMatches.map(name => name.replace(/, /g, "")
nameMatches
Array(2) [ "one", "two" ]
For constructing the "domain" part of the e-mail, we'll need similar logic for everything after the #, since this has a potential for repeats the same way the name part had a potential for repeats. Our final code (in JavaScript) may look something like this (you'll have to convert to Python yourself):
function getEmails(input)
{
var emailRegex = /([^#]+)\#(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}#${domain}`);
}
Multi-email Lines
This is where things start to get tricky, and we need to accept a little less accuracy if we don't want to build a full on lexer / tokenizer. Because our e-mails contain commas (within the name field) we can't accurately split on commas -- unless those commas aren't within curly braces. With my knowledge of regular expressions, I don't know if this can be easily done. It may be possible with lookahead or lookbehind operators, but someone else will have to fill me in on that.
What can be easily done with regular expressions, however, is finding a block of text containing a post-ampersand comma. Something like: #[^#{]+?,
In the string a#b.com, c#d.com this would match the entire phrase #b.com, - but the important thing is that it gives us a place to split our string. The tricky bit is then finding out how to split your string here. Something along the lines of this will work most of the time:
var emails = "a#b.com, c#d.com"
var matches = emails.match(/#[^#{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(2) [ "a", " c#d.com" ]
split[0] = split[0] + matches[0] // Add back in what we split on
This has a potential bug should you have two e-mails in the list with the same domain:
var emails = "a#b.com, c#b.com, d#e.com"
var matches = emails.match(#[^#{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(3) [ "a", " c", " d#e.com" ]
split[0] = split[0] + matches[0]
console.log(split) // Array(3) [ "a#b.com", " c", " d#e.com" ]
But again, without building a lexer / tokenizer we're accepting that our solution will only work for most cases and not all.
However since the task of splitting one line into multiple e-mails is easier than diving into the e-mail, extracting a name, and parsing the name: we may be able to write a really stupid lexer for just this part:
var inBrackets = false
var emails = "{a, b}#c.com, d#e.com"
var split = []
var lastSplit = 0
for (var i = 0; i < emails.length; i++)
{
if (inBrackets && emails[i] === "}")
inBrackets = false;
if (!inBrackets && emails[i] === "{")
inBrackets = true;
if (!inBrackets && emails[i] === ",")
{
split.push(emails.substring(lastSplit, i))
lastSplit = i + 1 // Skip the comma
}
}
split.push(emails.substring(lastSplit))
console.log(split)
Once again, this won't be a perfect solution because an e-mail address may exist like the following:
","#domain.com
But, for 99% of use cases, this simple lexer will suffice and we can now build a "usually works but not perfect" solution like the following:
function getEmails(input)
{
var emailRegex = /([^#]+)\#(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}#${domain}`);
}
function splitLine(line)
{
var inBrackets = false;
var split = [];
var lastSplit = 0;
for (var i = 0; i < line.length; i++)
{
if (inBrackets && line[i] === "}")
inBrackets = false;
if (!inBrackets && line[i] === "{")
inBrackets = true;
if (!inBrackets && line[i] === ",")
{
split.push(line.substring(lastSplit, i));
lastSplit = i + 1;
}
}
split.push(line.substring(lastSplit));
return split;
}
var line = "{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com";
var emails = splitLine(line);
var finalList = [];
for (var i = 0; i < emails.length; i++)
{
finalList = finalList.concat(getEmails(emails[i]));
}
console.log(finalList);
// Outputs: [ "a.b#uni.somewhere", "c.d#uni.somewhere", "e.f#uni.somewhere", "x.y#edu.com", "z.k#edu.com" ]
If you want to try and implement the full lexer / tokenizer solution, you can look at the simple / dumb lexer I built as a starting point. The general idea is that you have a state machine (in my case I only had two states: inBrackets and !inBrackets) and you read one letter at a time but interpret it differently based on your current state.
a quick solution using re:
test with one text line:
import re
line = '{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, {z.z, z.a}#edu.com'
com = re.findall(r'(#[^,\n]+),?', line) #trap #xx.yyy
adrs = re.findall(r'{([^}]+)}', line) #trap all inside { }
result=[]
for i in range(len(adrs)):
s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
result=result+s.split(',')
for r in result:
print(r)
output in list result:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
z.z#edu.com
z.a#edu.com
test with a text file:
import io
data = io.StringIO(u'''\
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, {z.z, z.a}#edu.com
{a.b, c.d, e.f}#uni.anywhere
{x.y, z.k}#adi.com, {z.z, z.a}#du.com
''')
result=[]
import re
for line in data:
com = re.findall(r'(#[^,\n]+),?', line)
adrs = re.findall(r'{([^}]+)}', line)
for i in range(len(adrs)):
s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
result = result + s.split(',')
for r in result:
print(r)
output in list result:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
z.z#edu.com
z.a#edu.com
a.b#uni.anywhere
c.d#uni.anywhere
e.f#uni.anywhere
x.y#adi.com
z.k#adi.com
z.z#du.com
z.a#du.com
I am trying to remove the outer most curly bracket while keeping only the inner string. My code almost works 100%, except when
expr = 'namespace P {\\na; b;}'
# I expect '\\na; b;'
# but I get 'namespace P {\na; b;}' instead
Any idea how to fix my regex string?
import doctest
import re
def remove_outer_curly_bracket(expr):
"""
>>> remove_outer_curly_bracket('P {')
'P {'
>>> remove_outer_curly_bracket('P')
'P'
>>> remove_outer_curly_bracket('P { a; b(); { c1(d,e); } }')
' a; b(); { c1(d,e); } '
>>> remove_outer_curly_bracket('a { }')
' '
>>> remove_outer_curly_bracket('')
''
>>> remove_outer_curly_bracket('namespace P {\\na; b;}')
'\\na; b;'
"""
r = re.findall(r'[.]*\{(.*)\}', expr)
return r[0] if r else expr
doctest.testmod()
This suffices:
def remove_outer_curly_bracket(expr):
r = re.search(r'{(.*)}', expr, re.DOTALL)
return r.group(1) if r else expr
The match will start as soon as possible, so the first { will indeed match the leftmost opening brace. Because * is greedy, .* will want to be as large as possible, which will ensure } will match the last closing brace.
Neither of the braces is a special character, and does not need escaping; also, [.]* matches any number of periods in a row, and will not help you at all in this task.
This will not work sensibly if the braces are not balanced; for example, for "{ { x }" will return " { x", but fortunately your examples do not include such.
EDIT: That said, this is just prettifying the original a bit. The functionality is unchanged. As blhsing says in comments, it seems your code is doing what it is supposed to. It even passes your tests.
EDIT2: There is nothing special in 'namespace P {\\na; b;}'. I believe you meant 'namespace P {\na; b;}'? With a line break inside? Indeed, that would not have worked. I changed my code so it does. The issue is that normally . matches every character except newline. We can modify that behaviour by supplying the flag re.DOTALL.
I have a string like this:
a = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
Currently I am using this re statement to get desired result:
filter(None, re.split("\\{(.*?)\\}", a))
But this gives me:
['CGPoint={CGPoint=d{CGPoint=dd', '}}', 'CGSize=dd', 'dd', 'CSize=aa']
which is incorrect for my current situation, I need a list like this:
['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']
As #m.buettner points out in the comments, Python's implementation of regular expressions can't match pairs of symbols nested to an arbitrary degree. (Other languages can, notably current versions of Perl.) The Pythonic thing to do when you have text that regexs can't parse is to use a recursive-descent parser.
There's no need to reinvent the wheel by writing your own, however; there are a number of easy-to-use parsing libraries out there. I recommend pyparsing which lets you define a grammar directly in your code and easily attach actions to matched tokens. Your code would look something like this:
import pyparsing
lbrace = Literal('{')
rbrace = Literal('}')
contents = Word(printables)
expr = Forward()
expr << Combine(Suppress(lbrace) + contents + Suppress(rbrace) + expr)
for line in lines:
results = expr.parseString(line)
There's an alternative regex module for Python I really like that supports recursive patterns:
https://pypi.python.org/pypi/regex
pip install regex
Then you can use a recursive pattern in your regex as demonstrated in this script:
import regex
from pprint import pprint
thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
theregex = r'''
(
{
(?<match>
[^{}]*
(?:
(?1)
[^{}]*
)+
|
[^{}]+
)
}
|
(?<match>
[^{}]+
)
)
'''
matches = regex.findall(theregex, thestr, regex.X)
print 'all matches:\n'
pprint(matches)
print '\ndesired matches:\n'
print [match[1] for match in matches]
This outputs:
all matches:
[('{CGPoint={CGPoint=d{CGPoint=dd}}}', 'CGPoint={CGPoint=d{CGPoint=dd}}'),
('{CGSize=dd}', 'CGSize=dd'),
('dd', 'dd'),
('{CSize=aa}', 'CSize=aa')]
desired matches:
['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']
pyparsing has a nestedExpr function for matching nested expressions:
import pyparsing as pp
ident = pp.Word(pp.alphanums)
expr = pp.nestedExpr("{", "}") | ident
thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
for result in expr.searchString(thestr):
print(result)
yields
[['CGPoint=', ['CGPoint=d', ['CGPoint=dd']]]]
[['CGSize=dd']]
['dd']
[['CSize=aa']]
Here is some pseudo code. It creates a stack of strings and pops them when a close brace is encountered. Some extra logic to handle the fact that the first braces encountered are not included in the array.
String source = "{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}";
Array results;
Stack stack;
foreach (match in source.match("[{}]|[^{}]+")) {
switch (match) {
case '{':
if (stack.size == 0) stack.push(new String()); // add new empty string
else stack.push('{'); // child, so include matched brace.
case '}':
if (stack.size == 1) results.add(stack.pop()) // clear stack add to array
else stack.last += stack.pop() + '}"; // pop from stack and concatenate to previous
default:
if (stack.size == 0) results.add(match); // loose text, add to results
else stack.last += match; // append to latest member.
}
}
I've got the string:
<u>40 -04-11</u>
How do I remove the spaces and hyphens so it returns 400411?
Currently I've got this:
(<u[^>]*>)(\-\s)(<\/u>)
But I can't figure out why it isn't working. Any insight would be appreciated.
Thanks
(<u[^>]*>)(\-\s)(<\/u>)
Your pattern above doesn't tell your regex where to expect numbers.
(<u[^>]*>)(?:-|\s|(\d+))*(<\/u>)
That should get you started, but not being a python guy, I can't give you the exact replacement syntax. Just be aware that the digits are in a repeating capture group.
Edit: This is an edit in response to your comment. Like I said, not a python guy, but this will probably do what you need if you hold your tongue just right.
def repl(matchobj):
if matchobj.group(1) is None:
return ''
else:
return matchobj.group(1)
source = '<u>40 -04-11</u>40 -04-11<u>40 -04-11</u>40 -04-11'
print re.sub(r'(?:\-|\s|(\d+))(?=[^><]*?<\/u>)', repl, source)
Results in:
>>>'<u>400411</u>40 -04-11<u>400411</u>40 -04-11'
If the above offends the Python deities, I promise to sacrifice the next PHP developer I come across. :)
You don't really need a regex, you could use :
>>> '<u>40 -04-11</u>'.replace('-','').replace(' ','')
'<u>400411</u>'
Using Perl syntax:
s{
(<u[^>]*>) (.*?) (</u>)
}{
my ($start, $body, $end) = ($1, $2, $3);
$body =~ s/[-\s]//g;
$start . $body . $end
}xesg;
Or if Python doesn't have an equivalent to /e,
my $out = '';
while (
$in =~ m{
\G (.*?)
(?: (<u[^>]*>) (.*?) (</u>) | \z )
}sg
) {
my ($pre, $start, $body, $end) = ($1, $2, $3, $4);
$out .= $pre;
if (defined($start)) {
$body =~ s/[-\s]//g;
$out .= $start . $body . $end;
}
}
I'm admittedly not very good at regexes, but the way I would do this is by:
Doing a match on a <u>...</u> pair
doing a re.sub on the bit between the match using group().
That looks like this:
example_str = "<u> 76-6-76s</u> 34243vvfv"
tmp = re.search("(<u[^>]*>)(.*?)(<\/u>)",example_str).group(2)
clean_str = re.sub("(\D)","",tmp)
>>>'76676'
You should expose correctly your problem. I firstly didn't exactly understand it.
Having read your comment (only between the tags <u> and </u> tags) , I can now propose:
import re
ss = '87- 453- kol<u>40 -04-11</u> maa78-55 98 12'
print re.sub('(?<=<u>).+?(?=</u>)',
lambda mat: ''.join(c for c in mat.group() if c not in ' -'),
ss)
result
87- 453- kol<u>400411</u> maa78-55 98 12