How to parseIPv6 address in a string using Python 3.x [duplicate] - python

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I'm having trouble writing a regular expression that matches valid IPv6 addresses, including those in their compressed form (with :: or leading zeros omitted from each byte pair).
Can someone suggest a regular expression that would fulfill the requirement?
I'm considering expanding each byte pair and matching the result with a simpler regex.

I was unable to get #Factor Mystic's answer to work with POSIX regular expressions, so I wrote one that works with POSIX regular expressions and PERL regular expressions.
It should match:
IPv6 addresses
zero compressed IPv6 addresses (section 2.2 of rfc5952)
link-local IPv6 addresses with zone index (section 11 of rfc4007)
IPv4-Embedded IPv6 Address (section 2 of rfc6052)
IPv4-mapped IPv6 addresses (section 2.1 of rfc2765)
IPv4-translated addresses (section 2.1 of rfc2765)
IPv6 Regular Expression:
(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
For ease of reading, the following is the above regular expression split at major OR points into separate lines:
# IPv6 RegEx
(
([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}| # 1:2:3:4:5:6:7:8
([0-9a-fA-F]{1,4}:){1,7}:| # 1:: 1:2:3:4:5:6:7::
([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}| # 1::8 1:2:3:4:5:6::8 1:2:3:4:5:6::8
([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}| # 1::7:8 1:2:3:4:5::7:8 1:2:3:4:5::8
([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}| # 1::6:7:8 1:2:3:4::6:7:8 1:2:3:4::8
([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}| # 1::5:6:7:8 1:2:3::5:6:7:8 1:2:3::8
([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}| # 1::4:5:6:7:8 1:2::4:5:6:7:8 1:2::8
[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})| # 1::3:4:5:6:7:8 1::3:4:5:6:7:8 1::8
:((:[0-9a-fA-F]{1,4}){1,7}|:)| # ::2:3:4:5:6:7:8 ::2:3:4:5:6:7:8 ::8 ::
fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}| # fe80::7:8%eth0 fe80::7:8%1 (link-local IPv6 addresses with zone index)
::(ffff(:0{1,4}){0,1}:){0,1}
((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}
(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])| # ::255.255.255.255 ::ffff:255.255.255.255 ::ffff:0:255.255.255.255 (IPv4-mapped IPv6 addresses and IPv4-translated addresses)
([0-9a-fA-F]{1,4}:){1,4}:
((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}
(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]) # 2001:db8:3:4::192.0.2.33 64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address)
)
# IPv4 RegEx
((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])
To make the above easier to understand, the following "pseudo" code replicates the above:
IPV4SEG = (25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])
IPV4ADDR = (IPV4SEG\.){3,3}IPV4SEG
IPV6SEG = [0-9a-fA-F]{1,4}
IPV6ADDR = (
(IPV6SEG:){7,7}IPV6SEG| # 1:2:3:4:5:6:7:8
(IPV6SEG:){1,7}:| # 1:: 1:2:3:4:5:6:7::
(IPV6SEG:){1,6}:IPV6SEG| # 1::8 1:2:3:4:5:6::8 1:2:3:4:5:6::8
(IPV6SEG:){1,5}(:IPV6SEG){1,2}| # 1::7:8 1:2:3:4:5::7:8 1:2:3:4:5::8
(IPV6SEG:){1,4}(:IPV6SEG){1,3}| # 1::6:7:8 1:2:3:4::6:7:8 1:2:3:4::8
(IPV6SEG:){1,3}(:IPV6SEG){1,4}| # 1::5:6:7:8 1:2:3::5:6:7:8 1:2:3::8
(IPV6SEG:){1,2}(:IPV6SEG){1,5}| # 1::4:5:6:7:8 1:2::4:5:6:7:8 1:2::8
IPV6SEG:((:IPV6SEG){1,6})| # 1::3:4:5:6:7:8 1::3:4:5:6:7:8 1::8
:((:IPV6SEG){1,7}|:)| # ::2:3:4:5:6:7:8 ::2:3:4:5:6:7:8 ::8 ::
fe80:(:IPV6SEG){0,4}%[0-9a-zA-Z]{1,}| # fe80::7:8%eth0 fe80::7:8%1 (link-local IPv6 addresses with zone index)
::(ffff(:0{1,4}){0,1}:){0,1}IPV4ADDR| # ::255.255.255.255 ::ffff:255.255.255.255 ::ffff:0:255.255.255.255 (IPv4-mapped IPv6 addresses and IPv4-translated addresses)
(IPV6SEG:){1,4}:IPV4ADDR # 2001:db8:3:4::192.0.2.33 64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address)
)
I posted a script on GitHub which tests the regular expression: https://gist.github.com/syzdek/6086792

The following will validate IPv4, IPv6 (full and compressed), and IPv6v4 (full and compressed) addresses:
'/^(?>(?>([a-f0-9]{1,4})(?>:(?1)){7}|(?!(?:.*[a-f0-9](?>:|$)){8,})((?1)(?>:(?1)){0,6})?::(?2)?)|(?>(?>(?1)(?>:(?1)){5}:|(?!(?:.*[a-f0-9]:){6,})(?3)?::(?>((?1)(?>:(?1)){0,4}):)?)?(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(?>\.(?4)){3}))$/iD'

It sounds like you may be using Python. If so, you can use something like this:
import socket
def check_ipv6(n):
try:
socket.inet_pton(socket.AF_INET6, n)
return True
except socket.error:
return False
print check_ipv6('::1') # True
print check_ipv6('foo') # False
print check_ipv6(5) # TypeError exception
print check_ipv6(None) # TypeError exception
I don't think you have to have IPv6 compiled in to Python to get inet_pton, which can also parse IPv4 addresses if you pass in socket.AF_INET as the first parameter. Note: this may not work on non-Unix systems.

From "IPv6 regex":
(\A([0-9a-f]{1,4}:){1,1}(:[0-9a-f]{1,4}){1,6}\Z)|
(\A([0-9a-f]{1,4}:){1,2}(:[0-9a-f]{1,4}){1,5}\Z)|
(\A([0-9a-f]{1,4}:){1,3}(:[0-9a-f]{1,4}){1,4}\Z)|
(\A([0-9a-f]{1,4}:){1,4}(:[0-9a-f]{1,4}){1,3}\Z)|
(\A([0-9a-f]{1,4}:){1,5}(:[0-9a-f]{1,4}){1,2}\Z)|
(\A([0-9a-f]{1,4}:){1,6}(:[0-9a-f]{1,4}){1,1}\Z)|
(\A(([0-9a-f]{1,4}:){1,7}|:):\Z)|
(\A:(:[0-9a-f]{1,4}){1,7}\Z)|
(\A((([0-9a-f]{1,4}:){6})(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3})\Z)|
(\A(([0-9a-f]{1,4}:){5}[0-9a-f]{1,4}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3})\Z)|
(\A([0-9a-f]{1,4}:){5}:[0-9a-f]{1,4}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A([0-9a-f]{1,4}:){1,1}(:[0-9a-f]{1,4}){1,4}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A([0-9a-f]{1,4}:){1,2}(:[0-9a-f]{1,4}){1,3}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A([0-9a-f]{1,4}:){1,3}(:[0-9a-f]{1,4}){1,2}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A([0-9a-f]{1,4}:){1,4}(:[0-9a-f]{1,4}){1,1}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A(([0-9a-f]{1,4}:){1,5}|:):(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A:(:[0-9a-f]{1,4}){1,5}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)

This catches the loopback(::1) as well and ipv6 addresses.
changed {} to + and put : inside the first square bracket.
([a-f0-9:]+:+)+[a-f0-9]+
tested on with ifconfig -a output
http://regexr.com/
Unix or Mac OSx terminal o option returns only the matching output(ipv6) including ::1
ifconfig -a | egrep -o '([a-f0-9:]+:+)+[a-f0-9]+'
Get All IP addresses (IPv4 OR IPv6) and print match on unix OSx term
ifconfig -a | egrep -o '([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) | (([a-f0-9:]+:+)+[a-f0-9]+)'

I'd have to strongly second the answer from Frank Krueger.
Whilst you say you need a regular expression to match an IPv6 address, I'm assuming what you really need is to be able to check if a given string is a valid IPv6 address. There is a subtle but important distinction here.
There is more than one way to check if a given string is a valid IPv6 address and regular expression matching is only one solution.
Use an existing library if you can. The library will have fewer bugs and its use will result in less code for you to maintain.
The regular expression suggested by Factor Mystic is long and complex. It most likely works, but you should also consider how you'd cope if it unexpectedly fails. The point I'm trying to make here is that if you can't form a required regular expression yourself you won't be able to easily debug it.
If you have no suitable library it may be better to write your own IPv6 validation routine that doesn't depend on regular expressions. If you write it you understand it and if you understand it you can add comments to explain it so that others can also understand and subsequently maintain it.
Act with caution when using a regular expression whose functionality you can't explain to someone else.

I'm not an Ipv6 expert but I think you can get a pretty good result more easily with this one:
^([0-9A-Fa-f]{0,4}:){2,7}([0-9A-Fa-f]{1,4}$|((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4})$
to answer "is a valid ipv6" it look like ok to me. To break it down in parts... forget it. I've omitted the unspecified one (::) since there is no use to have "unpecified adress" in my database.
the beginning:
^([0-9A-Fa-f]{0,4}:){2,7} <-- match the compressible part, we can translate this as: between 2 and 7 colon who may have heaxadecimal number between them.
followed by:
[0-9A-Fa-f]{1,4}$ <-- an hexadecimal number (leading 0 omitted)
OR
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4} <-- an Ipv4 adress

This regular expression will match valid IPv6 and IPv4 addresses in accordance with GNU C++ implementation of regex with REGULAR EXTENDED mode used:
"^\s*((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}))|:)))(%.+)?\s*$"

A simple regex that will match, but I wouldn't recommend for validation of any sort is this:
([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4}
Note this matches compression anywhere in the address, though it won't match the loopback address ::1. I find this a reasonable compromise in order to keep the regex simple.
I successfully use this in iTerm2 smart selection rules to quad-click IPv6 addresses.

Beware! In Java, the use of InetAddress and related classes (Inet4Address, Inet6Address, URL) may involve network trafic! E.g. DNS resolving (URL.equals, InetAddress from string!). This call may take long and is blocking!
For IPv6 I have something like this. This of course does not handle the very subtle details of IPv6 like that zone indices are allowed only on some classes of IPv6 addresses. And this regex is not written for group capturing, it is only a "matches" kind of regexp.
S - IPv6 segment = [0-9a-f]{1,4}
I - IPv4 = (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})
Schematic (first part matches IPv6 addresses with IPv4 suffix, second part matches IPv6 addresses, last patrt the zone index):
(
(
::(S:){0,5}|
S::(S:){0,4}|
(S:){2}:(S:){0,3}|
(S:){3}:(S:){0,2}|
(S:){4}:(S:)?|
(S:){5}:|
(S:){6}
)
I
|
:(:|(:S){1,7})|
S:(:|(:S){1,6})|
(S:){2}(:|(:S){1,5})|
(S:){3}(:|(:S){1,4})|
(S:){4}(:|(:S){1,3})|
(S:){5}(:|(:S){1,2})|
(S:){6}(:|(:S))|
(S:){7}:|
(S:){7}S
)
(?:%[0-9a-z]+)?
And here the might regex (case insensitive, surround with what ever needed like beginning/end of line, etc.):
(?:
(?:
::(?:[0-9a-f]{1,4}:){0,5}|
[0-9a-f]{1,4}::(?:[0-9a-f]{1,4}:){0,4}|
(?:[0-9a-f]{1,4}:){2}:(?:[0-9a-f]{1,4}:){0,3}|
(?:[0-9a-f]{1,4}:){3}:(?:[0-9a-f]{1,4}:){0,2}|
(?:[0-9a-f]{1,4}:){4}:(?:[0-9a-f]{1,4}:)?|
(?:[0-9a-f]{1,4}:){5}:|
(?:[0-9a-f]{1,4}:){6}
)
(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})|
:(?::|(?::[0-9a-f]{1,4}){1,7})|
[0-9a-f]{1,4}:(?::|(?::[0-9a-f]{1,4}){1,6})|
(?:[0-9a-f]{1,4}:){2}(?::|(?::[0-9a-f]{1,4}){1,5})|
(?:[0-9a-f]{1,4}:){3}(?::|(?::[0-9a-f]{1,4}){1,4})|
(?:[0-9a-f]{1,4}:){4}(?::|(?::[0-9a-f]{1,4}){1,3})|
(?:[0-9a-f]{1,4}:){5}(?::|(?::[0-9a-f]{1,4}){1,2})|
(?:[0-9a-f]{1,4}:){6}(?::|(?::[0-9a-f]{1,4}))|
(?:[0-9a-f]{1,4}:){7}:|
(?:[0-9a-f]{1,4}:){7}[0-9a-f]{1,4}
)
(?:%[0-9a-z]+)?

If you use Perl try Net::IPv6Addr
use Net::IPv6Addr;
if( defined Net::IPv6Addr::is_ipv6($ip_address) ){
print "Looks like an ipv6 address\n";
}
NetAddr::IP
use NetAddr::IP;
my $obj = NetAddr::IP->new6($ip_address);
Validate::IP
use Validate::IP qw'is_ipv6';
if( is_ipv6($ip_address) ){
print "Looks like an ipv6 address\n";
}

Following regex is for IPv6 only. Group 1 matches with the IP.
(([0-9a-fA-F]{0,4}:){1,7}[0-9a-fA-F]{0,4})

Regexes for ipv6 can get really tricky when you consider addresses with embedded ipv4 and addresses that are compressed, as you can see from some of these answers.
The open-source IPAddress Java library will validate all standard representations of IPv6 and IPv4 and also supports prefix-length (and validation of such). Disclaimer: I am the project manager of that library.
Code example:
try {
IPAddressString str = new IPAddressString("::1");
IPAddress addr = str.toAddress();
if(addr.isIPv6() || addr.isIPv6Convertible()) {
IPv6Address ipv6Addr = addr.toIPv6();
}
//use address
} catch(AddressStringException e) {
//e.getMessage has validation error
}

In Scala use the well known Apache Commons validators.
http://mvnrepository.com/artifact/commons-validator/commons-validator/1.4.1
libraryDependencies += "commons-validator" % "commons-validator" % "1.4.1"
import org.apache.commons.validator.routines._
/**
* Validates if the passed ip is a valid IPv4 or IPv6 address.
*
* #param ip The IP address to validate.
* #return True if the passed IP address is valid, false otherwise.
*/
def ip(ip: String) = InetAddressValidator.getInstance().isValid(ip)
Following the test's of the method ip(ip: String):
"The `ip` validator" should {
"return false if the IPv4 is invalid" in {
ip("123") must beFalse
ip("255.255.255.256") must beFalse
ip("127.1") must beFalse
ip("30.168.1.255.1") must beFalse
ip("-1.2.3.4") must beFalse
}
"return true if the IPv4 is valid" in {
ip("255.255.255.255") must beTrue
ip("127.0.0.1") must beTrue
ip("0.0.0.0") must beTrue
}
//IPv6
//#see: http://www.ronnutter.com/ipv6-cheatsheet-on-identifying-valid-ipv6-addresses/
"return false if the IPv6 is invalid" in {
ip("1200::AB00:1234::2552:7777:1313") must beFalse
}
"return true if the IPv6 is valid" in {
ip("1200:0000:AB00:1234:0000:2552:7777:1313") must beTrue
ip("21DA:D3:0:2F3B:2AA:FF:FE28:9C5A") must beTrue
}
}

Looking at the patterns included in the other answers there are a number of good patterns that can be improved by referencing groups and utilizing lookaheads. Here is an example of a pattern that is self referencing that I would utilize in PHP if I had to:
^(?<hgroup>(?<hex>[[:xdigit:]]{0,4}) # grab a sequence of up to 4 hex digits
# and name this pattern for usage later
(?<!:::):{1,2}) # match 1 or 2 ':' characters
# as long as we can't match 3
(?&hgroup){1,6} # match our hex group 1 to 6 more times
(?:(?:
# match an ipv4 address or
(?<dgroup>2[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3}(?&dgroup)
# match our hex group one last time
|(?&hex))$
Note: PHP has a built in filter for this which would be a better solution than this
pattern.
Regex101 Analysis

Depending on your needs, an approximation like:
[0-9a-f:]+
may be enough (as with simple log file grepping, for example.)

I generated the following using python and works with the re module. The look-ahead assertions ensure that the correct number of dots or colons appear in the address. It does not support IPv4 in IPv6 notation.
pattern = '^(?=\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)(?:(?:25[0-5]|[12][0-4][0-9]|1[5-9][0-9]|[1-9]?[0-9])\.?){4}$|(?=^(?:[0-9a-f]{0,4}:){2,7}[0-9a-f]{0,4}$)(?![^:]*::.+::[^:]*$)(?:(?=.*::.*)|(?=\w+:\w+:\w+:\w+:\w+:\w+:\w+:\w+))(?:(?:^|:)(?:[0-9a-f]{4}|[1-9a-f][0-9a-f]{0,3})){0,8}(?:::(?:[0-9a-f]{1,4}(?:$|:)){0,6})?$'
result = re.match(pattern, ip)
if result: result.group(0)

In Java, you can use the library class sun.net.util.IPAddressUtil:
IPAddressUtil.isIPv6LiteralAddress(iPaddress);

It is difficult to find a regular expression which works for all IPv6 cases. They are usually hard to maintain, not easily readable and may cause performance problems. Hence, I want to share an alternative solution which I have developed: Regular Expression (RegEx) for IPv6 Separate from IPv4
Now you may ask that "This method only finds IPv6, how can I find IPv6 in a text or file?" Here are methods for this issue too.
Note: If you do not want to use IPAddress class in .NET, you can also replace it with my method. It also covers mapped IPv4 and special cases too, while IPAddress does not cover.
class IPv6
{
public List<string> FindIPv6InFile(string filePath)
{
Char ch;
StringBuilder sbIPv6 = new StringBuilder();
List<string> listIPv6 = new List<string>();
StreamReader reader = new StreamReader(filePath);
do
{
bool hasColon = false;
int length = 0;
do
{
ch = (char)reader.Read();
if (IsEscapeChar(ch))
break;
//Check the first 5 chars, if it has colon, then continue appending to stringbuilder
if (!hasColon && length < 5)
{
if (ch == ':')
{
hasColon = true;
}
sbIPv6.Append(ch.ToString());
}
else if (hasColon) //if no colon in first 5 chars, then dont append to stringbuilder
{
sbIPv6.Append(ch.ToString());
}
length++;
} while (!reader.EndOfStream);
if (hasColon && !listIPv6.Contains(sbIPv6.ToString()) && IsIPv6(sbIPv6.ToString()))
{
listIPv6.Add(sbIPv6.ToString());
}
sbIPv6.Clear();
} while (!reader.EndOfStream);
reader.Close();
reader.Dispose();
return listIPv6;
}
public List<string> FindIPv6InText(string text)
{
StringBuilder sbIPv6 = new StringBuilder();
List<string> listIPv6 = new List<string>();
for (int i = 0; i < text.Length; i++)
{
bool hasColon = false;
int length = 0;
do
{
if (IsEscapeChar(text[length + i]))
break;
//Check the first 5 chars, if it has colon, then continue appending to stringbuilder
if (!hasColon && length < 5)
{
if (text[length + i] == ':')
{
hasColon = true;
}
sbIPv6.Append(text[length + i].ToString());
}
else if (hasColon) //if no colon in first 5 chars, then dont append to stringbuilder
{
sbIPv6.Append(text[length + i].ToString());
}
length++;
} while (i + length != text.Length);
if (hasColon && !listIPv6.Contains(sbIPv6.ToString()) && IsIPv6(sbIPv6.ToString()))
{
listIPv6.Add(sbIPv6.ToString());
}
i += length;
sbIPv6.Clear();
}
return listIPv6;
}
bool IsEscapeChar(char ch)
{
if (ch != ' ' && ch != '\r' && ch != '\n' && ch!='\t')
{
return false;
}
return true;
}
bool IsIPv6(string maybeIPv6)
{
IPAddress ip;
if (IPAddress.TryParse(maybeIPv6, out ip))
{
return ip.AddressFamily == AddressFamily.InterNetworkV6;
}
else
{
return false;
}
}
}

InetAddressUtils has all the patterns defined. I ended-up using their pattern directly, and am pasting it here for reference:
private static final String IPV4_BASIC_PATTERN_STRING =
"(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}" + // initial 3 fields, 0-255 followed by .
"([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])"; // final field, 0-255
private static final Pattern IPV4_PATTERN =
Pattern.compile("^" + IPV4_BASIC_PATTERN_STRING + "$");
private static final Pattern IPV4_MAPPED_IPV6_PATTERN = // TODO does not allow for redundant leading zeros
Pattern.compile("^::[fF]{4}:" + IPV4_BASIC_PATTERN_STRING + "$");
private static final Pattern IPV6_STD_PATTERN =
Pattern.compile(
"^[0-9a-fA-F]{1,4}(:[0-9a-fA-F]{1,4}){7}$");
private static final Pattern IPV6_HEX_COMPRESSED_PATTERN =
Pattern.compile(
"^(([0-9A-Fa-f]{1,4}(:[0-9A-Fa-f]{1,4}){0,5})?)" + // 0-6 hex fields
"::" +
"(([0-9A-Fa-f]{1,4}(:[0-9A-Fa-f]{1,4}){0,5})?)$"); // 0-6 hex fields

Using Ruby? Try this:
/^(((?=.*(::))(?!.*\3.+\3))\3?|[\dA-F]{1,4}:)([\dA-F]{1,4}(\3|:\b)|\2){5}(([\dA-F]{1,4}(\3|:\b|$)|\2){2}|(((2[0-4]|1\d|[1-9])?\d|25[0-5])\.?\b){4})\z/i

For PHP 5.2+ users filter_var works great.
I know this doesn't answer the original question (specifically a regex solution), but I post this in the hope it may help someone else in the future.
$is_ip4address = (filter_var($ip, FILTER_VALIDATE_IP, FILTER_FLAG_IPV4) !== FALSE);
$is_ip6address = (filter_var($ip, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6) !== FALSE);

Here's what I came up with, using a bit of lookahead and named groups. This is of course just IPv6, but it shouldn't interfere with additional patterns if you want to add IPv4:
(?=([0-9a-f]+(:[0-9a-f])*)?(?P<wild>::)(?!([0-9a-f]+:)*:))(::)?([0-9a-f]{1,4}:{1,2}){0,6}(?(wild)[0-9a-f]{0,4}|[0-9a-f]{1,4}:[0-9a-f]{1,4})

Just matching local ones from an origin with square brackets included. I know it's not as comprehensive but in javascript the other ones had difficult to trace issues primarily that of not working, so this seems to get me what I needed for now. extra capitals A-F aren't needed either.
^\[([0-9a-fA-F]{1,4})(\:{1,2})([0-9a-fA-F]{1,4})(\:{1,2})([0-9a-fA-F]{1,4})(\:{1,2})([0-9a-fA-F]{1,4})(\:{1,2})([0-9a-fA-F]{1,4})\]
Jinnko's version is simplified and better I see.

As stated above, another way to get an IPv6 textual representation validating parser is to use programming. Here is one that is fully compliant with RFC-4291 and RFC-5952. I've written this code in ANSI C (works with GCC, passed tests on Linux - works with clang, passed tests on FreeBSD). Thus, it does only rely on the ANSI C standard library, so it can be compiled everywhere (I've used it for IPv6 parsing inside a kernel module with FreeBSD).
// IPv6 textual representation validating parser fully compliant with RFC-4291 and RFC-5952
// BSD-licensed / Copyright 2015-2017 Alexandre Fenyo
#include <string.h>
#include <netinet/in.h>
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
typedef enum { false, true } bool;
static const char hexdigits[] = "0123456789abcdef";
static int digit2int(const char digit) {
return strchr(hexdigits, digit) - hexdigits;
}
// This IPv6 address parser handles any valid textual representation according to RFC-4291 and RFC-5952.
// Other representations will return -1.
//
// note that str input parameter has been modified when the function call returns
//
// parse_ipv6(char *str, struct in6_addr *retaddr)
// parse textual representation of IPv6 addresses
// str: input arg
// retaddr: output arg
int parse_ipv6(char *str, struct in6_addr *retaddr) {
bool compressed_field_found = false;
unsigned char *_retaddr = (unsigned char *) retaddr;
char *_str = str;
char *delim;
bzero((void *) retaddr, sizeof(struct in6_addr));
if (!strlen(str) || strchr(str, ':') == NULL || (str[0] == ':' && str[1] != ':') ||
(strlen(str) >= 2 && str[strlen(str) - 1] == ':' && str[strlen(str) - 2] != ':')) return -1;
// convert transitional to standard textual representation
if (strchr(str, '.')) {
int ipv4bytes[4];
char *curp = strrchr(str, ':');
if (curp == NULL) return -1;
char *_curp = ++curp;
int i;
for (i = 0; i < 4; i++) {
char *nextsep = strchr(_curp, '.');
if (_curp[0] == '0' || (i < 3 && nextsep == NULL) || (i == 3 && nextsep != NULL)) return -1;
if (nextsep != NULL) *nextsep = 0;
int j;
for (j = 0; j < strlen(_curp); j++) if (_curp[j] < '0' || _curp[j] > '9') return -1;
if (strlen(_curp) > 3) return -1;
const long val = strtol(_curp, NULL, 10);
if (val < 0 || val > 255) return -1;
ipv4bytes[i] = val;
_curp = nextsep + 1;
}
sprintf(curp, "%x%02x:%x%02x", ipv4bytes[0], ipv4bytes[1], ipv4bytes[2], ipv4bytes[3]);
}
// parse standard textual representation
do {
if ((delim = strchr(_str, ':')) == _str || (delim == NULL && !strlen(_str))) {
if (delim == str) _str++;
else if (delim == NULL) return 0;
else {
if (compressed_field_found == true) return -1;
if (delim == str + strlen(str) - 1 && _retaddr != (unsigned char *) (retaddr + 1)) return 0;
compressed_field_found = true;
_str++;
int cnt = 0;
char *__str;
for (__str = _str; *__str; ) if (*(__str++) == ':') cnt++;
unsigned char *__retaddr = - 2 * ++cnt + (unsigned char *) (retaddr + 1);
if (__retaddr <= _retaddr) return -1;
_retaddr = __retaddr;
}
} else {
char hexnum[4] = "0000";
if (delim == NULL) delim = str + strlen(str);
if (delim - _str > 4) return -1;
int i;
for (i = 0; i < delim - _str; i++)
if (!isxdigit(_str[i])) return -1;
else hexnum[4 - (delim - _str) + i] = tolower(_str[i]);
_str = delim + 1;
*(_retaddr++) = (digit2int(hexnum[0]) << 4) + digit2int(hexnum[1]);
*(_retaddr++) = (digit2int(hexnum[2]) << 4) + digit2int(hexnum[3]);
}
} while (_str < str + strlen(str));
return 0;
}

The regex allows the use of leading zeros in the IPv4 parts.
Some Unix and Mac distros convert those segments into octals.
I suggest using 25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d as an IPv4 segment.

This will work for IPv4 and IPv6:
^(([0-9a-f]{0,4}:){1,7}[0-9a-f]{1,4}|([0-9]{1,3}\.){3}[0-9]{1,3})$

You can use the ipextract shell tools I made for this purpose. They are based on regexp and grep.
Usage:
$ ifconfig | ipextract6
fe80::1%lo0
::1
fe80::7ed1:c3ff:feec:dee1%en0

Try this small one-liner. It should only match valid uncompressed/compressed IPv6 addresses (no IPv4 hybrids)
/(?!.*::.*::)(?!.*:::.*)(?!:[a-f0-9])((([a-f0-9]{1,4})?[:](?!:)){7}|(?=(.*:[:a-f0-9]{1,4}::|^([:a-f0-9]{1,4})?::))(([a-f0-9]{1,4})?[:]{1,2}){1,6})[a-f0-9]{1,4}/

If you want only normal IP-s (no slashes), here:
^(?:[0-9a-f]{1,4}(?:::)?){0,7}::[0-9a-f]+$
I use it for my syntax highlighter in hosts file editor application. Works as charm.

Related

Extract email addresses from academic curly braces format

I have a file where each line contains a string that represents one or more email addresses.
Multiple addresses can be grouped inside curly braces as follows:
{name.surname, name2.surnam2}#something.edu
Which means both addresses name.surname#something.edu and name2.surname2#something.edu are valid (this format is commonly used in scientific papers).
Moreover, a single line can also contain curly brackets multiple times. Example:
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
results in:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
Any suggestion on how I can parse this format to extract all email addresses? I'm trying with regexes but I'm currently struggling.
Pyparsing is a PEG parser that gives you an embedded DSL to build up parsers that can read through expressions like this, with resulting code that is more readable (and maintainable) than regular expressions, and flexible enough to add afterthoughts (wait, some parts of the email can be in quotes?).
pyparsing uses '+' and '|' operators to build up your parser from smaller bits. It also supports named fields (similar to regex named groups) and parse-time callbacks. See how this all rolls together below:
import pyparsing as pp
LBRACE, RBRACE = map(pp.Suppress, "{}")
email_part = pp.quotedString | pp.Word(pp.printables, excludeChars=',{}#')
# define a compressed email, and assign names to the separate parts
# for easier processing - luckily the default delimitedList delimiter is ','
compressed_email = (LBRACE
+ pp.Group(pp.delimitedList(email_part))('names')
+ RBRACE
+ '#'
+ email_part('trailing'))
# add a parse-time callback to expand the compressed emails into a list
# of constructed emails - note how the names are used
def expand_compressed_email(t):
return ["{}#{}".format(name, t.trailing) for name in t.names]
compressed_email.addParseAction(expand_compressed_email)
# some lists will just contain plain old uncompressed emails too
# Combine will merge the separate tokens into a single string
plain_email = pp.Combine(email_part + '#' + email_part)
# the complete list parser looks for a comma-delimited list of compressed
# or plain emails
email_list_parser = pp.delimitedList(compressed_email | plain_email)
pyparsing parsers come with a runTests method to test your parser against various test strings:
tests = """\
# original test string
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
# a tricky email containing a quoted string
{x.y, z.k}#edu.com, "{a, b}"#domain.com
# just a plain email
plain_old_bob#uni.elsewhere
# mixed list of plain and compressed emails
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, plain_old_bob#uni.elsewhere
"""
email_list_parser.runTests(tests)
Prints:
# original test string
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
['a.b#uni.somewhere', 'c.d#uni.somewhere', 'e.f#uni.somewhere', 'x.y#edu.com', 'z.k#edu.com']
# a tricky email containing a quoted string
{x.y, z.k}#edu.com, "{a, b}"#domain.com
['x.y#edu.com', 'z.k#edu.com', '"{a, b}"#domain.com']
# just a plain email
plain_old_bob#uni.elsewhere
['plain_old_bob#uni.elsewhere']
# mixed list of plain and compressed emails
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, plain_old_bob#uni.elsewhere
['a.b#uni.somewhere', 'c.d#uni.somewhere', 'e.f#uni.somewhere', 'x.y#edu.com', 'z.k#edu.com', 'plain_old_bob#uni.elsewhere']
DISCLOSURE: I am the author of pyparsing.
Note
I'm more familiar with JavaScript than Python, and the basic logic is the same regardless (the different is syntax), so I've written my solutions here in JavaScript. Feel free to translate to Python.
The Issue
This question is a bit more involved than a simple one-line script or regular expression, but depending on the specific requirements you may be able to get away with something rudimentary.
For starters, parsing an e-mail is not trivially boiled down to a single regular expression. This website has several examples of regular expressions that will match "many" e-mails, but explains the trade-offs (complexity versus accuracy) and goes on to include the RFC 5322 standard regular expression that should theoretically match any e-mail, followed by a paragraph for why you shouldn't use it. However even that regular expression assumes that a domain name taking the form of an IP address can only consist of a tuple of four integers ranging from 0 to 255 -- it doesn't allow for IPv6
Even something as simple as:
{a, b}#domain.com
Could get tripped up because technically according to the e-mail address specification an e-mail address can contain ANY ASCII characters surrounded by quotes. The following is a valid (single) e-mail address:
"{a, b}"#domain.com
To accurately parse an e-mail would require that you read the characters one letter at a time and build a finite state machine to track whether you are within a double-quote, within a curly brace, before the #, after the #, parsing a domain name, parsing an IP, etc. In this way you could tokenize the address, locate your curly brace token, and parse it independently.
Something Rudimentary
Regular expressions are not the way to go for 100% accuracy and support for all e-mails, *especially* if you want to support more than one e-mail on a single line. But we'll start with them and try to build from there.
You've probably tried a regular expression like:
/\{(([^,]+),?)+\}\#(\w+\.)+[A-Za-z]+/
Match a single curly brace...
Followed by one or more instances of:
One or more non-comma characters...
Followed by zero or one commas
Followed by a single closing curly brace...
Followed by a single #
Followed by one or more instances of:
One or more "word" characters...
Followed by a single .
Followed by one or more alpha characters
This should match something roughly of the form:
{one, two}#domain1.domain2.toplevel
This handles validating, next is the issue of extracting all valid e-mails. Note that we have two sets of parenthesis in the name portion of the e-mail address that are nested: (([^,]+),?). This causes a problem for us. Many regular expression engines don't know how to return matches in this case. Consider what happens when I run this in JavaScript using my Chrome developer console:
var regex = /\{(([^,]+),?)+\}\#(\w+\.)+[A-Za-z]+/
var matches = "{one, two}#domain.com".match(regex)
Array(4) [ "{one, two}#domain.com", " two", " two", "domain." ]
Well that wasn't right. It found two twice, but didn't find one once! To fix this, we need to eliminate the nesting and do this in two steps.
var regexOne = /\{([^}]+)\}\#(\w+\.)+[A-Za-z]+/
"{one, two}#domain.com".match(regexOne)
Array(3) [ "{one, two}#domain.com", "one, two", "domain." ]
Now we can use the match and parse that separately:
// Note: It's important that this be a global regex (the /g modifier) since we expect the pattern to match multiple times
var regexTwo = /([^,]+,?)/g
var nameMatches = matches[1].match(regexTwo)
Array(2) [ "one,", " two" ]
Now we can trim these and get our names:
nameMatches.map(name => name.replace(/, /g, "")
nameMatches
Array(2) [ "one", "two" ]
For constructing the "domain" part of the e-mail, we'll need similar logic for everything after the #, since this has a potential for repeats the same way the name part had a potential for repeats. Our final code (in JavaScript) may look something like this (you'll have to convert to Python yourself):
function getEmails(input)
{
var emailRegex = /([^#]+)\#(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}#${domain}`);
}
Multi-email Lines
This is where things start to get tricky, and we need to accept a little less accuracy if we don't want to build a full on lexer / tokenizer. Because our e-mails contain commas (within the name field) we can't accurately split on commas -- unless those commas aren't within curly braces. With my knowledge of regular expressions, I don't know if this can be easily done. It may be possible with lookahead or lookbehind operators, but someone else will have to fill me in on that.
What can be easily done with regular expressions, however, is finding a block of text containing a post-ampersand comma. Something like: #[^#{]+?,
In the string a#b.com, c#d.com this would match the entire phrase #b.com, - but the important thing is that it gives us a place to split our string. The tricky bit is then finding out how to split your string here. Something along the lines of this will work most of the time:
var emails = "a#b.com, c#d.com"
var matches = emails.match(/#[^#{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(2) [ "a", " c#d.com" ]
split[0] = split[0] + matches[0] // Add back in what we split on
This has a potential bug should you have two e-mails in the list with the same domain:
var emails = "a#b.com, c#b.com, d#e.com"
var matches = emails.match(#[^#{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(3) [ "a", " c", " d#e.com" ]
split[0] = split[0] + matches[0]
console.log(split) // Array(3) [ "a#b.com", " c", " d#e.com" ]
But again, without building a lexer / tokenizer we're accepting that our solution will only work for most cases and not all.
However since the task of splitting one line into multiple e-mails is easier than diving into the e-mail, extracting a name, and parsing the name: we may be able to write a really stupid lexer for just this part:
var inBrackets = false
var emails = "{a, b}#c.com, d#e.com"
var split = []
var lastSplit = 0
for (var i = 0; i < emails.length; i++)
{
if (inBrackets && emails[i] === "}")
inBrackets = false;
if (!inBrackets && emails[i] === "{")
inBrackets = true;
if (!inBrackets && emails[i] === ",")
{
split.push(emails.substring(lastSplit, i))
lastSplit = i + 1 // Skip the comma
}
}
split.push(emails.substring(lastSplit))
console.log(split)
Once again, this won't be a perfect solution because an e-mail address may exist like the following:
","#domain.com
But, for 99% of use cases, this simple lexer will suffice and we can now build a "usually works but not perfect" solution like the following:
function getEmails(input)
{
var emailRegex = /([^#]+)\#(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}#${domain}`);
}
function splitLine(line)
{
var inBrackets = false;
var split = [];
var lastSplit = 0;
for (var i = 0; i < line.length; i++)
{
if (inBrackets && line[i] === "}")
inBrackets = false;
if (!inBrackets && line[i] === "{")
inBrackets = true;
if (!inBrackets && line[i] === ",")
{
split.push(line.substring(lastSplit, i));
lastSplit = i + 1;
}
}
split.push(line.substring(lastSplit));
return split;
}
var line = "{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com";
var emails = splitLine(line);
var finalList = [];
for (var i = 0; i < emails.length; i++)
{
finalList = finalList.concat(getEmails(emails[i]));
}
console.log(finalList);
// Outputs: [ "a.b#uni.somewhere", "c.d#uni.somewhere", "e.f#uni.somewhere", "x.y#edu.com", "z.k#edu.com" ]
If you want to try and implement the full lexer / tokenizer solution, you can look at the simple / dumb lexer I built as a starting point. The general idea is that you have a state machine (in my case I only had two states: inBrackets and !inBrackets) and you read one letter at a time but interpret it differently based on your current state.
a quick solution using re:
test with one text line:
import re
line = '{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, {z.z, z.a}#edu.com'
com = re.findall(r'(#[^,\n]+),?', line) #trap #xx.yyy
adrs = re.findall(r'{([^}]+)}', line) #trap all inside { }
result=[]
for i in range(len(adrs)):
s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
result=result+s.split(',')
for r in result:
print(r)
output in list result:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
z.z#edu.com
z.a#edu.com
test with a text file:
import io
data = io.StringIO(u'''\
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, {z.z, z.a}#edu.com
{a.b, c.d, e.f}#uni.anywhere
{x.y, z.k}#adi.com, {z.z, z.a}#du.com
''')
result=[]
import re
for line in data:
com = re.findall(r'(#[^,\n]+),?', line)
adrs = re.findall(r'{([^}]+)}', line)
for i in range(len(adrs)):
s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
result = result + s.split(',')
for r in result:
print(r)
output in list result:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
z.z#edu.com
z.a#edu.com
a.b#uni.anywhere
c.d#uni.anywhere
e.f#uni.anywhere
x.y#adi.com
z.k#adi.com
z.z#du.com
z.a#du.com

Iterating through capture fields in a Rust regex

I'm playing around with the frowns parser available from http:// frowns.sourceforge.net, a parser that tokenizes SMILES standard chemical formula strings. Specifically I'm trying to port it to Rust.
The original regex for an "atom" token in the parser looks like this (Python):
element_symbols_pattern = \
r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|" \
r"H[eofgas]?|c|n|o|s|p|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|" \
r"G[aed]|I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|" \
r"U|V|W|Xe|Yb?|Z[nr]|\*"
atom_fields = [
"raw_atom",
"open_bracket",
"weight",
"element",
"chiral_count",
"chiral_named",
"chiral_symbols",
"hcount",
"positive_count",
"positive_symbols",
"negative_count",
"negative_symbols",
"error_1",
"error_2",
"close_bracket",
"error_3",
]
atom = re.compile(r"""
(?P<raw_atom>Cl|Br|[cnospBCNOFPSI]) | # "raw" means outside of brackets
(
(?P<open_bracket>\[) # Start bracket
(?P<weight>\d+)? # Atomic weight (optional)
( # valid term or error
( # valid term
(?P<element>""" + element_symbols_pattern + r""") # element or aromatic
( # Chirality can be
(?P<chiral_count>#\d+) | # #1 #2 #3 ...
(?P<chiral_named> # or
#TH[12] | # #TA1 #TA2
#AL[12] | # #AL1 #AL2
#SP[123] | # #SP1 #SP2 #SP3
#TB(1[0-9]?|20?|[3-9]) | # #TB{1-20}
#OH(1[0-9]?|2[0-9]?|30?|[4-9])) | # #OH{1-30}
(?P<chiral_symbols>#+) # or #######...
)? # and chirality is optional
(?P<hcount>H\d*)? # Optional hydrogen count
( # Charges can be
(?P<positive_count>\+\d+) | # +<number>
(?P<positive_symbols>\++) | # +++... This includes the single '+'
(?P<negative_count>-\d+) | # -<number>
(?P<negative_symbols>-+) # ---... including a single '-'
)? # and are optional
(?P<error_1>[^\]]+)? # If there's anything left, it's an error
) | ( # End of parsing stuff in []s, except
(?P<error_2>[^\]]*) # If there was an error, we get here
))
((?P<close_bracket>\])| # End bracket
(?P<error_3>$)) # unexpectedly reached end of string
)
""", re.X)
The field list is used to improve the reportability of the regex parser, as well as track parsing errors.
I wrote something that compiles and parses tokens without brackets properly, but something about the inclusion of brackets (such as [S] instead of S) breaks it. So I've narrowed it down with comments:
extern crate regex;
use regex::Regex;
fn main() {
let atom_fields: Vec<&'static str> = vec![
"raw_atom",
"open_bracket",
"weight",
"element",
"chiral_count",
"chiral_named",
"chiral_symbols",
"hcount",
"positive_count",
"positive_symbols",
"negative_count",
"negative_symbols",
"error_1",
"error_2",
"close_bracket",
"error_3"
];
const EL_SYMBOLS: &'static str = r#"(?P<element>S?|\*")"#;
let atom_re_str: &String = &String::from(vec![
// r"(?P<raw_atom>Cl|Br|[cnospBCNOFPSI])|", // "raw" means outside of brackets
r"(",
r"(?P<open_bracket>\[)", // Start bracket
// r"(?P<weight>\d+)?", // Atomic weight (optional)
r"(", // valid term or error
r"(", // valid term
&EL_SYMBOLS, // element or aromatic
// r"(", // Chirality can be
// r"(?P<chiral_count>#\d+)|", // #1 #2 #3 ...
// r"(?P<chiral_named>", // or
// r"#TH[12]|", // #TA1 #TA2
// r"#AL[12]|", // #AL1 #AL2
// r"#SP[123]|", // #SP1 #SP2 #SP3
// r"#TB(1[0-9]?|20?|[3-9])|", // #TB{1-20}
// r"#OH(1[0-9]?|2[0-9]?|30?|[4-9]))|", // #OH{1-30}
// r"(?P<chiral_symbols>#+)", // or ####....,
// r")?", // and chirality is optional
// r"(?P<hcount>H\d*)?", // Optional hydrogen count
// r"(", // Charges can be
// r"(?P<positive_count>\+\d+)|", // +<number>
// r"(?P<positive_symbols>\++)|", // +++...including a single '+'
// r"(?P<negative_count>-\d+)|", // -<number>
// r"(?P<negative_symbols>-+)", // ---... including a single '-'
// r")?", // and are optional
// r"(?P<error_1>[^\]]+)?", // anything left is an error
r")", // End of stuff in []s, except
r"|((?P<error_2>[^\]]*)", // If other error, we get here
r"))",
r"((?P<close_bracket>\])|", // End bracket
r"(?P<error_3>$)))"].join("")); // unexpected end of string
println!("generated regex: {}", &atom_re_str);
let atom_re = Regex::new(&atom_re_str).unwrap();
for cur_char in "[S]".chars() {
let cur_string = cur_char.to_string();
println!("cur string: {}", &cur_string);
let captures = atom_re.captures(&cur_string.as_str()).unwrap();
// if captures.name("atom").is_some() {
// for cur_field in &atom_fields {
// let field_capture = captures.name(cur_field);
// if cur_field.contains("error") {
// if *cur_field == "error_3" {
// // TODO replace me with a real error
// println!("current char: {:?}", &cur_char);
// panic!("Missing a close bracket (]). Looks like: {}.",
// field_capture.unwrap());
// } else {
// panic!("I don't recognize the character. Looks like: {}.",
// field_capture.unwrap());
// }
// } else {
// println!("ok! matched {:?}", &cur_char);
// }
// }
// }
}
}
--
You can see that the generated Rust regex works in Debuggex:
((?P<open_bracket>\[)(((?P<element>S?|\*"))|((?P<error_2>[^\]]*)))((?P<close_bracket>\])|(?P<error_3>$)))
(http://debuggex.com/r/7j75Y2F1ph1v9jfL)
If you run the example (https://gitlab.com/araster/frowns_regex), you'll see that the open bracket parses correctly, but the .captures().unwrap() dies on the next character 'S'. If I use the complete expression I can parse all kinds of things from the frowns test file, as long as they don't have brackets.
What am I doing wrong?
You are iterating on each character of your input string and trying to match the regex on a string composed of a single character. However, this regex is not designed to match individual characters. Indeed, the regex will match [S] as a whole.
If you want to be able to find multiple matches in a single string, use captures_iter instead of captures to iterate on all matches and their respective captures (each match will be a formula, the regex will skip text that doesn't match a formula).
for captures in atom_re.captures_iter("[S]") {
// check the captures of each match
}
If you only want to find the first match in a string, then use captures on the whole string, rather than on each individual character.

Increase C++ regex replace performance

I'm a beginner C++ programmer working on a small C++ project for which I have to process a number of relatively large XML files and remove the XML tags out of them. I've succeeded doing so using the C++0x regex library. However, I'm running into some performance issues. Just reading in the files and executing the regex_replace function over its contents takes around 6 seconds on my PC. I can bring this down to 2 by adding some compiler optimization flags. Using Python, however, I can get it done it less than 100 milliseconds. Obviously, I'm doing something very inefficient in my C++ code. What can I do to speed this up a bit?
My C++ code:
std::regex xml_tags_regex("<[^>]*>");
for (std::vector<std::string>::iterator it = _files.begin(); it !=
_files.end(); it++) {
std::ifstream file(*it);
file.seekg(0, std::ios::end);
size_t size = file.tellg();
std::string buffer(size, ' ');
file.seekg(0);
file.read(&buffer[0], size);
buffer = regex_replace(buffer, xml_tags_regex, "");
file.close();
}
My Python code:
regex = re.compile('<[^>]*>')
for filename in filenames:
with open(filename) as f:
content = f.read()
content = regex.sub('', content)
P.S. I don't really care about processing the complete file at once. I just found that reading a file line by line, word by word or character by character slowed it down considerably.
C++11 regex replace is indeed rather slow, as of yet, at least. PCRE performs much better in terms of pattern matching speed, however, PCRECPP provides very limited means for regular expression based substitution, citing the man page:
You can replace the first match of "pattern" in "str" with "rewrite".
Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
insert text matching corresponding parenthesized group from the
pattern. \0 in "rewrite" refers to the entire matching text.
This is really poor, compared to Perl's 's' command. That is why I wrote my own C++ wrapper around PCRE that handles regular expression based substitution in a fashion that is close to Perl's 's', and also supports 16- and 32-bit character strings: PCRSCPP:
Command string syntax
Command syntax follows Perl s/pattern/substitute/[options]
convention. Any character (except the backslash \) can be used as a
delimiter, not just /, but make sure that delimiter is escaped with
a backslash (\) if used in pattern, substitute or options
substrings, e.g.:
s/\\/\//g to replace all backslashes with forward ones
Remember to double backslashes in C++ code, unless using raw string
literal (see string literal):
pcrscpp::replace rx("s/\\\\/\\//g");
Pattern string syntax
Pattern string is passed directly to pcre*_compile, and thus has to
follow PCRE syntax as described in PCRE documentation.
Substitute string syntax
Substitute string backreferencing syntax is similar to Perl's:
$1 ... $n: nth capturing subpattern matched.
$& and $0: the whole match
${label} : labled subpattern matched. label is up to 32 alphanumerical +
underscore characters ('A'-'Z','a'-'z','0'-'9','_'),
first character must be alphabetical
$` and $' (backtick and tick) refer to the areas of the subject before
and after the match, respectively. As in Perl, the unmodified
subject is used, even if a global substitution previously matched.
Also, following escape sequences get recognized:
\n: newline
\r: carriage return
\t: horizontal tab
\f: form feed
\b: backspace
\a: alarm, bell
\e: escape
\0: binary zero
Any other escape sequence \<char>, is interpreted as <char>,
meaning that you have to escape backslashes too
Options string syntax
In Perl-like manner, options string is a sequence of allowed modifier
letters. PCRSCPP recognizes following modifiers:
Perl-compatible flags
g: global replace, not just the first match
i: case insensitive match
(PCRE_CASELESS)
m: multi-line mode: ^ and $ additionally match positions
after and before newlines, respectively
(PCRE_MULTILINE)
s: let the scope of the . metacharacter include newlines
(treat newlines as ordinary characters)
(PCRE_DOTALL)
x: allow extended regular expression syntax,
enabling whitespace and comments in complex patterns
(PCRE_EXTENDED)
PHP-compatible flags
A: "anchor" pattern: look only for "anchored" matches: ones that
start with zero offset. In single-line mode is identical to
prefixing all pattern alternative branches with ^
(PCRE_ANCHORED)
D: treat dollar $ as subject end assertion only, overriding the default:
end, or immediately before a newline at the end.
Ignored in multi-line mode
(PCRE_DOLLAR_ENDONLY)
U: invert * and + greediness logic: make ungreedy by default,
? switches back to greedy. (?U) and (?-U) in-pattern switches
remain unaffected
(PCRE_UNGREEDY)
u: Unicode mode. Treat pattern and subject as UTF8/UTF16/UTF32 string.
Unlike in PHP, also affects newlines, \R, \d, \w, etc. matching
((PCRE_UTF8/PCRE_UTF16/PCRE_UTF32) | PCRE_NEWLINE_ANY
| PCRE_BSR_UNICODE | PCRE_UCP)
PCRSCPP own flags:
N: skip empty matches
(PCRE_NOTEMPTY)
T: treat substitute as a trivial string, i.e., make no backreference
and escape sequences interpretation
n: discard non-matching portions of the string to replace
Note: PCRSCPP does not automatically add newlines,
the replacement result is plain concatenation of matches,
be specifically aware of this in multiline mode
I wrote a simple speed test code, which stores a 10x copy of file "move.sh" and tests regex performance on resulting string:
#include <pcrscpp.h>
#include <string>
#include <iostream>
#include <fstream>
#include <regex>
#include <chrono>
int main (int argc, char *argv[]) {
const std::string file_name("move.sh");
pcrscpp::replace pcrscpp_rx(R"del(s/(?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n)/$1\n$2\n/Dgn)del");
std::regex std_rx (R"del((?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n))del");
std::ifstream file (file_name);
if (!file.is_open ()) {
std::cerr << "Unable to open file " << file_name << std::endl;
return 1;
}
std::string buffer;
{
file.seekg(0, std::ios::end);
size_t size = file.tellg();
file.seekg(0);
if (size > 0) {
buffer.resize(size);
file.read(&buffer[0], size);
buffer.resize(size - 1); // strip '\0'
}
}
file.close();
std::string bigstring;
bigstring.reserve(10*buffer.size());
for (std::string::size_type i = 0; i < 10; i++)
bigstring.append(buffer);
int n = 10;
std::cout << "Running tests " << n << " times: be patient..." << std::endl;
std::chrono::high_resolution_clock::duration std_regex_duration, pcrscpp_duration;
std::chrono::high_resolution_clock::time_point t1, t2;
std::string result1, result2;
for (int i = 0; i < n; i++) {
// clear result
std::string().swap(result1);
t1 = std::chrono::high_resolution_clock::now();
result1 = std::regex_replace (bigstring, std_rx, "$1\\n$2", std::regex_constants::format_no_copy);
t2 = std::chrono::high_resolution_clock::now();
std_regex_duration = (std_regex_duration*i + (t2 - t1)) / (i + 1);
// clear result
std::string().swap(result2);
t1 = std::chrono::high_resolution_clock::now();
result2 = pcrscpp_rx.replace_copy (bigstring);
t2 = std::chrono::high_resolution_clock::now();
pcrscpp_duration = (pcrscpp_duration*i + (t2 - t1)) / (i + 1);
}
std::cout << "Time taken by std::regex_replace: "
<< std_regex_duration.count()
<< " ms" << std::endl
<< "Result size: " << result1.size() << std::endl;
std::cout << "Time taken by pcrscpp::replace: "
<< pcrscpp_duration.count()
<< " ms" << std::endl
<< "Result size: " << result2.size() << std::endl;
return 0;
}
(note that std and pcrscpp regular expressions do the same here, the trailing newline in expression for pcrscpp is due to std::regex_replace not stripping newlines despite std::regex_constants::format_no_copy)
and launched it on a large (20.9 MB) shell move script:
Running tests 10 times: be patient...
Time taken by std::regex_replace: 12090771487 ms
Result size: 101087330
Time taken by pcrscpp::replace: 5910315642 ms
Result size: 101087330
As you can see, PCRSCPP is more than 2x faster. And I expect this gap to grow with pattern complexity increase, since PCRE deals with complicated patterns much better. I originally wrote a wrapper for myself, but I think it can be useful for others too.
Regards,
Alex
I don't think you're doing anything "wrong" per-say, the C++ regex library just isn't as fast as the python one (for this use case at this time at least). This isn't too surprising, keeping in mind the python regex code is all C/C++ under the hood as well, and has been tuned over the years to be pretty fast as that's a fairly important feature in python, so naturally it is going to be pretty fast.
But there are other options in C++ for getting things faster if you need. I've used PCRE ( http://pcre.org/ ) in the past with great results, though I'm sure there are other good ones out there these days as well.
For this case in particular however, you can also achieve what you're after without regexes, which in my quick tests yielded a 10x performance improvement. For example, the following code scans your input string copying everything to a new buffer, when it hits a < it starts skipping over characters until it sees the closing >
std::string buffer(size, ' ');
std::string outbuffer(size, ' ');
... read in buffer from your file
size_t outbuffer_len = 0;
for (size_t i=0; i < buffer.size(); ++i) {
if (buffer[i] == '<') {
while (buffer[i] != '>' && i < buffer.size()) {
++i;
}
} else {
outbuffer[outbuffer_len] = buffer[i];
++outbuffer_len;
}
}
outbuffer.resize(outbuffer_len);

Why doesn't Python "grouping" work for regular expressions in C?

Here is my Python program:
import re
print re.findall( "([se]{2,30})ting", "testingtested" )
Its output is:
['es']
Which is what I expect. I expect to get back "es" because I searched for 2-30 characters of "e" or "s" which are followed by "ting".
Here is my C program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <regex.h>
int main(void) {
regex_t preg;
regmatch_t pmatch;
char string[] = "testingtested";
//Compile the regular expression
if ( regcomp( &preg, "([se]{2,30})ting", REG_EXTENDED ) ) {
printf( "ERROR!\n" );
return -1;
} else {
printf( "Compiled\n" );
}
//Do the search
if ( regexec( &preg, string, 1, &pmatch, REG_NOTEOL ) ) {
printf( "No Match\n" );
} else {
//Allocate memory on the stack for this
char substring[pmatch.rm_eo - pmatch.rm_so + 1];
//Copy the substring over
printf( "%d %d\n", pmatch.rm_so, pmatch.rm_eo );
strncpy( substring, &string[pmatch.rm_so], pmatch.rm_eo - pmatch.rm_so );
//Make sure there's a null byte
substring[pmatch.rm_eo - pmatch.rm_so] = 0;
//Print it out
printf( "Match\n" );
printf( "\"%s\"\n", substring );
}
//Release the regular expression
regfree( &preg );
return EXIT_SUCCESS;
}
It's output is:
Compiled
1 7
Match
"esting"
Why is the C program including the "ting" in the result? And is there a way for me to exclude the "ting" portion?
pmatch is the whole match, not the first parenthesized subexpression.
Try changing pmatch to an array of 2 elements, then passing 2 in place of 1 to regexec and using the [1] element to get the subexpression match.
To others who have cited differences between C and Python and different types of regular expressions, that's all unrelated. This expression is very simple and that's not coming into play.
While regular expressions are "more or less the same everywhere", the exact supported features differ from implementation to implementation.
Unfortunately, you need to consult each regex library's documentation separately when designing your regular expressions.

how to get the function declaration or definitions using regex

I want to get only function prototypes like
int my_func(char, int, float)
void my_func1(void)
my_func2()
from C files using regex and python.
Here is my regex format: ".*\(.*|[\r\n]\)\n"
This is a convenient script I wrote for such tasks but it wont give the function types. It's only for function names and the argument list.
# Exctract routine signatures from a C++ module
import re
def loadtxt(filename):
"Load text file into a string. I let FILE exceptions to pass."
f = open(filename)
txt = ''.join(f.readlines())
f.close()
return txt
# regex group1, name group2, arguments group3
rproc = r"((?<=[\s:~])(\w+)\s*\(([\w\s,<>\[\].=&':/*]*?)\)\s*(const)?\s*(?={))"
code = loadtxt('your file name here')
cppwords = ['if', 'while', 'do', 'for', 'switch']
procs = [(i.group(2), i.group(3)) for i in re.finditer(rproc, code) \
if i.group(2) not in cppwords]
for i in procs: print i[0] + '(' + i[1] + ')'
See if your C compiler has an option to output a file of just the prototypes of what it is compiling. For gcc, it's -aux-info FILENAME
I think regex isn't best solution in your case. There are many traps like comments, text in string etc., but if your function prototypes share common style:
type fun_name(args);
then \w+ \w+\(.*\); should work in most cases:
mn> egrep "\w+ \w+\(.*\);" *.h
md5.h:extern bool md5_hash(const void *buff, size_t len, char *hexsum);
md5file.h:int check_md5files(const char *filewithsums, const char *filemd5sum);
I think this one should do the work:
r"^\s*[\w_][\w\d_]*\s*.*\s*[\w_][\w\d_]*\s*\(.*\)\s*$"
which will be expanded into:
string begin:
^
any number of whitespaces (including none):
\s*
return type:
- start with letter or _:
[\w_]
- continue with any letter, digit or _:
[\w\d_]*
any number of whitespaces:
\s*
any number of any characters
(for allow pointers, arrays and so on,
could be replaced with more detailed checking):
.*
any number of whitespaces:
\s*
function name:
- start with letter or _:
[\w_]
- continue with any letter, digit or _:
[\w\d_]*
any number of whitespaces:
\s*
open arguments list:
\(
arguments (allow none):
.*
close arguments list:
\)
any number of whitespaces:
\s*
string end:
$
It's not totally correct for matching all possible combinations, but should work in more cases. If you want it to be more accurate, just let me know.
EDIT:
Disclaimer - I'm quite new to both Python and Regex, so please be indulgent ;)
There are LOTS of pitfalls trying to "parse" C code (or extract some information at least) with just regular expressions, I will definitely borrow a C for your favourite parser generator (say Bison or whatever alternative there is for Python, there are C grammar as examples everywhere) and add the actions in the corresponding rules.
Also, do not forget to run the C preprocessor on the file before parsing.
I built on Nick Dandoulakis's answer for a similar use case. I wanted to find the definition of the socket function in glibc. This finds a bunch of functions with "socket" in the name but socket was not found, highlighting what many others have said: there are probably better ways to extract this information, like tools provided by compilers.
# find_functions.py
#
# Extract routine signatures from a C++ module
import re
import sys
def loadtxt(filename):
# Load text file into a string. Ignore FILE exceptions.
f = open(filename)
txt = ''.join(f.readlines())
f.close()
return txt
# regex group1, name group2, arguments group3
rproc = r"((?<=[\s:~])(\w+)\s*\(([\w\s,<>\[\].=&':/*]*?)\)\s*(const)?\s*(?={))"
file = sys.argv[1]
code = loadtxt(file)
cppwords = ['if', 'while', 'do', 'for', 'switch']
procs = [(i.group(1)) for i in re.finditer(rproc, code) \
if i.group(2) not in cppwords]
for i in procs: print file + ": " + i
Then
$ cd glibc
$ find . -name "*.c" -print0 | xargs -0 -n 1 python find_functions.py | grep ':.*socket'
./hurd/hurdsock.c: _hurd_socket_server (int domain, int dead)
./manual/examples/mkfsock.c: make_named_socket (const char *filename)
./manual/examples/mkisock.c: make_socket (uint16_t port)
./nscd/connections.c: close_sockets (void)
./nscd/nscd.c: nscd_open_socket (void)
./nscd/nscd_helper.c: wait_on_socket (int sock, long int usectmo)
./nscd/nscd_helper.c: open_socket (request_type type, const char *key, size_t keylen)
./nscd/nscd_helper.c: __nscd_open_socket (const char *key, size_t keylen, request_type type,
./socket/socket.c: __socket (int domain, int type, int protocol)
./socket/socketpair.c: socketpair (int domain, int type, int protocol, int fds[2])
./sunrpc/key_call.c: key_call_socket (u_long proc, xdrproc_t xdr_arg, char *arg,
./sunrpc/pm_getport.c: __get_socket (struct sockaddr_in *saddr)
./sysdeps/mach/hurd/socket.c: __socket (int domain, int type, int protocol)
./sysdeps/mach/hurd/socketpair.c: __socketpair (int domain, int type, int protocol, int fds[2])
./sysdeps/unix/sysv/linux/socket.c: __socket (int fd, int type, int domain)
./sysdeps/unix/sysv/linux/socketpair.c: __socketpair (int domain, int type, int protocol, int sv[2])
In my case, this and this might help me, except it seems like I will need to read assembly code to reuse the strategy described there.
The regular expression below consider also the definition of destructor or const functions:
^\s*\~{0,1}[\w_][\w\d_]*\s*.*\s*[\w_][\w\d_]*\s*\(.*\)\s*(const){0,1}$

Categories