Python regex to pull FQDN from syslog server - python

I'm trying to build a regex to parse our syslogs. I was asked to account for each server that uses the service. I wrote a simple regex to pull out the FQDN, but it seems to be consuming too much of the line...
>>> string = "2010-12-13T00:00:02-05:00 <local3.info> suba1.suba2.example.com named[29959]: client 192.168.11.53#54608: query: subb1.subb2.example.com"
>>> regex = re.compile("\s.*?\.example\.com ")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0x896dae0bbf9e6bf0>
# Run findall
>>> regex.findall(string)
[u' <local3.info> suba1.suba2.example.com ', u' client 192.168.11.53#54608: query: subb1.subb2.example.com ']
As you can see the findall with .* is too generic and the regex ends up consuming to much.

Replacing \s with \b and the .*? with \S will do it.
>>> regex = re.compile(r'\b\S*\.example\.com')
>>> regex.findall(string)
[u'suba1.suba2.example.com', u'subb1.subb2.example.com']

The regex
r"query: ([\w\.]+)"
would grab the end from [...] query on and then you can use a unnamed group look-up to give you just the domain name.
If this isn't the output you need, can you elaborate on the desired output (as a data structure. I took a guess for this).
The python code might look like:
match = re.search(r"query: ([\w.]+)", string, re.IGNORECASE | re.MULTILINE)
if match:
result = match.group(1)
else:
result = ""
result would contain
subb1.subb2.example.com

Try using:
regex = re.compile("\s\S*?\.example\.com ")

#!/usr/bin/env python
import re
s = """2010-12-13T00:00:02-05:00 <local3.info>
suba1.suba2.example.com named[29959]:
client 192.168.11.53#54608: query: subb1.subb2.example.com"""
pattern = re.compile("[\S.]+.example.com")
print pattern.findall(s)
# => ['suba1.suba2.example.com', 'subb1.subb2.example.com']

Related

Using regex with python re.match

I am new to python and have a question about using regex on strings. Currently I have:
def find_ips(ip):
ip_str = '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
p = re.compile(ip_str)
m = p.match(ip)
if m:
print 'match found'
else:
print 'no match'
global find_addr
find_addr = p.match(ip)
return find_addr
find_ips('this is an ip 127.0.0.1 10.0.10.5')
print find_addr
This returns 'no match'. I'm not seeing what i'm missing so far. I am trying to extract the ip addresses out of this string, but first I have to find them. Using a regex editor I can use that same line to discover those IPs. Any help is appreciated.
re.match only finds a match if it is at the beginning of the string. re.search will look in the entire string for a match.
Also, it's usually a good idea to use raw strings when making regex:
ip_str = r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
# ^
On a slightly unrelated note:
find_ips('this is an ip 127.0.0.1 10.0.10.5')
print find_addr
is a bit kludgy. Making use of the return value in the caller is much better than doing funky stuff with globals:
print find_ips('...')
re.match() matches from the beginning of the string, I would use re.findall() here if you want to match all. Also it's good practice to use raw string notation with your pattern.
>>> import re
>>> def find_ips(str):
... m = re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', str)
... return ', '.join(m)
...
>>> print find_ips('this is an ip 127.0.0.1 10.0.10.5')
127.0.0.1, 10.0.10.5
from re import findall
# The string to be checked.
string = 'this is a string 126.32.13.1 with ips in 132.31.3.1 it'
# Print the matches of the regex in the string.
print findall('\d+\.\d+\.\d+\.\d+', string)
# Output
# ['126.32.13.1', '132.31.3.1']

Regex to retrieve the last few characters of a string

Regex to retrieve the last portion of a string:
https://play.google.com/store/apps/details?id=com.lima.doodlejump
I'm looking to retrieve the string followed by id=
The following regex didn't seem to work in python
sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
re.search("id=(.*?)", sampleURL).group(1)
The above should give me an output:
com.lima.doodlejump
Is my search group right?
Your regular expression
(.*?)
will not work because, it will match between zero and unlimited times, as few times as possible (becasue of the ?). So, you have the following choices of RegEx
(.*) # Matches the rest of the string
(.*?)$ # Matches till the end of the string
But, you don't need RegEx at all here, simply split the string like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
print data.split("id=", 1)[-1]
Output
com.lima.doodlejump
If you really have to use RegEx, you can do like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
import re
print re.search("id=(.*)", data).group(1)
Output
com.lima.doodlejump
I'm surprised that nobody has mentioned urlparse yet...
>>> s = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> urlparse.urlparse(s)
ParseResult(scheme='https', netloc='play.google.com', path='/store/apps/details', params='', query='id=com.lima.doodlejump', fragment='')
>>> urlparse.parse_qs(urlparse.urlparse(s).query)
{'id': ['com.lima.doodlejump']}
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id']
['com.lima.doodlejump']
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id'][0]
'com.lima.doodlejump'
The HUGE advantage here is that if the url query string gets more components then it could easily break the other solutions which rely on a simple str.split. It won't confuse urlparse however :).
Just split it in the place you want:
id = url.split('id=')[1]
If you print id, you'll get:
com.lima.doodlejump
Regex isn't needed here :)
However, in case there are multiple id=s in your string, and you only wanted the last one:
id = url.split('id=')[-1]
Hope this helps!
This works:
>>> import re
>>> sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> re.search("id=(.+)", sampleURL).group(1)
'com.lima.doodlejump'
>>>
Instead of capturing non-greedily for zero or more characters, this code captures greedily for one or more.

Python regular expression not matching

This is one of those things where I'm sure I'm missing something simple, but... In the sample program below, I'm trying to use Python's RE library to parse the string "line" to get the floating-point number just before the percent sign, i.e. "90.31". But the code always prints "no match".
I've tried a couple other regular expressions as well, all with the same result. What am I missing?
#!/usr/bin/python
import re
line = ' 0 repaired, 90.31% done'
pct_re = re.compile(' (\d+\.\d+)% done$')
#pct_re = re.compile(', (.+)% done$')
#pct_re = re.compile(' (\d+.*)% done$')
match = pct_re.match(line)
if match: print 'got match, pct=' + match.group(1)
else: print 'no match'
match only matches from the beginning of the string. Your code works fine if you do pct_re.search(line) instead.
You should use re.findall instead:
>>> line = ' 0 repaired, 90.31% done'
>>>
>>> pattern = re.compile("\d+[.]\d+(?=%)")
>>> re.findall(pattern, line)
['90.31']
re.match will match at the start of the string. So you would need to build the regex for complete string.
try this if you really want to use match:
re.match(r'.*(\d+\.\d+)% done$', line)
r'...' is a "raw" string ignoring some escape sequences, which is a good practice to use with regexp in python. – kratenko (see comment below)

How do you extract a url from a string using python?

For example:
string = "This is a link http://www.google.com"
How could I extract 'http://www.google.com' ?
(Each link will be of the same format i.e 'http://')
There may be few ways to do this but the cleanest would be to use regex
>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com
If there can be multiple links you can use something similar to below
>>> myString = "These are the links http://www.google.com and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>
There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:
pip install urlextract
and then you can use it like this:
from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")
print(urls) # prints: ['stackoverflow.com']
You can find more info on my github page: https://github.com/lipoja/URLExtract
NOTE: It downloads a list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then it's not for you.
In order to find a web URL in a generic string, you can use a regular expression (regex).
A simple regex for URL matching like the following should fit your case.
regex = r'('
# Scheme (HTTP, HTTPS, FTP and SFTP):
regex += r'(?:(https?|s?ftp):\/\/)?'
# www:
regex += r'(?:www\.)?'
regex += r'('
# Host and domain (including ccSLD):
regex += r'(?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)'
# TLD:
regex += r'([A-Z]{2,6})'
# IP Address:
regex += r'|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
regex += r')'
# Port:
regex += r'(?::(\d{1,5}))?'
# Query path:
regex += r'(?:(\/\S+)*)'
regex += r')'
If you want to be even more precise, in the TLD section, you should ensure that the TLD is a valid TLD (see the entire list of valid TLDs here: https://data.iana.org/TLD/tlds-alpha-by-domain.txt):
# TLD:
regex += r'(com|net|org|eu|...)'
Then, you can simply compile the former regex and use it to find possible matches:
import re
string = "This is a link http://www.google.com"
find_urls_in_string = re.compile(regex, re.IGNORECASE)
url = find_urls_in_string.search(string)
if url is not None and url.group(0) is not None:
print("URL parts: " + str(url.groups()))
print("URL" + url.group(0).strip())
Which, in case of the string "This is a link http://www.google.com" will output:
URL parts: ('http://www.google.com', 'http', 'google.com', 'com', None, None)
URL: http://www.google.com
If you change the input with a more complex URL, for example "This is also a URL https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo but this is not anymore" the output will be:
URL parts: ('https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo', 'https', 'host.domain.com', 'com', '80', '/path/page.php?query=value&a2=v2#foo')
URL: https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo
NOTE: If you are looking for more URLs in a single string, you can still use the same regex, but just use findall() instead of search().
This extracts all urls with parameters, somehow all above examples haven't worked for me
import re
data = 'https://net2333.us3.list-some.com/subscribe/confirm?u=f3cca8a1ffdee924a6a413ae9&id=6c03fa85f8&e=6bbacccc5b'
WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))"""
re.findall(WEB_URL_REGEX, text)
You can extract any URL from a string using the following patterns,
1.
>>> import re
>>> string = "This is a link http://www.google.com"
>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?'
>>> re.search(pattern, string)
http://www.google.com
>>> TWEET = ('New Pybites article: Module of the Week - Requests-cache '
'for Repeated API Calls - http://pybit.es/requests-cache.html '
'#python #APIs')
>>> re.search(pattern, TWEET)
http://pybit.es/requests-cache.html
>>> tweet = ('Pybites My Reading List | 12 Rules for Life - #books '
'that expand the mind! '
'http://pbreadinglist.herokuapp.com/books/'
'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter'
' #psychology #philosophy')
>>> re.findall(pattern, TWEET)
['http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter']
to take the above pattern to the next level, we can also detect hashtags including URL the following ways
2.
>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?|#[.\w]*'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']
The above example for taking URL and hashtags can be shortened to
>>> pattern = r'((?:#|http)\S+)'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']
The pattern below can matches two alphanumeric separated by "." as URL
>>> pattern = pattern = r'(?:http://)?\w+\.\S*[^.\s]'
>>> tweet = ('PyBites My Reading List | 12 Rules for Life - #books '
'that expand the mind! '
'www.google.com/telephone/wire.... '
'http://pbreadinglist.herokuapp.com/books/'
'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter '
"http://-www.pip.org "
"google.com "
"twitter.com "
"facebook.com"
' #psychology #philosophy')
>>> re.findall(pattern, tweet)
['www.google.com/telephone/wire', 'http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', 'www.pip.org', 'google.com', 'twitter.com', 'facebook.com']
You can try any complicated URL with the number 1 & 2 pattern.
To learn more about re module in python, do check this out
REGEXES IN PYTHON by Real Python.
Cheers!
I've used a slight variation from #Abhijit's accepted answer.
This one uses \S instead of [^\s], which is equivalent but more concise. It also doesn't use a named group, because there is just one and we can ommit the name for simplicity reasons:
import re
my_string = "This is my tweet check it out http://example.com/blah"
print(re.search(r'(https?://\S+)', my_string).group())
Of course, if there are multiple links to extract, just use .findall():
print(re.findall(r'(https?://\S+)', my_string))

How can I get part of regex match as a variable in python?

In Perl it is possible to do something like this (I hope the syntax is right...):
$string =~ m/lalala(I want this part)lalala/;
$whatIWant = $1;
I want to do the same in Python and get the text inside the parenthesis in a string like $1.
If you want to get parts by name you can also do this:
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
>>> m.groupdict()
{'first_name': 'Malcom', 'last_name': 'Reynolds'}
The example was taken from the re docs
See: Python regex match objects
>>> import re
>>> p = re.compile("lalala(I want this part)lalala")
>>> p.match("lalalaI want this partlalala").group(1)
'I want this part'
import re
astr = 'lalalabeeplalala'
match = re.search('lalala(.*)lalala', astr)
whatIWant = match.group(1) if match else None
print(whatIWant)
A small note: in Perl, when you write
$string =~ m/lalala(.*)lalala/;
the regexp can match anywhere in the string. The equivalent is accomplished with the re.search() function, not the re.match() function, which requires that the pattern match starting at the beginning of the string.
import re
data = "some input data"
m = re.search("some (input) data", data)
if m: # "if match was successful" / "if matched"
print m.group(1)
Check the docs for more.
there's no need for regex. think simple.
>>> "lalala(I want this part)lalala".split("lalala")
['', '(I want this part)', '']
>>> "lalala(I want this part)lalala".split("lalala")[1]
'(I want this part)'
>>>
import re
match = re.match('lalala(I want this part)lalala', 'lalalaI want this partlalala')
print match.group(1)
import re
string_to_check = "other_text...lalalaI want this partlalala...other_text"
p = re.compile("lalala(I want this part)lalala") # regex pattern
m = p.search(string_to_check) # use p.match if what you want is always at beginning of string
if m:
print m.group(1)
In trying to convert a Perl program to Python that parses function names out of modules, I ran into this problem, I received an error saying "group" was undefined. I soon realized that the exception was being thrown because p.match / p.search returns 0 if there is not a matching string.
Thus, the group operator cannot function on it. So, to avoid an exception, check if a match has been stored and then apply the group operator.
import re
filename = './file_to_parse.py'
p = re.compile('def (\w*)') # \w* greedily matches [a-zA-Z0-9_] character set
for each_line in open(filename,'r'):
m = p.match(each_line) # tries to match regex rule in p
if m:
m = m.group(1)
print m

Categories