Python - Display key values before and after Findall(regex, output)

Python - Display key values before and after Findall(regex, output) - python

I am trying to extract MAC addresses for each NIC from Dell's RACADM output such that my output should be like below:
NIC.Slot.2-2-1 --> 24:84:09:3E:2E:1B
I have used the following to extract the output
output = subprocess.check_output("sshpass -p {} ssh {}#{} racadm {}".format(args.password,args.username,args.hostname,args.command),shell=True).decode()
Part of output
https://pastebin.com/cz6LbcxU
Each component details are displayed between ------ lines
I want to search Device Type = NIC and then print Instance ID and Permanent MAC.
regex = r'Device Type = NIC'
match = re.findall(regex, output, flags=re.MULTILINE|re.DOTALL)
match = re.finditer(regex, output, flags=re.S)
I used both the above functions to extract the match but how do I print [InstanceID: NIC.Slot.2-2-1] and PermanentMACAddress of the Matched regex.
Please help anyone?

If I understood correctly,
you can search for the pattern [InstanceID: ...] to get the instance id,
and PermanentMACAddress = ... to get the MAC address.
Here's one way to do it:
import re
match_inst = re.search(r'\[InstanceID: (?P<inst>[^]]*)', output)
match_mac = re.search(r'PermanentMACAddress = (?P<mac>.*)', output)
inst = match_inst.groupdict()['inst']
mac = match_mac.groupdict()['mac']
print('{} --> {}'.format(inst, mac))
# prints: NIC.Slot.2-2-1 --> 24:84:09:3E:2E:1B
If you have multiple records like this and want to map NIC to MAC, you can get a list of each, zip them together to create a dictionary:
inst = re.findall(r'\[InstanceID: (?P<inst>[^]]*)', output)
mac = re.findall(r'PermanentMACAddress = (?P<mac>.*)', output)
mapping = dict(zip(inst, mac))

Your output looks like INI file content, you could try to parse them using configparser.
>>> import configparser
>>> config = configparser.ConfigParser()
>>> config.read_string(output)
>>> for section in config.sections():
... print(section)
... print(config[section]['Device Type'])
...
InstanceID: NIC.Slot.2-2-1
NIC
>>>

Related

Python - get TLD

I have a problem in function which should remove tld from domain. If domain has some subdomain it works correctly. For example:
Input: asdf.xyz.example.com
Output: asdf.xyz.example
Problem is when the domain has not any subdomain, there is dot in front of domain
Input: example.com
Output: .example
This is my code:
res = get_tld(domain, as_object=True, fail_silently=True, fix_protocol=True)
domain = '.'.join([res.subdomain, res.domain])
Function get_tld is from tld library
Could someone help me how to solve this problem?

With a very simple string manipulation, is this what you are looking for?
d1 = 'asdf.xyz.example.com'
output = '.'.join(d1.split('.')[:-1])
# output = 'asdf.xyz.example'
d2 = 'example.com'
output = '.'.join(d2.split('.')[:-1])
# output = 'example'

You can use filtering. It looks like get_tld works as intended but join is incorrect
domain = '.'.join(filter(lambda x: len(x), [res.subdomain, res.domain]))

another simple version is this:
def remove_tld(url):
*base, tld = url.split(".")
return ".".join(base)
url = "asdf.xyz.example.com"
print(remove_tld(url)) # asdf.xyz.example
url = "example.com"
print(remove_tld(url)) # example
*base, tld = url.split(".") puts the TLD in tld and everything else in base. then you just join tĥat with ".".join(base).

Python regex to find functions and params pairs in js files

I am writing a JavaScript crawler application.
The application needs to open JavaScript files and find some specific code in order to do some stuff with them.
I am using regular expressions to find the code of interest.
Consider the following JavaScript code:
let nlabel = rs.length ? st('string1', [st('string2', ctx = 'ctx2')], ctx = 'ctx1') : st('Found {0}', [st(this.param)]);
As you can see there is the st function which is called three times in the same line. The first two calls have an extra parameter named ctx but the third one doesn't have it.
What I need to do is to have 3 re matches as below:
Match 1
Group: function = "st('"
Group: string = "string1"
Group: ctx = "ctx1"
Match 2
Group: function = "st('"
Group: string = "string2"
Group: ctx = "ctx2"
Match 3
Group: function = "st('"
Group: string = "Found {0}"
Group: ctx = (None)
I am using the regex101.com to test my patterns and the pattern that gives the closest thing to what I am looking for is the following:
(?P<function>st\([\"'])(?P<string>.+?(?=[\"'](\s*,ctx\s*|\s*,\s*)))
You can see it in action here.
However, I have no idea how to make it return the ctx group the way I want it.
For your reference I am using the following Python code:
matches = []
code = "let nlabel = rs.length ? st('string1', [st('string2', ctx = 'ctx2')], ctx = 'ctx1') : st('Found {0}', [st(this.param)], ctx = 'ctxparam'"
pattern = "(?P<function>st\([\"'])(?P<string>.+?(?=[\"'](\s*,ctx\s*|\s*,\s*)))"
for m in re.compile(pattern).finditer(code):
fnc = m.group('function')
msg = m.group('string')
ctx = m.group('ctx')
idx = m.start()
matches.append([idx, fnc, msg, ctx])
print(matches)
I have the feeling that re alone isn't capable to do exactly what I am looking for but any suggestion/solution which gets closer is more than welcome.

Converting part of string into variable name in python

I have a file containing a text like this:
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
Does anyone know, how could I extract variables like below:
appList=["application1","application2"]
ServerOfapp1=["127.0.0.1:8082","127.0.0.1:8083","127.0.0.1:8084"]
ServerOfapp2=["127.0.0.1:8092","127.0.0.1:8093","127.0.0.1:8094"]
.
.
.
and so on

If the lines you want always start with upstream and server this should work:
app_dic = {}
with open('file.txt','r') as f:
for line in f:
if line.startswith('upstream'):
app_i = line.split()[1]
server_of_app_i = []
for line in f:
if not line.startswith('server'):
break
server_of_app_i.append(line.split()[1][:-1])
app_dic[app_i] = server_of_app_i
app_dic should then be a dictionary of lists:
{'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'],
'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']}
EDIT
If the input file does not contain any newline character, as long as the file is not too large you could write it to a list and iterate over it:
app_dic = {}
with open('file.txt','r') as f:
txt_iter = iter(f.read().split()) #iterator of list
for word in txt_iter:
if word == 'upstream':
app_i = next(txt_iter)
server_of_app_i=[]
for word in txt_iter:
if word == 'server':
server_of_app_i.append(next(txt_iter)[:-1])
elif word == '}':
break
app_dic[app_i] = server_of_app_i
This is more ugly as one has to search for the closing curly bracket to break. If it gets any more complicated, regex should be used.

If you are able to use the newer regex module by Matthew Barnett, you can use the following solution, see an additional demo on regex101.com:
import regex as re
rx = re.compile(r"""
(?:(?P<application>application\d)\s{\n| # "application" + digit + { + newline
(?!\A)\G\n) # assert that the next match starts here
server\s # match "server"
(?P<server>[\d.:]+); # followed by digits, . and :
""", re.VERBOSE)
string = """
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
"""
result = {}
for match in rx.finditer(string):
if match.group('application'):
current = match.group('application')
result[current] = list()
if current:
result[current].append(match.group('server'))
print result
# {'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094'], 'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084']}
This makes use of the \G modifier, named capture groups and some programming logic.

This is the basic method:
# each of your objects here
objText = "xyz xcyz 244.233.233.2:123"
listOfAll = re.findall(r"/\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):[0-9]{1,5}/g", objText)
for eachMatch in listOfAll:
print "Here's one!" % eachMatch
Obviously that's a bit rough around the edges, but it will perform a full-scale regex search of whatever string it's given. Probably a better solution would be to pass it the objects themselves, but for now I'm not sure what you would have as raw input. I'll try to improve on the regex, though.

I believe this as well can be solved with re:
>>> import re
>>> from collections import defaultdict
>>>
>>> APP = r'\b(?P<APP>application\d+)\b'
>>> IP = r'server\s+(?P<IP>[\d\.:]+);'
>>>
>>> pat = re.compile('|'.join([APP, IP]))
>>>
>>>
>>> scan = pat.scanner(s)
>>> d = defaultdict(list)
>>>
>>> for m in iter(scan.search, None):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})
Or similarly with re.finditer method and without pat.scanner:
>>> for m in re.finditer(pat, s):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})

Python Minidom XML parsing dotted quad/nested children

I've got a gigantic list of varying objects I need to parse, and have multiple questions:
The string values within XML I'm able to parse quite easily (hostname, color,class_name etc), however anything numerical in nature (ip address/subnet mask etc) I'm not doing correctly. How do I get it to display the correct dotted quad?
What is the correct method (using minidom) to pull information out of deeper children? (see Group object - need 'name' under reference)
How can I sanitize (remove) the erroneous [] when a field does not contain a value (netmask for instance).
XML looks like one of the two outputs(sanitized):
a) Host object:
<network_object>
<Name>DB1</Name>
<Class_Name>host_plain</Class_Name>
<color><![CDATA[black]]></color>
<ipaddr><![CDATA[192.168.100.100]]></ipaddr>
b) Group object (contains multiple members):
<network_object>
<Name>DB_Servers</Name>
<Class_Name>network_object_group</Class_Name>
<members>
<reference>
<Name>DB1</Name>
<Table>network_objects</Table>
</reference>
<reference>
<Name>DB2</Name>
<Table>network_objects</Table>
</reference>
</members>
<color><![CDATA[black]]></color>
Current output of my code looks like this for a host object:
DB1 host_plain black [<DOM Element: ipaddr at 0x2d05a50>] []
For a network object:
Net_192.168.100.0 network black [<DOM Element: ipaddr at 0x399add0>] [<DOM Element: netmask at 0x399af10>]
For a group object:
DB_Servers network_object_group black [] []
My code:
from xml.dom import minidom
net_xml = minidom.parse("network_objects.xml")
NetworkObjectsTag = net_xml.getElementsByTagName("network_objects")[0]
# Pull individual network objects
NetworkObjectTag = NetworkObjectsTag.getElementsByTagName("network_object")
for network_object in NetworkObjectTag:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
class_name = network_object.getElementsByTagName("Class_Name")[0].firstChild.data
color = network_object.getElementsByTagName("color")[0].firstChild.data
ipaddr = network_object.getElementsByTagName("ipaddr")
netmask = network_object.getElementsByTagName("netmask")
print(name,class_name,color,ipaddr,netmask)
Edit: I've been able to get some output to resolve #1, however it seems I'm reaching a limit I'm unware of.
New code:
ipElement = network_object.getElementsByTagName("ipaddr")
ipaddr = ipElement.firstChild.data
maskElement = network_object.getElementsByTagName("netmask")
netmask = maskElement.firstChild.data
Gives me the output I'm looking for, however it seems to stop after 6-9 entries noting that 'builtins.IndexError: list index out of range'

I've been able to answer all of my questions except how to properly handle the network_group_object. I'll make another post for that specifically.
Here's my new code:
from xml.dom import minidom
net_xml = minidom.parse("network_objects.xml")
NetworkObjectsTag = net_xml.getElementsByTagName("network_objects")[0]
# Pull individual network objects
NetworkObjectTag = NetworkObjectsTag.getElementsByTagName("network_object")
for network_object in NetworkObjectTag:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
class_name = network_object.getElementsByTagName("Class_Name")[0].firstChild.data
color = network_object.getElementsByTagName("color")[0].firstChild.data
ipElement = network_object.getElementsByTagName("ipaddr")
if ipElement:
ipElement = network_object.getElementsByTagName("ipaddr")[0]
ipaddr = ipElement.firstChild.data
maskElement = network_object.getElementsByTagName("netmask")
if maskElement:
maskElement = network_object.getElementsByTagName("netmask")[0]
netmask = maskElement.firstChild.data
#address_ranges
ipaddr_firstElement = network_object.getElementsByTagName("ipaddr_first")
if ipaddr_firstElement:
ipaddr_firstElement = network_object.getElementsByTagName("ipaddr_first")[0]
ipaddr_first = ipaddr_firstElement.firstChild.data
ipaddr_lastElement = network_object.getElementsByTagName("ipaddr_last")
if ipaddr_lastElement:
ipaddr_lastElement = network_object.getElementsByTagName("ipaddr_last")[0]
ipaddr_last = ipaddr_lastElement.firstChild.data
if ipaddr_firstElement:
print(name,class_name,ipaddr,netmask,ipaddr_first,ipaddr_last,color)
else:
print(name,class_name,ipaddr,netmask,color)

Get subdomain from URL using Python

For example, the address is:
Address = http://lol1.domain.com:8888/some/page
I want to save the subdomain into a variable so i could do like so;
print SubAddr
>> lol1

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:
>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Note that tldextract properly handles sub-domains.

urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.
import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]

Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL
You will need the list of effective tlds from here
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
class DomainParts(object):
def __init__(self, domain_parts, tld):
self.domain = None
self.subdomains = None
self.tld = tld
if domain_parts:
self.domain = domain_parts[-1]
if len(domain_parts) > 1:
self.subdomains = domain_parts[:-1]
def get_domain_parts(url, tlds):
urlElements = urlparse(url).hostname.split('.')
# urlElements = ["abcde","co","uk"]
for i in range(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds:
if (exceptionCandidate in tlds):
return ".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
# returns ["abcde"]
raise ValueError("Domain not in global list of TLDs")
domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld
Gives you:
Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk

A very basic approach, without any sanity checking could look like:
address = 'http://lol1.domain.com:8888/some/page'
host = address.partition('://')[2]
sub_addr = host.partition('.')[0]
print sub_addr
This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:
http://www.google.com/
Is that what you mean?

What you are looking for is in:
http://docs.python.org/library/urlparse.html
for example:
".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])
Will do the job for you (will return "www.my")

For extracting the hostname, I'd use urlparse from urllib2:
>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'
As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.
E.g.
>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'

We can use https://github.com/john-kurkowski/tldextract for this problem...
It's easy.
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')

tldextract separate the TLD from the registered domain and subdomains of a URL.
Installation
pip install tldextract
For the current question:
import tldextract
address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)
The output:
Extracted domain name : domain
In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.

First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.
import tldextract
Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:
ext = tldextract.extract("http://lol1.domain.com:8888/some/page")
If we simply try to run ext variable, the output will be:
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.
ext.subdomain
The result will be:
'lol1'
ext.domain
The result will be:
'domain'
ext.suffix
The result will be:
'com'
Also, if you want to store the results only of subdomain in a variable, then use the code below:
Sub_Domain = ext.subdomain
Then Print Sub_Domain
Sub_Domain
The result will be:
'lol1'

Using python 3 (I'm using 3.9 to be specific), you can do the following:
from urllib.parse import urlparse
address = 'http://lol1.domain.com:8888/some/page'
url = urlparse(address)
url.hostname.split('.')[0]

import re
def extract_domain(domain):
domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
if matches:
return matches[0]
else:
return domain
def extract_subdomains(domain):
subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
domain = extract_domain(subdomains)
subdomains = re.sub('\.?'+domain,'', subdomains)
return subdomains
Example to fetch subdomains:
print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))
Outputs:
lol1
kota-tangerang
Example to fetch domain
print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))
Outputs:
domain.com
kpu.go.id

Standardize all domains to start with www. unless they have a subdomain.
from urllib.parse import urlparse
def has_subdomain(url):
if len(url.split('.')) > 2:
return True
else:
return False
domain = urlparse(url).netloc
if not has_subdomain(url):
domain_name = 'www.' + domain
url = urlparse(url).scheme + '://' + domain

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Display key values before and after Findall(regex, output) - python

Related

Python - get TLD

Python regex to find functions and params pairs in js files

Converting part of string into variable name in python

Python Minidom XML parsing dotted quad/nested children

Get subdomain from URL using Python

Categories

Resources