Imagine that we have a string like:
Routing for Networks:
0.0.0.0/32
5.6.4.3/24
2.3.1.4/32
Routing Information Sources:
Gateway Distance Last Update
192.168.61.100 90 00:33:51
192.168.61.103 90 00:33:43
Irregular IPs:
1.2.3.4/24
5.4.3.3/24
I need to get a list of IPs between "Routing for Networks:" and "Routing Information Sources:" like below:
['0.0.0.0/32","5.6.4.3/24","2.3.1.4/32"]
What I have done till now is:
Routing for Networks:\n(.+(?:\n.+)*)\nRouting
But it is not working as expected.
UPDATE:
my code is as bellow:
re.findall("Routing for Networks:\n(.+(?:\n.+)*)\nRouting", string)
The value of capture group 1 included the newlines. You can split the value of capture group 1 on a newline to get the separated values.
If you want to use re.findall, you will a list of group 1 values, and you can split every value in the list on a newline.
An example with a single group 1 match:
import re
pattern = r"Routing for Networks:\n(.+(?:\n.+)*)\nRouting"
s = ("Routing for Networks:\n"
"0.0.0.0/32\n"
"5.6.4.3/24\n"
"2.3.1.4/32\n"
"Routing Information Sources:\n"
"Gateway Distance Last Update\n"
"192.168.61.100 90 00:33:51\n"
"192.168.61.103 90 00:33:43")
m = re.search(pattern, s)
if m:
print(m.group(1).split("\n"))
Output
['0.0.0.0/32', '5.6.4.3/24', '2.3.1.4/32']
For a bit more precise match, and if there can be multiple of the same consecutive parts, you can match the format and use an assertion for Routing instead of a match:
Routing for Networks:\n((?:(?:\d{1,3}\.){3}\d{1,3}/\d+\n)+)(?=Routing)
Example
pattern = r"Routing for Networks:\n((?:(?:\d{1,3}\.){3}\d{1,3}/\d+\n)+)(?=Routing)"
s = "..."
m = re.search(pattern, s)
if m:
print([s for s in m.group(1).split("\n") if s])
See a regex demo and a Python demo.
Related
I want to get a list and filter it (In this case it's a list of a record, a domain name and an ip).
I want the list to be something like so:
10.0.0.10 ansible0 ben1.com
ansible1 ben1.com 10.0.0.10
Aka you can put the ip the zone and the record anywhere and it will still catch them.
Now i got 2 regex, one that catches the domain (with the dot) and the IP:
Domain: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}
Simple IP: (?:[0-9]{1,3}\.){3}[0-9]{1,3}
With these i can catch in python all the domain names and put them into a list and all ips.
Now i only need to catch the "subdomain" (In this case ansible1 and ansible0).
I want it to be able to have numbers and characters like - _ * and so on, anything but a ..
How can i do it via regex?
You can use this regex with 3 alternations and 3 named groups:
(?P<domain>[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,})|
(?P<ip>(?:[0-9]{1,3}\.){3}[0-9]{1,3})|
(?P<sub>[^\s.]+)
RegEx Demo
Named groups domain and ip are using regex you've provided. 3rd group is (?P<sub>[^\s.]+) that is matching 1+ of any characters that are not dot and not whitespace.
Code:
import re
arr = ['10.0.0.10 ansible0 ben1.com', 'ansible1 ben1.com 10.0.0.10']
rx = re.compile(r'(?P<domain>[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,})|(?P<ip>(?:[0-9]{1,3}\.){3}[0-9]{1,3})|(?P<sub>[^\s.]+)')
subs = []
for i in arr:
for m in rx.finditer(i):
if (m.group('sub')): subs.append(m.group('sub'))
print (subs)
Output:
['ansible0', 'ansible1']
I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']
In a list I need to match specific instances, except for a specific combination of strings:
let's say I have a list of strings like the following:
l = [
'PSSTFRPPLYO',
'BNTETNTT',
'DE52 5055 0020 0005 9287 29',
'210-0601001-41',
'BSABESBBXXX',
'COMMERZBANK'
]
I need to match all the words that points to a swift / bic code, this code has the following form:
6 letters followed by
2 letters/digits followed by
3 optional letters / digits
hence I have written the following regex to match such specific pattern
import re
regex = re.compile(r'(?<!\w)[a-zA-Z]{6}[a-zA-Z0-9]{2}([a-zA-Z0-9]{3})?(?!\w)')
for item in l:
match = regex.search(item)
if match:
print('found a match, the matched string {} the match {}'.format( item, item[match.start() : match.end()]
else:
print('found no match in {}'.format(item)
I need the following cases to be macthed:
result = ['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX' ]
rather I get
result = ['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX', 'COMMERZBANK' ]
so what I need is to match only the strings that don't contain the word 'bank'
to do so I have refined my regex to :
regex = re.compile((?<!bank/i)(?<!\w)[a-zA-Z]{6}[a-zA-Z0-9]{2}([a-zA-Z0-9]{3})?(?!\w)(?!bank/i))
simply I have used negative look behind and ahead for more information about theses two concepts refer to link
My regex doesn't do the filtration intended to do, what did I miss?
You can try this:
import re
final_vals = [i for i in l if re.findall('^[a-zA-Z]{6}\w{2}|(^[a-zA-Z]{6}\w{2}\w{3})', i) and not re.findall('BANK', i, re.IGNORECASE)]
Output:
['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX']
I have a file with two different types of data I'd like to parse with a regex; however, the data is similar enough that I can't find the correct way to distinguish it.
Some lines in my file are of form:
AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Other lines are of form
AED=20180823
AMD=20150914
AMD=20150921
The remaining lines are headers and I'd like to discard them. For example
[HEADER: BUSINESS DATE=20160831]
My solution attempt so far is to match first three capital letters and an equal sign,
r'\b[A-Z]{3}=\b'
but after that I'm not sure how to distinguish between dates (eg 20180823) and days (eg FRI:SAT:SUN).
The results I'd expect from these parsing functions:
Regex weekday_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=FRI>);
Regex date_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=20160816>);
weekdays = [weekday_rx.Match(line) for line in infile.read()]
dates = [date_rx.Match(line) for line in infile.read()]
r'\S*\d$'
Will match all non-whitespace characters that end in a digit
Will match AED=20180823
r'\S*[a-zA-Z]$'
Matches all non-whitespace characters that end in a letter.
will match AED=AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Neither will match
[HEADER: BUSINESS DATE=20160831]
This will match both
r'(\S*[a-zA-Z]$|\S*\d$)'
Replacing the * with the number of occurences you expect will be safer, the (a|b) is match a or match b
The following is a solution in Python :)
import re
p = re.compile(r'\b([A-Z]{3})=((\d)+|([A-Z])+)')
str_test_01 = "AMD=SUN:SAT"
m = p.search(str_test_01)
print (m.group(1))
print (m.group(2))
str_test_02 = "AMD=20150921"
m = p.search(str_test_02)
print (m.group(1))
print (m.group(2))
"""
<Output>
AMD
SUN
AMD
20150921
"""
Use pipes to express alternatives in regex. Pattern '[A-Z]{3}:[A-Z]{3}|[A-Z]{3}' will match both ABC and ABC:ABC. Then use parenthesis to group results:
import re
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC:ABC')
assert match.groups() == ('ABC:ABC', None)
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC')
assert match.groups() == (None, 'ABC')
You can research the concept of named groups to make this even more readable. Also, take a look at the docs for the match object for useful info and methods.
What is the regex pattern to match a string starting with abc-def-xyz and ending with anything ??
Update
Since you only want to match host names that begin with abc-def you can simply use str.startswith():
hosts = ['abc-def.1.desktop.rul.com',
'abc-def.2.desktop.rul.com',
'abc-def.3.desktop.rul.com',
'abc-def.4.desktop.rul.com',
'abc-def.44.desktop.rul.com',
'abc-def.100.desktop.rul.com',
'qwe-rty.100.desktop.rul.com',
'z.100.desktop.rul.com',
'192.168.1.10',
'abc-def.100abc.desktop.rul.com']
filtered_hosts = [host for host in hosts if host.startswith('abc-def')]
print filtered_hosts
Output
['abc-def.1.desktop.rul.com', 'abc-def.2.desktop.rul.com', 'abc-def.3.desktop.rul.com', 'abc-def.4.desktop.rul.com', 'abc-def.44.desktop.rul.com', 'abc-def.100.desktop.rul.com', 'abc-def.100abc.desktop.rul.com']
Original regex solution follows.
Let's say that your data is a list of host names such as these:
hosts = ['abc-def.1.desktop.rul.com',
'abc-def.2.desktop.rul.com',
'abc-def.3.desktop.rul.com',
'abc-def.4.desktop.rul.com',
'abc-def.44.desktop.rul.com',
'abc-def.100.desktop.rul.com',
'qwe-rty.100.desktop.rul.com',
'z.100.desktop.rul.com',
'192.168.1.10',
'abc-def.100abc.desktop.rul.com']
import re
pattern = re.compile(r'abc-def\.\d+\.')
filtered_hosts = [host for host in hosts if pattern.match(host)]
print filtered_hosts
Output
['abc-def.1.desktop.rul.com', 'abc-def.2.desktop.rul.com', 'abc-def.3.desktop.rul.com', 'abc-def.4.desktop.rul.com', 'abc-def.44.desktop.rul.com', 'abc-def.100.desktop.rul.com']
The regex pattern says to match any lines that start with abc-def. followed by one or more digits, followed by a dot.
If you wanted to match a more generic pattern such as any sequence of 3 lowercase letters followed by a - and then another 3 lowercase letters, you could do this:
pattern = re.compile(r'[a-z]{3}-[a-z]{3}\.\d+\.')
Now the output also includes 'qwe-rty.100.desktop.rul.com'.