Regex: (date/time) same item between each part - python

Writing a simple regex to find dates and times within strings.
There's a small issue with identifying time-items when there's specific dates in the sting. Here's the regex:
TIME_REGEX = "([0-1][0-9]|2[0-3])[:\-\_]?([0-5][0-9])[:\-\_]?([0-5][0-9])"
The issue is that I need to accept time-values without anything between the numbers, hence the two "[:-_]?" parts. However, the regex matches even if the two are different from each other. So this will also match the date "2011-07-30" as being the time 20:11:07.
Can I change the regex so both items inbetween the numbers are the same, so it matches "201107" and "20-11-07", but not "2011-07" or "20:11-07"?

You can store the delimiter in a group and reuse it:
TIME_REGEX = "([0-1][0-9]|2[0-3])(?P<sep>[:\-\_]?)([0-5][0-9])(?P=sep)([0-5][0-9])"
Here, (?P<sep>...) stores the content of this group under the name sep, which we ruse with (?P+<sep>). This way, both items always have to be equal.
Example:
for test in ['201107', '20-11-07', '20-11:07']:
match = re.match(TIME_REGEX, test)
if match:
print test, match.group(1, 3, 4), "delimiter: '{}'".format(match.group('sep'))
yields:
201107 ('20', '11', '07') delimiter: ''
20-11-07 ('20', '11', '07') delimiter: '-'

I suggest you to match the first intermediate character into a group, and use the result of this group to match the second character, as follows. You just have to retrieve the correct groups at the end:
import re
times = ['20-11-07', '2011-07', '20-1107', '201107', '20:11-07', '20-10:07', '20:11:07']
TIME_REGEX = r'([0-1][0-9]|2[0-3])([:\-\_]*)([0-5][0-9])(\2)([0-5][0-9])'
for time in times:
m = re.search(TIME_REGEX, time)
if m:
print(time, "matches with following groups:", m.group(1), m.group(3), m.group(5))
else:
print(time, "does not match")
# 20-11-07 matches with following groups: 20 11 07
# 2011-07 does not match
# 20-1107 does not match
# 201107 matches with following groups: 20 11 07
# 20:11-07 does not match
# 20-10:07 does not match
# 20:11:07 matches with following groups: 20 11 07

Related

Select String after string with regex in python

Imagine that we have a string like:
Routing for Networks:
0.0.0.0/32
5.6.4.3/24
2.3.1.4/32
Routing Information Sources:
Gateway Distance Last Update
192.168.61.100 90 00:33:51
192.168.61.103 90 00:33:43
Irregular IPs:
1.2.3.4/24
5.4.3.3/24
I need to get a list of IPs between "Routing for Networks:" and "Routing Information Sources:" like below:
['0.0.0.0/32","5.6.4.3/24","2.3.1.4/32"]
What I have done till now is:
Routing for Networks:\n(.+(?:\n.+)*)\nRouting
But it is not working as expected.
UPDATE:
my code is as bellow:
re.findall("Routing for Networks:\n(.+(?:\n.+)*)\nRouting", string)
The value of capture group 1 included the newlines. You can split the value of capture group 1 on a newline to get the separated values.
If you want to use re.findall, you will a list of group 1 values, and you can split every value in the list on a newline.
An example with a single group 1 match:
import re
pattern = r"Routing for Networks:\n(.+(?:\n.+)*)\nRouting"
s = ("Routing for Networks:\n"
"0.0.0.0/32\n"
"5.6.4.3/24\n"
"2.3.1.4/32\n"
"Routing Information Sources:\n"
"Gateway Distance Last Update\n"
"192.168.61.100 90 00:33:51\n"
"192.168.61.103 90 00:33:43")
m = re.search(pattern, s)
if m:
print(m.group(1).split("\n"))
Output
['0.0.0.0/32', '5.6.4.3/24', '2.3.1.4/32']
For a bit more precise match, and if there can be multiple of the same consecutive parts, you can match the format and use an assertion for Routing instead of a match:
Routing for Networks:\n((?:(?:\d{1,3}\.){3}\d{1,3}/\d+\n)+)(?=Routing)
Example
pattern = r"Routing for Networks:\n((?:(?:\d{1,3}\.){3}\d{1,3}/\d+\n)+)(?=Routing)"
s = "..."
m = re.search(pattern, s)
if m:
print([s for s in m.group(1).split("\n") if s])
See a regex demo and a Python demo.

Match string using regular expression except specific string combinations python

In a list I need to match specific instances, except for a specific combination of strings:
let's say I have a list of strings like the following:
l = [
'PSSTFRPPLYO',
'BNTETNTT',
'DE52 5055 0020 0005 9287 29',
'210-0601001-41',
'BSABESBBXXX',
'COMMERZBANK'
]
I need to match all the words that points to a swift / bic code, this code has the following form:
6 letters followed by
2 letters/digits followed by
3 optional letters / digits
hence I have written the following regex to match such specific pattern
import re
regex = re.compile(r'(?<!\w)[a-zA-Z]{6}[a-zA-Z0-9]{2}([a-zA-Z0-9]{3})?(?!\w)')
for item in l:
match = regex.search(item)
if match:
print('found a match, the matched string {} the match {}'.format( item, item[match.start() : match.end()]
else:
print('found no match in {}'.format(item)
I need the following cases to be macthed:
result = ['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX' ]
rather I get
result = ['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX', 'COMMERZBANK' ]
so what I need is to match only the strings that don't contain the word 'bank'
to do so I have refined my regex to :
regex = re.compile((?<!bank/i)(?<!\w)[a-zA-Z]{6}[a-zA-Z0-9]{2}([a-zA-Z0-9]{3})?(?!\w)(?!bank/i))
simply I have used negative look behind and ahead for more information about theses two concepts refer to link
My regex doesn't do the filtration intended to do, what did I miss?
You can try this:
import re
final_vals = [i for i in l if re.findall('^[a-zA-Z]{6}\w{2}|(^[a-zA-Z]{6}\w{2}\w{3})', i) and not re.findall('BANK', i, re.IGNORECASE)]
Output:
['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX']

Python regex similar expressions

I have a file with two different types of data I'd like to parse with a regex; however, the data is similar enough that I can't find the correct way to distinguish it.
Some lines in my file are of form:
AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Other lines are of form
AED=20180823
AMD=20150914
AMD=20150921
The remaining lines are headers and I'd like to discard them. For example
[HEADER: BUSINESS DATE=20160831]
My solution attempt so far is to match first three capital letters and an equal sign,
r'\b[A-Z]{3}=\b'
but after that I'm not sure how to distinguish between dates (eg 20180823) and days (eg FRI:SAT:SUN).
The results I'd expect from these parsing functions:
Regex weekday_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=FRI>);
Regex date_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=20160816>);
weekdays = [weekday_rx.Match(line) for line in infile.read()]
dates = [date_rx.Match(line) for line in infile.read()]
r'\S*\d$'
Will match all non-whitespace characters that end in a digit
Will match AED=20180823
r'\S*[a-zA-Z]$'
Matches all non-whitespace characters that end in a letter.
will match AED=AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Neither will match
[HEADER: BUSINESS DATE=20160831]
This will match both
r'(\S*[a-zA-Z]$|\S*\d$)'
Replacing the * with the number of occurences you expect will be safer, the (a|b) is match a or match b
The following is a solution in Python :)
import re
p = re.compile(r'\b([A-Z]{3})=((\d)+|([A-Z])+)')
str_test_01 = "AMD=SUN:SAT"
m = p.search(str_test_01)
print (m.group(1))
print (m.group(2))
str_test_02 = "AMD=20150921"
m = p.search(str_test_02)
print (m.group(1))
print (m.group(2))
"""
<Output>
AMD
SUN
AMD
20150921
"""
Use pipes to express alternatives in regex. Pattern '[A-Z]{3}:[A-Z]{3}|[A-Z]{3}' will match both ABC and ABC:ABC. Then use parenthesis to group results:
import re
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC:ABC')
assert match.groups() == ('ABC:ABC', None)
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC')
assert match.groups() == (None, 'ABC')
You can research the concept of named groups to make this even more readable. Also, take a look at the docs for the match object for useful info and methods.

Python regex finding sub-string

I'm New to python and regex. Here I'm trying to recover the text between two limits. The starting could be mov/add/rd/sub/and/etc.. and end limit is end of the line.
/********** sample input text file *************/
f0004030: a0 10 20 02 mov %l0, %psr
//some unwanted lines
f0004034: 90 04 20 03 add %l0, 3, %o0
f0004038: 93 48 00 00 rd %psr, %o1
f000403c: a0 10 3f fe sub %o5, %l0, %g1
/*-------- Here is the code -----------/
try:
objdump = open(dest+name,"r")
except IOError:
print "Error: '" + name + "' not found in " + dest
sys.exit()
objdump_file = objdump.readlines()
for objdump_line in objdump_file:
a = ['add', 'mov','sub','rd', 'and']
if any(x in objdump_line for x in a) # To avoid unwanted lines
>>>>>>>>>> Here is the problem >>>>>>>>>>>>>
m = re.findall ('(add|mov|rd|sub|add)(.*?)($|\n)', objdump_line, re.DOTALL)
<<<<<<<<<<< Here is the problem <<<<<<<<<<<<<
print m
/*---------- Result I'm getting --------------*/
[('mov', ' %l0, %psr', '')]
[('add', ' %l0, 3, %o0', '')]
[('rd', ' %psr, %o1', '')]
[('sub', ' %o5, %l0, %g1', '')]
/*----------- Expected result ----------------*/
[' %l0, %psr']
[' %l0, 3, %o0']
[' %psr, %o1']
[' %o5, %l0, %g1']
I have no Idea why that parentheses and unwanted quotes are coming !!. Thanks in advance.
Quoting from python documentation from here about findall
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
The parenthesis represents one group or list that is found and it contains another list which contains all captured groups. There can be multiple groups that can be found. You can access it as
re.findall ('(add|mov|rd|sub|add)(.*?)($|\n)', objdump_line, re.DOTALL)[0][1]
0 represents the first group and 1 represents first element of the list of that group as you do not want any other element
The capturing group tries to capture the expression matched between the parenthesis. But for the last capturing group there is no text. So you are getting an empty ''
As you mentioned in your comment about using this
add(.*?)$
Instead of try this
(add)(.*?)$
The () indicates capturing group and you will get the result as expected
if you use grouping in findall, it's going to return all captured groups, if you want some specific parts use slicing:
m = re.findall ('(add|mov|rd|sub|add)(.*?)($|\n)', objdump_line, re.DOTALL)[0][-2:-1]
Additionally you can solve your problem without regex, you already checking if string has any of those ['add', 'mov','sub','rd', 'and'], so you can split the string and pick two last elemnts:
m = ' '.join(objdump_line.split()[-2:])

Need help extracting data from a file

I'm a newbie at python.
So my file has lines that look like this:
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
I need help coming up with the correct python code to extract every float preceded by a colon and followed by a space (ex: [-0.294118, 0.487437,etc...])
I've tried dataList = re.findall(':(.\*) ', str(line)) and dataList = re.split(':(.\*) ', str(line)) but these come up with the whole line. I've been researching this problem for a while now so any help would be appreciated. Thanks!
try this one:
:(-?\d\.\d+)\s
In your code that will be
p = re.compile(':(-?\d\.\d+)\s')
m = p.match(str(line))
dataList = m.groups()
This is more specific on what you want.
In your case .* will match everything it can
Test on Regexr.com:
In this case last element wasn't captured because it doesnt have space to follow, if this is a problem just remove the \s from the regex
This will do it:
import re
line = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
for match in re.finditer(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE):
print match.group(1)
Or:
match = re.search(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE)
if match:
datalist = match.group(1)
else:
datalist = ""
Output:
-0.294118
0.487437
0.180328
-0.292929
0.00149028
-0.53117
-0.0333333
Live Python Example:
http://ideone.com/DpiOBq
Regex Demo:
https://regex101.com/r/nR4wK9/3
Regex Explanation
(-?\d\.\d+)
Match the regex below and capture its match into backreference number 1 «(-?\d\.\d+)»
Match the character “-” literally «-?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character that is a “digit” (ASCII 0–9 only) «\d»
Match the character “.” literally «\.»
Match a single character that is a “digit” (ASCII 0–9 only) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Given:
>>> s='-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333.333'
With your particular data example, you can just grab the parts that would be part of a float with a regex:
>>> re.findall(r':([\d.-]+)', s)
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
You can also split and partition, which would be substantially faster:
>>> [e.partition(':')[2] for e in s.split() if ':' in e]
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
Then you can convert those to a float using try/except and map and filter:
>>> def conv(s):
... try:
... return float(s)
... except ValueError:
... return None
...
>>> filter(None, map(conv, [e.partition(':')[2] for e in s.split() if ':' in e]))
[-0.294118, 0.487437, 0.180328, -0.292929, -1.0, 0.00149028, -0.53117, -0.0333333]
A simple oneliner using list comprehension -
str = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
[float(s.split()[0]) for s in str.split(':')]
Note: this is simplest to understand (and pobably fastest) as we are not doing any regex evaluation. But this would only work for the particular case above. (eg. if you've to get the second number - in the above not so correctly formatted string would need more work than a single one-liner above).

Categories