I have to find dates in multiple formats in a text.
I have some regex like this one:
# Detection of:
# 25/02/2014 or 25/02/14 or 25.02.14
regex = r'\b(0?[1-9]|[12]\d|3[01])[-/\._](0?[1-9]|1[012])[-/\._]((?:19|20)\d\d|\d\d)\b'
The problem is that it also matches dates like 25.02/14 which is not good because the splitting character is not the same.
I could of course do multiple regex with a different splitting character for every regex, or do a post-treatment on the matching results, but I would prefer a complete solution using only one good regex. Is there a way to do so?
In addition to my comment (the original word boundary approach lets the pattern match "dates" that are in fact parts of other entities, like IPs, serial numbers, product IDs, etc.), see the improved version of your regex in comparison with yours:
import re
s = '25.02.19.35 6666-20-03-16-67875 25.02/2014 25.02/14 11/12/98 11/12/1998 14/12-2014 14-12-2014 14.12.1998'
found_dates = [m.group() for m in re.finditer(r'\b(?:0?[1-9]|[12]\d|3[01])([./-])(?:0?[1-9]|1[012])\1(?:19|20)?\d\d\b', s)]
print(found_dates) # initial regex
found_dates = [m.group() for m in re.finditer(r'(?<![\d.-])(?:0?[1-9]|[12]\d|3[01])([./-])(?:0?[1-9]|1[012])\1(?:19|20)?\d\d(?!\1\d)', s)]
print(found_dates) # fixed boundaries
# = >['25.02.19', '20-03-16', '11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']
# => ['11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']
See, your regex extracts '25.02.19' (part of a potential IP) and '20-03-16' (part of a potential serial number/product ID).
Note I also shortened the regex and extraction code a bit.
Pattern details:
(?<![\d.-]) - a negative lookbehind making sure there is no digit, .
and - immediately to the left of the current location (/ has been discarded since dates are often found inside URLs)
(?:0?[1-9]|[12]\d|3[01]) - 01 / 1 to 31 (day part)
([./-]) - Group 1 (technical group to hold the separator value) matching either ., or / or -
(?:0?[1-9]|1[012]) - month part: 01 / 1 to 12
\1 - backreference to the Group 1 value to make sure the same separator comes here
(?:19|20)?\d\d - year part: 19 or 20 (optional values) and then any two digits.
(?!\1\d) - negative lookahead making sure there is no separator (captured into Group 1) followed with any digit immediately to the right of the current location.
Based on the comment of Rawing, this did the trick:
regex = r'\b(0?[1-9]|[12]\d|3[01])([./-])(0?[1-9]|1[012])\2((?:19|20)\d\d|\d\d)\b'
So, the complete code is:
import re
s = '25.02/2014 25.02/14 11/12/98 11/12/1998 14/12-2014 14-12-2014 14.12.1998'
found_dates = []
for m in re.finditer(r'\b(0?[1-9]|[12]\d|3[01])([./-])(0?[1-9]|1[012])\2((?:19|20)\d\d|\d\d)\b', s):
found_dates.append(m.group(0))
print(found_dates)
The output is, as desired :
['11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']
Related
I am very new to use python re.finditer to find a regex pattern but trying to make a complex pattern finding, which is the g-quadruplex motif and described as below.
The sequence starts with at least 3 g followed with a/t/g/c multiple times until the next group of ggg+ shows up. this will repeat 3 times and resulting in a pattern like ggg+...ggg+...ggg+...ggg+
The cases should be all ignored and the overlapping can show in the original sequence like ggg+...ggg+...ggg+...ggg+...ggg+...ggg+ should return 3 such patterns.
I have suffered for some time and can only find a way like:
re.finditer(r"(?=(([Gg]{3,})([AaTtCcGg]+?(?=[Gg]{3,})[Gg]{3,}){3}))", seq)
and then filter out the ones that do not with the same start position with
re.finditer(r"([Gg]{3,})", seq)
Is there any better way to extract this type of sequence? And no for loops please since I have millions of rows like this.
Thank you very much!
PS: an example can be like this
ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC
1.ggggggcgggggggACGCTCggctcAAGGGCTCCGGG
2.gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg
3.GGGCTCCGGGCCCCgggggggACgcgcgAAGGG
You could start the match, asserting that what is directly to the left is not a g char to prevent matching on too many positions.
To match both upper and lowercase chars, you can make the pattern case insensitive using re.I
The value is in capture group 1, which will be returned by re.findall.
(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))
(?<!g) Negative lookbehind, assert not g directly to the left
(?= Positive lookahead
( Capture group 1
g{3,} Match 3 or more g chars to start with
(?: Non capture group
[atc](?:g{0,2}[atc])* Optionally repeat matching a t c and 0, 1 or 2 g chars without crossing matching ggg
g{3,} Match 3 or more g chars to end with
){3} Close non capture group and repeat 3 times
) Close group 1
) Close lookahead
Regex demo | Python demo
import re
pattern = r"(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))"
s = ("ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC \n")
print(re.findall(pattern, s, re.I))
Output
[
'ggggggcgggggggACGCTCggctcAAGGGCTCCGGG',
'gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg',
'GGGCTCCGGGCCCCgggggggACgcgcgAAGGG'
]
import regex as re
seq = 'ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC'
for match in re.finditer(r'(?<=(^|[^g]))g{3,}([atc](g{,2}[atc]+)*g{3,}){3}', seq, overlapped=True, flags=re.I):
print(match[0])
Overlapped works by restarting the search from the character after the start of the current match rather than the end of it. This would give you a bunch of essentially duplicate results, each just removing an extra leading G. To stop that, check for a preceding G with a lookbehind:
(?<=(^|[^g]))
The middle section needs to be a bit more complicated to require an ATC, preventing the seven G's from being split into a ggggggg match. So require one, then allow for any number of less the three G's followed by more ATCs; repeating as needed:
[atc](g{,2}[atc]+)*
The rest is just the Gs and the repeating.
I'm trying to implement some kind of markdown like behavior for a Python log formatter.
Let's take this string as example:
**This is a warning**: Virus manager __failed__
A few regexes later the string has lost the markdown like syntax and been turned into bash code:
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
But that should be compressed to
\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m
I tried these, beside many other non working solutions:
(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'
(\\033\[([\d]+)m)+ many results, not ok
(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok
and others..
My goal is to have as results:
Input
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
Output
Match 1
033[33m\033[1m
Group1: 33
Group2: 1
Match 2
033[0m\033[0m
Group1: 0
Group2: 0
In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.
You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.
You may use
re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)
See the Python demo online
The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.
The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex
r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)
to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.
Demo
The regex performs the following operations:
^ # match beginning of line
( # begin cap grp 1
\\0 # match '\0'
(\d+) # match 1+ digits in cap grp 2
\[ # match '['
\2 # match contents of cap grp 2
) # end cap grp 1
[a-z] # match a lc letter
\\0 # match '\0'
\2 # match contents of cap grp 2
\[ # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
# end of the line in cap grp 3
As you see, the portion of the string captured in group 1 is
\033[33
I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.
The next part of the string is to be replaced and therefore is not captured:
m\\033[
I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.
The remainder of the string,
1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.
One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.
This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.
\\033\[(\d+)m\\033\[(\d+)m
I have two kinds of documents to parse:
1545994641 INFO: ...
and
'{"deliveryDate":"1545994641","error"..."}'
I want to extract the timestamp 1545994641 from each of them.
So, I decided to write a regex to match both cases:
(\d{10}\s|\"\d{10}\")
In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s):
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '
(So far so good.)
However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\") it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the "":
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'
What I tried:
I decided to use a non-capturing group for the quotation marks:
(\d{10}\s|(?:\")\d{10}(?:\"))
but it doesn't work as the outer group catches them.
I also removed the outer group, but the result is the same.
Unwanted ways to solve:
I can surpass this by creating a group for each expression in the or,
but I just want it to output a single group (to abstract the code
from the regex).
I could also use a 2nd step of regex to capture the timestamp from
the group that has the quotation marks, but again that would break
the code abstraction.
I could omit the "" in the regex but that would match a timestamp in the middle of the message , as I want it to be objective to capture the timestamp as a value of a key or in the beginning of the document, followed by a space.
Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?
EDIT:
As noticed by #Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!
You may use lookarounds if your code can only access the whole match:
^\d{10}(?=\s)|(?<=")\d{10}(?=")
See the regex demo.
In Python, declare it as
rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'
Pattern details
^\d{10}(?=\s):
^ - string start
\d{10} - ten digits
(?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location
| - or
(?<=")\d{10}(?="):
(?<=") - a " char
\d{10} - ten digits
(?=") - a positive lookahead that requires a double quotation mark immediately to the right of the current location.
You could use lookarounds, but I think this solution is simpler, if you can just get the group:
"?(\d{10})(?:\"|\s)
EDIT:
Considering if there is a first " there must be a ", try this:
(^\d{10}\s|(?<=\")\d{10}(?=\"))
EDIT 2:
To also remove the trailing space in the end, use a lookahead too:
(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))
I'm using python regular expressions to extract dimensional information from a database. The entries in that column look like this:
23 cm
43 1/2 cm
20cm
15 cm x 30 cm
What I need from this is only the width of the entry (so for the entries with an 'x', only the first number), but as you can see the values are all over the place.
From what I understood in the documentation, you can access the groups in a match using their position, so I was thinking I could determine the type of the entry based on how many groups are returned and what is found at each index.
The expression I used so far is ^(\d{2})\s?(x\s?(\d{2}))?(\d+/\d+)?$, however it's not perfect and it returns a number of useless groups. Is there something more efficient and appropriate?
Edit: I need the number from every line. When there is only one number, it is implied that only the width was measured (including any fractional components such as line 2). When there are two numbers, the height was also measured, but I only need the width which is the first number (such as in the last line)
try regex below, it will capture 1st digits and optional fractional come after it before the 1st 'cm'
import re
regex = re.compile('(\d+.*?)\s?cm') # this will works for all your example data
# or
# this asserted whatever come after the 1st digit group must be fractional number only
regex = re.compile('(\d+(?:\s+\d+\/\d+)?)\s?cm')
>>> regex.match('23 cm').group(1)
>>> '23'
>>> regex.match('43 1/2 cm').group(1)
>>> '43 1/2'
>>> regex.match('20cm').group(1)
>>> '20'
>>> regex.match('15 cm x 30 cm').group(1)
>>> '15'
regex101 demo
This regex should work (Live Demo)
^(\d+)(?:\s*cm\s+[xX])
Explanation
^(\d+) - capture at least one digit at the beginning of the line
(?: - start non-capturing group
\s* - followed by at least zero whitespace characters
cm - followed by a literal c and m
\s+ - followed by at least one whitespace character
[xX] - followed by a literal x or X
) - end non-capturing group
You shouldn't need to bother matching the rest of the line.
Here's a sample of how to do it from a text file.
It works for the provided data.
f = open("textfile.txt",r')
for line in f :
if 'x'in line:
iposition = line.find('x')
print(line[:iposition])
This is my first post and I am a newbie to Python. I am trying to get this to work.
string 1 = [1/0/1, 1/0/2]
string 2 = [1/1, 1/2]
Trying to check the string if I see two / then I just need to replace the 0 with 1 so it becomes 1/1/1 and 1/1/2.
If I don't have two / then I need to add one in along with a 1 and change it to the format 1/1/1 and 1/1/2 so string 2 becomes [1/1/1,1/1/2]
Ultimate goal is to get all strings match the pattern x/1/x. Thanks for all the Input on this.I tried this and it seems to work
for a in Port:
if re.search(r'././', a):
z.append(a.replace('/0/','/1/') )
else:
t1= a.split('/')
if len(t1)>1 :
t2= t1[0] + "/1/" + t1[1]
z.append(t2)
few lines are there to take care of some exceptions but seems to do the job.
The regex pattern for identifying a / is just \/
This could be solved rather simply using the built in string functions without having to add all of the overhead and additional computational time caused by using the RegEx engine.
For example:
# The string to test:
sTest = '1/0/2'
# Test the string:
if(sTest.count('/') == 2):
# There are two forward slashes in the string
# If the middle number is a 0, we'll replace it with a one:
sTest = sTest.replace('/0/', '/1/')
elif(sTest.count('/') == 1):
# One forward slash in string
# Insert a 1 between first portion and the last portion:
sTest = sTest.replace('/', '/1/')
else:
print('Error: Test string is of an unknown format.')
# End If
If you really want to use RegEx, though, you could simply match the string against these two patterns: \d+/0/\d+ and \d+/\d+(?!/) If matching against the first pattern fails, then attempt to match against the second pattern. Then, you can use a either grouping, splitting, or simply calling .replace() (like I'm doing above) to format the string as you need.
EDIT: for clarification, I'll explain the two patterns:
Pattern 1: \d+/0/\d+ could essentially be read as "match any number (consisting of one (1) or more digits) followed by a forward slash, a zero (0), another forward slash and then followed by any number (consisting of one (1) or more digits).
Pattern 2: \d+/\d+(?!/) could be read as "match any number (consisting of one (1) or more digits) followed by a forward slash and any other number (consisting of one (1) or more digits) which is then NOT followed by another forward slash." The last part in this pattern could be a little confusing because it uses the negative lookahead abilities of the RegEx engine.
If you wanted to add stricter rules to these patterns to make sure there are not any leading or trailing non-digit characters, you could add ^ to the start of the patterns and $ to the end, to signify the start of the string and the end of the string respectively. This would also allow you to remove the lookahead expression from the second pattern ((?!/)). As such, you would end up with the following patterns: ^\d+/0/\d+$ and ^\d+/\d+$.
https://regex101.com/r/rE6oN2/1
Click code generator on the left side. You get:
import re
p = re.compile(ur'\d/1/\d')
test_str = u"1/1/2"
re.search(p, test_str)