How to match a numeric range out of 23:59? - python

I have possible strings in the format of:
x:y
where x & y could be multiple digits. I want to match the opposite of 23:59, meaning that x must be > 23 & y must be > 59, how to write that pattern? My intention is that, if a string x:y is not like a time format, i.e. 08:23, I want to exclude it. Note that the string could be:
8:23 OR
08:23
Both refers to 8:23am. I have to match the opposite of 23:59, since my program's logic works this way. The following pattern seems to match 0<x<=23 & 0<y<=59
([0-1][1-9]|2[0-4]):[0-5][0-9]
How to match the opposite of this, if the above regex is correct?

One way using dateutil.parser:
import dateutil.parser as dparser
def is_time(str_):
try:
dparser.parse(str_, fuzzy=True)
return True
except ValueError:
return False
times = ["8:23", "08:23", "28:23", "23:61"]
for t in times:
print(t, is_time(t))
Output:
8:23 True
08:23 True
28:23 False # Wrong hour
23:61 False # Wrong min

Look aheads might come to the rescue here:
\b(?!23:59)([0-1][0-9]|2[0-3]):[0-5][0-9]\b
The negative lookahead at the very start of the pattern (?!23:59) excludes 23:59, and the rest of the pattern allows all other hours:minutes.
Demo

(((2[4-9]|[3-9][0-9]):\d\d)|(\d\d:([6-9][0-9])))

I think the correct Regex for what you want is
([01]?[0-9]|2[0-3]):[0-5][0-9]
and for the opposite of the whole thing you can do the following, negating the accepted set.
(?!([01]?[0-9]|2[0-3]):[0-5][0-9])
?! = Negative lookahead.

My intention is that, if a string x:y is not like a time format, i.e.
08:23, I want to exclude it.
This to me seems like you just want to attempt to match the two formats and if no match, discard it.
Try this:
https://regex101.com/r/pQGNyj/1
Expression:
^([0-1][0-9]|[2][0-3]):([0-5][0-9])|((?<!\d)[1]*[0-9]|[2][0-3]):([0-5][0-9])$
It might be what you're after...

Related

extract hour from a string _ unclear format

this question maybe is duplicated but I didn't find any exact solution for this. I have this type of string that includes date and time.
"check_in": "10/25/2019 14:30"
I need to extract an hour from it but this is not always a valid format. I tried this pattern so far but it includes the ":" character.
\d+?(:)
(\d+:)
(\d+)*:
Regular expressions aren't always the best way to deal with strings representing dates, especially if you can't rely on the input format to be consistent. Use a specialized parser instead:
>>> from dateutil import parser
>>> parser.parse("10/25/2019 14:30").hour
14
>>> parser.parse("10/25/2019 2:30 PM").hour
14
>>> parser.parse("2019-10-25T143000").hour
14
The module dateutil isn't in the standard library but is well worth the trouble of downloading.
\d+(?=:)
Demo
You don't need match the :, but need check it. So use Positive Lookahead (?=:).
First, this is what is wrong with your regexes:
\d+?(:) - finds number and column (14:) and puts the column into a group
(\d+:) - finds number and column (14:) and puts all of it into a group
(\d+)*: - finds (optionally, because of *) number and column (14:) and puts the number into a group
So, the last one could work:
>>> match = re.search(r'(\d+)*:', "10/25/2019 14:30")
>>> match.group(0) # whole result
'14:'
>>> match.group(1) # just the number
'14'
But then again, it would give wrong result (instead of breaking) on something like "time: 14:30", making it difficult to debug the error later. What you want is to use a more strict search, e.g. matching the whole string and labelling all groups:
>>> regex = r'(?P<month>\d\d)/(?P<day>\d\d)/(?P<year>\d{4}) (?P<hour>\d\d):(?P<minute>\d\d)'
>>> re.search(regex, "10/25/2019 14:30").group('hour')
'14'
Another, easier and even safer way is to use strptime:
>>> import datetime
>>> datetime.datetime.strptime("10/25/2019 14:30", "%m/%d/%Y %H:%M")
datetime.datetime(2019, 10, 25, 14, 30)
Now you have the complete datetime object and you can extract the .hour if you want.

How to test a string that only contains alphabets and numbers?

I am trying to test either a string contains only alphabets or numbers. Following statement should return false but it doesn't return. What am I doing wrong?
bool(re.match('[A-Z\d]', '2ae12'))
Just use the string method isalnum(), it does exactly what you want.
While not regex, you can use the very concise str.isalnum():
s = "sdfsdfq34sd"
print(s.isalnum())
Output:
True
However, if you do want a pure regex solution:
import re
if re.findall('^[a-zA-Z0-9]+$', s):
pass #string just contains letters and digits
Using a dataframe solution, courtesy of #Wen:
df.col1.apply(lambda x : x.isalnum())
df=pd.DataFrame( {'col1':["sdfsdfq34sd","sdfsdfq###34sd","sdfsdf!q34sd","sdfs‌​dfq34s#d"]})
Pandas answer: Consider this df
col
0 2ae12
1 2912
2 da2ae12
3 %2ae12
4 #^%6f
5 &^$*
You can select the rows that contain only alphabets or numbers using
df[~df.col.str.contains('(\W+)')]
You get
col
0 2ae12
1 2912
2 da2ae12
If you just want a boolean column, use
~df.col.str.contains('(\W+)')
0 True
1 True
2 True
3 False
4 False
5 False
If you are looking to return True if the string is either all digits or all letters, you can do:
for case in ('abcdefg','12345','2ae12'):
print case, case.isalpha() or case.isdigit()
Prints:
abcdefg True
12345 True
2ae12 False
If you want the same logic with a regex, you would do:
import re
for case in ('abcdefg','12345','2ae12'):
print case, bool(re.search(r'^(?:[a-zA-Z]+|\d+)$', case))
You regex is only matching one character, and I think the \d is being treated as an escaped D instead of the set of all integer characters.
If you really want to use a regex here's how I would do it;
def isalphanum(test_str):
alphanum_re = re.compile(r"[0-9A-Z]+", re.I)
return bool(alphanum_re.match(test_str)
Let's focus on the alphanum regex. I compiled it with a raw literal, indicated by the string with an 'r' next to it. This type of string won't escape certain characters when a slash is present, meaning r"\n" is interpreted as a slash and an N instead of a newline. This is helpful when using regexs, and certain text editors will even change the syntax highlighting of an R string to highlight features in the regex to help you out. The re.I flag ignores the case of the test string, so [A-Z] will match A through Z in either upper or lower case.
The simpler, Zen of Python solution involves invoking the isalnum method of the string;
test_str = "abc123"
test_str.isalnum()
You need to check is the string is made up of either alphabets or digits!
import re
bool(re.match('^[A-Za-z]+|\d+$', df['some_column'].str))
As dawg has suggested you can also use isalpha and isdigit,
df['some_column'].str.isalpha() or df['some_column'].str.isdigit()

Optional grouping in a simple python regex

All I want to do is search a string for instances of two consecutive digits. If such an instance is found I want to group it, otherwise return none for that particular groups. I thought this would be trivial, but I can't understand where I'm going wrong. In the example below, removing the optional (?) character gets me the numbers, but in strings without numbers, the r evaluates to None, so r.groups() throws an exception.
p = re.compile(r'(\d{2})?')
r = p.search('wqddsel78ffgr')
print r.groups()
>>>(None, ) # why not ('78', )?
# --- update/clarification --- #
Thanks for the answers, but the explanations given are leaving me none-the-wiser. Here's a another go at pin-pointing exactly what it is I don't understand.
pattern = re.compile(r'z.*(A)?')
_string = "aazaa90aabcdefA"
result = pattern.search(_string)
result.group()
>>> zaa90aabcdefA
result.groups()
>>> (None, )
I understand why result.group() produces the result it does, but why doesn't result.groups() produce ('A', )? I thought it worked like this: once the regex hits the z it then matches right to the end of the line using .*. In spite of .* matching everything, the regex engine is aware that it passed over an optional group, and since ? means it will try to match if it can, it should work backwards to try and match. Replacing ? with + does return ('A', ). This suggests that ? won't try and match if it doesn't have to, but this seems to contrast with much of what I've read on the subject (esp. J. Friedl's excellent book).
This works for me:
p = re.compile('\D*(\d{2})?')
r = p.search('wqddsel78ffgr')
print r.groups() # ('78',)
r = p.search('wqddselffgr')
print r.groups() # (None,)
Use regex pattern
(\d{2}|(?!.*\d{2}))
(see this demo)
If you want be sure there are exactly 2 consecutive digits and not 3 or more, go with
((?<!\d)\d{2}(?!\d)|(?!.*(?<!\d)\d{2}(?!\d)))
(see this demo)
The ? makes your regex match the empty string. If you omit it, you could just check the result like this:
p = re.compile(r'(\d{2})')
r = p.search('wqddsel78ffgr')
print r.groups() if r else ('',)
Remember that you can search for all matches of a RE in a string easily using findall():
re.findall(r'\d{2}', 'wqddsel78ffgr') # => ['78']
If you don't need the positions where the match occurs, this seems like a simpler way to accomplish what you're doing.
? - is 0 or 1 repetitions. So the regex processor first tries to find 0 repetitions, and... finds it :)

Python Regex match or potential match

Question:
How do I use Python's regular expression module (re) to determine if a match has been made, or that a potential match could be made?
Details:
I want a regex pattern which searches for a pattern of words in a correct order regardless of what's between them. I want a function which returns Yes if found, Maybe if a match could still be found or No if no match can be found. We are looking for the pattern One|....|Two|....|Three, here are some examples (Note the names, their count, or their order are not important, all I care about is the three words One, Two and Three, and the acceptable words in between are John, Malkovich, Stamos and Travolta).
Returns YES:
One|John|Malkovich|Two|John|Stamos|Three|John|Travolta
Returns YES:
One|John|Two|John|Three|John
Returns YES:
One|Two|Three
Returns MAYBE:
One|Two
Returns MAYBE:
One
Returns NO:
Three|Two|One
I understand the examples are not airtight, so here is what I have for the regex to get YES:
if re.match('One\|(John\||Malkovich\||Stamos\||Travolta\|)*Two\|(John\||Malkovich\||Stamos\||Travolta\|)*Three\|(John\||Malkovich\||Stamos\||Travolta\|)*', 'One|John|Malkovich|Two|John|Stamos|Three|John|Travolta') != None
return 'Yes'
Obviously if the pattern is Three|Two|One the above will fail, and we can return No, but how do I check for the Maybe case? I thought about nesting the parentheses, like so (note, not tested)
if re.match('One\|((John\||Malkovich\||Stamos\||Travolta\|)*Two(\|(John\||Malkovich\||Stamos\||Travolta\|)*Three\|(John\||Malkovich\||Stamos\||Travolta\|)*)*)*', 'One|John|Malkovich|Two|John|Stamos|Three|John|Travolta') != None
return 'Yes'
But I don't think that will do what I want it to do.
More Details:
I am not actually looking for Travoltas and Malkovichs (shocking, I know). I am matching against inotify Patterns such as IN_MOVE, IN_CREATE, IN_OPEN, and I am logging them and getting hundreds of them, then I go in and then look for a particular pattern such as IN_ACCESS...IN_OPEN....IN_MODIFY, but in some cases I don't want an IN_DELETE after the IN_OPEN and in others I do. I'm essentially pattern matching to use inotify to detect when text editors gone wild and they try to crush programmers souls by doing a temporary-file-swap-save instead of just modifying the file. I don't want to free up those logs instantly, but I only want to hold on to them for as long as is necessary. Maybe means dont erase the logs. Yes means do something then erase the log and No means don't do anything but still erase the logs. As I will have multiple rules for each program (ie. vim v gedit v emacs) I wanted to use a regular expression which would be more human readable and easier to write then creating a massive tree, or as user Joel suggested, just going over the words with a loop
I wouldn't use a regex for this. But it's definitely possible:
regex = re.compile(
r"""^ # Start of string
(?: # Match...
(?: # one of the following:
One() # One (use empty capturing group to indicate match)
| # or
\1Two() # Two if One has matched previously
| # or
\1\2Three() # Three if One and Two have matched previously
| # or
John # any of the other strings
| # etc.
Malkovich
|
Stamos
|
Travolta
) # End of alternation
\|? # followed by optional separator
)* # any number of repeats
$ # until the end of the string.""",
re.VERBOSE)
Now you can check for YES and MAYBE by checking if you get a match at all:
>>> yes = regex.match("One|John|Malkovich|Two|John|Stamos|Three|John|Travolta")
>>> yes
<_sre.SRE_Match object at 0x0000000001F90620>
>>> maybe = regex.match("One|John|Malkovich|Two|John|Stamos")
>>> maybe
<_sre.SRE_Match object at 0x0000000001F904F0>
And you can differentiate between YES and MAYBE by checking whether all of the groups have participated in the match (i. e. are not None):
>>> yes.groups()
('', '', '')
>>> maybe.groups()
('', '', None)
And if the regex doesn't match at all, that's a NO for you:
>>> no = regex.match("Three|Two|One")
>>> no is None
True
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski
Perhaps an algorithm like this would be more appropriate. Here is some pseudocode.
matchlist.current = matchlist.first()
for each word in input
if word = matchlist.current
matchlist.current = matchlist.next() // assuming next returns null if at end of list
else if not allowedlist.contains(word)
return 'No'
if matchlist.current = null // we hit the end of the list
return 'Yes'
return 'Maybe'

how do I include a boolean AND within a regex?

Is there a way to get single regex to satisfy this condition??
I am looking for a "word" that has three letters from the set MBIPI, any order,
but MUST contain an I.
ie.
re.match("[MBDPI]{3}", foo) and "I" in foo
So this is the correct result (in python using the re module), but can I get this from a single regex?
>>> for foo in ("MBI", "MIB", "BIM", "BMI", "IBM", "IMB", "MBD"):
... print foo,
... print re.match("[MBDPI]{3}", foo) and "I" in foo
MBI True
MIB True
BIM True
BMI True
IBM True
IMB True
MBD False
with regex I know I can use | as a boolean OR operator, but is there a boolean AND equivalent?
or maybe I need some forward or backward lookup?
You can fake boolean AND by using lookaheads. According to http://www.regular-expressions.info/lookaround2.html, this will work for your case:
"\b(?=[MBDPI]{3}\b)\w*I\w*"
with regex I know I can use | as a boolean OR operator, but is there a boolean AND equivalent?
A and B = not ( not A or not B) = (?![^A]|[^B])
A and B being expressions that actually may have members in common.
Or is about the only thing you can do:
\b(I[MBDPI]{2}|[MBDPI]I[MBDPI]|[MBDPI]{2}I)\b
The \b character matches a zero-width word boundary. This ensures you match something that is exactly three characters long.
You're otherwise running into the limits to what a regular language can do.
An alternative is to match:
\b[MBDPI]{3}\b
capture that group and then look for an I.
Edit: for the sake of having a complete answer, I'll adapt Jens' answer that uses Testing The Same Part of a String for More Than One Requirement:
\b(?=[MBDPI]{3}\b)\w*I\w*
with the word boundary checks to ensure it's only three characters long.
This is a bit more of an advanced solution and applicable in more situations but I'd generally favour what's easier to read (being the "or" version imho).
You could use lookahead to see if an I is present:
(?=[MBDPI]{0,2}I)[MBDPI]{3}

Categories