Regular expressions: How to make my code match the '+' character OR digits - python

I've just started on regex.
I'm trying to search through a short list of 'phrases' to find UK mobile numbers (starting with +44 or 07, sometimes with the number broken up by one space). I'm having trouble getting it to return numbers starting +44.
This is what I've written:
for snippet in phrases:
match = re.search("\\b(\+44|07)\\d+\\s?\\d+\\b", snippet)
if match:
numbers.append(match)
print(match)
which prints
<_sre.SRE_Match object; span=(19, 31), match='07700 900432'>
<_sre.SRE_Match object; span=(20, 31), match='07700930710'>
and misses out the number +44770090999 which is in 'phrases.'
I tried with and without the brackets. Without the brackets it would also print the +44 in sums like '10+44=54.' Is the backslash before the +44 necessary? Any ideas on what I'm missing?
Thanks to all!
EDIT: Some of my input:
phrases = ["You can call me on 07700 900432.",
"My mobile number is 07700930710",
"My date of birth is 07.08.92",
"Why not phone me on 202-555-0136?"
"There are around 7600000000 people on Earth",
"If you're from overseas, call +44 7700 900190",
"Try calling +447700900999 now!",
"56+44=100."]

In your regex the word boundary \b does not match between a whitespace and a plus sign.
What you could do is match either 07 or +44 and then match either a digit or a whitespace one or more times [\d ]+ followed by a digit \d to not match a whitespace at the end and add a word boundary \b at the end.
(?:07|\+44)[\d ]+\d\b
Demo Python

The problem with your regex is that the the first \b matches the word boundary between the + and the 4. The boundary between a space and a + is not a word boundary. This means that it can't find +44 after the \b because the + is on the left of the \b. There is only 44 on the right of \b.
To fix this, you can use a negative lookbehind to make sure there are no words before +44. Remember to put it inside the capturing group because it should only be matched if the +44 option was chosen. You still want to match a word boundary if it were starting with 07.
((?!\w)\+44|\b07)\d+\s?\d+\b
You can put the regex in a r"" string. This way you don't have to write that many slashes:
r"((?!\w)\+44|07)\d+\s?\d+\b"
Demo

This should help.
import re
phrases = ["Hello +4407700 900432 World", "Hello +44770090999 World"]
for snippet in phrases:
match = re.search(r"(?P<num>(\+44|07)\d+\s?\d+)", snippet)
if match:
print(match.group('num'))
Output:
+4407700 900432
+44770090999

You should be able to cover all cases by removing expected "noisy characters" from the string and simplify your regex to just "(07|\D44)\d{9}". Where:
(07|\D44) searches for a starting number with 07 and 44 preceded by a non-numeric character.
\d{9} searches for the remaining 9 digits.
Your code should look like this:
cleansnippet = snippet.replace("-","").replace(" ","").replace("(0)","")...
re.search("(07|\D44)\d{9}", cleansnippet)
Applying this to your input retrieves this:
<_sre.SRE_Match object; span=(14, 25), match='07700900432'>
<_sre.SRE_Match object; span=(16, 27), match='07700930710'>
<_sre.SRE_Match object; span=(25, 37), match='+44770090019'>
<_sre.SRE_Match object; span=(10, 22), match='+44770090099'>
Hope that helps.
Pd.:
The \ before the + means that you are specifically looking for a + sign instead of "1 or more" of the previous element.
The only reason why I propose \D44 instead of the \+44 is because it could be safer for you as people could miss typing + prior their number. :)

Related

Apply a look ahead in regex that should be followed by the specified pattern and give a match other wise a no match

Hello i am new to regex , i needed to apply a regex to a string of us zip codes , which we got from concatenating rows of pandas columns
for example zip being header of the column
zip
you have some thing
70456
90876
78905
we get the string zip you have some thing 70456 90876 78905 as single literal string which should be matched by the regex that has some characters followed by one or more 5 digits separated by empty space
so i wrote a simple regex of '.*zip.*(\d{5}|\s)*' a zip followed by any number of 5 digit characters but it gives a match(re.fullmatch) zip 123456 a zip which is followed by a 6 digit code
for that reason i thought of using look ahead assertion in regex, but i am not able to know how to use it exactly it not giving any matches , i used look behind with re.search also but it also seems to fail , can some one give a regex having word zip and also only a 5 digit characters at the end may be a nan
here are the codes i have written
re.match('(?=zip)(\d{5}|\s)*','zip 123456')
<re.Match object; span=(0, 0), match=''>
re.search('(?<=zip)(\d{5}|\s)*','zip 123456')
<re.Match object; span=(3, 9), match=' 12345'>
can some one tell me how to write a regex for if .zip. follwed by digits having only 5 digits give a match else None
re.match('(?=zip)(\d{5}|\s)*','zip 123456')
re.search('(?<=zip)(\d{5}|\s)*','zip 123456')
those are the codes i have tried i need a regex having any alphanumeric charcters that contain zip followed by a 5 digit numeric code
You can use
re.search(r'\bzip\b\D*\d{5}(?:\s+\d{5})*\b', text)
See the regex demo. If you want to also capture the ZIPs, you can use a capturing group:
re.search(r'\bzip\b\D*(\d{5}(?:\s+\d{5})*)\b', text)
See this regex demo.
Details:
\b - a word boundary
zip - a zip string
\b - a word boundary
\D* - zero or more chars other than digits as many as possible
\d{5} - five digits
(?:\s+\d{5})* - zero or more sequences of one or more whitespaces and then five digits
\b - a word boundary
I suggest using word-boundary (\b) as follows
import re
t1 = 'zip 1234' # less than 5, should not match
t2 = 'zip 12345' # should match
t3 = 'zip 123456' # more than 5, should not match
pattern = r'zip\s\d{5}\b'
print(re.search(pattern, t1)) # None
print(re.search(pattern, t2)) # <re.Match object; span=(0, 9), match='zip 12345'>
print(re.search(pattern, t3)) # None
\b is zero-length assertion useful to make sure you have complete word rather than just part. See re docs for details of \b operations.

Regular expression to find a date substring Python 3.7

I'm trying to write a regular expression to find a specific substring within a string.
I'm looking for dates in the following format:
"January 1, 2018"
I have already done some research but have not been able to figure out how to make a regular expression for my specific case.
The current version of my regular expression is
re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string)
I'm fairly inexperienced with regular expression but from reading the documentation this is what I could come up with as far as matching the date format I'm working with.
Here is my thought process behind my regular expression:
\w should match to any unicode word character and * should repeat the previous match so these together should match some thing like this "January". ? makes * not greedy so it won't try to match anything in the form of January 20 as in it should stop at the first whitespace character.
\s should match white space.
\d\d and \d\d\d\d should match a two digit and four digit number respectively.
Here's a testable sample of my code:
import re
my_string = "January 01, 1990\n By SomeAuthor"
print(re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string))
EDIT:
I have also tried :[A-Za-z]\s\d{1,2}\s\d{2, 4}
Your pattern may be a bit greedy in certain areas like in the month name. Also, you're missing the optional comma. Finally, you can use the ignore case flag to simplify your pattern. Here is an example using re in verbose mode.
import re
text = "New years day was on January 1, 2018, and boy was it a good time!"
pattern = re.compile(r"""
[a-z]+ # at least one+ ascii letters (ignore case is use)
\s # one space after
\d\d? # one or two digits
,? # an oprtional comma
\s # one space after
\d{4} # four digits (year)
""",re.IGNORECASE|re.VERBOSE)
result = pattern.search(text).group()
print(result)
output
January 1, 2018
Try
In [992]: my_string = "January 01, 1990\n By SomeAuthor"
...: print(re.search("[A-Z][a-z]+\s+\d{1,2},\s+\d{4}", my_string))
...:
<_sre.SRE_Match object; span=(0, 16), match='January 01, 1990'>
[A-Z] is any uppercase letter
[a-z]+ is 1 or more lowercase letters
\s+ is 1 or more space characters
\d{1,2} is at least 1 and at most 2 digits
here:
re.search("\w+\s+\d\d?\s*,\s*\d{4}",date_string)
import re
my_string = "January 01, 1990\n By SomeAuthor"
regex = re.compile('\w+\s+\d+, \d{4}')
result = regex.search(my_string)
result will contain the matched text and the character span.

regex capture numbers after varied lengths of spaces

I try to use a non-capturing group to detect the spaces (before the numbers I needed) and not to bring spaces into my result, so I use
(?: 1+)\d*.?\d*
to process my text:
input: kMPCV/epS4SgFoNdLo3LOuClO/URXS/5 134.686356921 2018-06-14 21:50:35.494
input: pRVh7kPpFbtmuwS1NILiCzwHUVwJ4NcK 839.680408921 2018-06-14 22:13:39.996
input: Ga7MIXmXAsrbaEc1Yj60qYYblcRQpnpz 4859.688276920 2018-06-14 23:02:11.125
input: 4mqdb5njytfDOFpgeG3XS0Iv1OXFPEnb 1400.684675920 2018-06-14 23:33:42.031
and try to get the numbers.
But line 2 and 3 returns None result and line 1 and 4 returns numbers with 1 space before it: " 134.686356921"
Why I get different results? Code is below:
import re
def calcprice(filename):
try:
print ('ok')
f = open(filename, 'r')
data = f.read()
rows = data.split('\n')
for row in rows:
print (re.search("[(?: 1+)\d*\.?\d*][1]",row))
except Exception as e:
print(e)
if __name__ == "__main__": ## If we are not importing this:
calcprice('dfk balance.txt')
Result:
<_sre.SRE_Match object; span=(52, 66), match=' 134.686356921'>
None
None
<_sre.SRE_Match object; span=(51, 66), match=' 1400.684675920'>
Your current regex is basically one big character set:
[(?: 1+)\d*\.?\d*]
which doesn't make much sense, looks like a misunderstanding of how regex works. If you want to match the numbers, it would probably make more sense to lookbehind for a couple spaces, match digits and periods, and lookahead for another couple spaces:
(?<= )[\d.]+(?= )
https://regex101.com/r/NRnXWb/1
for row in rows:
print (re.search(r"(?<= )[\d.]+(?= )",row))
Try the regex \b(\d+[\d\.]*)\b
Your regex doesn't align to what you're trying to do.. It's pretty erroneous.
Try this pattern: +(\d+(\.\d+)?) +.
Explanation: pattern will match number preceeded and followed by one or more spaces (+). It will match numbers with optional decimal part ((\.\d+)?), which will become second capturing group in a match (but you won't need it anyway).
In every match, first capturing group \1 will be your number.
Demo
Your regex [(?: 1+)\d*\.?\d*][1] consists or 2 times a character class.
If the number you want to match always contains a dot, you could use a word boundary and a positive lookahead to assert that what followes is a whitespace:
\b\d+\.\d+(?= )
If it could also be without a dot you could check for a leading and a trailing whitespace using lookrounds and make the part which will match a dot and one or more times a digit optional (?:\.\d+)?.
(?<= )\d+(?:\.\d+)?(?= )
Demo

Regular Expression for Combined Look-ahead/Look-behind

I'm using python and trying to write a regular expression that matches a hyphen (-) if it is not preceded by a period (.) and not followed by one character and a period.
This one is matching hyphen not preceded by a period and not followed by a character:
r'(?<!\.)(-(?![a-zA-Z]))'
Nothing I've tried seems to get me the right match for the negative look-ahead part (single character and period).
Any help appreciated. Even a totally different regex if I'm barking up the wrong tree altogether.
Edit
Thanks for the answers. I did actually try
r'(?<!\.)(-(?![a-zA-Z]\.))'
But I now realise that my logic was wrong, not my expression.
I've chosen the answer and upvoted the other correct ones :)
Assuming that by "character" you mean (and I base this assumption on your example and on #SimonO101's comment) [A-Za-z], I think you are looking for something like this:
>>> r = re.compile(r'(?<!\.)-(?![A-Za-z]\.)')
>>> r.search('k.-kj')
>>> r.search('k-l.')
>>> r.search('k-ll')
<_sre.SRE_Match object at 0x02D46758>
>>> r.search('k-.l')
<_sre.SRE_Match object at 0x02D46720>
>>> r.search('l-..')
<_sre.SRE_Match object at 0x02D46758>
There is no need to try to enclose the hyphen in a group that also captures the negative lookahead assertion. Trying to do this only complicates the matter.
import re
ss = ' a-bc1 d-e.2 .-gh3 .-N.4'
print 'The analysed string:\n',ss
print '\n(?!\.-[a-zA-Z]\.)'
print 'NOT (preceded by a dot AND followed by character-and-dot)'
r = re.compile('(?!\.-[a-zA-Z]\.).-...')
print r.findall(ss)
print '\n(?<!\.)-(?![a-zA-Z]\.)'
print 'NOT (preceded by a dot OR followed by character-and-dot)'
q = re.compile('.(?<!\.)-(?![a-zA-Z]\.)...')
print q.findall(ss)
result
The analysed string:
a-bc1 d-e.2 .-gh3 .-N.4
(?!\.-[a-zA-Z]\.)
NOT (preceded by a dot AND followed by character-and-dot)
['a-bc1', 'd-e.2', '.-gh3']
(?<!\.)-(?![a-zA-Z]\.)
NOT (preceded by a dot OR followed by character-and-dot)
['a-bc1']
Which case do you want in fact ?

find a word in a sentence using regular expression

So, I am trying to find a word (a complete word) in a sentence. Lets say the sentence is
Str1 = "1. how are you doing"
and that I am interested in finding if
Str2 = "1."
is in it. If I do,
re.search(r"%s\b" % Str2, Str1, re.IGNORECASE)
it should say that a match was found, isn't it? but the re.search fails for this query. why?
There are two things wrong here:
\b matches a position between a word and a non-word character, so between any letter, digit or underscore, and a character that doesn't match that set.
You are trying to match the boundary between a . and a space; both are non-word characters and the \b anchor would never match there.
You are handing re a 1., which means 'match a 1 and any other character'. You'd need to escape the dot by using re.escape() to match a literal ..
The following works better:
re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
Now it'll match your input literally, and look for a following space or the end of the string. The (?:...) creates a non-capturing group (always a good idea unless you specifically need to capture sections of the match); inside the group there is a | pipe to give two alternatives; either match \s (whitespace) or match $ (end of a line). You can expand this as needed.
Demo:
>>> import re
>>> Str1 = "1. how are you doing"
>>> Str2 = "1."
>>> re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
<_sre.SRE_Match object at 0x10457eed0>
>>> _.group(0)
'1. '

Categories