Spaces in regular expression

Spaces in regular expression - python

I have this code to find :) and :( in a text:
for match in re.finditer(r':\)|:\(', ":) :):( :) :("):
print match.span()
and give me this answer:
(0, 2)
(3, 5)
(5, 7)
(8, 10)
(12, 14)
It works, but I need it to show me only those which the word is alone(next to no other character) so the answer would be:
(0, 2)
(8, 10)
(12, 14)
I tried adding \b but got no answer
This is a case to add (x) to the pattern
for match in re.finditer(r'(?<![\w()]):(?:\)|\()(?![\w:])', ":) :):( :) :( (x)"):
print match.span()
shows:
(0, 2)
(8, 10)
(12, 14)
ans I want
(0, 2)
(8, 10)
(12, 14)
(16, 19)

If by no other character, you mean no other visible character, so that the only characters allowed around the smiley are space (including tabs), you could use something like this:
for match in re.finditer(r"(?:(?<=\s)|(?<=^)):[()](?=\s|$)", ":) :):( :) :("):
print match.span()
(?:(?<=\s)|(?<=^)) makes sure there's either a whitespace character or the beginning of the line before the smiley,
:[()] matches : followed by either ( or )
(?=\s|$) makes sure that there's either a whitespace character or the end of the line after the smiley.
If you additionally want to match the smiley x), you can use this:
r"(?:(?<=\s)|(?<=^))(?::[()]|x\))(?=\s|$)"
If you want to match x( as well, it becomes a little easier:
r"(?:(?<=\s)|(?<=^))[x:][()](?=\s|$)"
[ ... ] is a character class and you don't need to escape stuff in there. Be wary of the placements of - and ^ since those two have special meanings in a character class.
EDIT: Seems that I got the wrong additional smiley x) For this (meaning :), :( and (x)), it will be something a bit like that:
r"(?:(?<=\s)|(?<=^))(?::[()]|\(x\))(?=\s|$)"
reEDIT: Actually, the positive assertions can be shortened with negative ones, which makes it simpler:
r"(?<!\S)(?::[()]|\(x\))(?!\S)"

:, ( and ) are non word characters, so \b won't work. You'd use the inverse, \B:
r'\B:(?\)|\()\B'
Where \b matches on the boundary between \w and \W or vice-versa, \B only matches between two \w or two \W points. Since : and the parenthesis characters are both \W characters, this means they must sit next to another non-word character (or the start or end of the line).
This will still match other smileys too however.
To completely exclude other smileys you need to use both a negative look-ahead and a negative look-behind:
r'(?<![\w()]):(?\)|\()(?![\w:])'
This says:
(?<![\w()]): No word character or parentheses before the smiley (start of string is fine)
(?![\w:]): No word character or colon after the smiley (end of string is fine)
Demo:
>>> for match in re.finditer(r'(?<![\w()]):(?:\)|\()(?![\w:])', ":) :):( :) :("):
... print match.span()
...
(0, 2)
(8, 10)
(12, 14)
For your updated pattern version, you clearly don't mind if ( is in front, so we remove that from the excluded characters preceding the pattern, and update : to [x:] to match either an x or a colon:
r'(?<![\w)])[x:](?:\)|\()(?![\w:])'
Demo:
>>> for match in re.finditer(r'(?<![\w)])[x:](?:\)|\()(?![\w:])', ":) :):( :) :( (x)"):
... print match.span()
...
(0, 2)
(8, 10)
(12, 14)
(16, 18)

Related

Apply a look ahead in regex that should be followed by the specified pattern and give a match other wise a no match

Hello i am new to regex , i needed to apply a regex to a string of us zip codes , which we got from concatenating rows of pandas columns
for example zip being header of the column
zip
you have some thing
70456
90876
78905
we get the string zip you have some thing 70456 90876 78905 as single literal string which should be matched by the regex that has some characters followed by one or more 5 digits separated by empty space
so i wrote a simple regex of '.*zip.*(\d{5}|\s)*' a zip followed by any number of 5 digit characters but it gives a match(re.fullmatch) zip 123456 a zip which is followed by a 6 digit code
for that reason i thought of using look ahead assertion in regex, but i am not able to know how to use it exactly it not giving any matches , i used look behind with re.search also but it also seems to fail , can some one give a regex having word zip and also only a 5 digit characters at the end may be a nan
here are the codes i have written
re.match('(?=zip)(\d{5}|\s)*','zip 123456')
<re.Match object; span=(0, 0), match=''>
re.search('(?<=zip)(\d{5}|\s)*','zip 123456')
<re.Match object; span=(3, 9), match=' 12345'>
can some one tell me how to write a regex for if .zip. follwed by digits having only 5 digits give a match else None
re.match('(?=zip)(\d{5}|\s)*','zip 123456')
re.search('(?<=zip)(\d{5}|\s)*','zip 123456')
those are the codes i have tried i need a regex having any alphanumeric charcters that contain zip followed by a 5 digit numeric code

You can use
re.search(r'\bzip\b\D*\d{5}(?:\s+\d{5})*\b', text)
See the regex demo. If you want to also capture the ZIPs, you can use a capturing group:
re.search(r'\bzip\b\D*(\d{5}(?:\s+\d{5})*)\b', text)
See this regex demo.
Details:
\b - a word boundary
zip - a zip string
\b - a word boundary
\D* - zero or more chars other than digits as many as possible
\d{5} - five digits
(?:\s+\d{5})* - zero or more sequences of one or more whitespaces and then five digits
\b - a word boundary

I suggest using word-boundary (\b) as follows
import re
t1 = 'zip 1234' # less than 5, should not match
t2 = 'zip 12345' # should match
t3 = 'zip 123456' # more than 5, should not match
pattern = r'zip\s\d{5}\b'
print(re.search(pattern, t1)) # None
print(re.search(pattern, t2)) # <re.Match object; span=(0, 9), match='zip 12345'>
print(re.search(pattern, t3)) # None
\b is zero-length assertion useful to make sure you have complete word rather than just part. See re docs for details of \b operations.

Python recognize part of string (position and length)

I have got a file (.VAR) which gives me a positions and lengths in a strings per row, see the example below.
*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE
How do i retrieve the " xxLxxx:" value, which is always preceded by a space and always ends with a colon, but never on the same location within the string.
Preferably I would like to find the number before L as the position, and the number behind L as the length, but only searching for "L" would give me also the input from other values within the string. Therefore I think I have to use the space_number_L_number_colon to recognize this part, but I don't know how.
Any thoughts? TIA

You can use a regex here.
Example:
s='''*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE'''
import re
out = re.findall(r'\s(\d+)L(\d+):', s)
output: [('1', '8'), ('29', '4'), ('33', '2')]
As integers:
out = [tuple(map(int, x)) for x in re.findall(r'\s(\d+)L(\d+):', s)]
output: [(1, 8), (29, 4), (33, 2)]
regex:
regex demo
\s # space
(\d+) # capture one or more digits
L # literal L
(\d+) # capture one or more digits
: # literal :

Python Regular Expression Why Quantifier (+) is not greedy

Input: asjkd http://www.as.com/as/g/ff askl
Expected output: http://www.as.com/as/g/ff
When I try below I am getting expected output
pattern=re.compile(r'http[\w./:]+')
print(pattern.search("asjkd http://www.as.com/as/g/ff askl"))
Why isn't the + quantifier greedy here? I was expecting it to be greedy. Here actually not being greedy is helping to find the right answer.

It is greedy. It stops matching when it hits the space because [\w./:] doesn't match a space. A space isn't a word character (alphanumeric or underscore), dot, slash, or colon.
Change + to +? and you can see what happens when it's non-greedy.
Greedy
>>> pattern=re.compile(r'http[\w./:]+')
>>> print(pattern.search("asjkd http://www.as.com/as/g/ff askl"))
<re.Match object; span=(6, 31), match='http://www.as.com/as/g/ff'>
Non-greedy
>>> pattern=re.compile(r'http[\w./:]+?')
>>> print(pattern.search("asjkd http://www.as.com/as/g/ff askl"))
<re.Match object; span=(6, 11), match='http:'>
It matches a single character :!

Regular expressions: How to make my code match the '+' character OR digits

I've just started on regex.
I'm trying to search through a short list of 'phrases' to find UK mobile numbers (starting with +44 or 07, sometimes with the number broken up by one space). I'm having trouble getting it to return numbers starting +44.
This is what I've written:
for snippet in phrases:
match = re.search("\\b(\+44|07)\\d+\\s?\\d+\\b", snippet)
if match:
numbers.append(match)
print(match)
which prints
<_sre.SRE_Match object; span=(19, 31), match='07700 900432'>
<_sre.SRE_Match object; span=(20, 31), match='07700930710'>
and misses out the number +44770090999 which is in 'phrases.'
I tried with and without the brackets. Without the brackets it would also print the +44 in sums like '10+44=54.' Is the backslash before the +44 necessary? Any ideas on what I'm missing?
Thanks to all!
EDIT: Some of my input:
phrases = ["You can call me on 07700 900432.",
"My mobile number is 07700930710",
"My date of birth is 07.08.92",
"Why not phone me on 202-555-0136?"
"There are around 7600000000 people on Earth",
"If you're from overseas, call +44 7700 900190",
"Try calling +447700900999 now!",
"56+44=100."]

In your regex the word boundary \b does not match between a whitespace and a plus sign.
What you could do is match either 07 or +44 and then match either a digit or a whitespace one or more times [\d ]+ followed by a digit \d to not match a whitespace at the end and add a word boundary \b at the end.
(?:07|\+44)[\d ]+\d\b
Demo Python

The problem with your regex is that the the first \b matches the word boundary between the + and the 4. The boundary between a space and a + is not a word boundary. This means that it can't find +44 after the \b because the + is on the left of the \b. There is only 44 on the right of \b.
To fix this, you can use a negative lookbehind to make sure there are no words before +44. Remember to put it inside the capturing group because it should only be matched if the +44 option was chosen. You still want to match a word boundary if it were starting with 07.
((?!\w)\+44|\b07)\d+\s?\d+\b
You can put the regex in a r"" string. This way you don't have to write that many slashes:
r"((?!\w)\+44|07)\d+\s?\d+\b"
Demo

This should help.
import re
phrases = ["Hello +4407700 900432 World", "Hello +44770090999 World"]
for snippet in phrases:
match = re.search(r"(?P<num>(\+44|07)\d+\s?\d+)", snippet)
if match:
print(match.group('num'))
Output:
+4407700 900432
+44770090999

You should be able to cover all cases by removing expected "noisy characters" from the string and simplify your regex to just "(07|\D44)\d{9}". Where:
(07|\D44) searches for a starting number with 07 and 44 preceded by a non-numeric character.
\d{9} searches for the remaining 9 digits.
Your code should look like this:
cleansnippet = snippet.replace("-","").replace(" ","").replace("(0)","")...
re.search("(07|\D44)\d{9}", cleansnippet)
Applying this to your input retrieves this:
<_sre.SRE_Match object; span=(14, 25), match='07700900432'>
<_sre.SRE_Match object; span=(16, 27), match='07700930710'>
<_sre.SRE_Match object; span=(25, 37), match='+44770090019'>
<_sre.SRE_Match object; span=(10, 22), match='+44770090099'>
Hope that helps.
Pd.:
The \ before the + means that you are specifically looking for a + sign instead of "1 or more" of the previous element.
The only reason why I propose \D44 instead of the \+44 is because it could be safer for you as people could miss typing + prior their number. :)

Regex: Matching individual characters without matching characters inbetween

I have a simple regex query.
Here is the input:
DLWLALDYVASQASV
The desired output are the positions of the bolded characters. DLWLALDYVASQASV
So it would be D:6, Y:7, S:10.
I am using python, so I know I can use span() or start() to obtain the start positions of a match. But if I try to use something like: DY.{2}S It will match the characters in between and only give me the position of the first (and last in the case of span) character of the match.
Is there a function or a way to retrieve the position of each specified character, not including the characters in-between?

match = re.search(r'(D)(Y)..(S)', 'DLWLALDYVASQASV')
print([match.group(i) for i in range(4)])
>>> ['DYVAS', 'D', 'Y', 'S']
print([match.span(i) for i in range(4)])
>>> [(6, 11), (6, 7), (7, 8), (10, 11)]
print([match.start(i) for i in range(4)])
>>> [6, 6, 7, 10]
You can take subexpressions of regular expression into brackets and then access the corresponding substrings via the match object. See the documentation of Match object for more details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spaces in regular expression - python

Related

Apply a look ahead in regex that should be followed by the specified pattern and give a match other wise a no match

Python recognize part of string (position and length)

Python Regular Expression Why Quantifier (+) is not greedy

Regular expressions: How to make my code match the '+' character OR digits

Regex: Matching individual characters without matching characters inbetween

Categories

Resources