Regular expression result - python

I have below code:
import re
line = "78349999234";
searchObj = re.search(r'9*', line)
if searchObj:
print "searchObj.group() : ", searchObj.group()
else:
print "Nothing found!!"
However the output is empty. I thought * means: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s. Why am I not able to see any result in this case?

I think the regular expression matches left to right. So the first pattern that matches is the empty string before 7.... If it find a 9, it will indeed match it greedy: and try to "eat" (that's the correct terminology) as many characters as possible.
If you query for:
>>> print(re.findall(r'9*',line));
['', '', '', '', '9999', '', '', '', '']
It matches all empty strings between the characters and as you can see, 9999 is matched as well.
The main reason is probably performance: if you search for a pattern in a string of 10M+ characters, you're very happy if the pattern is already in the first 10k characters. You don't want to waste effort on finding the "nicest" match...
EDIT
With 0 or more occurrence one means the group (in this case 9) is repeated zero or more times. In an empty string, the characters is repeated exactly 0 times. If you want to match patterns where the characters is repeated one or more times, you should use
9+
This results in:
>>> print(re.search(r'9+', line));
<_sre.SRE_Match object; span=(4, 8), match='9999'>
re.search for a pattern that accepts the empty string, is probably not that much helpful since it will always match the empty string before the actual start of the string first.

The main reason is , re.search function stops searching for strings once it finds a match. 9* means match the digit 9 zero or more times. Because an empty string exists before each and every character, re.search function stops it searching after finding the first empty string. That's why you got an empty string as output...

Related

python3: regex, find all substrings that starts with and end with certain string

Let's say that I have a string that looks like this:
a = '1253abcd4567efgh8910ijkl'
I want to find all substrings that starts with a digit, and ends with an alphabet.
I tried,
b = re.findall('\d.*\w',a)
but this gives me,
['1253abcd4567efgh8910ijkl']
I want to have something like,
['1234abcd','4567efgh','8910ijkl']
How can I do this? I'm pretty new to regex method, and would really appreciate it if anyone can show how to do this in different method within regex, and explain what's going on.
\w will match any wordcharacter which consists of numbers, alphabets and the underscore sign. You need to use [a-zA-Z] to capture letters only. See this example.
import re
a = '1253abcd4567efgh8910ijkl'
b = re.findall('(\d+[A-Za-z]+)',a)
Output:
['1253abcd', '4567efgh', '8910ijkl']
\d will match digits. \d+ will match one or more consecutive digits. For e.g.
>>> re.findall('(\d+)',a)
['1253', '4567', '8910']
Similarly [a-zA-Z]+ will match one or more alphabets.
>>> re.findall('([a-zA-Z]+)',a)
['abcd', 'efgh', 'ijkl']
Now put them together to match what you exactly want.
From the Python manual on regular expressions, it tells us that \w:
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
So you are actually over capturing what you need. Refine your regular expression a bit:
>>> re.findall(r'(\d+[a-z]+)', a, re.I)
['1253abcd', '4567efgh', '8910ijkl']
The re.I makes your expression case insensitive, so it will match upper and lower case letters as well:
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA')
['12124adbad']
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA', re.I)
['12124adbad', '13434AGDFDF', '434348888AAA']
\w matches string with any alphanumeric character. And you have used \w with *. So your code will provide a string which is starting with a digit and contains alphanumeric characters of any length.
Solution:
>>>b=re.findall('\d*[A-Za-z]*', a)
>>>b
['1253abcd', '4567efgh', '8910ijkl', '']
you will get '' (an empty string) at the end of the list to display no match. You can remove it using
b.pop(-1)

Regex pattern to match substring

Would like to find the following pattern in a string:
word-word-word++ or -word-word-word++
So that it iterates the -word or word- pattern until the end of the substring.
the string is quite large and contains many words with those^ patterns.
The following has been tried:
p = re.compile('(?:\w+\-)*\w+\s+=', re.IGNORECASE)
result = p.match(data)
but it returns NONE. Does anyone know the answer?
Your regex will only match the first pattern, match() will only find one occurrence, and that only if it is immediately followed by some whitespace and an equals sign.
Also, in your example you implied you wanted three or more words, so here's a version that was changed in the following ways:
match both patterns (note the leading -?)
match only if there are at least three words to the pattern ({2,} instead of +)
match even if there's nothing after the pattern (the \b matches a word boundary. It is not really necessary here, since the preceding \w+ guarantees we are at a word boundary anyway)
returns all matches instead of only the first one.
Here's the code:
#!/usr/bin/python
import re
data=r"foo-bar-baz not-this -this-neither nope double-dash--so-nope -yeah-this-even-at-end-of-string"
p = re.compile(r'-?(?:\w+-){2,}\w+\b', re.IGNORECASE)
print p.findall(data)
# prints ['foo-bar-baz', '-yeah-this-even-at-end-of-string']

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

python regular expression match comma

In the following string,how to match the words including the commas
--
process_str = "Marry,had ,a,alittle,lamb"
import re
re.findall(r".*",process_str)
['Marry,had ,a,alittle,lamb', '']
--
process_str="192.168.1.43,Marry,had ,a,alittle,lamb11"
import re
ip_addr = re.findall(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",l)
re.findall(ip_addr,process_str1)
How to find the words after the ip address excluding the first comma only
i.e, the outout again is expected to be Marry,had ,a,alittle,lamb11
In the second example above how to find if the string is ending with a digit.
In the second example, you just need to capture (using ()) everything that follows the ip:
import re
s = "192.168.1.43,Marry,had ,a,alittle,lamb11"
text = re.findall(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3},(.*)", s)[0]
// text now holds the string Marry,had ,a,alittle,lamb11
To find out if the string ends with a digit, you can use the following:
re.match(".*\d$", process_str)
That is, you match the entire string (.*), and then backtrack to test if the last character (using $, which matches the end of the string) is a digit.
Find the words including the commas, that's how I understand this sentence:
>>> re.findall("\w+,*", process_str)
['Marry,', 'had', 'a,', 'alittle,', 'lamb']
ending with a didgit:
"[0-9]+$"
Hmm. The examples are not quite clear, but it seems in example #2, you want to only match text , commas, space-chars, and ignore digits? How about this:
re.findall('(?i)([a-z, ]+), process_str)
I didn't quite understand the "if the string is ending with a digit". Does that mean you ONLY want to match 'Mary...' IF it ends with a digit? Then that would look like this:
re.findall('(?i)([a-z, ]+)\d+, process_str)

Check String for / against Characters in Python

I need to be able to tell the difference between a string that can contain letters and numbers, and a string that can contain numbers, colons and hyphens.
>>> def checkString(s):
... pattern = r'[-:0-9]'
... if re.search(pattern,s):
... print "Matches pattern."
... else:
... print "Does not match pattern."
# 3 Numbers seperated by colons. 12, 24 and minus 14
>>> s1 = "12:24:-14"
# String containing letters and string containing letters/numbers.
>>> s2 = "hello"
>>> s3 = "hello2"
When I run the checkString method on each of the above strings:
>>>checkString(s1)
Matches Pattern.
>>>checkString(s2)
Does not match Pattern.
>>>checkString(s3)
Matches Pattern
s3 is the only one that doesn't do what I want. I'd like to be able to create a regex that allows numbers, colons and hyphens, but excludes EVERYTHING else (or just alphabetical characters). Can anyone point me in the right direction?
EDIT:
Therefore, I need a regex that would accept:
229 // number
187:657 //two numbers
187:678:-765 // two pos and 1 neg numbers
and decline:
Car //characters
Car2 //characters and numbers
you need to match the whole string, not a single character as you do at the moment:
>>> re.search('^[-:0-9]+$', "12:24:-14")
<_sre.SRE_Match object at 0x01013758>
>>> re.search('^[-:0-9]+$', "hello")
>>> re.search('^[-:0-9]+$', "hello2")
To explain regex:
within square brackets (character class): match digits 0 to 9, hyphen and colon, only once.
+ is a quantifier, that indicates that preceding expression should be matched as many times as possible but at least once.
^ and $ match start and end of the string. For one-line strings they're equivalent to \A and \Z.
This way you restrict content of the whole string to be at least one-charter long and contain any permutation of characters from the character class. What you were doing before hand was to search for a single character from the character class within subject string. This is why s3 that contains a digit matched.
SilentGhost's answer is pretty good, but take note that it would also match strings like "---::::" with no digits at all.
I think you're looking for something like this:
'^(-?\d+:)*-?\d+$'
^ Matches the beginning of the line.
(-?\d+:)* Possible - sign, at least one digit, a colon. That whole pattern 0 or many times.
-?\d+ Then the pattern again, at least once, without the colon
$ The end of the line
This will better match the strings you describe.
pattern = r'\A([^-:0-9]+|[A-Za-z0-9])\Z'
Your regular expression is almost fine; you just need to make it match the whole string. Also, as a commenter pointed out, you don't really need a raw string (the r prefix on the string) in this case. Voila:
def checkString(s):
if re.match('[-:0-9]+$', s):
print "Matches pattern."
else:
print "Does not match pattern."
The '+' means "match one or more of the previous expression". (This will make checkString return False on an empty string. If you want True on an empty string, change the '+' to a '*'.) The '$' means "match the end of the string".
re.match means "the string must match the regular expression starting at the first character"; re.search means "the regular expression can match a sequence anywhere inside the string".
Also, if you like premature optimization--and who doesn't!--note that 're.match' needs to compile the regular expression each time. This version compiles the regular expression only once:
__checkString_re = re.compile('[-:0-9]+$')
def checkString(s):
global __checkString_re
if __checkString_re.match(s):
print "Matches pattern."
else:
print "Does not match pattern."

Categories