I want to use python re to match this kind of input:12,13,45,23.
The input is combined by four non-negative integers, separated by ",".
However, my re does not match...
print re.match(u'^([1−9]\d*|0),([1−9]\d*|0),([1−9]\d*|0),([1−9]\d*|0)$',u"0,1001,13,2")
#output is None
However, the next re works well.
print re.match(u'^([1−9]\d*|0),([1−9]\d*|0),([1−9]\d*|0)$',u"0,1001,13")
#<_sre.SRE_Match object at 0x024151B0>
I am totally confused.
For this, you don't need to repeat the regex that match each integer, you can use the {x} where x is the number of times it should appear, something like:
import re
matcher = re.compile(u"([1-9]\d*|0)(,([1-9]\d*|0)){3}$")
print matcher.match(u"12,45")
# None
print matcher.match(u"0,1001,13,578")
# <_sre.SRE_Match object at 0x7fb0e911ca48>
Related
I have text with values like:
this is a value £28.99 (0.28/ml)
I want to remove everything to return the price only so it returns:
£28.99
there could be any number of digits between the £ and .
I think
r"£[0-9]*\.[0-9]{2}"
matches the pattern I want to keep but i'm unsure on how to remove everything else and keep the pattern instead of replacing the pattern like in usual re.sub() cases.
I want to remove everything to return the price only so it returns:
Why not trying to extract the proper information instead?
import re
s = "this is a value £28.99 (0.28/ml)"
m = re.search("£\d*(\.\d+)?",s)
if m:
print(m.group(0))
to find several occurrences use findall or finditer instead of search
You don't care how many digits are before the decimal, so using the zero-or-more matcher was correct. However, you could just rely on the digit class (\d) to provide that more succinctly.
The same is true of after the decimal. You only need two so your limiting the matches to 2 is correct.
The issue then comes in with how you actually capture the value. You can use a capturing group to be sure that you only ever get the value you care about.
Complete regex:
(£\d*.\d{2})
Sample code:
import re
r = re.compile("(£\d*.\d{2})")
match = r.findall("this is a value £28.99 (0.28/ml)")
if match: # may bring back an empty list; check for that here
print(match[0]) # uses the first group, and will print £28.99
If it's a string, you can do something like this:
x = "this is a value £28.99 (0.28/ml)"
x_list = x.split()
for i in x_list:
if "£" in i: #or if i.startswith("£") Credit – Jean-François Fabre
value=i
print(value)
>>>£28.99
You can try:
import re
t = "this is a value £28.99 (0.28/ml)"
r = re.sub(".*(£[\d.]+).*", r"\1", t)
print(r)
Output:
£28.99
Python Demo
I'm tackling a python challenge problem to find a block of text in the format xXXXxXXXx (lower vs upper case, not all X's) in a chunk like this:
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
I have tested the following RegEx and found it correctly matches what I am looking for from this site (http://www.regexr.com/):
'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])'
However, when I try to match this expression to the block of text, it just returns the entire string:
In [1]: import re
In [2]: example = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
In [3]: expression = re.compile(r'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])')
In [4]: found = expression.search(example)
In [5]: print found.string
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
Any ideas? Is my expression incorrect? Also, if there is a simpler way to represent that expression, feel free to let me know. I'm fairly new to RegEx.
You need to return the match group instead of the string attribute.
>>> import re
>>> s = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
>>> rgx = re.compile(r'[a-z][A-Z]{3}[a-z][A-Z]{3}[a-z]')
>>> found = rgx.search(s).group()
>>> print found
nJDKoJIWh
The string attribute always returns the string passed as input to the match. This is clearly documented:
string
The string passed to match() or search().
The problem has nothing to do with the matching, you're just grabbing the wrong thing from the match object. Use match.group(0) (or match.group()).
Based on xXXXxXXXx if you want upper letters with len 3 and lower with len 1 between them this is what you want :
([a-z])(([A-Z]){3}([a-z]))+
also you can get your search function with group()
print expression.search(example).group(0)
I want to check either given words contain special character or not.
so below is my python code
The literal 'a#bcd' has '#', so it will be matchd and it's ok.
but 'a1bcd' has no special character. but it was filtered too!!
import re
regexp = re.compile('[~`!##$%^&*()-_=+\[\]{}\\|;:\'\",.<>/?]+')
if regexp.search('a#bcd') :
print 'matched!! nich catch!!'
if regexp.search('a1bcd') :
print 'something is wrong here!!!'
result :
python ../special_char.py
matched!! nich catch!!
something is wrong here!!!
I have no idea why it works like above..someone help me..T_T;;;
thanks~
Move the dash in you regular expression to the start of the [] group, like this:
regexp = re.compile('[-~`!##$%^&*()_=+\[\]{}\\|;:\'\",.<>/?]+')
Where you had the dash, it was read with the surrounding characters as )-_ and since it is inside [] it is interpreted as asking to match a range from ) to _. If you move the dash to just after the [ it has no special meaning and instead matches itself.
Here's an interactive session showing the specific problem there was in your regular expression:
>>> import re
>>> print re.search('[)-_]', 'abcd')
None
>>> print re.search('[)-_]', 'a1b')
<_sre.SRE_Match object at 0x7f71082247e8>
>>> print re.search('[)-_]', 'a1b').group(0)
1
After fixing it:
>>> print re.search('[-)_]', 'a1b')
None
Unless there's some reason not visible in your question, I'd also say that the final + is not needed.
re will be relatively slow for this
I'd suggest trying
specialchars = '''-~`!##$%^&*()_=+[]{}\\|;:'",.<>/?'''
len(word) != len(word.translate(None, specialchars))
or
set(word) & set(specialchars)
I need to find the value of "taxid" in a large number of strings similar to one given below. For this particular string, the 'taxid' value is '9606'. I need to discard everything else. The "taxid" may appear anywhere in the text, but will always be followed by a ":" and then number.
score:0.86|taxid:9606(Human)|intact:EBI-999900
How to write regular expression for this in python.
>>> import re
>>> s = 'score:0.86|taxid:9606(Human)|intact:EBI-999900'
>>> re.search(r'taxid:(\d+)', s).group(1)
'9606'
If there are multiple taxids, use re.findall, which returns a list of all matches:
>>> re.findall(r'taxid:(\d+)', s)
['9606']
for line in lines:
match = re.match(".*\|taxid:([^|]+)\|.*",line)
print match.groups()
I'm running search below Idle, in Python 2.7 in a Windows Bus. 64 bit environment.
According to RegexBuddy, the search pattern ('patternalphaonly') should not produce a match against a string of digits.
I looked at "http://docs.python.org/howto/regex.html", but did not see anything there that would explain why the search and match appear to be successful in finding something matching the pattern.
Does anyone know what I'm doing wrong, or misunderstanding?
>>> import re
>>> numberstring = '3534543234543'
>>> patternalphaonly = re.compile('[a-zA-Z]*')
>>> result = patternalphaonly.search(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
>>> result = patternalphaonly.match(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
Thanks
The star operator (*) indicates zero or more repetitions. Your string has zero repetitions of an English alphabet letter because it is entirely numbers, which is perfectly valid when using the star (repeat zero times). Instead use the + operator, which signifies one or more repetitions. Example:
>>> n = "3534543234543"
>>> r1 = re.compile("[a-zA-Z]*")
>>> r1.match(n)
<_sre.SRE_Match object at 0x07D85720>
>>> r2 = re.compile("[a-zA-Z]+") #using the + operator to make sure we have at least one letter
>>> r2.match(n)
Helpful link on repetition operators.
Everything eldarerathis says is true. However, with a variable named: 'patternalphaonly' I would assume that the author wants to verify that a string is composed of alpha chars only. If this is true then I would add additional end-of-string anchors to the regex like so:
patternalphaonly = re.compile('^[a-zA-Z]+$')
result = patternalphaonly.search(numberstring)
Or, better yet, since this will only ever match at the beginning of the string, use the preferred match method:
patternalphaonly = re.compile('[a-zA-Z]+$')
result = patternalphaonly.match(numberstring)
(Which, as John Machin has pointed out, is evidently faster for some as-yet unexplained reason.)