Python Regexes differ in Pythex vs. console - python

I am using Pythex to test out two regexes, and I get the result I'm hoping for in Pythex, however, when I run these regexes against test strings in the console or while running the program, I don't get the match I'm expecting.
The first regex is supposed to check that the string has a pair of letters which occur at least twice in the string (but this pair does not overlap). So, "xyxy" and "aabcdefgaa" are valid, while "aaa" is not, since the a's overlap. Here is a link to the Pythex regex, where it's working: http://pythex.org/?regex=(.)%7B1%7D.(.)%7B1%7D.%5C1.%5C2.&test_string=qjhvhtzxzqqjkmpb&ignorecase=0&multiline=0&dotall=0&verbose=0. and here is the console output of the same regex & string in the python console (2.7):
>>> import re
>>> pair_of_letters = re.compile('(.){1}.*(.){1}.*\1.*\2.*')
>>> string = "qjhvhtzxzqqjkmpb"
>>> match = pair_of_letters.match(string); print match
None
The second regex is supposed to check that the string has a pair of letters with exactly one character between them, e.g, "xyx", "abcdefeghi", or "aaa". Again, here's a link to Pythex: http://pythex.org/?regex=(.)%7B1%7D.%7B1%7D%5C1&test_string=qjhvhtzxzqqjkmpb&ignorecase=0&multiline=0&dotall=0&verbose=0 and below I've pasted the Python console output:
>>> repeated_letter_with_one_between = re.compile('(.){1}.{1}\1')
>>> string = "qjhvhtzxzqqjkmpb"
>>> match = repeated_letter_with_one_between.match(string); print match
None
Does anyone know what might account for the discrepancy? Thanks in advance.

Use raw strings to define a regex, or \1 will be interpreted as ASCII 01.
pair_of_letters = re.compile(r'(.).*(.).*\1.*\2.*')
repeated_letter_with_one_between = re.compile(r'(.).\1')
To illustrate:
>>> "\1"
'\x01'
>>> r"\1"
'\\1'
>>> print("\1")
�
>>> print(r"\1")
\1

Related

Word boundary detection doesn't seem to work for me in Python re

I tried using:
>>> wbpat='\btest\b'
>>> re.findall(wbpat, 'a test tested in testing')
The result that expected to get was ['test'] but somehow I am getting an empty list. What could be the problem...
\b is an escape code for a backspace (length 1 string). Use r'\btest\b'. The leading r indicates to the Python interpreter that it should interpret each character in the string as a literal single character (a "raw" string) and ignore escape sequences.
Example:
>>> len('\btest\b') # <backspace>test<backspace>
6
>>> len(r'\btest\b') # <backslash>btest<backslash>b
8
>>> import re
>>> re.findall(r'\btest\b','a test tested in testing')
['test']
It's a good habit to use a raw string for regular expressions in Python.

Regex to retrieve the last few characters of a string

Regex to retrieve the last portion of a string:
https://play.google.com/store/apps/details?id=com.lima.doodlejump
I'm looking to retrieve the string followed by id=
The following regex didn't seem to work in python
sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
re.search("id=(.*?)", sampleURL).group(1)
The above should give me an output:
com.lima.doodlejump
Is my search group right?
Your regular expression
(.*?)
will not work because, it will match between zero and unlimited times, as few times as possible (becasue of the ?). So, you have the following choices of RegEx
(.*) # Matches the rest of the string
(.*?)$ # Matches till the end of the string
But, you don't need RegEx at all here, simply split the string like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
print data.split("id=", 1)[-1]
Output
com.lima.doodlejump
If you really have to use RegEx, you can do like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
import re
print re.search("id=(.*)", data).group(1)
Output
com.lima.doodlejump
I'm surprised that nobody has mentioned urlparse yet...
>>> s = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> urlparse.urlparse(s)
ParseResult(scheme='https', netloc='play.google.com', path='/store/apps/details', params='', query='id=com.lima.doodlejump', fragment='')
>>> urlparse.parse_qs(urlparse.urlparse(s).query)
{'id': ['com.lima.doodlejump']}
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id']
['com.lima.doodlejump']
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id'][0]
'com.lima.doodlejump'
The HUGE advantage here is that if the url query string gets more components then it could easily break the other solutions which rely on a simple str.split. It won't confuse urlparse however :).
Just split it in the place you want:
id = url.split('id=')[1]
If you print id, you'll get:
com.lima.doodlejump
Regex isn't needed here :)
However, in case there are multiple id=s in your string, and you only wanted the last one:
id = url.split('id=')[-1]
Hope this helps!
This works:
>>> import re
>>> sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> re.search("id=(.+)", sampleURL).group(1)
'com.lima.doodlejump'
>>>
Instead of capturing non-greedily for zero or more characters, this code captures greedily for one or more.

python regular expression : How can I filter only special characters?

I want to check either given words contain special character or not.
so below is my python code
The literal 'a#bcd' has '#', so it will be matchd and it's ok.
but 'a1bcd' has no special character. but it was filtered too!!
import re
regexp = re.compile('[~`!##$%^&*()-_=+\[\]{}\\|;:\'\",.<>/?]+')
if regexp.search('a#bcd') :
print 'matched!! nich catch!!'
if regexp.search('a1bcd') :
print 'something is wrong here!!!'
result :
python ../special_char.py
matched!! nich catch!!
something is wrong here!!!
I have no idea why it works like above..someone help me..T_T;;;
thanks~
Move the dash in you regular expression to the start of the [] group, like this:
regexp = re.compile('[-~`!##$%^&*()_=+\[\]{}\\|;:\'\",.<>/?]+')
Where you had the dash, it was read with the surrounding characters as )-_ and since it is inside [] it is interpreted as asking to match a range from ) to _. If you move the dash to just after the [ it has no special meaning and instead matches itself.
Here's an interactive session showing the specific problem there was in your regular expression:
>>> import re
>>> print re.search('[)-_]', 'abcd')
None
>>> print re.search('[)-_]', 'a1b')
<_sre.SRE_Match object at 0x7f71082247e8>
>>> print re.search('[)-_]', 'a1b').group(0)
1
After fixing it:
>>> print re.search('[-)_]', 'a1b')
None
Unless there's some reason not visible in your question, I'd also say that the final + is not needed.
re will be relatively slow for this
I'd suggest trying
specialchars = '''-~`!##$%^&*()_=+[]{}\\|;:'",.<>/?'''
len(word) != len(word.translate(None, specialchars))
or
set(word) & set(specialchars)

python regular expression to match strings

I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..

Python regular expression; why do the search & match appear to find alpha chars in a number string?

I'm running search below Idle, in Python 2.7 in a Windows Bus. 64 bit environment.
According to RegexBuddy, the search pattern ('patternalphaonly') should not produce a match against a string of digits.
I looked at "http://docs.python.org/howto/regex.html", but did not see anything there that would explain why the search and match appear to be successful in finding something matching the pattern.
Does anyone know what I'm doing wrong, or misunderstanding?
>>> import re
>>> numberstring = '3534543234543'
>>> patternalphaonly = re.compile('[a-zA-Z]*')
>>> result = patternalphaonly.search(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
>>> result = patternalphaonly.match(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
Thanks
The star operator (*) indicates zero or more repetitions. Your string has zero repetitions of an English alphabet letter because it is entirely numbers, which is perfectly valid when using the star (repeat zero times). Instead use the + operator, which signifies one or more repetitions. Example:
>>> n = "3534543234543"
>>> r1 = re.compile("[a-zA-Z]*")
>>> r1.match(n)
<_sre.SRE_Match object at 0x07D85720>
>>> r2 = re.compile("[a-zA-Z]+") #using the + operator to make sure we have at least one letter
>>> r2.match(n)
Helpful link on repetition operators.
Everything eldarerathis says is true. However, with a variable named: 'patternalphaonly' I would assume that the author wants to verify that a string is composed of alpha chars only. If this is true then I would add additional end-of-string anchors to the regex like so:
patternalphaonly = re.compile('^[a-zA-Z]+$')
result = patternalphaonly.search(numberstring)
Or, better yet, since this will only ever match at the beginning of the string, use the preferred match method:
patternalphaonly = re.compile('[a-zA-Z]+$')
result = patternalphaonly.match(numberstring)
(Which, as John Machin has pointed out, is evidently faster for some as-yet unexplained reason.)

Categories