Regex not working normally in python - python

I have this pattern: "(\?(.+?))\b".
In python, what should happen, is findall should return ("?var", "var") if i run it on the string: "some text ?var etc".
It works normally elsewhere, here's a regexr for proof.
In python, re's findall returns an empty list. Why is that?

You're not using raw string notation:
>>> import re
>>> re.findall(r'(\?(.+?))\b', 'some text ?var etc')
[('?var', 'var')]

Related

Regex check if backslash before every symbols using python

I met some problems when I'd like to check if the input regex if correct or not.
I'd like to check is there one backslash before every symbol, but I don't know how to implement using Python.
For example:
number: 123456789. (return False)
phone\:111111 (return True)
I try to use (?!) and (?=) in Python, but it doesn't work.
Update:
I'd like to match the following string:
\~, \!, \#, \$, \%, \^, \&, \*, \(, \), \{, \}, \[, \], \:, \;, \", \', \>, \<, \?
Thank you very much.
import re
if re.seach(r"\\\W", "phone\:111111") is not None:
print("OK")
Does it work?
Reading between the lines a bit, it sounds like you are trying to pass a string to a regex and you want to make sure it has no special characters in it that are unescaped.
Python's re module has an inbuilt re.escape() function for this.
Example:
>>> import re
>>> print(re.escape("phone:111111"))
"phone\\:111111"
Check that the entire string is composed of single characters or pairs of backslash+symbol:
import re
def has_backslash_before_every_symbol(s):
return re.match(r"^(\\[~!#$%^&*(){}\[\]:;"'><?]|[^~!#$%^&*(){}\[\]:;"'><?])*$", s) is not None
Python regex reference: https://docs.python.org/3/library/re.html

Regex matching specific HTML string with Python

The pattern is as follows
page_pattern = 'manual-data-link" href="(.*?)"'
The matching function is as follows, where pattern is one of the predefined patterns like the above page_pattern
def get_pattern(pattern, string, group_num=1):
escaped_pattern = re.escape(pattern)
match = re.match(re.compile(escaped_pattern), string)
if match:
return match.group(group_num)
else:
return None
The problem is that match is always None, even though I made sure it works correctly with http://pythex.org/. I suspect I'm not compiling/escaping the pattern correctly.
Test string
<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>
You have three problems.
1) You shouldn't call re.escape in this case. re.escape prevents special characters (like ., *, or ?) from having their special meanings. You want them to have special meanings here.
2) You should use re.search, not re.match re.match matches from the beginning of the string; you want to find a match anywhere inside the string.
3) You shouldn't parse HTML with regular expressions. Use a tool designed for the job, like BeautifulSoup.
re.match tries to match from the beginning of the string. Since the string you're trying to match is at the middle, you need to use re.search instead of re.match
>>> import re
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> re.search(r'manual-data-link" href="(.*?)"', s).group(1)
'/data/123421'
Use html parsers like BeautifulSoup to parse html files.
>>> from bs4 import BeautifulSoup
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> soup = BeautifulSoup(s)
>>> for i in soup.find_all('a', class_=re.compile('.*manual-data-link')):
print(i['href'])
/data/123421

Python match regex always returning None

I have a python regex that match method always return None. I tested in pythex site and the pattern seems OK.
Pythex example
But when I try with re module, the result is always None:
import re
a = re.match(re.compile("\.aspx\?.*cp="), 'page.aspx?cpm=549&cp=168')
What am I doing wrong?
re.match() only matches at the start of a string. Use re.search() instead:
re.search(r"\.aspx\?.*cp=", 'page.aspx?cpm=549&cp=168')
Demo:
>>> import re
>>> re.search(r"\.aspx\?.*cp=", 'page.aspx?cpm=549&cp=168')
<_sre.SRE_Match object at 0x105d7e440>
>>> re.search(r"\.aspx\?.*cp=", 'page.aspx?cpm=549&cp=168').group(0)
'.aspx?cpm=549&cp='
Note that any re functions that take a pattern, accept a string and will call re.compile() for you (which caches compilation results). You only need to use re.compile() if you want to store the compiled expression for re-use, at which point you can call pattern.search() on it:
pattern = re.compile(r"\.aspx\?.*cp=")
pattern.search('page.aspx?cpm=549&cp=168')

What is the syntax for evaluating string matches on regular expressions?

How do I determine if a string matches a regular expression?
I want to find True if a string matches a regular expression.
Regular expression:
r".*apps\.facebook\.com.*"
I tried:
if string == r".*apps\.facebook\.com.*":
But that doesn't seem to work.
From the Python docs: on re module, regex
import re
if re.search(r'.*apps\.facebook\.com.*', stringName):
print('Yay, it matches!')
Since re.search returns a MatchObject if it finds it, or None if it is not found.
You have to import the re module and test it that way:
import re
if re.match(r'.*apps\.facebook\.com.*', string):
# it matches!
You can use re.search instead of re.match if you want to search for the pattern anywhere in the string. re.match will only match if the pattern can be located at the beginning of the string.
import re
match = re.search(r'.*apps\.facebook\.com.*', string)
You're looking for re.match():
import re
if (re.match(r'.*apps\.facebook\.com.*', string)):
do_something()
Or, if you want to match the pattern anywhere in the string, use re.search().
Why don't you also read through the Python documentation for the re module?

Using Regex Plus Function in Python to Encode and Substitute

I'm trying to substitute something in a string in python and am having some trouble. Here's what I'd like to do.
For a given comment in my posting:
"here are some great sites that i will do cool things with! https://stackoverflow.com/it's a pig & http://google.com"
I'd like to use python to make the strings like this:
"here are some great sites that i will do cool things with! http%3A//stackoverflow.com & http%3A//google.com
Here's what I have so far...
import re
import urllib
def getExpandedURL(url)
encoded_url = urllib.quote(url)
return ""+encoded_url+""
text = '<text from above>'
url_pattern = re.compile('(http.+?[^ ]+', re.I | re.S | re.M)
url_iterator = url_pattern.finditer(text)
for matched_url in url_iterator:
getExpandedURL(matched_url.groups(1)[0])
But this is where i'm stuck. I've previously seen things on here like this: Regular Expressions but for Writing in the Match but surely there's got to be a better way than iterating through each match and doing a position replace on them. The difficulty here is that it's not a straight replace, but I need to do something specific with each match before replacing it.
I think you want url_pattern.sub(getExpandedURL, text).
re.sub(pattern, repl, string, count=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a callable, it's passed the match object and must return a replacement string to be used.

Categories