Using Python 2.7.3 on Linux. Here is a shell session verbatim.
>>> f = open("feed.xml")
>>> text = f.read()
>>> import re
>>> regexp1 = re.compile(r'</?item>')
>>> regexp2 = re.compile(r'<item>.*</item>')
>>> regexp1.findall(text)
['<item>', '</item>', '<item>', '</item>', '<item>', '</item>', '<item>', '</item>']
>>> regexp2.findall(text)
[]
Is this a bug, or is there something I'm not understanding about Python regular expressions?
By default, '.' does not match a newline. Try with
regexp2 = re.compile(r'<item>.*</item>', re.DOTALL)
Here is the best answer to this question: Don't use regular expressions to parse non-regular languages such as XML. It drove one S-O user insane. Another relevant link.
Related
Is there a way of toggling the compilation or use of metacharacters when compiling regexes? The current code looks like this:
Current code:
import re
the_value = '192.168.1.1'
the_regex = re.compile(the_value)
my_collection = ['192a168b1c1', '192.168.1.1']
my_collection.find_matching(the_regex)
result = ['192a168b1c1', '192.168.1.1']
The ideal solution would look like:
import re
the_value = '192.168.1.1'
the_regex = re.compile(the_value, use_metacharacters=False)
my_collection = ['192a168b1c1', '192.168.1.1']
my_collection.find_matching(the_regex)
result = ['192.168.1.1']
The ideal solution would let the re library handle the disabling of metacharacters, to avoid having to get involved in the process as much as possible.
Nope. However:
the_regex = re.compile(re.escape(the_value))
Use the re.escape() function for this.
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('192.168.1.1')
'192\\.168\\.1\\.1'
Please help with my regex problem
Here is my string
source="http://www.amazon.com/ref=s9_hps_bw_g200_t2?pf_rd_m=ATVPDKIKX0DER&pf_rd_i=3421"
source_resource="pf_rd_m=ATVPDKIKX0DER"
The source_resource is in the source may end with & or with .[for example].
So far,
regex = re.compile("pf_rd_m=ATVPDKIKX0DER+[&.]")
regex.findall(source)
[u'pf_rd_m=ATVPDKIKX0DER&']
I have used the text here. Rather using text, how can i use source_resource variable with & or . to find this out.
If the goal is to extract the pf_rd_m value (which it apparently is as you are using regex.findall), than I'm not sure regex are the easiest solution here:
>>> import urlparse
>>> qs = urlparse.urlparse(source).query
>>> urlparse.parse_qs(qs)
{'pf_rd_m': ['ATVPDKIKX0DER'], 'pf_rd_i': ['3421']}
>>> urlparse.parse_qs(qs)['pf_rd_m']
['ATVPDKIKX0DER']
You also have to escape the .
pattern=re.compile(source_resource + '[&\.]')
You can just build the string for the regular expression like a normal string, utilizing all string-formatting options available in Python:
import re
source_and="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER&"
source_dot="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER."
source_resource="pf_rd_m=ATVPDKIKX0DER"
regex_string = source_resource + "[&\.]"
regex = re.compile(regex_string)
print regex.findall(source_and)
print regex.findall(source_dot)
>>> ['pf_rd_m=ATVPDKIKX0DER&']
['pf_rd_m=ATVPDKIKX0DER.']
I hope this is what you mean.
Just take note that I modified your regular expression: the . is a special symbol and needs to be escaped, as is the + (I just assumed the string will only occur once, which makes the use of + unnecessary).
I've just made the switch from Perl to Python and am disappointed by the re module. I'm looking for the equivalent of $1 in Python, or any other special variables in regular expressions. In Perl I would use this:
$_ = "<name>Joe</name>";
s/<(.)>(.)<[/](.*)>/$2/;
I'm trying to do the same in Python. Thanks!
You can also use the \2 in the back ref or match group in Python.
Such as this:
>>> re.sub(r'(\w+) (\w+)',r'\2 \1','Joe Bob')
'Bob Joe'
Or named substitutions (a Python innovation later ported to Perl):
>>> re.sub(r'(?P<First>\w+) (?P<Second>\w+)',r'\g<Second> \g<First>','Joe Bob')
'Bob Joe'
>>> ma=re.search(r'(?P<First>\w+) (?P<Second>\w+)','George Bush')
>>> ma.group(1)
'George'
>>> ma.group('Second')
'Bush'
But, admittedly, Python re module is a little weak in comparison to recent Perl's.
For a first class regex module, install the newer regex module. It is scheduled to by part of Python 3.4 and is very good.
You want the re.MatchObject.group() method.
import re
var = "<name>Joe</name>"
match = re.search(r"<(.)>(.)<[/](.*)>", var)
print match.group(2)
It looks like you are using regex to parse a tag-based markup language such as XML. See the following link on why you should use a parser such as ElementTree instead: https://stackoverflow.com/a/1732454/1032785
I can't seem to find a good resource on this.. I am trying to do a simple re.place
I want to replace the part where its (.*?), but can't figure out the syntax on how to do this.. I know how to do it in PHP, so I've been messing around with what I think it could be based on that (which is why it has the $1 but I know that isn't correct in python).. I would appreciate if anyone can show the proper syntax, I'm not asking specifics for any certain string, just how I can replace something like this, or if it had more than 1 () area.. thanks
originalstring = 'fksf var:asfkj;'
pattern = '.*?var:(.*?);'
replacement_string='$1' + 'test'
replaced = re.sub(re.compile(pattern, re.MULTILINE), replacement_string, originalstring)
>>> import re
>>> originalstring = 'fksf var:asfkj;'
>>> pattern = '.*?var:(.*?);'
>>> pattern_obj = re.compile(pattern, re.MULTILINE)
>>> replacement_string="\\1" + 'test'
>>> pattern_obj.sub(replacement_string, originalstring)
'asfkjtest'
Edit: The Python Docs can be pretty useful reference.
>>> import re
>>> regex = re.compile(r".*?var:(.*?);")
>>> regex.sub(r"\1test", "fksf var:asfkj;")
'asfkjtest'
The python docs are online, and the one for the re module is here. http://docs.python.org/library/re.html
To answer your question though, Python uses \1 rather than $1 to refer to matched groups.
I have an web address:
http://www.example.com/org/companyA
I want to be able to pass CompanyA to a view using regular expressions.
This is what I have:
(r'^org/?P<company_name>\w+/$',"orgman.views.orgman")
and it doesn't match.
Ideally all URL's that look like example.com/org/X would pass x to the view.
Thanks in advance!
You need to wrap the group name in parentheses. The syntax for named groups is (?P<name>regex), not ?P<name>regex. Also, if you don't want to require a trailing slash, you should make it optional.
It's easy to test regular expression matching with the Python interpreter, for example:
>>> import re
>>> re.match(r'^org/?P<company_name>\w+/$', 'org/companyA')
>>> re.match(r'^org/(?P<company_name>\w+)/?$', 'org/companyA')
<_sre.SRE_Match object at 0x10049c378>
>>> re.match(r'^org/(?P<company_name>\w+)/?$', 'org/companyA').groupdict()
{'company_name': 'companyA'}
Your regex isn't valid. It should probably look like
r'^org/(?P<company_name>\w+)/$'
It should look more like r'^org/(?P<company_name>\w+)'
>>> r = re.compile(r'^org/(?P<company_name>\w+)')
>>> r.match('org/companyA').groups()
('companyA',)