Parse URL with a regex in Python

Parse URL with a regex in Python - python

I want to get the query name and values to be displayed from a URL.
For example, url='http://host:port_num/file/path/file1.html?query1=value1&query2=value2'
From this, parse the query names and its values and to print it.

Don't use a regex! Use urlparse.
>>> import urlparse
>>> urlparse.parse_qs(urlparse.urlparse(url).query)
{'query2': ['value2'], 'query1': ['value1']}

I agree that it's best not to use a regular expression and better to use urlparse, but here is my regular expression.
Classes like urlparse were developed specifically to handle all URLs efficiently and are much more reliable than a regular expression is, so make use of them if you can.
>>> x = 'http://www.example.com:8080/abcd/dir/file1.html?query1=value1&query2=value2'
>>> query_pattern='(query\d+)=(\w+)'
>>> # query_pattern='(\w+)=(\w+)' a more general pattern
>>> re.findall(query_pattern, x)
[('query1', 'value1'), ('query2', 'value2')]

Related

In Python, is there a way to capture the following YYYYMMDD-N from a URL

I am looking for a way to capture the following with either a regular expression or a built-in function in Python.
From /url-path/YYYYMMDD-N/url-path-cont I only need YYYYMMDD-N. Sometimes the -N is present and sometimes it is not. I have tried various methods but so far all my attempts either stop at YYYMMDD or capture part of /url-path-cont.
I would like to capture only the YYYYMMDD-N with the -N as optional whenever present.

There are probably better ways of doing this, but as long as there's always the same amount of / then you could use the split method:
url_path = "/url-path/YYYYMMDD-N/url-path-cont"
date_only = url_path.split("/")[2]
print(date_only)

Here is a regular expression that will extract the date from a string.
>>> import re
>>> url = "url-path/YYYYMMDD-N/url-path-cont"
>>> re.compile(r"/(\w+-?\w?)/").search(url).group(1)
'YYYYMMDD-N'
>>>

Is there a simple way to switch between using and ignoring metacharacters in Python regular expressions?

Is there a way of toggling the compilation or use of metacharacters when compiling regexes? The current code looks like this:
Current code:
import re
the_value = '192.168.1.1'
the_regex = re.compile(the_value)
my_collection = ['192a168b1c1', '192.168.1.1']
my_collection.find_matching(the_regex)
result = ['192a168b1c1', '192.168.1.1']
The ideal solution would look like:
import re
the_value = '192.168.1.1'
the_regex = re.compile(the_value, use_metacharacters=False)
my_collection = ['192a168b1c1', '192.168.1.1']
my_collection.find_matching(the_regex)
result = ['192.168.1.1']
The ideal solution would let the re library handle the disabling of metacharacters, to avoid having to get involved in the process as much as possible.

Nope. However:
the_regex = re.compile(re.escape(the_value))

Use the re.escape() function for this.
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('192.168.1.1')
'192\\.168\\.1\\.1'

Extract string using regex

How can I extract the content (how are you) from the string:
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>.
Can I use regex for the purpose? if possible whats suitable regex for it.
Note: I dont want to use split function for extract the result. Also can you suggest some links to learn regex for a beginner.
I am using python2.7.2

You could use a regular expression for this (as Joey demonstrates).
However if your XML document is any bigger than this one-liner you could not since XML is not a regular language.
Use BeautifulSoup (or another XML parser) instead:
>>> from BeautifulSoup import BeautifulSoup
>>> xml_as_str = '<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>. '
>>> soup = BeautifulSoup(xml_as_str)
>>> print soup.text
how are you.
Or...
>>> for string_tag in soup.findAll('string'):
... print string_tag.text
...
how are you

Try with following regex:
/<[^>]*>(.*?)</

(?<=<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">)[^<]+(?=</string>)
would match what you want, as a trivial example.
(?<=<)[^<]+
would, too. It all depends a bit on how your input is formatted exactly.

This will match a generic HTML tag (Replace "string" with the tag you want to match):
/<string[^<]*>(.*?)<\/string>/i
(i=case insensitive)

Split from a specific delimiter

How to rip a URL like http://www.facebook.com/pages/create.php to have a result like this: www.facebook.com?
I tried this way, but doesn't work:
line.split('/', 2)[2]
My problem is probably with that two forward slashes // and some of the URLs start from the www strings.
Thanks for your help, Adia

You might want to look at Python's urlparse module.
>>> from urlparse import urlparse
>>> o = urlparse('http://www.facebook.com/pages/create.php')
>>> o.netloc
'www.facebook.com'

Probably the best bet would be returning the server part from a regex, ie,
\/[a-z0-9\-\.]*[a-zA-Z0-9\-]+\.[a-z]{2,3}\/
That can cover www.facebook.com, facebook.com, some-domain.tv, www.some-domain.net, etc.
NOTE: the head and trailing slashes are part of the regex and not regex separators.

Try:
line.split("//", 1)[-1].split("/", 1)[0]

I would do:
ch[7 if ch[0:7]=='http://' else 0:].partition('/')[0]
I’m not sure it’s valid for all the cases you’ll encounter
Also:
ch[(ch[0:7]=='http://')*7:].partition('/')[0]

Using Python Regular Expression in Django

I have an web address:
http://www.example.com/org/companyA
I want to be able to pass CompanyA to a view using regular expressions.
This is what I have:
(r'^org/?P<company_name>\w+/$',"orgman.views.orgman")
and it doesn't match.
Ideally all URL's that look like example.com/org/X would pass x to the view.
Thanks in advance!

You need to wrap the group name in parentheses. The syntax for named groups is (?P<name>regex), not ?P<name>regex. Also, if you don't want to require a trailing slash, you should make it optional.
It's easy to test regular expression matching with the Python interpreter, for example:
>>> import re
>>> re.match(r'^org/?P<company_name>\w+/$', 'org/companyA')
>>> re.match(r'^org/(?P<company_name>\w+)/?$', 'org/companyA')
<_sre.SRE_Match object at 0x10049c378>
>>> re.match(r'^org/(?P<company_name>\w+)/?$', 'org/companyA').groupdict()
{'company_name': 'companyA'}

Your regex isn't valid. It should probably look like
r'^org/(?P<company_name>\w+)/$'

It should look more like r'^org/(?P<company_name>\w+)'
>>> r = re.compile(r'^org/(?P<company_name>\w+)')
>>> r.match('org/companyA').groups()
('companyA',)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse URL with a regex in Python - python

I want to get the query name and values to be displayed from a URL. For example, url='http://host:port_num/file/path/file1.html?query1=value1&query2=value2' From this, parse the query names and its values and to print it.

Don't use a regex! Use urlparse. >>> import urlparse >>> urlparse.parse_qs(urlparse.urlparse(url).query) {'query2': ['value2'], 'query1': ['value1']}

Related

In Python, is there a way to capture the following YYYYMMDD-N from a URL

Is there a simple way to switch between using and ignoring metacharacters in Python regular expressions?

Extract string using regex

Split from a specific delimiter

Using Python Regular Expression in Django

Categories

Resources