Regex match string beginning with ?code= - python

I'm using python and django to match urls for my site. I need to match a url that looks like this:
/company/code/?code=34k3593d39k
The part after ?code= is any combination of letters and numbers, and any length.
I've tried this so far:
r'^company/code/(.+)/$'
r'^company/code/(\w+)/$'
r'^company/code/(\D+)/$'
r'^company/code/(.*)/$'
But so far none are catching the expression. Any ideas? Thanks

code=34k3593d39k is GET parameter and you don't need to define the pattern for it in URL pattern. You can access it using request.GET.get('code') under view. The pattern should be just:
r'^company/code/$'
Usage, accessing GET parameter:
def my_view(request):
code = request.GET.get('code')
print code
Check the documentation:
The URLconf searches against the requested URL, as a normal Python
string. This does not include GET or POST parameters, or the domain
name.

The first pattern will work if you move the last / to just after the ^:
>>> import re
>>> re.match(r'^/company/code/(.+)$', '/company/code/?code=34k3593d39k')
<_sre.SRE_Match object at 0x0209C4A0>
>>> re.match(r'^/company/code/(.+)$', '/company/code/?code=34k3593d39k').groups()
('?code=34k3593d39k',)
>>>
Note too that the ^ is unnecessary because re.match matches from the start of the string:
>>> re.match(r'/company/code/(.+)$', '/company/code/?code=34k3593d39k').groups()
('?code=34k3593d39k',)
>>>

Related

Extracting [0-9_]+ from a URL

I've put together the following regular expression to extract image ID's from a URL:
''' Parse the post details from the full story page '''
def parsePostFromPermalink(session, permalink):
r = session.get('https://m.facebook.com{0}'.format(permalink))
dom = pq(r.content)
# Parse the images, extract the ID's, and construct large image URL
images = []
for img in dom('a img[src*="jpg"]').items():
if img.attr('src'):
m = re.match(r'/([0-9_]+)n\.jpg/', img.attr('src'))
images.append(m)
return images
URL example:
https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/s200x200/13645330_275977022775421_8826465145232985957_n.jpg?efg=eyJpIjoiYiJ9&oh=ed5b4593ed9c8b6cfe683f9c6932acc7&oe=57EE1224
I want this bit:
13645330_275977022775421_8826465145232985957
I've tested it on regex101 and it works: https://regex101.com/r/eS6eS7/2
img.attr('src') contains the correct URL and is not empty. I tested this. When I try to use m.group(0) I get an exception that group is not a function. m is None.
Am I doing something wrong?
Two problems:
those enclosing /.../ are not a part of Python regex syntax
you should use search instead of match
Working example:
>>> url = "https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/s200x200/13645330_275977022775421_8826465145232985957_n.jpg?efg=eyJpIjoiYiJ9&oh=ed5b4593ed9c8b6cfe683f9c6932acc7&oe=57EE1224"
>>> re.search(r'([0-9_]+)n\.jpg', url).group(0)
'13645330_275977022775421_8826465145232985957_n.jpg'
If you want just the number part, use this (group(1), and note the additional _):
>>> re.search(r'([0-9_]+)_n\.jpg', url).group(1)
'13645330_275977022775421_8826465145232985957'
This is the correct python code from Regex101. (There's a code generator on the left). Notice the lack of slashes on the outside of the regex...
import re
p = re.compile(r'([\d_]+)n\.jpg')
test_str = u"https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/c3.0.103.105/p110x80/13700209_937389626383181_6033441713767984695_n.jpg?efg=eyJpIjoiYiJ9&oh=a0b90ec153211eaf08a6b7c4cc42fb3b&oe=581E2EB8"
re.findall(p, test_str)
I'm not sure how you got m as None, but you might need to compile the pattern and use that to match first. Otherwise, try to fix the expression first

Find string with regular expression in python

I am a newbie in python and I am trying to cut piece of string in another string at python.
I looked at other similar questions but I could not find my answer.
I have a variable which contain a domain list which the domains look like this :
http://92.230.38.21/ios/Channel767/Hotbird.mp3
http://92.230.38.21/ios/Channel9798/Coldbird.mp3
....
I want the mp3 file name (in this example Hotbird, Coldbird etc)
I know I must be able to do it with re.findall() but I have no idea about regular expressions I need to use.
Any idea?
Update:
Here is the part I used:
for final in match2:
netname=re.findall('\W+\//\W+\/\W+\/\W+\/\W+', final)
print final
print netname
Which did not work. Then I tried to do this one which only cut the ip address (92.230.28.21) but not the name:
for final in match2:
netname=re.findall('\d+\.\d+\.\d+\.\d+', final)
print final
You may just use str.split():
>>> urls = ["http://92.230.38.21/ios/Channel767/Hotbird.mp3", "http://92.230.38.21/ios/Channel9798/Coldbird.mp3"]
>>> for url in urls:
... print(url.split("/")[-1].split(".")[0])
...
Hotbird
Coldbird
And here is an example regex-based approach:
>>> import re
>>>
>>> pattern = re.compile(r"/(\w+)\.mp3$")
>>> for url in urls:
... print(pattern.search(url).group(1))
...
Hotbird
Coldbird
where we are using a capturing group (\w+) to capture the mp3 filename consisting of one or more aplhanumeric characters which is followed by a dot, mp3 at the end of the url.
How about ?
([^/]*mp3)$
I think that might work
Basically it says...
Match from the end of the line, start with mp3, then match everything back to the first slash.
Think it will perform well.

HowTo regex only domains from file

I have such complex file: http://regexr.com/3a8n4
I need to regex every domain out of it, meaning such a line:
http://liqueur.werbeschalter.com/if/?http%3A%2F%2Fwww.vornamenkartei.de
should yield me:
liqueur.werbeschalter.com and www.vornamenkartei.de
I could do this with python.
Any ideas?
Trying this:
https?:\/\/(.+?)\/
Should be ok, but I wanted to get also the other domains after the "http%3A..."
(?:https?:\/\/|www\.)([^\/]+)\/.*$
Relatively simple, gets everything between the scheme and the start of the path, and captures it on group 1.
(?:): non-capturing group
https?|www.\: matches http with a optional s, OR www.
:\/\/: just the start of a URL, no special meaning. \s are for escaping
([^\/]+): creates a matching group (()) that matches any character except \/ one or more times
\/: matches a literal slash
See here: http://regexr.com/3a8n7
But ideally you wouldn't use regexes directly to parse the URL. Instead, use urlparse:
import re
import urlparse
with open("yourfile") as f:
for line in f:
referrer = re.match("Referrer: (.*)$")
url = urlparse.urlparse(referrer)
print(url.netloc) # or whatever you want to do
To get both the domain names and the URL-encoded domain names, you might want to try the following:
(?:https?(?::\/\/|%3A%2F%2F))([^\/%]*)
The reason for the % in the character class is in case there is a URL-encoded forward slash in the URL.
Please see Regex Demo here.
How about this ?
for url in urls:
result = urlparse(url)
print("{}://{}".format(result.scheme, result.netloc))
unquoted = unquote(result.query)
parsed_qs = parse_qs(unquoted, keep_blank_values=True)
extracted_strings = list(parsed_qs.keys())
for get_arg_values in parsed_qs.values():
extracted_strings.extend(get_arg_values)
for possible_url in extracted_strings:
if possible_url.startswith('http'):
parsed_url = urlparse(possible_url)
print("{}://{}".format(parsed_url.scheme, parsed_url.netloc))
Python has means to parse urls and get params, we also need to process special case when get parameter doesn't have value and process keys as well.
EDIT: updated code

Python regex with question mark literal

I'm using Django's URLconf, the URL I will receive is /?code=authenticationcode
I want to match the URL using r'^\?code=(?P<code>.*)$' , but it doesn't work.
Then I found out it is the problem of '?'.
Becuase I tried to match /aaa?aaa using r'aaa\?aaa' r'aaa\\?aaa' even r'aaa.*aaa' , all failed, but it works when it's "+" or any other character.
How to match the '?', is it special?
>>> s="aaa?aaa"
>>> import re
>>> re.findall(r'aaa\?aaa', s)
['aaa?aaa']
The reason /aaa?aaa won't match inside your URL is because a ? begins a new GET query.
So, the matchable part of the URL is only up to the first 'aaa'. The remaining '?aaa' is a new query string separated by the '?' mark, containing a variable "aaa" being passed as a GET parameter.
What you can do here is encode the variable before it makes its way into the URL. The encoded form of ? is %3F.
You should also not match a GET query such as /?code=authenticationcode using regex at all. Instead, match your URL up to / using r'^$'. Django will pass the variable code as a GET parameter to the request object, which you can obtain in your view using request.GET.get('code').
You are not allowed to use ? in a URL as a variable value. The ? indicates that there are variables coming in.
Like: http://www.example.com?variable=1&another_variable=2
Replace it or escape it. Here's some nice documentation.
Django's urls.py does not parse query strings, so there is no way to get this information at the urls.py file.
Instead, parse it in your view:
def foo(request):
code = request.GET.get('code')
if code:
# do stuff
else:
# No code!
"How to match the '?', is it special?"
Yes, but you are properly escaping it by using the backslash. I do not see where you have accounted for the leading forward slash, though. That bit just needs to be added in:
r'^/\?code=(?P<code>.*)$'
supress the regex metacharacters with []
>>> s
'/?code=authenticationcode'
>>> r=re.compile(r'^/[?]code=(.+)')
>>> m=r.match(s)
>>> m.groups()
('authenticationcode',)

Using Python Regular Expression in Django

I have an web address:
http://www.example.com/org/companyA
I want to be able to pass CompanyA to a view using regular expressions.
This is what I have:
(r'^org/?P<company_name>\w+/$',"orgman.views.orgman")
and it doesn't match.
Ideally all URL's that look like example.com/org/X would pass x to the view.
Thanks in advance!
You need to wrap the group name in parentheses. The syntax for named groups is (?P<name>regex), not ?P<name>regex. Also, if you don't want to require a trailing slash, you should make it optional.
It's easy to test regular expression matching with the Python interpreter, for example:
>>> import re
>>> re.match(r'^org/?P<company_name>\w+/$', 'org/companyA')
>>> re.match(r'^org/(?P<company_name>\w+)/?$', 'org/companyA')
<_sre.SRE_Match object at 0x10049c378>
>>> re.match(r'^org/(?P<company_name>\w+)/?$', 'org/companyA').groupdict()
{'company_name': 'companyA'}
Your regex isn't valid. It should probably look like
r'^org/(?P<company_name>\w+)/$'
It should look more like r'^org/(?P<company_name>\w+)'
>>> r = re.compile(r'^org/(?P<company_name>\w+)')
>>> r.match('org/companyA').groups()
('companyA',)

Categories