Regex in Python? - python

I have a string:
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
I want to get this result:
[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', '')]
I tried this:
match = re.findall("(^[a-z]+[^://](^[a-z]+\d))", line)
I'm a beginner in Python. If there is somebody who can explain, it would be very nice :D

I suggest to use urlparse lib that has everything you need instead of a regex.
from urllib.parse import urlparse
def getparts(url):
return (url.scheme, url.hostname, url.port)
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,\file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
urls = [getparts(urlparse(url)) for url in line.split(',')]

You can use the following regex:
([fh]t*ps?|file):[\\/]*(.*?)(?=:|)(\d+|(?=[\\\/]))
Tested on Regex101:
https://regex101.com/r/hCprgS/3
Try this code:
import re
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,\file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
match = re.findall("([fh]t*ps?|file):[\\/]*(.*?)(?=:|)(\d+|(?=[\\\/]))", line)
print(match)
Results:
[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('http', 'v2-dbwebb.se', '')]

Instead of using regex, try using line.split(',') Then iterate through the list, like
myList=[]
for l in line.split(','):
myList.append(tuple(m.split('/')[0:2]))
It isn't pretty, but it gets around the problem of regex. It doesn't get into the specifics of the URL and FTP, but you can eliminate those systematically.

Python urlparse is the module you need to do all of the work, it has a urlparse constructor function that will parse a URL. The interesting parts of the URL can then be extracted from this object as attribute names. Here is the code:
import urlparse
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
# you want the port as a string so adjust it here
def port2str(port):
if port: return str(port)
else: return ''
urls = [x.strip() for x in line.split(',')]
result = map(lambda u: (u.scheme, u.hostname, port2str(u.port)), map(lambda url: urlparse.urlparse(url), urls))
print result
The code first breaks your input to an array of strings; note that they need to be clean up (stripped) as some have leading spaces which would break the parser. Then this array is converted to an array of parsed url objects, which is then converted to an array of tuples you want. The reason this is done in two steps here is that unfortunately the python lambda is very restrictive -- it cannot contain statements.
(I assumed the \file was a typo)

To provide yet another druidic and hacky regular expression approach:
import re
rx = re.compile(r"""
(?P<protocol>[^:]+):// # protocol
(?P<domain>[^/:]+) # domain part
(?::(?P<port>\d+))? # port, optional
""", re.VERBOSE)
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
matches = [match.groups()
for part in line.split(" ")
for match in [rx.match(part)]]
print(matches)
# [('https', 'dbwebb.se', None), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', None)]
See a demo on ideone.com. Otherwise, have a look at #DRC's answer for a very good non-regex way to tackle the problem.

Related

Strip A specific part from a url string in python

Im passing through some urls and I'd like to strip a part of it which dynamically changes so I don't know it firsthand.
An example url is:
https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2
And I'd like to strip the gid=lostchapter part without any of the rest.
How do I do that?
You can use urllib to convert the query string into a Python dict and access the desired item:
In [1]: from urllib import parse
In [2]: s = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
In [3]: q = parse.parse_qs(parse.urlsplit(s).query)
In [4]: q
Out[4]:
{'pid': ['2'],
'gid': ['lostchapter'],
'lang': ['en_GB'],
'practice': ['1'],
'channel': ['desktop'],
'demo': ['2']}
In [5]: q["gid"]
Out[5]: ['lostchapter']
Here is the simple way to strip them
urls = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
# Import the `urlparse` and `urlunparse` methods
from urllib.parse import urlparse, urlunparse
# Parse the URL
url = urlparse(urls)
# Convert the `urlparse` object back into a URL string
url = urlunparse(url)
# Strip the string
url = url.split("?")[1]
url = url.split("&")[1]
# Print the new URL
print(url) # Prints "gid=lostchapter"
Method 1: Using UrlParsers
from urllib.parse import urlparse
p = urlparse('https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2')
param: list[str] = [i for i in p.query.split('&') if i.startswith('gid=')]
Output: gid=lostchapter
Method 2: Using Regex
param: str = re.search(r'gid=.*&', 'https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2').group()[:-1]
you can change the regex pattern to appropriate pattern to match the expected outputs. currently it will extract any value.
We can try doing a regex replacement:
url = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
output = re.sub(r'(?<=[?&])gid=lostchapter&?', '', url)
print(output) # https://...?pid=2&lang=en_GB&practice=1&channel=desktop&demo=2
For a more generic replacement, match on the following regex pattern:
(?<=[?&])gid=\w+&?
Using string slicing (I'm assuming there will be an '&' after gid=lostchapter)
url = r'https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2'
start = url.find('gid')
end = start + url[url.find('gid'):].find('&')
url = url[start:] + url[:end-1]
print(url)
output
gid=lostchapter
What I'm trying to do here is:
find index of occurrence of "gid"
find the first "&" after "gid" is found
concatenate the parts of the url after"gid" and before "&"

Regular Expression result don't match with tester

I'm new with Python...
After couple days if googling I'm still don't get it to work.
My script:
import re
pattern = '^Hostname=([a-zA-Z0-9.]+)'
hand = open('python_test_data.conf')
for line in hand:
line = line.rstrip()
if re.search(pattern, line) :
print line
Test file content:
### Option: Hostname
# Unique, case sensitive Proxy name. Make sure the Proxy name is known to the server!
# Value is acquired from HostnameItem if undefined.
#
# Mandatory: no
# Default:
# Hostname=
Hostname=bbg-zbx-proxy
Script results:
ubuntu-workstation:~$ python python_test.py
Hostname=bbg-zbx-proxy
But when I have tested regex in tester the result is: https://regex101.com/r/wYUc4v/1
I need some advice haw cant I get only bbg-zbx-proxy as script output.
You have already written a regular expression capturing one part of the match, so you could as well use it then. Additionally, change your character class to include - and get rid of the line.strip() call, it's not necessary with your expression.
In total this comes down to:
import re
pattern = '^Hostname=([-a-zA-Z0-9.]+)'
hand = open('python_test_data.conf')
for line in hand:
m = re.search(pattern, line)
if m:
print(m.group(1))
# ^^^
The simple solution would be to split on the equals sign. You know it will always contain that and you will be able to ignore the first item in the split.
import re
pattern = '^Hostname=([a-zA-Z0-9.]+)'
hand = open('testdata.txt')
for line in hand:
line = line.rstrip()
if re.search(pattern, line) :
print(line.split("=")[1]) # UPDATED HERE

Extract urls from a string of html data

I already tried to extract this html data with BeautifulSoup but it's only limited with tags. What I need to do is to get the trailing something.html or some/something.html after the prefix www.example.com/products/ while eliminating the parameters like ?search=1. I prefer to use regex with this but I don't know the exact pattern for this.
input:
System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html
prefix = "www.example.com/products/"
# do something
# expected output: ['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']
I guess you want to use re here - with a trick since I "?" will follow the "html" in a URI:
import re
L = ["//www.example.com/products/ammaxxllx.html", "https://www.example.com/site/kakzja.html", "//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1"]
prefix = "www.example.com/products/"
>>> [re.search(prefix+'(.*)html', el).group(1) + 'html' for el in L if prefix in el]
['ammaxxllx.html', 'sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html']
Though the above answer by using re module is just awesome. You could also work around without using the module. Like this:
prefix = 'www.example.com/products/'
L = ['//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1', '//www.example.com/products/site/ammaxxllx.html', 'https://www.example.com/site/kakzja.html']
ans = []
for l in L:
input_ = l.rsplit(prefix, 1)
try:
input_ = input_[1]
ans.append(input_[:input_.index('.html')] + '.html')
except Exception as e:
pass
print ans
['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']
Another option is to use urlparse instead of/along with re
It will allow you to split a URL like this:
import urlparse
my_url = "http://www.example.com/products/ammaxxllx.html?spam=eggs#sometag"
url_obj = urlparse.urlsplit(my_url)
url_obj.scheme
>>> 'http'
url_obj.netloc
>>> 'www.example.com'
url_obj.path
>>> '/products/ammaxxllx.html'
url_obj.query
>>> 'spam=eggs'
url_obj.fragment
>>> 'sometag'
# Now you're able to work with every chunk as wanted!
prefix = '/products'
if url_obj.path.startswith(prefix):
# Do whatever you need, replacing the initial characters. You can use re here
print url_obj.path[len(prefix) + 1:]
>>>> ammaxxllx.html

Get YouTube video url or YouTube video ID from a string using RegEx

So I've been stuck on this for about an hour or so now and I just cannot get it to work. So far I have been trying to extract the whole link from the string, but now I feel like it might be easier to just get the video ID.
The RegEx would need to take the ID/URL from the following link styles, no matter where they are in a string:
http://youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
https://youtube.com/iwGFalTRHDA
http://youtu.be/n17B_uFF4cA
youtube.com/iwGFalTRHDA
youtube.com/n17B_uFF4cA
http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4
http://www.youtube.com/watch?v=t-ZRX8984sc
http://youtu.be/t-ZRX8984sc
So far, I have this RegEx:
((http(s)?\:\/\/)?(www\.)?(youtube|youtu)((\.com|\.be)\/)(watch\?v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
This catches the link, however it also breaks down the link in to multiple parts and also adds that to the list too, so if a string contains a single youtube link, the output when I print the list is something like this:
('https://www.youtube.com/watch?v=Idn7ODPMhFY', 'https://', 's', 'www.', 'youtube', '.com/', '.com', 'watch?v=', 'Idn7ODPMhFY', '', '')
I need the list to only contain the link itself, or just the video id (which would be more preferable). I have really tried doing this myself for quite a while now but I just cannot figure it out. I was wondering if someone could sort out the regex for me and tell me where I am going wrong so that I don't run in to this issue again in the future?
Instead of writing a complicated regex that probably work not in all cases, you better use tools to analyze the url, like urllib:
from urllib.parse import urlparse, parse_qs
url = 'http://youtube.com/watch?v=iwGFalTRHDA'
def get_id(url):
u_pars = urlparse(url)
quer_v = parse_qs(u_pars.query).get('v')
if quer_v:
return quer_v[0]
pth = u_pars.path.split('/')
if pth:
return pth[-1]
This function will return None if both attempts fail.
I tested it with the sample urls:
>>> get_id('http://youtube.com/watch?v=iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related')
'iwGFalTRHDA'
>>> get_id('https://youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://youtu.be/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('youtube.com/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4')
'r5nB9u4jjy4'
>>> get_id('http://www.youtube.com/watch?v=t-ZRX8984sc')
't-ZRX8984sc'
>>> get_id('http://youtu.be/t-ZRX8984sc')
't-ZRX8984sc'
Here's the approach I'd use, no regex needed at all.
(This is pretty much equivalent to #Willem Van Onsem's solution, plus an easy to run / update unit test).
from urlparse import parse_qs
from urlparse import urlparse
import re
import unittest
TEST_URLS = [
('iwGFalTRHDA', 'http://youtube.com/watch?v=iwGFalTRHDA'),
('iwGFalTRHDA', 'http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related'),
('iwGFalTRHDA', 'https://youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'http://youtu.be/n17B_uFF4cA'),
('iwGFalTRHDA', 'youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'youtube.com/n17B_uFF4cA'),
('r5nB9u4jjy4', 'http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4'),
('t-ZRX8984sc', 'http://www.youtube.com/watch?v=t-ZRX8984sc'),
('t-ZRX8984sc', 'http://youtu.be/t-ZRX8984sc'),
(None, 'http://www.stackoverflow.com')
]
YOUTUBE_DOMAINS = [
'youtu.be',
'youtube.com',
]
def extract_id(url_string):
# Make sure all URLs start with a valid scheme
if not url_string.lower().startswith('http'):
url_string = 'http://%s' % url_string
url = urlparse(url_string)
# Check host against whitelist of domains
if url.hostname.replace('www.', '') not in YOUTUBE_DOMAINS:
return None
# Video ID is usually to be found in 'v' query string
qs = parse_qs(url.query)
if 'v' in qs:
return qs['v'][0]
# Otherwise fall back to path component
return url.path.lstrip('/')
class TestExtractID(unittest.TestCase):
def test_extract_id(self):
for expected_id, url in TEST_URLS:
result = extract_id(url)
self.assertEqual(
expected_id, result, 'Failed to extract ID from '
'URL %r (got %r, expected %r)' % (url, result, expected_id))
if __name__ == '__main__':
unittest.main()
I really advise on #LukasGraf's comment, however if you really must use regex you can check the following:
(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
Here is a working example in regex101:
https://regex101.com/r/5eRqn2/1
And here the python example:
In [38]: r = re.compile('(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(?:\-|\_)[0-z]{4}|.(?:\-|\_)[0-z]{9}))')
In [39]: r.match('http://youtube.com/watch?v=iwGFalTRHDA').groups()
Out[39]: ('iwGFalTRHDA',)
In [40]: r.match('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related').groups()
Out[40]: ('iwGFalTRHDA',)
In [41]: r.match('https://youtube.com/iwGFalTRHDA').groups()
Out[41]: ('iwGFalTRHDA',)
In order to not catch specific group in regex you should this: (?:...)

Or syntax when parsing urls with regex & python

Struggling with some regex here. I'll be looping through several urls but I cannot get the regex to how to recognize revenue or cost and grab the them both. Essentially the output would look something like this:
import re
url = ['GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=13']
values = []
for i in urls:
values.append(re.search(r'(?<=revenue=)(.*?)(?=&|;)',url).group(0))
print values
[[224.00, ''],
'224.00',
[224.00, 13]]
You need to use re.findall since re.search returns only the first match.
>>> for i in url:
values.append(re.findall(r'(?:\brevenue=|\bcost=)(.*?)(?:[&;]|$)', i))
>>> values
[['224.00', ''], ['224.00'], ['224.00', '13']]
Use urlparse.urlparse to parse the URL, and urlparse.parse_qs to parse the query string.
from urlparse import urlparse, parse_qs
reqs = ['GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=13']
urls = [re.split(' +', s, 1)[1] for s in reqs]
kv = [parse_qs(urlparse(url).query) for url in urls]
values = [(e.get('revenue'), e.get('cost')) for e in kv]
# values = [{'revenue': e.get('revenue'), 'cost': e.get('cost')} for e in kv]
Sample output (parse_qs provides a list of values for every key, since the query may contain duplicate keys):
[(['224.00'], None), (['224.00'], None), (['224.00'], ['13'])]
The values line is not necessary. You can use the kv dict directly.
If you have to deal with invalid input, the list comprehension with urls and kv has to be rewritten as a loop:
For urls, you need to check and filter out entries without HTTP method
For kv, you need to add try catch for urlparse to catch invalid syntax.

Categories