Or syntax when parsing urls with regex & python - python

Struggling with some regex here. I'll be looping through several urls but I cannot get the regex to how to recognize revenue or cost and grab the them both. Essentially the output would look something like this:
import re
url = ['GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=13']
values = []
for i in urls:
values.append(re.search(r'(?<=revenue=)(.*?)(?=&|;)',url).group(0))
print values
[[224.00, ''],
'224.00',
[224.00, 13]]

You need to use re.findall since re.search returns only the first match.
>>> for i in url:
values.append(re.findall(r'(?:\brevenue=|\bcost=)(.*?)(?:[&;]|$)', i))
>>> values
[['224.00', ''], ['224.00'], ['224.00', '13']]

Use urlparse.urlparse to parse the URL, and urlparse.parse_qs to parse the query string.
from urlparse import urlparse, parse_qs
reqs = ['GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=13']
urls = [re.split(' +', s, 1)[1] for s in reqs]
kv = [parse_qs(urlparse(url).query) for url in urls]
values = [(e.get('revenue'), e.get('cost')) for e in kv]
# values = [{'revenue': e.get('revenue'), 'cost': e.get('cost')} for e in kv]
Sample output (parse_qs provides a list of values for every key, since the query may contain duplicate keys):
[(['224.00'], None), (['224.00'], None), (['224.00'], ['13'])]
The values line is not necessary. You can use the kv dict directly.
If you have to deal with invalid input, the list comprehension with urls and kv has to be rewritten as a loop:
For urls, you need to check and filter out entries without HTTP method
For kv, you need to add try catch for urlparse to catch invalid syntax.

Related

Strip A specific part from a url string in python

Im passing through some urls and I'd like to strip a part of it which dynamically changes so I don't know it firsthand.
An example url is:
https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2
And I'd like to strip the gid=lostchapter part without any of the rest.
How do I do that?
You can use urllib to convert the query string into a Python dict and access the desired item:
In [1]: from urllib import parse
In [2]: s = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
In [3]: q = parse.parse_qs(parse.urlsplit(s).query)
In [4]: q
Out[4]:
{'pid': ['2'],
'gid': ['lostchapter'],
'lang': ['en_GB'],
'practice': ['1'],
'channel': ['desktop'],
'demo': ['2']}
In [5]: q["gid"]
Out[5]: ['lostchapter']
Here is the simple way to strip them
urls = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
# Import the `urlparse` and `urlunparse` methods
from urllib.parse import urlparse, urlunparse
# Parse the URL
url = urlparse(urls)
# Convert the `urlparse` object back into a URL string
url = urlunparse(url)
# Strip the string
url = url.split("?")[1]
url = url.split("&")[1]
# Print the new URL
print(url) # Prints "gid=lostchapter"
Method 1: Using UrlParsers
from urllib.parse import urlparse
p = urlparse('https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2')
param: list[str] = [i for i in p.query.split('&') if i.startswith('gid=')]
Output: gid=lostchapter
Method 2: Using Regex
param: str = re.search(r'gid=.*&', 'https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2').group()[:-1]
you can change the regex pattern to appropriate pattern to match the expected outputs. currently it will extract any value.
We can try doing a regex replacement:
url = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
output = re.sub(r'(?<=[?&])gid=lostchapter&?', '', url)
print(output) # https://...?pid=2&lang=en_GB&practice=1&channel=desktop&demo=2
For a more generic replacement, match on the following regex pattern:
(?<=[?&])gid=\w+&?
Using string slicing (I'm assuming there will be an '&' after gid=lostchapter)
url = r'https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2'
start = url.find('gid')
end = start + url[url.find('gid'):].find('&')
url = url[start:] + url[:end-1]
print(url)
output
gid=lostchapter
What I'm trying to do here is:
find index of occurrence of "gid"
find the first "&" after "gid" is found
concatenate the parts of the url after"gid" and before "&"

Get number sequence after an specific string in url text

I'm coding a python script to check a bunch of URL's and get their ID text, the URL's follow this sequence:
http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
Up to
http://XXXXXXX.XXX/index.php?id=YYYYYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
What I'm trying to do is get only the numbers after the id= and before the &
I've tried to use the regex (\D+)(\d+) but I'm also getting the auth numbers too.
Any suggestion on how to get only the id sequence?
Another way is to use split:
string = 'http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX'
string.split('id=')[1].split('&auth=')[0]
Output:
YY
These are URL addresses, so I would just use url parser in that case.
Look at urllib.parse
Use urlparse to get query parameters, and then parse_qs to get query dict.
import urllib.parse as p
url = "http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"
query = p.urlparse(url).query
params = p.parse_qs(query)
print(params['id'])
You can include the start and stop tokens in the regex:
pattern = r'id=(\d+)(?:&|$)'
You can try this regex
import re
urls = ["http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"]
for url in urls:
id_value = re.search(r"id=(.*)(?=&)", url).group(1)
print(id_value)
that will get you the id value from the URL
YY
YYY
YYYY
variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX""".splitlines()
for v in variables:
p1 = v.split("id=")[1]
p2 = p1.split("&")[0]
print(p2)
outoput:
YY
YYY
YYYY
If you prefer regex
import re
variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""
pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)
output:
['YY', 'YYY', 'YYYY']
I don't know if you mean with only numbers after id= and before & you mean that there could be letters and numbers between those letters, so I though to this
import re
variables = """http://XXXXXXX.XXX/index.php?id=5Y44Y&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=Y2242YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=5YY453YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""
pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)
x2 = []
for p in x:
x2.append(re.sub("\\D", "", p))
print(x2)
Output:
['5Y44Y', 'Y2242YY', '5YY453YY']
['544', '2242', '5453']
Use the regex id=[0-9]+:
pattern = "id=[0-9]+"
id = re.findall(pattern, url)[0].split("id=")[1]
If you do it this way, there is no need for &auth to follow the id, which makes it very versatile. However, the &auth won't make the code stop working. It works for the edge cases, as well as the simple ones.

Get YouTube video url or YouTube video ID from a string using RegEx

So I've been stuck on this for about an hour or so now and I just cannot get it to work. So far I have been trying to extract the whole link from the string, but now I feel like it might be easier to just get the video ID.
The RegEx would need to take the ID/URL from the following link styles, no matter where they are in a string:
http://youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
https://youtube.com/iwGFalTRHDA
http://youtu.be/n17B_uFF4cA
youtube.com/iwGFalTRHDA
youtube.com/n17B_uFF4cA
http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4
http://www.youtube.com/watch?v=t-ZRX8984sc
http://youtu.be/t-ZRX8984sc
So far, I have this RegEx:
((http(s)?\:\/\/)?(www\.)?(youtube|youtu)((\.com|\.be)\/)(watch\?v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
This catches the link, however it also breaks down the link in to multiple parts and also adds that to the list too, so if a string contains a single youtube link, the output when I print the list is something like this:
('https://www.youtube.com/watch?v=Idn7ODPMhFY', 'https://', 's', 'www.', 'youtube', '.com/', '.com', 'watch?v=', 'Idn7ODPMhFY', '', '')
I need the list to only contain the link itself, or just the video id (which would be more preferable). I have really tried doing this myself for quite a while now but I just cannot figure it out. I was wondering if someone could sort out the regex for me and tell me where I am going wrong so that I don't run in to this issue again in the future?
Instead of writing a complicated regex that probably work not in all cases, you better use tools to analyze the url, like urllib:
from urllib.parse import urlparse, parse_qs
url = 'http://youtube.com/watch?v=iwGFalTRHDA'
def get_id(url):
u_pars = urlparse(url)
quer_v = parse_qs(u_pars.query).get('v')
if quer_v:
return quer_v[0]
pth = u_pars.path.split('/')
if pth:
return pth[-1]
This function will return None if both attempts fail.
I tested it with the sample urls:
>>> get_id('http://youtube.com/watch?v=iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related')
'iwGFalTRHDA'
>>> get_id('https://youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://youtu.be/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('youtube.com/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4')
'r5nB9u4jjy4'
>>> get_id('http://www.youtube.com/watch?v=t-ZRX8984sc')
't-ZRX8984sc'
>>> get_id('http://youtu.be/t-ZRX8984sc')
't-ZRX8984sc'
Here's the approach I'd use, no regex needed at all.
(This is pretty much equivalent to #Willem Van Onsem's solution, plus an easy to run / update unit test).
from urlparse import parse_qs
from urlparse import urlparse
import re
import unittest
TEST_URLS = [
('iwGFalTRHDA', 'http://youtube.com/watch?v=iwGFalTRHDA'),
('iwGFalTRHDA', 'http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related'),
('iwGFalTRHDA', 'https://youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'http://youtu.be/n17B_uFF4cA'),
('iwGFalTRHDA', 'youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'youtube.com/n17B_uFF4cA'),
('r5nB9u4jjy4', 'http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4'),
('t-ZRX8984sc', 'http://www.youtube.com/watch?v=t-ZRX8984sc'),
('t-ZRX8984sc', 'http://youtu.be/t-ZRX8984sc'),
(None, 'http://www.stackoverflow.com')
]
YOUTUBE_DOMAINS = [
'youtu.be',
'youtube.com',
]
def extract_id(url_string):
# Make sure all URLs start with a valid scheme
if not url_string.lower().startswith('http'):
url_string = 'http://%s' % url_string
url = urlparse(url_string)
# Check host against whitelist of domains
if url.hostname.replace('www.', '') not in YOUTUBE_DOMAINS:
return None
# Video ID is usually to be found in 'v' query string
qs = parse_qs(url.query)
if 'v' in qs:
return qs['v'][0]
# Otherwise fall back to path component
return url.path.lstrip('/')
class TestExtractID(unittest.TestCase):
def test_extract_id(self):
for expected_id, url in TEST_URLS:
result = extract_id(url)
self.assertEqual(
expected_id, result, 'Failed to extract ID from '
'URL %r (got %r, expected %r)' % (url, result, expected_id))
if __name__ == '__main__':
unittest.main()
I really advise on #LukasGraf's comment, however if you really must use regex you can check the following:
(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
Here is a working example in regex101:
https://regex101.com/r/5eRqn2/1
And here the python example:
In [38]: r = re.compile('(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(?:\-|\_)[0-z]{4}|.(?:\-|\_)[0-z]{9}))')
In [39]: r.match('http://youtube.com/watch?v=iwGFalTRHDA').groups()
Out[39]: ('iwGFalTRHDA',)
In [40]: r.match('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related').groups()
Out[40]: ('iwGFalTRHDA',)
In [41]: r.match('https://youtube.com/iwGFalTRHDA').groups()
Out[41]: ('iwGFalTRHDA',)
In order to not catch specific group in regex you should this: (?:...)

Regex in Python?

I have a string:
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
I want to get this result:
[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', '')]
I tried this:
match = re.findall("(^[a-z]+[^://](^[a-z]+\d))", line)
I'm a beginner in Python. If there is somebody who can explain, it would be very nice :D
I suggest to use urlparse lib that has everything you need instead of a regex.
from urllib.parse import urlparse
def getparts(url):
return (url.scheme, url.hostname, url.port)
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,\file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
urls = [getparts(urlparse(url)) for url in line.split(',')]
You can use the following regex:
([fh]t*ps?|file):[\\/]*(.*?)(?=:|)(\d+|(?=[\\\/]))
Tested on Regex101:
https://regex101.com/r/hCprgS/3
Try this code:
import re
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,\file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
match = re.findall("([fh]t*ps?|file):[\\/]*(.*?)(?=:|)(\d+|(?=[\\\/]))", line)
print(match)
Results:
[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('http', 'v2-dbwebb.se', '')]
Instead of using regex, try using line.split(',') Then iterate through the list, like
myList=[]
for l in line.split(','):
myList.append(tuple(m.split('/')[0:2]))
It isn't pretty, but it gets around the problem of regex. It doesn't get into the specifics of the URL and FTP, but you can eliminate those systematically.
Python urlparse is the module you need to do all of the work, it has a urlparse constructor function that will parse a URL. The interesting parts of the URL can then be extracted from this object as attribute names. Here is the code:
import urlparse
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
# you want the port as a string so adjust it here
def port2str(port):
if port: return str(port)
else: return ''
urls = [x.strip() for x in line.split(',')]
result = map(lambda u: (u.scheme, u.hostname, port2str(u.port)), map(lambda url: urlparse.urlparse(url), urls))
print result
The code first breaks your input to an array of strings; note that they need to be clean up (stripped) as some have leading spaces which would break the parser. Then this array is converted to an array of parsed url objects, which is then converted to an array of tuples you want. The reason this is done in two steps here is that unfortunately the python lambda is very restrictive -- it cannot contain statements.
(I assumed the \file was a typo)
To provide yet another druidic and hacky regular expression approach:
import re
rx = re.compile(r"""
(?P<protocol>[^:]+):// # protocol
(?P<domain>[^/:]+) # domain part
(?::(?P<port>\d+))? # port, optional
""", re.VERBOSE)
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
matches = [match.groups()
for part in line.split(" ")
for match in [rx.match(part)]]
print(matches)
# [('https', 'dbwebb.se', None), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', None)]
See a demo on ideone.com. Otherwise, have a look at #DRC's answer for a very good non-regex way to tackle the problem.

parsing a url in python with changing part in it

I'm parsing a url in Python, below you can find a sample url and the code, what i want to do is splitting the (74743) from the url and make a for loop which will be taking it from a parts list.
Tried to use urlparse but couldn't complete it to the end mostly because of the changing parts in the url. Ijust want the easiest and fastest way to do this.
Sample url:
http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is=
(http://example.com/wps/portal) Always fixed
(lYuxDoIwGAYf6f9aqKSjMNQ) Always changing
(74743) Will be taken from a list name Parts
(IntNumberOf=&is=) Also changing depending on the section of
the website
Here's the Code:
from lxml import html
import requests
import urlparse
Parts = [74743, 85731, 93021]
url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='
parsing = urlparse.urlsplit(url)
print parsing
>>> import urlparse
>>> url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='
>>> split_url = urlparse.urlsplit(url)
>>> split_url.path
'/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/'
You can split the path into a list of strings using '/', slice the list, and re-join:
>>> path = split_url.path
>>> path.split('/')
['', 'wps', 'portal', 'lYuxDoIwGAYf6f9aqKSjMNQ', '']
Slice off the last two:
>>> path.split('/')[:-2]
['', 'wps', 'portal']
And re-join:
>>> '/'.join(path.split('/')[:-2])
'/wps/portal'
To parse the query, use parse_qs:
>>> parsed_query = urlparse.parse_qs(split_url.query)
{'PartNo': ['74743']}
To keep the empty parameters use keep_blank_values=True:
>>> query = urlparse.parse_qs(split_url.query, keep_blank_values=True)
>>> query
{'PartNo': ['74743'], 'is': [''], 'IntNumberOf': ['']}
You can then modify the query dictionary:
>>> query['PartNo'] = 85731
And update the original split_url:
>>> updated = split_url._replace(path='/'.join(base_path.split('/')[:-2] +
['ASDFZXCVQWER', '']),
query=urllib.urlencode(query, doseq=True))
>>> urlparse.urlunsplit(updated)
'http://example.com/wps/portal/ASDFZXCVQWER/?PartNo=85731&IntNumberOf=&is='

Categories