Split URL in Python - python

So, I have this URL: https://www.last.fm/music/Limp+Bizkit/Significant+Other
I want to split it, to only keep the Limp+Bizkit and Significant+Other part of the URL. These are variables, and can be different each time. These are needed to create a new URL (which I know how to do).
I want the Limp+Bizkit and Significant+Other to be two different variables. How do I do this?

You can use the str.split method and use the forward slash as the separator.
>>> url = "https://www.last.fm/music/Limp+Bizkit/Significant+Other"
>>> *_, a, b = url.split("/")
>>> a
'Limp+Bizkit'
>>> b
'Significant+Other'

You can replace https://www.last.fm/music/ in the URL to just get Limp+Bizkit/Significant+Other. Then you can split it in half at the / character to break it into two strings. Then the URL will be a list and you can access the indices with url[0] and url[1]:
>>> url = "https://www.last.fm/music/Limp+Bizkit/Significant+Other"
>>> url = url.replace("https://www.last.fm/music/",'').split('/')
>>> first_value = url[0]
>>> second_value = url[1]
>>> first_value
'Limp+Bizkit'
>>> second_value
'Significant+Other'

You can use regular expressions to achieve this.
import regex as re
url = "https://www.last.fm/music/Limp+Bizkit/Significant+Other"
match = re.match("^.*\/\/.*\/.*\/(.*)\/(.*)", url)
print(match.group(1))
print(match.group(2))

Related

Filter urls by last path with regex

I need to filter urls with regex with all last pathes except of several that shoul be skipped. For example:
import re
urls_to_exclude = ["example_1", "example_2", "example_3"]
url_1 = "htttps://site.com/api/user/endpath"
url_2 = "htttps://site.com/api/user/other_end?limit=10"
url_3 = "htttps://site.com/api/customer/example_1#tag"
url_4 = "htttps://site.com/api/blog/example_2"
>>> match = re.findall(r"...magic_regex...", url_1)
>>> 'endpath'
>>> match = re.findall(r"...magic_regex...", url_2)
>>> 'other_end'
>>> match = re.findall(r"...magic_regex...", url_3)
>>> 'example_1'
>>> match = re.findall(r"...magic_regex...", url_4)
>>> 'example_2'
It should be regex string of compile odject.
Thank you
You can try regex, it will not give you the exact last path, but you can easily evaluate using result[1:-1]
Regex:
/[\w\d_-]+[?^"#]

Extract domain name from URL using python's re regex

I want to input a URL and extract the domain name which is the string that comes after http:// or https:// and contains strings, numbers, dots, underscores, or dashes.
I wrote the regex and used the python's re module as follows:
import re
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
m.group(1)
print(m)
My understanding is that m.group(1) will extract the part between () in the re.search.
The output that I expect is: google.co.uk
But I am getting this:
<_sre.SRE_Match object; span=(0, 35), match='https://google.co.uk?link=something'>
Can you point to me how to use re to achieve my requirement?
You need to write
print(m.group(1))
Even better yet - have a condition before:
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
if m:
print(m.group(1))
Jan has already provided solution for this. But just to note, we can implement the same without using re. All it needs is !"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ for validation purposes. The same can be obtained from string package.
def domain_finder(link):
import string
dot_splitter = link.split('.')
seperator_first = 0
if '//' in dot_splitter[0]:
seperator_first = (dot_splitter[0].find('//') + 2)
seperator_end = ''
for i in dot_splitter[2]:
if i in string.punctuation:
seperator_end = i
break
if seperator_end:
end_ = dot_splitter[2].split(seperator_end)[0]
else:
end_ = dot_splitter[2]
domain = [dot_splitter[0][seperator_first:], dot_splitter[1], end_]
domain = '.'.join(domain)
return domain
link = 'https://google.co.uk?link=something'
domain = domain_finder(link=link)
print(domain) # prints ==> 'google.co.uk'
This is just another way of solving the same without re.
There is an library called tldextract which is very reliable in this case.
Here is how it will work
import tldextract
def extractDomain(url):
if "http" in str(url) or "www" in str(url):
parsed = tldextract.extract(url)
parsed = ".".join([i for i in parsed if i])
return parsed
else: return "NA"
op = open("out.txt",'w')
# with open("test.txt") as ptr:
# for lines in ptr.read().split("\n"):
# op.write(str(extractDomain(lines)) + "\n")
print(extractDomain("https://test.pythonhosted.org/Flask-Mail/"))
output as follows,
test.pythonhosted.org

python take out portion of URL and keep original formatting

Lets say i have the following URL:
https://espn.com/1234/44/222/mlb/standings
And i wanted to extract the /1234/44/222 as is. I understand that split('/')[3:5] would extract that , but it would lose the / formatting.
If your urls follow the above format, and you want the text between .com and /mlb, you can use the following regular expression:
.com([\/\d]+)\/mlb
In action:
>>> s = 'https://espn.com/1234/44/222/mlb/standings'
>>> re.findall(r'.com([\/\d]+)\/mlb', s)
['/1234/44/222']
You could also use join with split:
>>> '/'.join(s.split('/')[3:6])
1234/44/222
You can use re.findall:
import re
s = "https://espn.com/1234/44/222/mlb/standings"
new_s = '/'.join(re.findall("\d+", s))
Output:
'1234/44/222'

How to check a string matched with multiple patterns?

I want to check an URL in definition pattern list.
My pattern list is:
pat = ['/FoodListAdminCP/Login[/]?', '/FoodListAdminCP[/]?']
I used this code for check the URL matched with one item of this list
import re
f = re.compile('|'.join(pat))
if f.match(self.request.uri):
self.login = True
else:
self.login = False
Now, if I request /FoodListAdminCP/Dashboard as URL, that matched with pattern. Because start of this URL matched with '/FoodListAdminCP[/]?' who is in my list.
I want my request URL matched with entire of list item not part of that.
How I can do it?
If you want to match the entire URL against your pattern, you can use '^' and '$' to match the beginning and the end of the string to match.
In your example you could use
f = re.compile('|'.join( '(^'+p+'$)' for p in pat ))
to get the regular expression
'(^/FoodListAdminCP/Login[/]?$)|(^/FoodListAdminCP[/]?$)'
from your pat list.
If you'd rather not concatenate patterns, which may not be flexible enough, but match them separately, you can use [re.compile(p).match(uri) for p in pat] list comprehrension to get a list of match results for all the patterns:
>>> import re
>>> pat = ['/FoodListAdminCP/Login[/]?', '/FoodListAdminCP[/]?']
>>> uri = '/FoodListAdminCP/Dashboard'
>>> match_results = [re.compile(p).match(uri) for p in pat]
>>> match_results
[None, <_sre.SRE_Match object at 0x101c05d30>]
Then you can ask if all of the results are matches using all, which is what you want your login to be:
>>> login = all(match_results)
>>> login
False
Or in short:
login = all([re.compile(p).match(uri) for p in pat])
\Z at the end of your regex.
f = re.compile('(' + '|'.join(pat) + ')\Z')

Python - How to cut a string in Python?

Suppose that I have the following string:
http://www.domain.com/?s=some&two=20
How can I take off what is after & including the & and have this string:
http://www.domain.com/?s=some
Well, to answer the immediate question:
>>> s = "http://www.domain.com/?s=some&two=20"
The rfind method returns the index of right-most substring:
>>> s.rfind("&")
29
You can take all elements up to a given index with the slicing operator:
>>> "foobar"[:4]
'foob'
Putting the two together:
>>> s[:s.rfind("&")]
'http://www.domain.com/?s=some'
If you are dealing with URLs in particular, you might want to use built-in libraries that deal with URLs. If, for example, you wanted to remove two from the above query string:
First, parse the URL as a whole:
>>> import urlparse, urllib
>>> parse_result = urlparse.urlsplit("http://www.domain.com/?s=some&two=20")
>>> parse_result
SplitResult(scheme='http', netloc='www.domain.com', path='/', query='s=some&two=20', fragment='')
Take out just the query string:
>>> query_s = parse_result.query
>>> query_s
's=some&two=20'
Turn it into a dict:
>>> query_d = urlparse.parse_qs(parse_result.query)
>>> query_d
{'s': ['some'], 'two': ['20']}
>>> query_d['s']
['some']
>>> query_d['two']
['20']
Remove the 'two' key from the dict:
>>> del query_d['two']
>>> query_d
{'s': ['some']}
Put it back into a query string:
>>> new_query_s = urllib.urlencode(query_d, True)
>>> new_query_s
's=some'
And now stitch the URL back together:
>>> result = urlparse.urlunsplit((
parse_result.scheme, parse_result.netloc,
parse_result.path, new_query_s, parse_result.fragment))
>>> result
'http://www.domain.com/?s=some'
The benefit of this is that you have more control over the URL. Like, if you always wanted to remove the two argument, even if it was put earlier in the query string ("two=20&s=some"), this would still do the right thing. It might be overkill depending on what you want to do.
You need to split the string:
>>> s = 'http://www.domain.com/?s=some&two=20'
>>> s.split('&')
['http://www.domain.com/?s=some', 'two=20']
That will return a list as you can see so you can do:
>>> s2 = s.split('&')[0]
>>> print s2
http://www.domain.com/?s=some
string = 'http://www.domain.com/?s=some&two=20'
cut_string = string.split('&')
new_string = cut_string[0]
print(new_string)
You can use find()
>>> s = 'http://www.domain.com/?s=some&two=20'
>>> s[:s.find('&')]
'http://www.domain.com/?s=some'
Of course, if there is a chance that the searched for text will not be present then you need to write more lengthy code:
pos = s.find('&')
if pos != -1:
s = s[:pos]
Whilst you can make some progress using code like this, more complex situations demand a true URL parser.
>>str = "http://www.domain.com/?s=some&two=20"
>>str.split("&")
>>["http://www.domain.com/?s=some", "two=20"]
s[0:"s".index("&")]
what does this do:
take a slice from the string starting at index 0, up to, but not including the index of &in the string.

Categories