Python getting the name of a website from its url [duplicate] - python

This question already has answers here:
Extract domain name from URL in Python
(8 answers)
Closed 5 months ago.
I want to get the name of a website from a url in a very simple way. Like, I have the URL "https://www.google.com/" or any other url, and I want to get the "google" part.
The issue is that there could be many pitfalls. Like, it could be www3 or it could be http for some reason. It could also be like the python docs where it says "https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse". I only want "python" in that case.
Is there a simple way to do it? The only one I can think of is just doing lots and lots of string.removeprefix or something like that, but thats ugly. I could not find anything that resembled what I searched for in the urllib library, but maybe there is another one?

Here's an idea:
import re
url = 'https://python.org'
url_ext = ['.com', '.org', '.edu', '.net', '.co.uk', '.cc', '.info', '.io']
web_name = ''
# Cuts off extension and everything after
for ext in url_ext:
if ext in url:
web_name = url.split(ext)[0]
# Reverse the string to find first non-alphanumeric character
web_name = web_name[::-1]
final = re.search(r'\W+', web_name).start()
final = web_name[0 : final]
# Reverse string again, return final
print(final[::-1])
The code starts by cutting off the extension of the website and everything that follows it. It then reverses the string and looks for the first non-alphanumeric character and cuts off everything after that utilizing the regex library. It then reverses the string again to print out the final result.
This code is probably not going to work on every single website as there are a million different way to structure a URL but it should work for you to some degree.

Related

Python string.split() How to ignore empty spaces [duplicate]

This question already has answers here:
Why are empty strings returned in split() results?
(9 answers)
Closed 2 years ago.
I'm using split to parse http requests and came across something that I do not like but don't know a better way.
Imagine I have this GET : /url/hi
I'm splitting the url simply like so:
fields = request['url'].split('/')
It's simple, it works but it also makes the contents of the list have the first position as an empty string. I know this is expected behavior.
The question is: Can I change the calling of split to contemplate such thing or do I just live with it?
If you just always want to remove the first entry to the list you could just do this:
fields = request['url'].split('/')[1:]
If you just want to remove any empty strings from the list you can use instead follow your initial call with this:
fields.remove('')
Hope it helps!
Ok, If you sure your string start with '/'
you can ignore first character like this:
url = request['url']
fields = url[1:].split('/') #[1: to end]
If your not sure, simple check first:
url = request['url']
if url.startswith('/'):
url = url[1:]
fields = url.split('/')
Happy coding 😎

find substrings between two string [duplicate]

This question already has answers here:
Parsing HTML using Python
(7 answers)
Closed 3 years ago.
I have a string like this:
string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title" width="600"...
title="one more title"...> '''
I am trying to get anything that appears as title (title="Anything here")
I have already tried this but it does not work correctly.
re.findall(r'title=\"(.*)\"',string)
I think your Regex is too Greedy. You can try something like this
re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:
https://docs.python.org/3/library/html.parser.html
https://www.simplifiedpython.net/parsing-html-in-python/
https://github.com/psf/requests-html / Get html using Python requests?
If you would like to read more on performance testing of different python HTML parsers you can learn more here
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help
c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)
for i in c:
print(i.group(1))
The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.

Extract URL's inclusive with fragments in string using Python with Regex

Ok i know ppl are going to say this question has been asked a million times.. but my question is DIFFERENT. I have searched stackoverflow many many many times to ensure this is not a duplicate..
I want a regex in Python that also helps to extract the URL from a string INCLUDING FRAGMENTS
What i have done so far is:
import re
test = 'This is a string with my URL as follows http://www.example.org/foo.html#bar and here i continue with my string'
test = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', test)
print (test)
The output i get for the above code is ['http://www.example.org/foo.html']
Which is not what i want..
I want to the output to be ['http://www.example.org/foo.html#bar']
Your original regex is this:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
Couldn't you just add '#' Like this?:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),#]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
I am unclear as to what you mean by 'fragments'... Do you mean anything up to the space in the string?

How to use extended ascii with bs4 url

I've been reluctant to post a question about this, but after 3 days of google I can't get this to work. Long story short i'm making a raid gear tracker for WoW.
I'm using BS4 to handle the webscraping, I'm able to pull the page and scrape the info I need from it. The problem I'm having is when there is an extended ascii character in the player's name, ex: thermíte. (the i is alt+161)
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
I'm trying to figure out how to re-encode the url so it is more like this:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
I'm using tkinter for the gui, I have the user select their realm from a dropdown and then type in the character name in an entry field.
namefield = Entry(window, textvariable=toonname)
I have a scraping function that performs the initial scrape of the main profile page. this is where I assign the value of namefield to a global variable.(I tried to passing it directly to the scraper from with this
namefield = Entry(window, textvariable=toonname, command=firstscrape)
I thought I was close, because when it passed "thermíte", the scrape function would print out "therm\xC3\xADte" all I needed to do was replace the '\x' with '%' and i'd be golden. But it wouldn't work. I could use mastername.find('\x') and it would find instances of it in the string, but doing mastername.replace('\x','%') wouldn't actually replace anything.
I tried various combinations of r'\x' '\%' r'\x' etc etc. no dice.
Lastly when I try to do things like encode into latin then decode back into utf-8 i get errors about how it can't handle the extended ascii character.
urlpart1 = "http://us.battle.net/wow/en/character/garrosh/"
urlpart2 = mastername
urlpart3 = "/advanced"
url = urlpart1 + urlpart2 + urlpart3
That's what I've been using to try and rebuild the final url(atm i'm leaving the realm constant until I can get the name problem fixed)
Tldr:
I'm trying to take a url with extended ascii like:
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
And have it become a url that a browser can easily process like:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
with all of the normal extended ascii characters.
I hope this made sense.
here is a pastebin for the full script atm. there are some things in it atm that aren't utilized until later on. pastebin link
There shouldn't be non-ascii characters in the result url. Make sure mastername is a Unicode string (isinstance(mastername, str) on Python 3):
#!/usr/bin/env python3
from urllib.parse import quote
mastername = "thermíte"
assert isinstance(mastername, str)
url = "http://us.battle.net/wow/en/character/garrosh/{mastername}/advanced"\
.format(mastername=quote(mastername, safe=''))
# -> http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
You can try something like this:
>>> import urllib
>>> 'http://' + '/'.join([urllib.quote(x) for x in url.strip('http://').split('/')]
'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
urllib.quote() "safe" urlencodes characters of a string. You don't want all the characters to be affected, just everything between the '/' characters and excluding the initial 'http://'. So the strip and split functions take those out of the equation, and then you concatenate them back in with the + operator and join
EDIT: This one is on me for not reading the docs... Much cleaner:
>>> url = 'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
>>> urllib.quote(url, safe=':/')
'http://us.battle.net/wow/en/character/garrosh/therm%25C3%25ADte/advanced'

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

Categories