How to use extended ascii with bs4 url - python

I've been reluctant to post a question about this, but after 3 days of google I can't get this to work. Long story short i'm making a raid gear tracker for WoW.
I'm using BS4 to handle the webscraping, I'm able to pull the page and scrape the info I need from it. The problem I'm having is when there is an extended ascii character in the player's name, ex: thermíte. (the i is alt+161)
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
I'm trying to figure out how to re-encode the url so it is more like this:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
I'm using tkinter for the gui, I have the user select their realm from a dropdown and then type in the character name in an entry field.
namefield = Entry(window, textvariable=toonname)
I have a scraping function that performs the initial scrape of the main profile page. this is where I assign the value of namefield to a global variable.(I tried to passing it directly to the scraper from with this
namefield = Entry(window, textvariable=toonname, command=firstscrape)
I thought I was close, because when it passed "thermíte", the scrape function would print out "therm\xC3\xADte" all I needed to do was replace the '\x' with '%' and i'd be golden. But it wouldn't work. I could use mastername.find('\x') and it would find instances of it in the string, but doing mastername.replace('\x','%') wouldn't actually replace anything.
I tried various combinations of r'\x' '\%' r'\x' etc etc. no dice.
Lastly when I try to do things like encode into latin then decode back into utf-8 i get errors about how it can't handle the extended ascii character.
urlpart1 = "http://us.battle.net/wow/en/character/garrosh/"
urlpart2 = mastername
urlpart3 = "/advanced"
url = urlpart1 + urlpart2 + urlpart3
That's what I've been using to try and rebuild the final url(atm i'm leaving the realm constant until I can get the name problem fixed)
Tldr:
I'm trying to take a url with extended ascii like:
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
And have it become a url that a browser can easily process like:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
with all of the normal extended ascii characters.
I hope this made sense.
here is a pastebin for the full script atm. there are some things in it atm that aren't utilized until later on. pastebin link

There shouldn't be non-ascii characters in the result url. Make sure mastername is a Unicode string (isinstance(mastername, str) on Python 3):
#!/usr/bin/env python3
from urllib.parse import quote
mastername = "thermíte"
assert isinstance(mastername, str)
url = "http://us.battle.net/wow/en/character/garrosh/{mastername}/advanced"\
.format(mastername=quote(mastername, safe=''))
# -> http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced

You can try something like this:
>>> import urllib
>>> 'http://' + '/'.join([urllib.quote(x) for x in url.strip('http://').split('/')]
'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
urllib.quote() "safe" urlencodes characters of a string. You don't want all the characters to be affected, just everything between the '/' characters and excluding the initial 'http://'. So the strip and split functions take those out of the equation, and then you concatenate them back in with the + operator and join
EDIT: This one is on me for not reading the docs... Much cleaner:
>>> url = 'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
>>> urllib.quote(url, safe=':/')
'http://us.battle.net/wow/en/character/garrosh/therm%25C3%25ADte/advanced'

Related

Python getting the name of a website from its url [duplicate]

This question already has answers here:
Extract domain name from URL in Python
(8 answers)
Closed 5 months ago.
I want to get the name of a website from a url in a very simple way. Like, I have the URL "https://www.google.com/" or any other url, and I want to get the "google" part.
The issue is that there could be many pitfalls. Like, it could be www3 or it could be http for some reason. It could also be like the python docs where it says "https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse". I only want "python" in that case.
Is there a simple way to do it? The only one I can think of is just doing lots and lots of string.removeprefix or something like that, but thats ugly. I could not find anything that resembled what I searched for in the urllib library, but maybe there is another one?
Here's an idea:
import re
url = 'https://python.org'
url_ext = ['.com', '.org', '.edu', '.net', '.co.uk', '.cc', '.info', '.io']
web_name = ''
# Cuts off extension and everything after
for ext in url_ext:
if ext in url:
web_name = url.split(ext)[0]
# Reverse the string to find first non-alphanumeric character
web_name = web_name[::-1]
final = re.search(r'\W+', web_name).start()
final = web_name[0 : final]
# Reverse string again, return final
print(final[::-1])
The code starts by cutting off the extension of the website and everything that follows it. It then reverses the string and looks for the first non-alphanumeric character and cuts off everything after that utilizing the regex library. It then reverses the string again to print out the final result.
This code is probably not going to work on every single website as there are a million different way to structure a URL but it should work for you to some degree.

Is there a urllib bug with unquoting 'ß'?

I'm scraping a german website with python and want to save files with the original name that I get as an url encoded string.
e.g. what I start with:
'GA_Pottendorf_Wiener%20Stra%C3%9Fe%2035_signed.pdf'
what I want:
'GA_Pottendorf_Wiener Straße 35_signed.pdf'
import urllib.parse
urllib.parse.unquote(file_name_quoted)
Above code works fine with 99% of the names until they contain an 'ß', which should be a legit
utf-8 character.
But I get this:
'GA_Pottendorf_Wiener Stra�e 35_signed.pdf'
>>> urllib.parse.quote('ß')
'%C3%9F'
>>> urllib.parse.unquote('%C3%9F')
'�'
Is this a bug, or a feature that I don't understand?
(and please feel free to correct my way of asking this question, it's my first one)

Python Web Scraping: Extracting the Area of a Region in Wikipedia from the Infobox Geography Vcard

I know this sort of question has been dealt with numerous times, but after combing through answers and guides for hours, I just can't crack this and would enormously grateful for some help.
Ideally, I want to extract the area in square kilometers as listed in the Infobox on Wikipedia. For example, the code I run on https://en.wikipedia.org/wiki/Sandton should produce something along the lines of "143.54 km".
The code I've put together using numerous guides seems to work only on Wikipedia sites for whole countries where the "Area" is actually a link. Trying this on Spain's Wikipedia page:
from bs4 import BeautifulSoup
import requests
def getAdditionalDetails(URL):
try:
soup = BeautifulSoup(requests.get(URL).text, 'lxml')
table = soup.find('table', {'class': 'infobox geography vcard'})
additional_details = []
read_content = False
for tr in table.find_all('tr'):
if (tr.get('class') == ['mergedtoprow'] and not read_content):
link = tr.find('th')
if (link.get_text().strip() == 'Area'):
read_content = True
if (link.get_text().strip() == 'Population'):
read_content = False
elif ((tr.get('class') == ['mergedrow'] or tr.get('class') == ['mergedbottomrow']) and read_content):
additional_details.append(tr.find('td').get_text().strip('\n'))
if (tr.find('div').get_text().strip() != '•\xa0Total area'):
read_content = False
return additional_details
except Exception as error:
print('Error occured: {}'.format(error))
return []
URL = "https://en.wikipedia.org/wiki/Spain"
print(getAdditionalDetails(URL))
This outputs the almost usable:
['505,990[6]\xa0km2 (195,360\xa0sq\xa0mi) (51st)']
Can anyone much smarter than I assist?
Thank you.
Not the cleanest way to do this but here goes. If you want a specific row, start with that as the CSS selector.
Code Example
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Sandton'
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')
area = soup.select('table > tbody > tr')[9].get_text(strip=True)
area = area.replace('\xa0', '').split('(')[0]
cleaned_area = area[7:]
Output
143.54 km2(55.42 sq mi)
Explanation
The area variable in this code we're selecting the rows specifically with the CSS selector.
The get_text(strip=True) is the method to grab text but it strips all white space. You should know that \xa0 is non-breaking space in Latin1 encoding. The Strip=True will remove this at the start and end of the string.
The output of the area variable without strip=True looks like this
'\xa0•\xa0Total143.54\xa0km2 (55.42\xa0sq\xa0mi)'
With strip=True
'•\xa0Total143.54\xa0km2(55.42\xa0sq\xa0mi)'
So you're still stuck within the string.
Using the replace string method, we can replace \xa0 with a space.
So the output
'• Total143.54 km2(55.42 sq mi)'
Then beacuse we actually don't need the first 7 characters, we just take from the 8th character onwards using slicing that comes with strings.
Additional Information
Encoding is a huge topic within python and computing in general, knowing a little abit about it is important. Essentially encoding exists because everything in computers is a byte whether like it or not. There has to be a translation from hardware to software and encoding is part of that step.
We want to be able to convert characters into bits so that computer can do something them when we write code.
The simplest type of encoding is ASCII which you may have already come across at some point. The entire ASCII table has 128 characters which correspond to 'Code Points'
ASCII Code Point: 97
Character: a
Now you might ask what is the point in that ? Well we can turn this characters into code points which are easily translated to binary. That is easily converted into bits (A one or a zero) for the computer to do something with at the hardware level.
Now the problem with ASCII is that there are more characters in human languages than 128 characters much more... So enter in a new types of encodings. Which are there many, the commonest one is Unicode and I've provied some resources to learn a little bit more on that.
Now Latin-1 encoding is the default encoding for HTTP requests, where the requests library follows this encoding strictly.
Some resources:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Pragmatic Unicode
Real Python | Encoding
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

How do I get a regular expression to recognize non-ASCII characters as letters?

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.
My problem is that when I print the information the öäå are gone.
I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.
So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)
EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8.
EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site
Always work in unicode and only convert to an encoded representation when necessary.
For this particular situation, you also need to use the re.U flag so \w matches unicode letters:
#coding: utf-8
import re
location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)
print location # prints öäå
It would help if you could dump the strings before and after each step.
Check your value of re.UNICODE first, see this

Python regex convert youtube url to youtube video

I'm making a regex so I can find youtube links (can be multiple) in a piece of HTML text posted by an user.
Currently I'm using the following regex to change 'http://www.youtube.com/watch?v=-JyZLS2IhkQ' into displaying the corresponding youtube video:
return re.compile('(http(s|):\/\/|)(www.|)youtube.(com|nl)\/watch\?v\=([a-zA-Z0-9-_=]+)').sub(tag, value)
(where the variable 'tag' is a bit of html so the video works and 'value' a user post)
Now this works.. until the url is like this:
'http://www.youtube.com/watch?v=-JyZLS2IhkQ&feature...'
Now I'm hoping you guys could help me figure how to also match the '&feature...' part so it disappears.
Example HTML:
No replies to this post..
Youtube vid:
http://www.youtube.com/watch?v=-JyZLS2IhkQ
More blabla
Thanks for your thoughts, much appreciated
Stefan
Here how I'm solving it:
import re
def youtube_url_validation(url):
youtube_regex = (
r'(https?://)?(www\.)?'
'(youtube|youtu|youtube-nocookie)\.(com|be)/'
'(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')
youtube_regex_match = re.match(youtube_regex, url)
if youtube_regex_match:
return youtube_regex_match
return youtube_regex_match
TESTS:
youtube_urls_test = [
'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
'http://youtu.be/5Y6HSHwhVlY',
'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
'http://www.youtube.com/',
'http://www.youtube.com/?feature=ytca']
for url in youtube_urls_test:
m = youtube_url_validation(url)
if m:
print('OK {}'.format(url))
print(m.groups())
print(m.group(6))
else:
print('FAIL {}'.format(url))
You should specify your regular expressions as raw strings.
You don't have to escape every character that looks special, just the ones which are.
Instead of specifying an empty branch ((foo|)) to make something optional, you can use ?.
If you want to include - in a character set, you have to escape it or put it at right after the opening bracket.
You can use special character sets like \w (equals [a-zA-Z0-9_]) to shorten your regex.
r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([-\w]+)'
Now, in order to match the whole URL, you have to think about what can or cannot follow it in the input. Then you put that into a lookahead group (you don't want to consume it).
In this example I took everything except -, =, %, & and alphanumerical characters to end the URL (too lazy to think about it any harder).
Everything between the v-argument and the end of the URL is non-greedily consumed by .*?.
r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([\w-]+)(&.*?)?(?=[^-\w&=%])'
Still, I would not put too much faith into this general solution. User input is notoriously hard to parse robustly.
What if you used the urlparse module to pick apart the youtube address you find and put it back into the format you want? You could then simplify your regex so that it only finds the entire url and then use urlparse to do the heavy lifting of picking it apart for you.
from urlparse import urlparse,parse_qs,urlunparse
from urllib import urlencode
youtube_url = urlparse('http://www.youtube.com/watch?v=aFNzk7TVUeY&feature=grec_index')
params = parse_qs(youtube_url.query)
new_params = {'v': params['v'][0]}
cleaned_youtube_url = urlunparse((youtube_url.scheme, \
youtube_url.netloc, \
youtube_url.path,
None, \
urlencode(new_params), \
youtube_url.fragment))
It's a bit more code, but it allows you to avoid regex madness.
And as hop said, you should use raw strings for the regex.
Here's how I implemented it in my script:
string = "Hey, check out this video: https://www.youtube.com/watch?v=bS5P_LAqiVg"
youtube = re.findall(r'(https?://)?(www\.)?((youtube\.(com))/watch\?v=([-\w]+)|youtu\.be/([-\w]+))', string)
if youtube:
print youtube
That outputs:
["", "youtube.com/watch?v=BS5P_LAqiVg", ".com", "watch", "com", "bS5P_LAqiVg", ""]
If you just wanted to grab the video id, for example, you would do:
video_id = [c for c in youtube[0] if c] # Get rid of empty list objects
video_id = video_id[len(video_id)-1] # Return the last item in the list

Categories