Python BeautifulSoup Ampersand issue Mac vs. Linux Ubuntu

Python BeautifulSoup Ampersand issue Mac vs. Linux Ubuntu - python

I've read that BeautifulSoup has problems with ampersands (&) which are not strictly correct in HTML but still interpreted correctly by most browsers. However weirdly I'm getting different behaviour on a Mac system and on a Ubuntu system, both using bs4 version 4.3.2:
html='<td>S&P500</td>'
s=bs4.BeautifulSoup(html)
On the Ubuntu system s is equal to:
<td>S&P500;</td>
Notice the added semicolon at the end which is a real problem
On the mac system:
<html><head></head><body>S&P500</body></html>
Never mind the html/head/body tags, I can deal with that, but notice S&P 500 is correctly interpreted this time, without the added ";".
Any idea what's going on? How to make cross-platform code without resorting to an ugly hack? Thanks a lot,

First I can't reproduce the mac results using python2.7.1 and beautifulsoup4.3.2, that is I am getting the extra semicolon on all systems.
The easy fix is a) use strictly valid HTML, or b) add a space after the ampersand. Chances are you can't change the source, and if you could parse out and replace these in python you wouldn't be needing BeautifulSoup ;)
So the problem is that the BeautifulSoupHTMLParser first converts S&P500 to S&P500; because it assumes P500 is the character name and you just forgot the semicolon.
Then later it reparses the string and finds &P500;. Now it doesn't recognize P500 as a valid name and converts the & to & without touching the rest.
Here is a stupid monkeypatch only to demonstrate my point. I don't know the inner workings of BeautifulSoup well enough to propose a proper solution.
from bs4 import BeautifulSoup
from bs4.builder._htmlparser import BeautifulSoupHTMLParser
from bsp.dammit import EntitySubstitution
def handle_entityref(self, name):
character = EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
if character is not None:
data = character
else:
# Previously was
# data = "&%s;" % name
data = "&%s" % name
self.handle_data(data)
html = '<td>S&P500</td>'
# Pre monkeypatching
# <td>S&P500;</td>
print(BeautifulSoup(html))
BeautifulSoupHTMLParser.handle_entityref = handle_entityref
# Post monkeypatching
# <td>S&P500</td>
print(BeautifulSoup(html))
Hopefully someone more versed in bs4 can give you a proper solution, good luck.

Related

Is there a urllib bug with unquoting 'ß'?

I'm scraping a german website with python and want to save files with the original name that I get as an url encoded string.
e.g. what I start with:
'GA_Pottendorf_Wiener%20Stra%C3%9Fe%2035_signed.pdf'
what I want:
'GA_Pottendorf_Wiener Straße 35_signed.pdf'
import urllib.parse
urllib.parse.unquote(file_name_quoted)
Above code works fine with 99% of the names until they contain an 'ß', which should be a legit
utf-8 character.
But I get this:
'GA_Pottendorf_Wiener Stra�e 35_signed.pdf'
>>> urllib.parse.quote('ß')
'%C3%9F'
>>> urllib.parse.unquote('%C3%9F')
'�'
Is this a bug, or a feature that I don't understand?
(and please feel free to correct my way of asking this question, it's my first one)

How do I parse the output of getmac command?

After doing some research I found out that the best way to get the Ethernet MAC address under windows is the "getmac" command. (getmac module of python does not produce the same!!). Now I want to use this command from within a python code to get the MAC address. I figured out that my code should start something like this:
import os
if sys.platform == 'win32':
os.system("getmac")
do something here to get the first mac address that appears in the results
here is an example output
Physical Address Transport Name
=================== ==========================================================
1C-69-7A-3A-E3-40 Media disconnected
54-8D-5A-CE-21-1A \Device\Tcpip_{82B01094-C274-418F-AB0A-BC4F3660D6B4}
I finally want to get 1C-69-7A-3A-E3-40 preferably without the dashes.
Thanks in advance.

Two things. First of all, I recommend you find ways of getting the mac address more elegantly. This question's answer seems to use the uuid module, which is perhaps a good cross-platform solution.
Having said that, if you want to proceed with parsing the output of a system call, I recommend the use of Python's subprocess module. For example:
import subprocess
output_of_command = subprocess.check_output("getmac")
This will run getmac and the output of that command will go into a variable. From there, you can parse the string.
Here's how you might extract the mac address from that string:
# I'm setting this directly to provide a clear example of the parsing, separate
# from the first part of this answer.
my_string = """Physical Address Transport Name
=================== ==========================================================
1C-69-7A-3A-E3-40 Media disconnected
54-8D-5A-CE-21-1A \Device\Tcpip_{82B01094-C274-418F-AB0A-BC4F3660D6B4}"""
my_mac_address = my_string.rsplit('=', 1)[-1].split(None, 1)[0]
The first split is a right split. It's breaking up the string by the '=' character, once, starting from the end of the string. Then, I'm splitting the output of that by whitespace, limiting to one split, and taking the first string value.
Again, however, I would discourage this approach to getting a mac address. Parsing the human-readable output of command line scripts is seldom advisable because the output can unexpectedly be different than what your script is expecting. You can assuredly get the mac address in a more robust way.

Stripping multiple characters from the start of a string

I'm trying to trim a sub-string from the beginning of a string based on a condition:
For instance, if the input is a domain name prefixed with http, https and/or www, it needs to strip these and return only the domain.
Here's what I have so far:
if my_domain.startswith("http://"):
my_domain = my_domain[7:]
elif my_domain.startswith("https://"):
my_domain = my_domain[8:]
if my_domain.startswith("www."):
my_domain = my_domain[4:]
print my_domain
I've tried to use these inbuilt functions (.startswith) instead of trying to use regex.
While the code above works, I'm wondering if there is a more efficient way to combine the conditions to make the code shorter or have multiple checks in the same conditional statement?

I know regex is computationally slower than a lot of the built in methods but it is a lot easier to write code wise :)
import re
re.sub("http[s]*://|www\." , "", my_domain)
edit:
As mentioned by #Dunes a more correct way of answering this problem is.
re.sub(r"^https?://(www\.)?" , "" , my_domain)
Old answer left for reference so that Dunes comment still has some context.

Use urllib.parse (Python 3).
>>> from urllib import parse
>>> components = parse.urlsplit('http://stackoverflow.com/questions/38187220/stripping-multiple-characters-from-the-start-of-a-string')
>>> components[1]
'stackoverflow.com'
The Python 2.7 equivalent is named urlparse.
To cover the 'www.' case, you could simply do
* subdomains, domain, ending = components[1].split('.')
return '.'.join((domain, ending))
In Python 2.7 you don’t have access to * unpacking but you can use a list slice instead to get the same effect.

How to use extended ascii with bs4 url

I've been reluctant to post a question about this, but after 3 days of google I can't get this to work. Long story short i'm making a raid gear tracker for WoW.
I'm using BS4 to handle the webscraping, I'm able to pull the page and scrape the info I need from it. The problem I'm having is when there is an extended ascii character in the player's name, ex: thermíte. (the i is alt+161)
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
I'm trying to figure out how to re-encode the url so it is more like this:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
I'm using tkinter for the gui, I have the user select their realm from a dropdown and then type in the character name in an entry field.
namefield = Entry(window, textvariable=toonname)
I have a scraping function that performs the initial scrape of the main profile page. this is where I assign the value of namefield to a global variable.(I tried to passing it directly to the scraper from with this
namefield = Entry(window, textvariable=toonname, command=firstscrape)
I thought I was close, because when it passed "thermíte", the scrape function would print out "therm\xC3\xADte" all I needed to do was replace the '\x' with '%' and i'd be golden. But it wouldn't work. I could use mastername.find('\x') and it would find instances of it in the string, but doing mastername.replace('\x','%') wouldn't actually replace anything.
I tried various combinations of r'\x' '\%' r'\x' etc etc. no dice.
Lastly when I try to do things like encode into latin then decode back into utf-8 i get errors about how it can't handle the extended ascii character.
urlpart1 = "http://us.battle.net/wow/en/character/garrosh/"
urlpart2 = mastername
urlpart3 = "/advanced"
url = urlpart1 + urlpart2 + urlpart3
That's what I've been using to try and rebuild the final url(atm i'm leaving the realm constant until I can get the name problem fixed)
Tldr:
I'm trying to take a url with extended ascii like:
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
And have it become a url that a browser can easily process like:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
with all of the normal extended ascii characters.
I hope this made sense.
here is a pastebin for the full script atm. there are some things in it atm that aren't utilized until later on. pastebin link

There shouldn't be non-ascii characters in the result url. Make sure mastername is a Unicode string (isinstance(mastername, str) on Python 3):
#!/usr/bin/env python3
from urllib.parse import quote
mastername = "thermíte"
assert isinstance(mastername, str)
url = "http://us.battle.net/wow/en/character/garrosh/{mastername}/advanced"\
.format(mastername=quote(mastername, safe=''))
# -> http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced

You can try something like this:
>>> import urllib
>>> 'http://' + '/'.join([urllib.quote(x) for x in url.strip('http://').split('/')]
'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
urllib.quote() "safe" urlencodes characters of a string. You don't want all the characters to be affected, just everything between the '/' characters and excluding the initial 'http://'. So the strip and split functions take those out of the equation, and then you concatenate them back in with the + operator and join
EDIT: This one is on me for not reading the docs... Much cleaner:
>>> url = 'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
>>> urllib.quote(url, safe=':/')
'http://us.battle.net/wow/en/character/garrosh/therm%25C3%25ADte/advanced'

Parsing xml with "not well-formed" characters in python

I am getting xml data from an application, which I want to parse in python:
#!/usr/bin/python
import xml.etree.ElementTree as ET
import re
xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)
It works for smaller datasets with example data, but when I go to real live data, I get
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72
Looking at the xml file, I see this line 364658:
WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>
I guess it is the ^[ which makes python choke - it is also highlighted blue in vim. Now I was hoping that I could clean the data with my regex substitution, but that did not work.
The best thing would be fixing the application which generated the xml, but that is out of scope. So I need to deal with the data as it is. How can I work around this? I could live with just throwing away "illegal" characters.

You already do:
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
but the character ^[ is probably Python's \x1b. If xml.parser.expat chokes on it, you need simply to clean up more, by only accepting some characters below 0x20 (space). For example:
xml_cleaned = re.sub(u'[^\n\r\t\x20-\x7f]+',u'',xml_as_string)

I know this is pretty old, but stumbled upon the following url that has a list of all of the primary characters and their encodings.
https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.