Part of below is sourced from another example. It’s modified a bit and use to read a HTML file, and output the contents into a spreadsheet.
As it’s a just a local file, using Selenium is maybe an over-kill, but I just want to learn through this example.
from selenium import webdriver
import lxml.html as LH
import lxml.html.clean as clean
import xlwt
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('SeaWeb', cell_overwrite_ok = True)
driver = webdriver.PhantomJS()
ignore_tags=('script','noscript','style')
results = []
driver.get("source_file.html")
content = driver.page_source
cleaner = clean.Cleaner()
content = cleaner.clean_html(content)
doc = LH.fromstring(content)
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text = elt.text or '' #question 1
tail = elt.tail or '' #question 1
words = ''.join((text,tail)).strip()
if words: # extra question
words = words.encode('utf-8') #question 2
results.append(words) #question 3
results.append('; ') #question 3
sheet.write (0, 0, results)
book.save("C:\\ source_output.xls")
The lines text=elt.text or '' and tail=elt.tail or '' – why both .text and .tail have texts? And why the or '' part is important here?
The texts in the HTML file contains special characters like ° (temperature degrees) – the .encode('utf-8') doesn’t make it a perfect output, neither in IDLE or Excel spreadsheet. What’s the alternative?
Is it possible to join the output into a string, instead of a list? Now to append it into a list, I have to .append it twice to have the texts and ; added.
elt is a html node. It contains certain attributes and a text section. lxml provides way to extract all the attributes and text, by using .text or .tail depending where the text is.
<a attribute1='abc'>
some text ----> .text gets this
<p attributeP='def'> </p>
some tail ---> .tail gets this
</a>
The idea behind the or ''is that if there is no text/tail found in the current html node, it returns None. And later when we want to concatenate/append None type it will complain. So to avoid any future error, if the text/tail is None then use an empty string ''
Degree character is a one-character unicode string, but when you do a .encode('utf-8') it becomes 2-byte utf-8 byte string. This 2-byte is nothing but ° or \xc3\x82\xc2\xb0. So basically you do not have to do any encoding for ° character and Python interpreter correctly interprets the encoding. If not, provide the correct shebang on top of your python script. Check the PEP-0263
# -*- coding: UTF-8 -*-
Yes you can also join the output in string, just use + as there is no append for string types for e.g.
results = ''
results = results + 'whatever you want to join'
You can keep the list and combine your 2 lines:
results.append(words + '; ')
Note: Just now i checked the xlwt documentation and sheet.write() accept only strings. So basically you cannot pass results, a list type.
A simple example for Q1
from lxml import etree
test = etree.XML("<main>placeholder</main>")
print test.text #prints placeholder
print test.tail #prints None
print test.tail or '' #prints empty string
test.text = "texter"
print etree.tostring(test) #prints <main>texter</main>
test.tail = "tailer"
print etree.tostring(test) #prints <main>texter</main>tailer
Related
I am creating a bot that would automate my work and copy particular values from a particular website. Everything works fine but the last lines of my code that says w.text produces an outcome which is text and I need a number. Each element that I need the value of looks like this after inspection:
<span class="good">€25,217.65</span>
How do I get the value as a number instead of as a text? I tried w.value or w.get_attribute('value) but it doesn't work.
Here is my program (excluding downloads of libraries and files)
driver = webdriver.Chrome(driver_path)
driver.get('https://seabass-admin.igp.cloud/')
# waiting for login table to load
try:
element = WebDriverWait(driver,10).until(
ec.presence_of_element_located((By.XPATH,'//*[#id="email"]'))
)
except:
driver.quit()
#entering sensitive info
driver.find_element_by_id("email").send_keys(pwx.em) # login details
driver.find_element_by_id("password").send_keys(pwx.pw) # password
details
driver.find_element_by_xpath('//*[#id="appContainer"]/div/form/button').click() # click sign in
# waiting for page to load
try:
element = WebDriverWait(driver,10).until(
ec.presence_of_element_located((By.XPATH,'//*
[#id="testing"]/section/section[4]/div/table/tbody/tr[2]/td[3]/span'))
)
except:
driver.quit()
# getting info from the page
w = driver.find_element_by_xpath('//*
[#id="testing"]/section/section[4]/div/table/tbody/tr[2]/td[3]/span')
cell = outcome['import']
cell[withdrawal_cell].value = w.text
You could use some of Python's built in functions for that:
str.strip() to remove any leading or trailing '€' character, then
str.replace() to remove ',' (replace it with an empty string '')
Specifically:
str_w = w.text # this is the '€25,217.65' string
digits=str_w.strip('€').replace(',','') # use the functions above to get number-like string
cell[withdrawal_cell].value = float(digits) # convert to float number
As per the HTML you have shared:
<span class="good">€25,217.65</span>
The text €25,217.65 is the innerHTML.
So, you can extract the text €25,217.65 using either:
w.get_attribute("innerHTML")
text attribute.
Now to get the value €25,217.65 as a number instead of text you need to:
Remove the € and , character using re.sub():
import re
string = "€25,217.65"
my_string = re.sub('[€,]', '', string)
Finally, to convert the string to float you need to pass the string as an argument to the float() as follows:
my_number = float(my_string)
So the entire operation in a single line:
import re
string = "€25,217.65"
print(float(re.sub('[€,]', '', string)))
Effectively, your line of code can be either of the following:
Using text attribute:
cell[withdrawal_cell].value = float(re.sub('[€,]', '', w.text))
Using get_attribute("innerHTML"):
cell[withdrawal_cell].value = float(re.sub('[€,]', '', w.get_attribute("innerHTML")))
I'm trying to scrape the position off of this webpage using BeautifulSoup. Here is my relevant code.
info_panel = soup.find("div",{"id":"meta"})
info_panel_rows = info_panel.find_all("p")
if(info_panel_rows[2].find("strong") != None):
position = info_panel_rows[2].find("strong").next_sibling
position = str(position).strip()
else: # Executing on this path in my current problem
position = info_panel_rows[3].find("strong").next_sibling
position = str(position).strip()
print(position)
When I scrape it though, it prints like such:
Small Forward
▪
How would I go about stripping this down to just "Small Forward"? I've looked all over Stack Overflow and couldn't find a clear answer.
Thanks for any help you can provide!
Are you having issues with newline and tab in position? if so, do
position = str(position).strip('\n\t ')
and if that dot is also an issue, copy from print and paste it into strip. when you dont put anything in strip, it only removes white space from both side, you need to specify what you want to remove, the above example removes newline and tab and whitespace
If this does not solve your problem, you can try regex
import re
string_patterns = re.compile(r'\b[0-9a-zA-Z]*\b')
position = info_panel_rows[3].find("strong").next_sibling
results = string_patterns.findall(str(position))
results = ' '.join([item for item in results if len(item)])
print(results)
Hope this helps
If you encode it to ascii ignoring errors then call strip() you get the desired output.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.basketball-reference.com/players/y/youngtr01.html').text
soup = BeautifulSoup(html, 'html.parser')
info_panel = soup.find("div", {"id": "meta"})
info_panel_rows = info_panel.find_all("p")
if info_panel_rows[2].find("strong") is not None:
position = info_panel_rows[2].find("strong").next_sibling
else:
position = info_panel_rows[3].find("strong").next_sibling
print(position.encode('ascii', 'ignore').strip())
Outputs:
Point Guard
Encoding to ascii gets rid of the bullet point.
Or if you just want to print the second line:
print(position.splitlines()[1].strip())
Also outputs:
Point Guard
I have some raw HTML scraped from a random website, possibly messy, with some scripts, self-closing tags, etc. Example:
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
I want to return the HTML DOM without any string, attributes or such stuff, only the tag structure, in the format of a string showing the relation between parents, children and siblings, this would be my expected output (though the use of brackets is a personnal choice):
'[html[head[meta, title], body[h1, p[span]]]]'
So far I tried using beautifulSoup (this answer was helpful). I figured out I should split the work in two steps:
- extract the tag "skeleton" of the html DOM, emptying everything like strings, attributes, and stuff before the <html>.
- return the flat HTML DOM, but structured with tree-like delimiters indicating each children and siblings, such as brackets.
I posted the code as an self-answer
You can use recursion. The name argument will give the name of the tag. You can check if the type is bs4.element.Tag to confirm if an element is a tag.
import bs4
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
soup=bs4.BeautifulSoup(ex,'html.parser')
str=''
def recursive_child_seach(tag):
global str
str+=tag.name
child_tag_list=[x for x in tag.children if type(x)==bs4.element.Tag]
if len(child_tag_list) > 0:
str+='['
for i,child in enumerate(child_tag_list):
recursive_child_seach(child)
if not i == len(child_tag_list) - 1: #if not last child
str+=', '
if len(child_tag_list) > 0:
str+=']'
return
recursive_child_seach(soup.find())
print(str)
#html[head[meta, title], body[h1, p[span]]]
print('['+str+']')
#[html[head[meta, title], body[h1, p[span]]]]
I post here my first solution, which is still a bit messy and uses a lot of regex. The first function gets the emptied DOM structure and outputs it as a raw string, the second function modifies the string to add the delimiters.
import re
def clear_tags(htmlstring, remove_scripts=False):
htmlstring = re.sub("^.*?(<html)",r"\1", htmlstring, flags=re.DOTALL)
finishyoursoup = soup(htmlstring, 'html.parser')
for tag in finishyoursoup.find_all():
tag.attrs = {}
for sub in tag.contents:
if sub.string:
sub.string.replace_with('')
if remove_scripts:
[tag.extract() for tag in finishyoursoup.find_all(['script', 'noscript'])]
return(str(finishyoursoup))
clear_tags(ex)
# '<html><head><meta/><title></title></head><body><h1></h1><p><span></span></p></b
def flattened_html(htmlstring):
import re
squeletton = clear_tags(htmlstring)
step1 = re.sub("<([^/]*?)>", r"[\1", squeletton) # replace begining of tag
step2 = re.sub("</(.*?)>", r"]", step1) # replace end of tag
step3 = re.sub("<(.*?)/>", r"[\1]", step2) # deal with self-closing tag
step4 = re.sub("\]\[", ", ", step3) # gather sibling tags with coma
return(step4)
flattened_html(ex)
# '[html[head[meta, title], body[h1, p[span]]]]'
I needed to take an XML file and replace certain values with other values.
This was easy enough parsing through the xml (as text) and replacing the old values with the new.
The issue is the new txt file is in the wrong format.
It's all encased in square brackets and has "/n" characters instead of linebreaks.
I did try the xml.dom.minidom lib but it's not working ...
I could parse the resulting file aswell and remove the "/n" and square brackets but don't want to do that as I am not sure this is the only thing that has been added in this format.
source code :
import json
import shutil
import itertools
import datetime
import time
import calendar
import sys
import string
import random
import uuid
import xml.dom.minidom
inputfile = open('data.txt')
outputfile = open('output.xml','w')
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
my_text = str(inputfile.readlines())
my_text2 = ''
#print (my_text)
#Just replicate the session logs x times ...
#print ("UUID random : " + str(uuid.uuid4()).replace("-","")[0:28])
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
#xml = xml.dom.minidom.parseString(my_text2)
outputfile.write(my_text2)
print (my_text)
inputfile.close()
outputfile.close()
The original text is XML format but the output is like
time is it</span></div><div
class=\\"di_transcriptAvatarAnswerEntry\\"><span
class=\\"di_transcriptAvatarTitle\\">[AVATAR] </span> <span
class=\\"di_transcriptAvatarAnswerText\\">My watch says it\'s 6:07
PM.<br/>Was my answer helpful? No Yes</span></div>\\r\\n"
</variable>\n', '</element>\n', '</path>\n', '</transaction>\n',
'</session>\n', '<session>']
You are currently using readlines(). This will read each line of your file and return you a Python list, one line per entry (complete with \n on the end of each entry). You were then using str() to convert the list representation into a string, for example:
text = str(['line1\n', 'line2\n', line3\n'])
text would now be a string looking like a your list, complete with all the [ and quote characters. Rather than using readlines(), you probably need to just use read() which would return the whole file contents as a single text string for you to work with.
Try using the following type approach which also uses the preferred with context manager for dealing with files (it closes them automatically for you).
import uuid
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
with open('data.txt') as inputfile, open('output.xml','w') as outputfile:
my_text = inputfile.read()
my_text2 = ''
#Just replicate the session logs x times ...
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
outputfile.write(my_text2)
This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 5 years ago.
I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?
For example:
I get back:
ǎ
which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'
The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
up to Python 3.4:
import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('© 2010') # u'\xa9 2010'
h.unescape('© 2010') # u'\xa9 2010'
Python 3.4+:
import html
html.unescape('© 2010') # u'\xa9 2010'
html.unescape('© 2010') # u'\xa9 2010'
Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.
Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:
import re, htmlentitydefs
##
# Removes HTML or XML character references and entities from a text string.
#
# #param text The HTML (or XML) source text.
# #return The plain text, as a Unicode string, if necessary.
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
Use the builtin unichr -- BeautifulSoup isn't necessary:
>>> entity = 'ǎ'
>>> unichr(int(entity[3:],16))
u'\u01ce'
If you are on Python 3.4 or newer, you can simply use the html.unescape:
import html
s = html.unescape(s)
An alternative, if you have lxml:
>>> import lxml.html
>>> lxml.html.fromstring('ǎ').text
u'\u01ce'
You could find an answer here -- Getting international characters from a web page?
EDIT: It seems like BeautifulSoup doesn't convert entities written in hexadecimal form. It can be fixed:
import copy, re
from BeautifulSoup import BeautifulSoup
hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
# replace hexadecimal character reference by decimal one
hexentityMassage += [(re.compile('&#x([^;]+);'),
lambda m: '&#%d;' % int(m.group(1), 16))]
def convert(html):
return BeautifulSoup(html,
convertEntities=BeautifulSoup.HTML_ENTITIES,
markupMassage=hexentityMassage).contents[0].string
html = '<html>ǎǎ</html>'
print repr(convert(html))
# u'\u01ce\u01ce'
EDIT:
unescape() function mentioned by #dF which uses htmlentitydefs standard module and unichr() might be more appropriate in this case.
This is a function which should help you to get it right and convert entities back to utf-8 characters.
def unescape(text):
"""Removes HTML or XML character references
and entities from a text string.
#param text The HTML (or XML) source text.
#return The plain text, as a Unicode string, if necessary.
from Fredrik Lundh
2008-01-03: input only unicode characters string.
http://effbot.org/zone/re-sub.htm#unescape-html
"""
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
print "Value Error"
pass
else:
# named entity
# reescape the reserved characters.
try:
if text[1:-1] == "amp":
text = "&"
elif text[1:-1] == "gt":
text = ">"
elif text[1:-1] == "lt":
text = "<"
else:
print text[1:-1]
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
print "keyerror"
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
Not sure why the Stack Overflow thread does not include the ';' in the search/replace (i.e. lambda m: '&#%d*;*') If you don't, BeautifulSoup can barf because the adjacent character can be interpreted as part of the HTML code (i.e. 'B for 'Blackout).
This worked better for me:
import re
from BeautifulSoup import BeautifulSoup
html_string=''Blackout in a can; on some shelves despite ban'
hexentityMassage = [(re.compile('&#x([^;]+);'),
lambda m: '&#%d;' % int(m.group(1), 16))]
soup = BeautifulSoup(html_string,
convertEntities=BeautifulSoup.HTML_ENTITIES,
markupMassage=hexentityMassage)
The int(m.group(1), 16) converts the number (specified in base-16) format back to an integer.
m.group(0) returns the entire match, m.group(1) returns the regexp capturing group
Basically using markupMessage is the same as:
html_string = re.sub('&#x([^;]+);', lambda m: '&#%d;' % int(m.group(1), 16), html_string)
Another solution is the builtin library xml.sax.saxutils (both for html and xml). However, it will convert only >, & and <.
from xml.sax.saxutils import unescape
escaped_text = unescape(text_to_escape)
Here is the Python 3 version of dF's answer:
import re
import html.entities
def unescape(text):
"""
Removes HTML or XML character references and entities from a text string.
:param text: The HTML (or XML) source text.
:return: The plain text, as a Unicode string, if necessary.
"""
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return chr(int(text[3:-1], 16))
else:
return chr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = chr(html.entities.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
The main changes concern htmlentitydefs that is now html.entities and unichr that is now chr. See this Python 3 porting guide.