python HTML print a list - python

here is part of my script. i have a list called content in this code that contains a bunch of ascii in it. my question is how do you print out all those ascii in the rows and columns. all im getting in the url is a word content rather than whats in that list. everything else works, so i just need to know how to display whats in that list
print (''' <!DOCTYPE html>
<html>
<head>
<title>Lab 5.1.cgi</title>
<style type="text/css">
body {background-color:white;font-style:none;font-weight:normal; font-family:Gill Sans, Helvetica, Arial,sans-serif;font-weight:bold;font-size:30px;color:#c00; text-shadow:0 0 2px black,0px 0px 8px white; background:url(data:image/gif;base64,%s);text-align:center} h1 {font-size:72px;font-weight:bold} h2 {font-weight:normal} div {font-family:Arial;font-size:30px;color:black; text-shadow:none;background-color:white;width:440px;margin:0 auto} .source {width:960px}
</style>
</head>
<body>
<div class="row">
<div class="twelve columns">
<div class="globalWrapper">
<div>
<strong>lab5.1.cgi</strong>
<h1 class="title">Arial Unicode MS</h1>
<h2 class="small">Printable Characters: 32–126, 128 – 4000</h2>
%s
</div>
</div>
</div>
</div>
</body>
</html>''' % (external_link, content))
this is basically whats in the list
for i in range(33,127):
content.append('<div class="unicode-char">')
char = cgi.escape(chr(i)).encode("ascii", "xmlcharrefreplace")
content.append(char.decode())
content.append('<div class="clearfix"></div><sub style="font-size: 12px">{sub_number}</sub>'.format(sub_number = i))
content.append('</div>')
UPDATE: found a way to print the list but it just prints in one long row. what am i doing wrong. thought i set it to b 12 columns

If content is a Python list, then you can convert that to HTML with
''.join(content)
If you want every element in the list on a different line, use:
'\n'.join(content)
If you see everything in a single line in the browser, show us the CSS for unicode-char

Related

Python Beautifulsoup get texts before a certain tag

I have the following html code to run a python beautifulsoup to:
<html>
<head>
<script> ... </head>
<title> ... </title>
<style> ... </title>
</head>
<body onload="nextHit()">
S. <a name="hit1"></a><span style="background-color: #FFFF00">NO</span>. 178 H. <a name="hit2"></a><span style="background-color: #FFFF00">NO</span>. 1323 / 46 OG <a name="hit3"></a><span style="background-color: #FFFF00">No</span>. 12, 5977 (December, 1950)
<center>
<h2>...</h2>
<h3>...</h3>
</center>
<br>
....Lines omitted for brevity (more brs, divs, prs)...
</body>
The thing is I only want to get the texts in the beginning of the body tag, just before the first center tag like so:
S. NO. 178 H. NO. 1323 / 46 OG No. 12, 5977 (December, 1950)
I have tried:
ogsourcing = soup.find('center').previousSibling
But I am getting just the last part like so:
. 12, 5977 (December, 1950)
Version 2; based on OP's comment
find() the <center> element
Use previous_siblings to get an iterator with all the siblings
Loop over then, append the .text to an list
Reverse the list since we're looping from bottom to top
''.join() the list to get the desired string
from bs4 import BeautifulSoup
html = """
<html>
<head>
<script></script>
<title></title>
<style></style>
</head>
<body onload="nextHit()">
S. <a name="hit1"></a><span style="background-color: #FFFF00">NO</span>. 178 H. <a name="hit2"></a><span style="background-color: #FFFF00">NO</span>. 1323 / 46 OG <a name="hit3"></a><span style="background-color: #FFFF00">No</span>. 12, 5977 (December, 1950)
<center>
<h2>foo</h2>
<h3>bar</h3>
</center>
<br>
<em>test</em>
<div>
<em>test</em>
</div>
</body>
</html>
"""
res = []
soup = BeautifulSoup(html, 'html.parser')
for sibling in soup.find('center').previous_siblings:
res.append(sibling.text)
res.reverse()
res = ''.join(res)
print(res)
The above print() will output:
S. NO. 178 H. NO. 1323 / 46 OG No. 12, 5977 (December, 1950)
You might want to include a .strip() to get rid of any whitespaces and/or newlines

Python add text to a HTML table file file generated with to_html() method

please I have a question that is probably an easy one especially for those of you expert of HTML.
I basically have a python pandas dataframe 'df' and I convert it to a HTML document using the useful method:
html = df.to_html()
text_file = open('example.html', "w")
text_file.write(html)
text_file.close()
The problem I face is that I would need to add a paragraph (a simple sentence) before the table.
I tried to add the following code to my script:
title = """<head>
<title>Test title</title>
</head>
"""
html = html.replace('<table border="1" class="dataframe">', title + '<table border="1" class="dataframe">')
but it doesn't seem to do anything, plus in reality what I would need to add is not a title but a string containing the paragraph information.
Does anybody have a simple suggestion that doesn't involve using beautiful soup or other libraries?
Thank you.
This code does pretty much what I needed:
html = df.to_html()
msg = "custom mesagges"
title = """
<html>
<head>
<style>
thead {color: green;}
tbody {color: black;}
tfoot {color: red;}
table, th, td {
border: 1px solid black;
}
</style>
</head>
<body>
<h4>
""" + msg + "</h4>"
end_html = """
</body>
</html>
"""
html = title + html + end_html
text_file = open(file_name, "w")
text_file.write(html)
text_file.close()
You should consider using dominate. You can build html elements and combine raw html. As a proof of concept:
from dominate.tags import *
from dominate.util import raw
head_title = 'test' # Replace this with whatever content you like
raw_html_content = '<table border="1" class="dataframe"></table>' # Replace this with df.to_html()
print(html(head(title(head_title)), body(raw(raw_html_content))))
This will output:
<html>
<head>
<title>test</title>
</head>
<body><table border="1" class="dataframe"></table> </body>
</html>
Alternatively you can build the html with BeauitfulSoup. It a lot more powerful, but then you have to write a lot more code.
from bs4 import BeautifulSoup
raw_html_content = '<table border="1" class="dataframe"></table> '
some_content = 'TODO click here'
soup = BeautifulSoup(raw_html_content, features='html.parser') # This would contain the table
paragraph = soup.new_tag('p') # To add content wrapped in p tag under table
paragraph.append(BeautifulSoup(some_content, features='html.parser'))
soup.append(paragraph)
print(soup.prettify())
This will output:
<table border="1" class="dataframe">
</table>
<p>
TODO
<a href="#">
click here
</a>
</p>
You can use python built in f-string to add replacement fields with variables. Simply add the character f at the start of the string and then pass in the variable wrapped in brace brackets. This makes the html easier to read and edit. The downside is that to display brace brackets within the content, you have to use double brace brackets (see thead below).
An example e.g:
main_content = '<table border="1" class="dataframe"></table>' # // df.to_html()
msg = "custom messages"
html = f"""
<html>
<head>
<style>
thead {{color: green;}}
tbody {{color: black;}}
tfoot {{color: red;}}
table, th, td {{
border: 1px solid black;
}}
</style>
</head>
<body>
<h4>{msg}</h4>
{main_content}
</body>
</html>
"""
print(html)
This will output:
<html>
<head>
<style>
thead {color: green;}
tbody {color: black;}
tfoot {color: red;}
table, th, td {
border: 1px solid black;
}
</style>
</head>
<body>
<h4>custom mesagges</h4>
<table border="1" class="dataframe"></table>
</body>
</html>

how to exclude content from html page and keeping only the html tags?

I have a huge corpus of HTML pages and I want to exclude all the content from this dataset and finally extracting only the html tags(I want the tags, not the contents). For instance if i have this html elements:
<div class="tensorsite-content__title ">
Differentiate yourself with the TensorFlow Developer Certificate </div>
I need to extract only :
<div class="tensorsite-content__title ">
</div>
I have tried the (?!) negative lookahead regex to exclude the html tags matches with
tags=re.sub('.*?!<[^<]+?>', '',htmlwithcontent )
but despite the fact it doesn't look smart and efficient, obviously, it doesn't work even!
So do you have any Idea? preferably in python
As Ivar commented, an HTML parser is really the only way to correctly deal with this class of problem:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.indent = -1
def handle_starttag(self, tag, attrs):
self.indent += 1
print(2 * self.indent * ' ', sep='', end='')
print(f'<{tag}', sep='', end='')
for attr in attrs:
print(f' {attr[0]}="{attr[1]}"', sep='', end='')
print('>', sep='')
def handle_endtag(self, tag):
print(2 * self.indent * ' ', sep='', end='')
print(f'</{tag}>')
self.indent -= 1
parser = MyHTMLParser()
parser.feed("""<html>
<head>
<title>Test</title>
</head>
<body>
<h1>Heading!</h1>
<p style="font-weight: bold; color: red;">
Some text
<BR/>
Some more text
</p>
<ol>
<li>Item 1</li>
<li>Item 2</li>
</ol>
</body>
</html>
""")
Prints:
<html>
<head>
<title>
</title>
</head>
<body>
<h1>
</h1>
<p style="font-weight: bold; color: red;">
<br>
</br>
</p>
<ol>
<li>
</li>
<li>
</li>
</ol>
</body>
</html>
See Python Demo
Update
If the HTML is is a not-too-large file, it make sense to read the entire file into memory and pass to the parser thus:
parser = MyHTMLParser()
with open('test.html') as f:
html = f.read()
parser.feed(html)
If the input is in a extremely large file, it might make sense to "feed" the parser line by line or in chunks rather than attempting to read the entire file into memory:
Line by Line:
parser = MyHTMLParser()
with open('test.html') as f:
for line in f:
parser.feed(line)
Or even more efficiently:
To Read in Chunks of 32K:
CHUNK_SIZE = 32 * 1024
parser = MyHTMLParser()
with open('test.html') as f:
while True:
chunk = f.read(CHUNK_SIZE)
if chunk == '':
break
parser.feed(chunk)
You can, of course, choose even larger chunk sizes.

Python + BeautifulSoup: How to get wrapper out of HTML based on text?

Would like to get the wrapper of a key text. For example, in HTML:
…
<div class=“target”>chicken</div>
<div class=“not-target”>apple</div>
…
And by based on the text “chicken”, would like to get back <div class=“target”>chicken</div>.
Currently, have the following to fetch the HTML:
import requests
from bs4 import BeautifulSoup
req = requests.get(url).txt
soup = BeautifulSoup(r, ‘html.parser’)
And having to just do soup.find_all(‘div’,…) and loop through all available div to find the wrapper that I am looking for.
But without having to loop through every div, What would be the proper and most optimal way of fetching the wrapper in HTML based on a defined text?
Thank you in advance and will be sure to accept/upvote answer!
# coding: utf-8
html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title> Last chicken leg on stock! Only 500$ !!! </title>
</head>
</body>
<div id="layer1" class="class1">
<div id="layer2" class="class2">
<div id="layer3" class="class3">
<div id="layer4" class="class4">
<div id="layer5" class="class5">
<p>My chicken has <span style="color:blue">ONE</span> leg :P</p>
<div id="layer6" class="class6">
<div id="layer7" class="class7">
<div id="chicken_surname" class="chicken">eat me</div>
<div id="layer8" class="class8">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>"""
from bs4 import BeautifulSoup as BS
import re
soup = BS(html_doc, "lxml")
# (tag -> text) direction is pretty obvious that way
tag = soup.find('div', class_="chicken")
tag2 = soup.find('div', {'id':"chicken_surname"})
print('\n###### by_cls:')
print(tag)
print('\n###### by_id:')
print(tag2)
# but can be tricky when need to find tag by substring
tag_by_str = soup.find(string="eat me")
tag_by_sub = soup.find(string="eat")
tag_by_resub = soup.find(string=re.compile("eat"))
print('\n###### tag_by_str:')
print(tag_by_str)
print('\n###### tag_by_sub:')
print(tag_by_sub)
print('\n###### tag_by_resub:')
print(tag_by_resub)
# there are more than one way to access underlying strings
# both are different - see results
tag = soup.find('p')
print('\n###### .text attr:')
print( tag.text, type(tag.text) )
print('\n###### .strings generator:')
for s in tag.strings: # strings is an generator object
print s, type(s)
# note that .strings generator returns list of bs4.element.NavigableString elements
# so we can use them to navigate, for example accessing their parents:
print('\n###### NavigableString parents:')
for s in tag.strings:
print s.parent
# or even grandparents :)
print('\n###### grandparents:')
for s in tag.strings:
print s.parent.parent

element is not clicked [selenium]

I am using selenium2library(python) for our automation. this is the method is used
def get_appointment_from_manage(self, date, appt_id):
ref_date = "//*[#data-date=\"%s\"]" % date
time.sleep(2)
logging.info(date)
logging.info(appt_id)
while not self.is_element_present_by_xpath(ref_date) :
self._current_browser().find_element_by_xpath("//*[#id=\"calendar1\"]/div[1]/div[3]/div/button[2]").click();
time.sleep(2)
element = self._current_browser().find_element_by_xpath("//*[#data-aid=\"%s\"]" % appt_id)
logging.info(element)
ActionChains(self._current_browser()).move_to_element(element).click().perform()
The logging states that the element was found but it doesn't click.
this is the part that isn't clicking.
element = self._current_browser().find_element_by_xpath("//*[#data-aid=\"%s\"]" % appt_id)
logging.info(element)
ActionChains(self._current_browser()).move_to_element(element).click().perform()
When you inspect the element, the whole element is covered in blue. So I don't know what am i missing. Firefox version is 28. Thanks in advance!
EDIT
This is the html
<div class="fc-event-container">
<div class="fc-event-box" style="position:relative;z-index:1"></div>
<div data-aid="31" class="fc-event-data-container fc-status-2" style="position:absolute;top:0px;right:0;bottom:-62px;left:0;z-index:1">
<div class="fc-event-data-box">
<a class="fc-time-grid-event fc-event fc-start fc-end evnt-1419408000000" style="top: 0px; bottom: -62px; z-index: 1; left: 0%; right: 0%;">
<div class="fc-content">
<div class="fc-time" data-start="8:00" data-full="8:00 AM - 8:30 AM" style="display:none;">
<span>8:00 - 8:30</span>
</div>
<div class="fc-title">Robot-FN</div>
<span class="fc-product">Home Loans</span>
</div>
<div class="fc-bg"></div>
</a>
</div>
</div>
</div>
I'm not sure this what you are trying, but if you want to click on the <a> tag (which is clickable), then, you need to hold that element, not the <div> that contains it.
try somthing like this: (I didn't try this xpath so take it as a general idea)
element = self._current_browser().find_element_by_xpath("//*[#data-aid=\"%s\"]//a" % appt_id)

Categories