BeautifulSoup4 findChildren() is empty

BeautifulSoup4 findChildren() is empty - python

I'm trying to use the findChildren() function. I basically want all the <p> under a particular <h3> tag. I'm trying a simple amount of code but the set children. I'm getting back is empty. h3 returns the correct line (see print(h3) comment) and the print(type(children)) prints type: <class 'bs4.element.ResultSet'>. Please tell me what I'm doing wrong.
soup = BeautifulSoup(contents, 'html.parser')
h3 = soup.find('h3', text=re.compile('chapter', re.IGNORECASE))
print(h3) #result prints <h3 style="text-align: center;">CHAPTER ONE - STEPHANUS GRAYLAND</h3>
children = h3.findChildren('p')
print(type(children)) #returns type: <class 'bs4.element.ResultSet'>
I also tried h3.findChildren('p', Recursive=True) and children = h3.findChildren(Recursive=True). Which also come back empty.
Here's the section of HTML I'm trying to grab:
<h3 style="text-align: center;">CHAPTER ONE - STEPHANUS GRAYLAND</h3>
<p dir="ltr" style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;">
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">Stephanus Grayland did not try to hide his smile of satisfaction . He had “eaten” lunch, but now, he sensed, he would truly </span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">feast</span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span>
</p>
<p></p>

Thanks to those who responded. My problem is that <h3> and the sub <p>s are siblings not parent/child. I think these posts are what I'm after code-wise but my comment above remains. http://stackoverflow.com/questions/51571609/… and http://stackoverflow.com/questions/51852588/

In the sample you provided, the h3 node has no children. All of the p nodes are outside of that scope.
If you wrap your contents in a div (say) then you can see you're using the right technique
>>> soup = BeautifulSoup('<div>' + contents + '</div>', 'html.parser')
>>> div = soup.find('div')
>>> div.findChildren('p')
[<p dir="ltr" style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">Stephanus Grayland did not try to hide his smile of satisfaction . He had “eaten” lunch, but now, he sensed, he would truly </span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">feast</span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span></p>, <p> </p>]
>>>
Edit
As you mention in your comments above, the h3 and p nodes are siblings in the content you've supplied. I'm not sure it makes sense to have p elements that are children of h3, but if you did it would look like
<h3>
This content is within the h3 tag
<p>this is a child of h3</p>
<p>another child</p>
</h3>
<p>this is not a child of h3 as it is after the h3 close tag</p>
It's not really clear what the conditions for selecting p nodes in your example content should be - a simple soup.find('p') would return all of those tags, but I suspect you need to limit it in some way to prevent other content from being included. Can you elaborate? You possibly just want something like:
>>> soup = BeautifulSoup(content, 'html.parser')
>>> h3 = soup.find('h3')
>>> h3.find_next_sibling('p')
<p dir="ltr" style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;">
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">Stephanus Grayland did not try to hide his smile of satisfaction . He had “eaten” lunch, but now, he sensed, he would truly </span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">feast</span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span>
</p>

Thank you for your patience. I had to figure out how to get the html structure, prettify the html and write to a file to see the relationships better, etc. The pages I need to process (I didn't write them) have a structure as below. After building the bs4 structure, I figured out my desired content starts at the <article..> tag and ends at the beginning of the next <script...> code here</<script> <h3>Comments</h3>. I'm not sure how to terminate a search between two different tags. I was able to grab EVERYTHING between an <h3> tag and the next <h3> tag. But that pulls the <script> section which I don't want. Thanks again for continuing help! -Meghan
....
<div id="rt-main" class="sa3-mb9">
<div class="rt-container">
<div class="rt-grid-9 rt-push-3">
<div class="rt-block">
<div id="rt-mainbody">
<div class="component-content">
<article class="item-pageDarkening">
<h3 style="text-align: center;">CHAPTER ONE - STEPHANUS GRAYLAND</h3>
<p> </p>
<p style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;" dir="ltr"><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">text.. ż/span></p>
<p> </p>
<p style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;" dir="ltr"><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">text here</span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"></span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span></p>
<p> </p>
<p>dljlg</p>
<span></span>
<p>dljlg</p>
<span></span>
<p style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;" dir="ltr"><em><span style="font-size: 16px; font-family: 'arial black', 'avant garde'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;"> </span></em></p>
<script type='text/javascript'>
Komento.ready(function($) {
// declare master namespace variable for shared values
Komento.component = "com_content";
Komento.cid = "1211";
Komento.contentLink = "...";
Komento.sort = "latest";
Komento.loadedCount = parseInt(10);
Komento.totalCount = parseInt(56);
if( Komento.options.konfig.enable_shorten_link == 0 ) {
Komento.shortenLink = Komento.contentLink;
}
});
</script>
<div id="section-kmt" class="theme-kuro">
<script type="text/javascript">
Komento.require()
.library('dialog')
.script(
'komento.language',
'komento.common',
'komento.commentform'
)
.done(function($) {
if($('.commentForm').exists()) {
Komento.options.element.form = new Komento.Controller.CommentForm($('.commentForm'));
Komento.options.element.form.kmt = Komento.options.element;
}
});
</script>
<div id="kmt-form" class="commentForm kmt-form clearfix">
<a class="addCommentButton kmt-form-addbutton" href="javascript:void(0);"><b>Add comment</b></a>
<div class="formArea kmt-form-area hidden">
<h3 class="kmt-title">Leave your comments</h3>

Related

How to click on element using find_elements_by_class_name() using Selenium and Python

I have to following HTML that is on a website I'm trying to scrape:
<div class="test-section-container">
<div>
<span class="test-section-title">Section Title</span>
<div style="display: inline-block; padding: 0.05rem;"></div>
</div>
<div style="cursor: pointer; background-color: rgb(248, 248, 248); display: flex; line-height: 1.2; margin-bottom: 0.07rem;">
<div style="width: 0.5rem; flex-shrink: 0; background-color: rgb(245, 222, 136);"></div>
<div style="padding: 0.07rem; overflow: hidden;">
<div style="font-size: 0.18rem; text-overflow: ellipsis; overflow: hidden; white-space: nowrap;">Newsletter 1</div>
<div style="font-size: 0.13rem; color: rgb(102, 102, 102);">2021 11 8</div>
</div>
</div>
<div style="cursor: pointer; background-color: rgb(248, 248, 248); display: flex; line-height: 1.2; margin-bottom: 0.07rem;">
<div style="width: 0.5rem; flex-shrink: 0; background-color: rgb(221, 221, 221);"></div>
<div style="padding: 0.07rem; overflow: hidden;">
<div style="font-size: 0.18rem; text-overflow: ellipsis; overflow: hidden; white-space: nowrap;">Newsletter 2 </div>
<div style="font-size: 0.13rem; color: rgb(102, 102, 102);">2021 11 3</div>
</div>
</div>
This is the selenium/python code that I'm using:
driver.get("http://www.testwesbite.org/#/newsarticles")
results = driver.find_elements_by_class_name('test-section-container')
texts = []
for result in results:
text = result.text
texts.append(text)
print(text)
This gives me an output off:
Newsletter 1
2021 11 8
Newsletter 2
2021 11 3
If I use the following code:
first_result = results[0]
first_result.click()
It does click into the first article but a results[1] give me an out of bounds error.
How would I go about click on the second article?

As you have used driver.find_elements_by_class_name('test-section-container') all the following texts:
Newsletter 1
2021 11 8
Newsletter 2
2021 11 3
Are within the results[0] element and results[1] desn't exists. Hence you face out of bounds error
Solution
To click on each results[0] and results[1] you can use:
driver.get("http://www.testwesbite.org/#/newsarticles")
results = driver.find_elements(By.CSS_SELECTOR, "div.test-section-container div[style*='nowrap']")
texts = []
for result in results:
text = result.text
texts.append(text)
print(text)
Now you can click the individual items as:
first_result = results[0]
first_result.click()
and
second_result = results[1]
second_result.click()
Note: You have to add the following imports :
from selenium.webdriver.common.by import By

python requests captcha form

I'm filling a form using requests and python but i'm blocked with the recaptcha.
I need to send a g-recaptcha-response but I don't know how to get it.
Here is the website code:
<div class="g-recaptcha" data-callback="checkoutAfterCaptcha" data-sitekey="6LeWwRkUAAAAAOBsau7xxxx-xxxxxxxxxx" data-size="invisible">
<div class="grecaptcha-badge" data-style="bottomright" style="width: 256px; height: 60px; transition: right 0.3s ease; position: fixed; bottom: 14px; right: -186px; box-shadow: gray 0px 0px 5px;">
<div class="grecaptcha-logo">
<iframe src="https://www.google.com/recaptcha/api2/anchor?ar=1&k=6LeWwRkUAAAAAOBsau7KpuC9AV-6J8mhw4AjC3Xz&co=aHR0cHM6Ly93d3cuc3VwcmVtZW5ld3lvcmsuY29tOjQ0Mw..&hl=fr&v=v1525674693836&size=invisible&cb=g8s5582r6zik" width="256" height="60"
role="presentation" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox" kwframeid="3">
</iframe>
</div>
<div class="grecaptcha-error">
</div>
<textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="width: 250px; height: 40px; border: 1px solid #c1c1c1; margin: 10px 25px; padding: 0px; resize: none; display: none; ">
</textarea>
</div>
</div>
here is my code, I manage to get the data-sitekey but I don't understand how to get the g-recaptcha-response:
page = c.get(link_checkout)
soup = BeautifulSoup(page.text, 'html.parser')
find_class = soup.find(class_='g-recaptcha')
get_captcha_token = find_class.get('data-sitekey')
print (get_captcha_token)
# try:
# content = requests.post(
# 'https://www.google.com/recaptcha/api/siteverify',
# data={
# 'secret': RECAPTCHA_SECRET,
# 'response': get_captcha_token,
# 'remoteip': ip
# }
# ).content
# except:
# print ("fail")
# print (get_captcha_token)
c.post(url, data=payload_FORM, headers={"Refered": link_checkout})
page = c.get(link_checkout)
thank you all for your help! This is my last probleme to finish my programme and there is really little on google
If you need more info tell me in the coms I will add it

python-pdfkit (wkhtmltopdf) TOC overflow

I currently am creating a perfectly good PDF. there is nothing technically wrong with it. However, the TOC is ugly.
The TOC is generated via xsl which is passed through jinja2 for simple details to the top section of the page. I have modified the XSL to match the client's branding and design precisely. However, the list keeps growing in height.
Here is the current result (sorry to blur the text) you can see the toc picks up at the right spot on the new page, but there seems to be no way to apply a top margin to the new page:
The code:
Here is the xsl:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:outline="http://wkhtmltopdf.org/outline"
xmlns="http://www.w3.org/1999/xhtml">
<xsl:output doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
indent="yes" />
<xsl:template match="outline:outline">
<html>
<head>
<title>Table of Contents</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style>
body{
background-color: #fff;
margin-left: 0px;
margin-top: 0px;
color:#1e1e1e;
font-family: arial, verdana,sans-serif;
font-size: 90px;
}
.contentSection{
position:relative;
height:3200px;
width:6100px;
}
.profile{
position:absolute;
display:inline-block;
top:200px !important;
}
h1 {
text-align: left;
font-size: 70px;
font-family: arial;
color: #ef882d;
}
li {
border-bottom: 1px dashed rgb(45,117,183);
}
span {float: right;}
li {
list-style: none;
margin-top:30px;
}
ul {
font-size: 70px;
font-family: arial;
color:#2d75b7;
}
ul ul {font-size: 80%; padding-top:0px;}
ul {padding-left: 0em; padding-top:0px;}
ul ul {padding-left: 1em; padding-top:0px;}
a {text-decoration:none; color: color:#2d75b7;}
#topper{
width:100%;
border-bottom:8px solid #ef882d;
}
#title{
position:absolute;
top:60px;
font-size:60px;
left:150px;
color:#666666;
}
h1, h2{
font-size:60px;
-webkit-margin-before: 0px;
-webkit-margin-after: 0px;
-webkit-margin-start: 0px;
-webkit-margin-end: 0px;
}
#profile{
position:static;
-webkit-border-top-left-radius: 40px;
-webkit-border-bottom-left-radius: 40px;
-moz-border-radius-topleft: 40px;
-moz-border-radius-bottomleft: 40px;
border-top-left-radius: 40px;
border-bottom-left-radius: 40px;
right:-540px;
background-color: #2d75b7;
padding:4px;
padding-left:60px;
padding-right:250px;
color:#fff;
display:inline-block;
margin-top:200px;
float:right;
}
#room{
padding-top: 200px;
padding-left: 150px;
display:inline-block;
}
#section{
padding-left: 150px;
color: #ef882d;
text-transform: uppercase;
font-size:60px;
font-weight: bold;
display:inline-block;
margin-top: 30px;
margin-bottom: 5px;
}
#area{
padding-left: 150px;
font-size:60px;
color:#2d75b7;
margin-top: 15px;
}
#dims{
padding-left: 150px;
font-size:60px;
color:#2d75b7;
margin-top: 15px;
}
#toc{
width:50%;
margin-top:150px;
margin-left:300px;
}
</style>
<script>
var value = {{profile|e}};
</script>
</head>
<body>
<div class="contentSection">
<div id="title">A title here</div>
<div id="topper">
<div id="profile" class="profile">{{profile|e}}</div>
<div id="room"> {{profile|e}} </div>
<div id="area"> Revision Date </div>
<div id="dims"> {{area|e}} </div>
<div id="section">Table of Contents</div>
</div>
<div id="toc">
<ul><xsl:apply-templates select="outline:item/outline:item"/></ul>
</div>
</div>
</body>
</html>
</xsl:template>
<xsl:template match="outline:item">
<! begin LI>
<li>
<xsl:if test="#title!=''">
<div>
<a>
<xsl:if test="#link">
<xsl:attribute name="href"><xsl:value-of select="#link"/> .
</xsl:attribute>
</xsl:if>
<xsl:if test="#backLink">
<xsl:attribute name="name"><xsl:value-of select="#backLink"/> . </xsl:attribute>
</xsl:if>
<xsl:value-of select="#title" />
</a>
<span>
<xsl:value-of select="#page" />
</span>
</div>
</xsl:if>
<ul>
<xsl:comment>added to prevent self-closing tags in QtXmlPatterns</xsl:comment>
<xsl:apply-templates select="outline:item"/>
</ul>
</li>
</xsl:template>
</xsl:stylesheet>
I have dealt with content overflows in other areas of the PDF using traditional HTML, JavaScript, and a document ready flag. The TOC however requires an XSL file instead.
I tried do this with nth-child css nth-child is ignored.
The question:
*Is there a way within wkhtmltopdf or python pdf-kit to deal with page breaks in the TOC specifically, and place a better margin top on the new page? is there a way to supply a TOC as a traditional html page so that I can do this with javaScript instead? *

Code review
I made a quick code review in your XSL (and CSS) file.
Even if it doesn’t solve your problem, it help reproducing and understanding it.
Here is my comments:
Your XSL has a typo: <! begin LI> is not a valid XML tab. Is it a comment?
I prefer using the concat() XPath function to append characters directly. Because, if you re-indent your code, you may introduce extra whitespaces.
So, I replaced:
<xsl:attribute name="href"><xsl:value-of select="#link"/> . </xsl:attribute>
By:
<xsl:attribute name="href">
<xsl:value-of select="concat(#link, ' . ')"/>
</xsl:attribute>
I added a xs:if to prevent generating an empty <ul> if it is not necessary:
<xsl:if test="count(outline:item)">
<ul>
<xsl:comment>added to prevent self-closing tags in QtXmlPatterns</xsl:comment>
<xsl:apply-templates select="outline:item"/>
</ul>
</xsl:if>
I also fixed duplicate or mal-formed CSS entries, I replaced:
li {
border-bottom: 1px dashed rgb(45, 117, 183);
}
span {
float: right;
}
li {
list-style: none;
margin-top: 30px;
}
ul ul {font-size: 80%; padding-top:0px;}
ul {padding-left: 0em; padding-top:0px;}
ul ul {padding-left: 1em; padding-top:0px;}
a {text-decoration:none; color: color:#2d75b7;}
by:
span {
float: right;
}
li {
list-style: none;
margin-top: 30px;
border-bottom: 1px dashed rgb(45, 117, 183);
}
ul {
font-size: 70px;
font-family: arial;
color: #2d75b7;
}
ul ul {
font-size: 80%;
padding-left: 1em;
padding-top: 0px;
}
a {
text-decoration: none;
color: #2d75b7;
}
If you target XHTML, the <style> tag has a mandatory type attribute. Same remark for the <script> attribute.
<style type="text/css">...</style>
<script type="text/javascript">...</script>
Reproducing the problem
It was a little hard to reproduce your bug, because of a lack of information. So I guess it.
First, I create a sample TOC file, which look like this:
outline.xml
<?xml version="1.0" encoding="UTF-8"?>
<outline xmlns="http://wkhtmltopdf.org/outline">
<item>
<item title="Lorem ipsum dolor sit amet, consectetur adipiscing elit." page="2"/>
<item title="Cras at odio ultrices, elementum leo at, facilisis nibh." page="8"/>
<item title="Vestibulum sed libero bibendum, varius massa vitae, dictum arcu." page="19"/>
...
<item title="Sed semper augue quis enim varius viverra." page="467"/>
</item>
</outline>
This file contains 70 items so that I can see the page breaks.
To build the HTML and PDF I used your (fixed) XSL file and run pdfkit:
import io
import os
import pdfkit
from lxml import etree
HERE = os.path.dirname(__file__)
def layout(src_path, dst_path):
# load the XSL
xsl_path = os.path.join(HERE, "layout.xsl")
xsl_tree = etree.parse(xsl_path)
# load the XML source
src_tree = etree.parse(src_path)
# transform
transformer = etree.XSLT(xsl_tree)
dst_tree = transformer.apply(src_tree)
# write the result
with io.open(dst_path, mode="wb") as f:
f.write(etree.tostring(dst_tree, encoding="utf-8", method="html"))
if __name__ == '__main__':
layout(os.path.join(HERE, "outline.xml"), os.path.join(HERE, "outline.html"))
pdfkit.from_file(os.path.join(HERE, "outline.html"),
os.path.join(HERE, "outline.pdf"),
options={'page-size': 'A1', 'orientation': 'landscape'})
note: your page size looks very huge…
Solution
You are right, wkhtmltopdf doesn't take into account the margin in your CSS:
li {
list-style: none;
border-bottom: 1px dashed rgb(45, 117, 183);
margin-top: 30px; # <-- not working after page break
}
This is a normal behavior, consider for instance the header paragraphs (h1, h2, etc.).
A header can have a top margin in order to add white space between a paragraph and the following header,
but, if the header starts a new page we want to get rid of the margin, and have the heading touching to top margin of the page.
For your TOC, there is a solution. You can use padding (instead of margin):
li {
border-bottom: 1px dashed rgb(45, 117, 183);
list-style: none;
padding-top: 30px;
}
Actually, the TOC content (#toc element) is fixed:
#toc {
width: 50%;
margin-top: 150px;
margin-left: 300px;
}
So, you can reduce the margin-top to match your need, for instance:
#toc {
width: 50%;
margin-top: 120px;
margin-left: 300px;
}

Not being able to print a value corresponding to a particular "div" element using Beautiful Soup

I want to print both the CVE-IDs "CVE-2013-2566" and "CVE-2015-2808" under References and "tcp 23" which corresponds to Unencrypted telnet server using beautiful soup. Couldn't think of a logic for that.
<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>
<div xmlns="" style="margin: 0 0 45px 0;">
<div class="details-header">Risk Factor<div class="clear"></div>
</div>
<div style="line-height: 20px; padding: 0 0 20px 0;">Medium<div class="clear"></div>
<div class="details-header">Plugin Information: <div class="clear"></div>
</div>
<div style="line-height: 20px; padding: 0 0 20px 0;">Published: 2009/10/27, Modified: 2015/10/21<div class="clear"></div>
</div>
<div class="details-header">**References**<div class="clear"></div>
</div>
<div id="idm8894160" style="display: block;" class="table-wrapper see-also">
<table cellpadding="0" cellspacing="0">
<thead><tr>
<th width="15%"></th>
<th width="85%"></th>
</tr></thead>
<tbody>
<tr class="">
<td class="#ffffff">CVE</td>
<td class="#ffffff">CVE-2013-2566</td>
</tr>
<tr class="">
<td class="#ffffff">CVE</td>
<td class="#ffffff">CVE-2015-2808</td>
</tr>
</tbody>
<div class="details-header">Plugin Output<div class="clear"></div>
</div>
<h2>tcp/23</h2>
This is what I have written and I am stuck where I have put the comments.
I am very much a beginner in bs4 so just bear with me please and I have to submit a report tomorrow so, please help.
from bs4 import BeautifulSoup
import csv
import urllib.request as urllib2
with open(r"C:\Users\sourabhk076\Documents\CHIDRMUM_DR8016CHI1_CTSINWDB01_9xtqpj.html") as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
f = csv.writer(open("Report.csv", "w"))
f.writerow(["Observation", "Port", "CVE-ID"])
medium = soup.find_all('div', attrs={'style':'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'})
####this will search for text "Unencrypted telnet server"####
for x in medium:
port = x.find('h2')
cve = x.find('div', class_='table-wrapper see-also').findAll('tr')
######## don't know what to do next #############
obsv = x.text
portd = port.text
print([obsv,portd,cve])

Code:
from bs4 import BeautifulSoup
with open('/path/to/some.html') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
service = soup.find('div', style='box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;').get_text(strip=True)
cve_ids = [cve_elem.text for cve_elem in soup.select('table > tbody > tr > td > a')]
protocol, port = soup.select_one('table > h2').text.split('/')
print('{}, {}/{}, CVE-IDs: {}'.format(service, protocol, port, cve_ids))
Output:
42263 - Unencrypted Telnet Server, tcp/23, CVE-IDs: ['CVE-2013-2566', 'CVE-2015-2808']
Notice usage of select() that works with CSS selectors. I also used >, which is a child combinator.
The child combinator (>) is placed between two CSS selectors. It
matches only those elements matched by the second selector that are
the children of elements matched by the first.

you can search your tags for child tags. So maybe something like
tbody = cve.find("tbody")
for row in tbody.find_all("tr"):
print row.find_all("td")[1].text

How to handle this alert or frame using python selenium?

https://niioa.immigration.gov.tw/NIA_OnlineApply_inter/visafreeApply/visafreeApplyForm.action
Something pop up after I select the first item and I cannot handle the popup . I do not know what it is, it's not alert. and I cant find the frame for the (switch to frame)
its a Chinese website....
so I have pasted the elements that's loaded after I selected the first item
<div class="blockUI" style="display:none"></div>
<div class="blockUI blockOverlay" style="z-index: 1000; border: none; margin: 0px; padding: 0px; width: 100%; height: 100%; top: 0px; left: 0px; background-color: rgb(0, 0, 0); opacity: 0.6; cursor: wait; position: fixed;"></div>
<div class="blockUI blockMsg blockPage" style="z-index: 1011; position: fixed; padding: 0px; margin: 0px; width: 450px; top: 539.5px; left: 119.5px; text-align: center; color: rgb(0, 0, 0); border: 3px solid rgb(170, 170, 170); background-color: rgb(255, 255, 255); height: 140px; overflow: hidden;"><div id="showWarnMessage1" style="">
<table class="application" style="margin: 10px;">
<tbody><tr>
<td>
<p class="Prompt" style="text-align: center">注意</p>
<p>除香港居民持有BNO護照及澳門居民持有1999年前取得之葡萄牙護照外，持有外國護照，不適合辦理本許可。</p>
</td>
</tr>
</tbody></table>
<div>
<input class="btn" value="確認" type="button" onclick="$.unblockUI();">
</div>
</div></div>

This worked for me to get past the pop-up:
chromedriver = "your_path"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(15)
driver.get('https://niioa.immigration.gov.tw/NIA_OnlineApply_inter/visafreeApply/visafreeApplyForm.action')
driver.find_element_by_xpath('//*[#id="isHKMOVisaN"]').click()
And then this last line is what gets rid of the pop-up:
driver.find_element_by_xpath('//*[#id="showWarnMessage1"]/div/input').click()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup4 findChildren() is empty - python

Thanks to those who responded. My problem is that <h3> and the sub <p>s are siblings not parent/child. I think these posts are what I'm after code-wise but my comment above remains. http://stackoverflow.com/questions/51571609/… and http://stackoverflow.com/questions/51852588/

Related

How to click on element using find_elements_by_class_name() using Selenium and Python

python requests captcha form

python-pdfkit (wkhtmltopdf) TOC overflow

Not being able to print a value corresponding to a particular "div" element using Beautiful Soup

How to handle this alert or frame using python selenium?

Categories

Resources