Beautiful Soup Table, stop getting info - python

Hey everyone I have some html that I am parsing, here it is:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<table class="dayinner">
<tr class="lun">
<td class="mealname" colspan="3">LUNCH</td>
</tr>
<tr class="lun">
<td class="station"> Deli</td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000010000047598_35356" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000047598_35356');" onmouseout="pcls(this);"
onmouseover="ws(this);">Made to Order Deli Core</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td colspan="3" style="height:3px;"></td>
</tr>
<tr class="lun">
<td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
</tr>
<tr class="lun">
<td class="station"> Dessert</td>
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
</tr>
</table>
</body>
</html>
Here is the code I have, I want just the items under the deli section, and normally I won't know how many there are is there a way to do this?
soup = BeautifulSoup(open("upperMenu.html"))
title = soup.find('td', class_='station').text.strip()
spans = soup.find_all('span', class_='ul')[:2]
but this only works if there are two items, how can I have it work if the number of items is unknown?
Thanks in advance

You can use the text attribute in find_all function to 1. find all the rows whose station column contains the substring Deli.. 2. Loop through every row and find the spans within that row whose class is ul.
import re
soup = BeautifulSoup(text)
tds_deli = soup.find_all(name='td', attrs={'class':'station'}, text=re.compile('Deli'))
for td in tds_deli:
try:
tr = td.find_parent()
spans = tr.find_all('span', {'class':'ul'})
for span in spans:
# do something
print span.text
print '------------one row -------------'
except:
pass
Sample Output in this case:
Made to Order Deli Core
------------one row -------------
Not sure if I am understanding the problem correctly but I think my code might help you get started.

Related

HTML table to database

At this point, my table looks as follows:
<table border="0" cellpadding="0" cellspacing="0" class="ms-formtable" id="formTbl" style="margin-top: 8px;" width="100%">
<tbody>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_FileLeafRef">
</a>
Name
</h3>
</td>
<td class="ms-formbody" id="SPFieldFile" valign="top" width="450px">
<a href="http://google.com" onclick="DispDocItemEx(this, 'FALSE', 'FALSE', 'FALSE', '');">
X
</a>
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_Owner">
</a>
Name#
</h3>
</td>
<td class="ms-formbody" id="SPFieldChoice" valign="top" width="450px">
Z
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_DirectiveRank">
</a>
Age
</h3>
</td>
<td class="ms-formbody" id="SPFieldChoice" valign="top" width="450px">
52
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_Number">
</a>
number
</h3>
</td>
<td class="ms-formbody" id="SPFieldText" valign="top" width="450px">
1
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_Title">
</a>
Name of File
</h3>
</td>
<td class="ms-formbody" id="SPFieldText" valign="top" width="450px">
Funny Names
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_EffectiveFrom">
</a>
date
</h3>
</td>
<td class="ms-formbody" id="SPFieldDateTime" valign="top" width="450px">
1.1.2022
</td>
</tr>
</tbody>
</table>
I basically need to open an HTML file, filter table with id "formTbl" and then either create JSON with values : {Firsttd:Secondtd, "Name":"Test", "Date":"Blank"} or insert into database where First td (in tr tag we have 2 td, first it name of column and second is value) in table A and second td in table B. Is there any way? I´ve tried using Python, where I got so far json looks like [["","Name","","Test",""],["","Age","","12",""]] and in C# I´ve tried HTMLAgilityPack but it wasn´t working.
Here is the solution with JQuery.
<html>
<body>
<table id="example-table">
<tr>
<th>Name</th>
<th>Name#</th>
<th>Age</th>
<th>Number</th>
<th>Name of file</th>
<th>Date</th>
</tr>
<tr>
<td>X</td>
<td>Z</td>
<td>52</td>
<td>1</td>
<td>Name of file</td>
<td>2021-22-10</td>
</tr>
</table>
<textarea rows="10" cols="50" id="jsonTextArea">
</textarea>
</body>
</html>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/table-to-json#1.0.0/lib/jquery.tabletojson.min.js"></script>
<script type="text/javascript">
var tableToJson = $('#example-table').tableToJSON();
var sendingData = JSON.stringify (tableToJson);
$('#jsonTextArea').val(sendingData);
// Send JSON data to backend
$.post('http://localhost/test.php', {sendingData}, function(data, textStatus, xhr) {
var backendResponse = data;
console.log(backendResponse);
});
</script>

How to extract text from table moving between<tr> tags using Beautifulsoup

I need to extract text from a table using BeautifulSoup.
Below is the code which I have written and output
HTML:
<div class="Tech">
<div class="select">
<span>Selection is mandatory</span>
</div>
<table id="product">
<tbody>
<tr class="feature">
<td class="title" rowspan="3">
<h2>Information</h2>
</td>
<td class="label">
<h3>Design</h3>
</td>
<td class="checkbox">product</td>
</tr>
<tr>
<td class="label">
<h3>Marque</h3>
</td>
<td class="checkbox">
<input type="checkbox">
<label>retro</label>
<a href="link">
Landlord
</a>
</td>
</tr>
<tr>
<td class="label">
<h3>Model</h3>
</td>
<td class="checkbox">model123</td>
</tr>
import requests
from bs4 import BeautifulSoup
url='someurl.com'
source2= requests.get(url,timeout=30).text
soup2=BeautifulSoup(source2,'lxml')
element2= soup2.find('div',class_='Tech')
pin= element2.find('table',id='product').tbody.tr.text
print(pin)
Output that I am getting is:
Information
Design
product
How to do I move between <tr>s? I need the output as: model123.
To get output model123, you can try:
# search <h3> that contains "Model"
h3 = soup.select_one('h3:contains("Model")')
# search next <td>
model = h3.find_next("td").text
print(model)
Prints:
model123
Or without CSS selectors:
model = (
soup.find(lambda tag: tag.name == "h3" and tag.text.strip() == "Model")
.find_next("td")
.text
)
print(model)

How to use XPath get content of same field?

I am a beginner of Xpath and just can not match content correctly. Here is my question:
How to use XPath to get the date '2010.09.07' (after 申请日:) and '2009.09.03'? Actually, there are 10 same items (g_item) below g_list hierarchy, here I just listed two of them. I try to copy Xpath from Chorm, while it doesn't work.
Also, I try to use regex as below,however, it just matches the first one. Is there a way to return all dates of all items?
Thanks!
s.find(string=('申请日:')).find_next().text.replace('\n', '').strip()
<div class="g_list">
<div class="g_item">
<div class="g_tit">
<ul>
<li class="g_li0">
<input id="CN201010274593.21" name="recordno" type="checkbox" value="CN201010274593.2" pnm="CN102403785B" sysid="B58C6C20BB7D5998B03811E0866F5981" appid="201010274593.2" sectionName="FMSQ" onclick="checkall()"/></li>
<input id="tifPath1" name="tifPath" type="hidden" tifvalue="BOOKS/SD/2014/20140716/201010274593.2,12,CN201010274593.2" xmlvalue="FMSQ,CN201010274593.2,2014.07.16" pdfvalue="Granted_patent_for_invention/2014/20140716/CN102403785B/PDF_PID/CN102010000274593CN00001024037850BPDFZH20140716CN008.PDF,CN201010274593.2" pdfvalue2="CN102403785B,2014.07.16"/>
<li class="g_li" onclick="viewDetail(0)" style="cursor:pointer" name='patti' title="电源管理装置及其电源管理方法">
1.电源管理装置及其电源管理方法</li>
<li class="g_li1">发明授权 </li>
<li class="g_li2 cor3">无效</li>
<li class="g_li3">下载</li>
</ul>
<div class="clear"></div>
</div>
<div class="g_cont">
<div class="g_cont_left">
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td><span>申请号:</span> CN201010274593.2 </td>
<td><span>申请日:</span> 2010.09.07 </td>
</tr>
<tr>
<td><span>公开(公告)号:</span> CN102403785B </td>
<td><span>公开(公告)日:</span> 2014.07.16 </td>
</tr>
<tr>
<td><span>同日申请:
</td>
<td><span>分案原申请号:
</td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>申请(专利权)人:</span> 鸿富锦精密工业(深圳)有限公司;鸿海精密工业股份有限公司 </td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>分类号:</span> H02J13/00(2006.01) </td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>优先权:</span></td>
</tr>
<tr>
<td colspan="2"><span>摘要:</span><span name="patab" style="font-weight:normal"></span>
<a name="abmtlink" href="javascript:return false;" style="color:blue">机器翻译</a></td>
</tr>
</table>
</div>
<div class="g_cont_rig" id="pic1">
<img name="tifpath" src="http://pic.cnipr.com/XmlData/SQ\20140716\201010274593.2/201010274593.gif" class="imgstyle"/>
</div>
<div class="clear"></div>
</div>
</div>
<div class="g_item">
<div class="g_tit">
<ul>
<li class="g_li0">
<input id="CN200910171675.12" name="recordno" type="checkbox" value="CN200910171675.1" pnm="CN102006581B" sysid="E7025BBD105585DF6CE4193E52ECC322" appid="200910171675.1" sectionName="FMSQ" onclick="checkall()"/></li>
<input id="tifPath2" name="tifPath" type="hidden" tifvalue="BOOKS/SD/2013/20130911/200910171675.1,21,CN200910171675.1" xmlvalue="FMSQ,CN200910171675.1,2013.09.11" pdfvalue="Granted_patent_for_invention/2013/20130911/CN102006581B/PDF_PID/CN102009000171675CN00001020065810BPDFZH20130911CN008.PDF,CN200910171675.1" pdfvalue2="CN102006581B,2013.09.11"/>
<li class="g_li" onclick="viewDetail(1)" style="cursor:pointer" name='patti' title="IP地址强制续约的方法及装置">
2.IP地址强制续约的方法及装置</li>
<li class="g_li1">发明授权 </li>
<li class="g_li2 cor3">无效</li>
<li class="g_li3">下载</li>
</ul>
<div class="clear"></div>
</div>
<div class="g_cont">
<div class="g_cont_left">
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td><span>申请号:</span> CN200910171675.1 </td>
<td><span>申请日:</span> 2009.09.03 </td>
</tr>
<tr>
<td><span>公开(公告)号:</span> CN102006581B </td>
<td><span>公开(公告)日:</span> 2013.09.11 </td>
</tr>
<tr>
<td><span>同日申请:
</td>
<td><span>分案原申请号:
</td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>申请(专利权)人:</span> 中兴通讯股份有限公司 </td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>分类号:</span> H04W8/08(2009.01);H04W36/14(2009.01);H04W84/12(2009.01);H04L29/12(2006.01) </td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>优先权:</span></td>
</tr>
<tr>
<td colspan="2"><span>摘要:</span><span name="patab" style="font-weight:normal"></span>
<a name="abmtlink" href="javascript:return false;" style="color:blue">机器翻译</a></td>
</tr>
</table>
</div>
<div class="g_cont_rig" id="pic2">
<img name="tifpath" src="http://pic.cnipr.com/XmlData/SQ/20130911/200910171675.1/200910171675.gif" class="imgstyle"/>
</div>
<div class="clear"></div>
</div>
</div>
</div>
</div>
Your HTML has multiple markup validation errors, you can check for the errors using W3 validator. However, if you fix the following errors is possible to parse the string using lxml.
Unclosed element span.
From line 43, column 37; to line 43, column 42
Unclosed element span.
From line 46, column 37; to line 46, column 42
Unclosed element span.
From line 119, column 37; to line 119, column 42
Unclosed element span.
From line 122, column 37; to line 122, column 42
Stray end tag div.
From line 159, column 5; to line 159, column 10
from lxml import etree
pageHTML = """
<div class="g_list">
<div class="g_item">
...
...
"""
root = etree.fromstring(pageHTML)
dateList = root.xpath("//*[#class='g_cont_left']/table/tr[1]/td[2]/text()")
print(dateList)
#[' 2010.09.07 ', ' 2009.09.03 ']
If you still want to use a regex (which I would advise against, given the whole discussion about applying regular expressions over an HTML, XML, etc. or use a parser specific for that grammar) you can define a capturing group allowing only digits and a literal . (([\d\.]+)) surrounded by the exact words you expect to find.
import re
pageHTML = """
<div class="g_list">
<div class="g_item">
...
...
"""
date_Regex = re.findall("申请日:\s*</span>\s*([\d\.]+)\s*</td>", pageHTML)
print(date_Regex)
# ['2010.09.07', '2009.09.03']

Scraping the top ten stories of a website using Beautiful Soup

I'm trying to scrape the website: http://edition.cnn.com/EVENTS/1996/year.in.review/
and trying to acquire the top 10 stories, this is my attempt so far, and im wondering if there is an easier way that i'm overlooking to get this in one go? Also, I'm trying to find a way to remove the linebreaks between each print, since i don't know why there is a gap between each headline.
import requests
from bs4 import BeautifulSoup
import lxml
html = """
<HTML>
<HEAD>
<TITLE>Top Ten Stories From 1996</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFCC" LINK="#162323" ALINK="#FFFFCE" VLINK="#162323">
<CENTER>
<P><BR>
<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0">
<TR>
<TD><IMG SRC="logos.gif" WIDTH="112" HEIGHT="60" ALIGN="TOP"></TD>
<TD><IMG SRC="banner.gif" WIDTH="360" HEIGHT="60" ALIGN="TOP"></TD>
</TR>
</TABLE>
</P>
</CENTER>
<BLOCKQUOTE>
<CENTER>
<TABLE BORDER="0" CELLPADDING="2">
<TR>
<TD WIDTH="90" VALIGN="TOP" ROWSPAN="11">
<P ALIGN="RIGHT"><B><TT>What were the biggest stories of the year?</TT></B><BR>
<BR>
<FONT SIZE="2">It's a question journalists like to ask themselves at the end of every
year. Now you can join in the process. Here are our selections for the top ten news
stories of 1996.<BR>
<BR>
Disagree with our choices? Then tell us what stories you think were most compelling
in the poll below.</FONT>
</TD>
<TD WIDTH="4" ROWSPAN="11"></TD>
<TD VALIGN="MIDDLE" ROWSPAN="11"><IMG SRC="generic/dot.gif" WIDTH="1" HEIGHT="250" ALIGN="MIDDLE"></TD>
<TD WIDTH="10" ROWSPAN="11"></TD>
<TD COLSPAN="4" VALIGN=TOP>
<P ALIGN="CENTER"><IMG SRC="generic/topten.gif" WIDTH="263" HEIGHT="24" ALIGN="MIDDLE" VSPACE="5">
</TD>
</TR>
<TR>
<TD><A HREF="topten/israel/israel.index.html" TARGET=_top><IMG SRC="generic/1.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/israel/israel.index.html" TARGET=_top><B>Israel</B> elects <B>Netanyahu</A></B></TD>
</TR>
<TR>
<TD><A HREF="topten/twa/twa.index.html" TARGET=_top><IMG SRC="generic/2.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/twa/twa.index.html" TARGET=_top>Crash of TWA Flight 800</A></TD>
</TR>
<TR>
<TD><A HREF="topten/yeltsin/yeltsin.index.html" TARGET=_top><IMG SRC="generic/3.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/yeltsin/yeltsin.index.html" TARGET=_top><B>Russia</B> elects <B>Yeltsin</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/clinton/clinton.index.html" TARGET=_top><IMG SRC="generic/4.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/clinton/clinton.index.html" TARGET=_top><B>U.S</B>. elects <B>Clinton</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/hutu/hutu.index.html" TARGET=_top><IMG SRC="generic/5.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/hutu/hutu.index.html" TARGET=_top><B>Hutu-Tutsi</B> conflict in central Africa</A></TD>
</TR>
<TR>
<TD><A HREF="topten/bosnia/bosnia.index.html" TARGET=_top><IMG SRC="generic/6.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/bosnia/bosnia.index.html" TARGET=_top>Peace, elections in <B>Bosnia</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/saudi/saudi.index.html" TARGET=_top><IMG SRC="generic/7.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/saudi/saudi.index.html" TARGET=_top><B>U.S</B>. base bombed in <B>Saudi Arabia</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/olympics/olympics.index.html" TARGET=_top><IMG SRC="generic/8.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/olympics/olympics.index.html" TARGET=_top>Centennial <B>Olympic</B> Games</A></TD>
</TR>
<TR>
<TD><A HREF="topten/aids/aids.index.html" TARGET=_top><IMG SRC="generic/9.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/aids/aids.index.html" TARGET=_top>Advances against <B>AIDS</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/unabomb/unabomb.index.html" TARGET=_top><IMG SRC="generic/10.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/unabomb/unabomb.index.html" TARGET=_top><B>Unabomb</B> suspect <B>Ted Kaczynski</B> arrested</A></TD>
</TR>
</TABLE>
<BR clear = "all">
<TABLE WIDTH=300>
<TR>
<TD>
<CENTER></CENTER>
</TD>
<TD>
<CENTER></CENTER>
</TD>
</TR>
<TR><TD COLSPAN=2><CENTER><A TARGET=_top HREF="http://www-cgi.cnn.com/cgi-bin/poll/heavypoll.pl?slug=9612%2Fyir_top_10">The top 10 stories according to our users</A></CENTER></TD></TR>
</TABLE>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE"><BR>
<BR><IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE"><BR>
<BR>
<CENTER>
<A HREF="http://pathfinder.com/time/bestof1996/index.html" TARGET=_top>
T I M E: The Best of 1996</A>
<BR clear = "all"><BR>
<A HREF="http://pathfinder.com/##qsdFOQcA62PJWEWu/time/moy/index.html" TARGET=_top>
T I M E: Man of the Year</A>
<BR clear = "all"><BR>
<A HREF="http://pathfinder.com/time/1996/" TARGET=_top>
<IMG SRC="time.gif" WIDTH="540" HEIGHT="50" ALIGN="MIDDLE" BORDER="0"></A>
<BR clear = "all"><BR><BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
</CENTER>
<BR clear = "all">
<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0" WIDTH="63%">
<TR>
<TD WIDTH="100%">
<P><B><TT>What makes a </TT></B><FONT SIZE="5"><TT><B>big</B></TT></FONT><TT><B>
story </B></TT><FONT SIZE="5"><TT><B>BIG?</B></TT></FONT>
<BLOCKQUOTE>
<P>It depends on your criteria, of course, and your perspective. That's why we offered
a poll to find out what you think.</P>
<P>For our list, we polled producers throughout the CNN/Pathfinder family of networks
and publications, and weighed such criteria as a story's long-term implications,
geopolitical significance, user interest, amount of coverage, and old-fashioned newsworthiness.
All these things help make a "big" story big.</P>
<P>By no means do we think our lists are the final word. Even our polls among CNN
producers turned up a wide variety of responses. The process is meant to encourage
you to reconsider the stories that dominated the media during the past year and determine
for yourself which were mere sensations and which were truly significant.
</BLOCKQUOTE>
</TD>
</TR>
</TABLE>
<BR CLEAR=ALL>
<BR>
<CENTER>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
<TABLE WIDTH=300><TR VALIGN=CENTER>
<TD ALIGN=CENTER><IMG SRC="what_you_think.gif" ALT="What you think" WIDTH="60" HEIGHT="59" BORDER="0"></TD>
<TD><STRONG><A NAME="_top" HREF="/feedback/index.html">Tell us what you think</A></STRONG><BR><BR>
<STRONG><A NAME="_top" HREF="/feedback/comments.html">You said it...</A></STRONG></TD>
</TR></TABLE>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
</CENTER>
<CENTER><A HREF="generic/credits.index.html" TARGET=_top><TT><B>C R E D I T S</B></TT></A></CENTER>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
<CENTER><TT><B>Back to top</B></TT></CENTER>
<BR CLEAR=ALL><BR>
<FONT SIZE=-1><P>© 1996 Cable News Network, Inc.<BR>
All Rights Reserved.</FONT>
<H6><A HREF="http://cnn.com/interactive_legal.html" target=_top>Terms</A> under which this
service is provided to you.</H6>
</CENTER>
</CENTER>
</BLOCKQUOTE>
</BODY>
</HTML>
"""
soup = BeautifulSoup(html, "lxml")
td_list = soup.find_all('td')
count = 0
for link in td_list:
if count == 20:
pass
elif link.a is not None:
print(link.text.strip())
count += 1
Output:
Israel elects Netanyahu
Crash of TWA Flight 800
Russia elects Yeltsin
U.S. elects Clinton
Hutu-Tutsi conflict in central Africa
Peace, elections in Bosnia
U.S. base bombed in Saudi Arabia
Centennial Olympic Games
Advances against AIDS
Unabomb suspect Ted Kaczynski arrested
Well, I've used re to shorten the road to select all tag a where href value starts with topten, also you can do it with different way such as.
for item in soup.select("a[href^=topten]"):
And then i got all text within the tag, then stripped it with strip=True and putted an empty separator so the text will not be assigned within together.
import requests
from bs4 import BeautifulSoup
import re
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for item in soup.findAll("a", href=re.compile("^topten")):
item = item.get_text(strip=True, separator=" ")
if item:
print(item)
main("http://edition.cnn.com/EVENTS/1996/year.in.review/main.html")
Output:
Israel elects Netanyahu
Crash of TWA Flight 800
Russia elects Yeltsin
U.S . elects Clinton
Hutu-Tutsi conflict in central Africa
Peace, elections in Bosnia
U.S . base bombed in Saudi Arabia
Centennial Olympic Games
Advances against AIDS
Unabomb suspect Ted Kaczynski arrested

BeautifulSoup: How to parse an un-id'ed list of TDs in table

Using bs4 I'm able to use soup.find_all() to find each of the s for the table. HTML is below.
However, how do I efficiently access specific columns within each ? Say I only want the 1st, 3rd and 5th column.
In other words, is there a way to so something similar to "date = row.td[1]" or "price_low = row.td[3]" etc?
Thanks.
<tr class="cmc-table-row" style="display:table-row">
<td class="cmc-table__cell cmc-table__cell--sticky cmc-table__cell--left">
<div class="">Dec 23, 2019</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,508.90</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,656.18</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,326.19</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,355.63</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">27,831,788,041</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">133,275,709,111</div>
</td>
</tr>
from bs4 import BeautifulSoup
html = """<tr class="cmc-table-row" style="display:table-row">
<td class="cmc-table__cell cmc-table__cell--sticky cmc-table__cell--left">
<div class="">Dec 23, 2019</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,508.90</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,656.18</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,326.19</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,355.63</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">27,831,788,041</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">133,275,709,111</div>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("div", {'class': ''})[0:5:2]:
print(item.text)
output:
Dec 23, 2019
7,656.18
7,355.63

Categories