Scraping the top ten stories of a website using Beautiful Soup

Scraping the top ten stories of a website using Beautiful Soup - python

I'm trying to scrape the website: http://edition.cnn.com/EVENTS/1996/year.in.review/
and trying to acquire the top 10 stories, this is my attempt so far, and im wondering if there is an easier way that i'm overlooking to get this in one go? Also, I'm trying to find a way to remove the linebreaks between each print, since i don't know why there is a gap between each headline.
import requests
from bs4 import BeautifulSoup
import lxml
html = """
<HTML>
<HEAD>
<TITLE>Top Ten Stories From 1996</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFCC" LINK="#162323" ALINK="#FFFFCE" VLINK="#162323">
<CENTER>
<P><BR>
<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0">
<TR>
<TD><IMG SRC="logos.gif" WIDTH="112" HEIGHT="60" ALIGN="TOP"></TD>
<TD><IMG SRC="banner.gif" WIDTH="360" HEIGHT="60" ALIGN="TOP"></TD>
</TR>
</TABLE>
</P>
</CENTER>
<BLOCKQUOTE>
<CENTER>
<TABLE BORDER="0" CELLPADDING="2">
<TR>
<TD WIDTH="90" VALIGN="TOP" ROWSPAN="11">
<P ALIGN="RIGHT"><B><TT>What were the biggest stories of the year?</TT></B><BR>
<BR>
<FONT SIZE="2">It's a question journalists like to ask themselves at the end of every
year. Now you can join in the process. Here are our selections for the top ten news
stories of 1996.<BR>
<BR>
Disagree with our choices? Then tell us what stories you think were most compelling
in the poll below.</FONT>
</TD>
<TD WIDTH="4" ROWSPAN="11"></TD>
<TD VALIGN="MIDDLE" ROWSPAN="11"><IMG SRC="generic/dot.gif" WIDTH="1" HEIGHT="250" ALIGN="MIDDLE"></TD>
<TD WIDTH="10" ROWSPAN="11"></TD>
<TD COLSPAN="4" VALIGN=TOP>
<P ALIGN="CENTER"><IMG SRC="generic/topten.gif" WIDTH="263" HEIGHT="24" ALIGN="MIDDLE" VSPACE="5">
</TD>
</TR>
<TR>
<TD><A HREF="topten/israel/israel.index.html" TARGET=_top><IMG SRC="generic/1.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/israel/israel.index.html" TARGET=_top><B>Israel</B> elects <B>Netanyahu</A></B></TD>
</TR>
<TR>
<TD><A HREF="topten/twa/twa.index.html" TARGET=_top><IMG SRC="generic/2.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/twa/twa.index.html" TARGET=_top>Crash of TWA Flight 800</A></TD>
</TR>
<TR>
<TD><A HREF="topten/yeltsin/yeltsin.index.html" TARGET=_top><IMG SRC="generic/3.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/yeltsin/yeltsin.index.html" TARGET=_top><B>Russia</B> elects <B>Yeltsin</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/clinton/clinton.index.html" TARGET=_top><IMG SRC="generic/4.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/clinton/clinton.index.html" TARGET=_top><B>U.S</B>. elects <B>Clinton</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/hutu/hutu.index.html" TARGET=_top><IMG SRC="generic/5.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/hutu/hutu.index.html" TARGET=_top><B>Hutu-Tutsi</B> conflict in central Africa</A></TD>
</TR>
<TR>
<TD><A HREF="topten/bosnia/bosnia.index.html" TARGET=_top><IMG SRC="generic/6.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/bosnia/bosnia.index.html" TARGET=_top>Peace, elections in <B>Bosnia</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/saudi/saudi.index.html" TARGET=_top><IMG SRC="generic/7.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/saudi/saudi.index.html" TARGET=_top><B>U.S</B>. base bombed in <B>Saudi Arabia</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/olympics/olympics.index.html" TARGET=_top><IMG SRC="generic/8.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/olympics/olympics.index.html" TARGET=_top>Centennial <B>Olympic</B> Games</A></TD>
</TR>
<TR>
<TD><A HREF="topten/aids/aids.index.html" TARGET=_top><IMG SRC="generic/9.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/aids/aids.index.html" TARGET=_top>Advances against <B>AIDS</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/unabomb/unabomb.index.html" TARGET=_top><IMG SRC="generic/10.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/unabomb/unabomb.index.html" TARGET=_top><B>Unabomb</B> suspect <B>Ted Kaczynski</B> arrested</A></TD>
</TR>
</TABLE>
<BR clear = "all">
<TABLE WIDTH=300>
<TR>
<TD>
<CENTER></CENTER>
</TD>
<TD>
<CENTER></CENTER>
</TD>
</TR>
<TR><TD COLSPAN=2><CENTER><A TARGET=_top HREF="http://www-cgi.cnn.com/cgi-bin/poll/heavypoll.pl?slug=9612%2Fyir_top_10">The top 10 stories according to our users</A></CENTER></TD></TR>
</TABLE>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE"><BR>
<BR><IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE"><BR>
<BR>
<CENTER>
<A HREF="http://pathfinder.com/time/bestof1996/index.html" TARGET=_top>
T I M E: The Best of 1996</A>
<BR clear = "all"><BR>
<A HREF="http://pathfinder.com/##qsdFOQcA62PJWEWu/time/moy/index.html" TARGET=_top>
T I M E: Man of the Year</A>
<BR clear = "all"><BR>
<A HREF="http://pathfinder.com/time/1996/" TARGET=_top>
<IMG SRC="time.gif" WIDTH="540" HEIGHT="50" ALIGN="MIDDLE" BORDER="0"></A>
<BR clear = "all"><BR><BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
</CENTER>
<BR clear = "all">
<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0" WIDTH="63%">
<TR>
<TD WIDTH="100%">
<P><B><TT>What makes a </TT></B><FONT SIZE="5"><TT><B>big</B></TT></FONT><TT><B>
story </B></TT><FONT SIZE="5"><TT><B>BIG?</B></TT></FONT>
<BLOCKQUOTE>
<P>It depends on your criteria, of course, and your perspective. That's why we offered
a poll to find out what you think.</P>
<P>For our list, we polled producers throughout the CNN/Pathfinder family of networks
and publications, and weighed such criteria as a story's long-term implications,
geopolitical significance, user interest, amount of coverage, and old-fashioned newsworthiness.
All these things help make a "big" story big.</P>
<P>By no means do we think our lists are the final word. Even our polls among CNN
producers turned up a wide variety of responses. The process is meant to encourage
you to reconsider the stories that dominated the media during the past year and determine
for yourself which were mere sensations and which were truly significant.
</BLOCKQUOTE>
</TD>
</TR>
</TABLE>
<BR CLEAR=ALL>
<BR>
<CENTER>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
<TABLE WIDTH=300><TR VALIGN=CENTER>
<TD ALIGN=CENTER><IMG SRC="what_you_think.gif" ALT="What you think" WIDTH="60" HEIGHT="59" BORDER="0"></TD>
<TD><STRONG><A NAME="_top" HREF="/feedback/index.html">Tell us what you think</A></STRONG><BR><BR>
<STRONG><A NAME="_top" HREF="/feedback/comments.html">You said it...</A></STRONG></TD>
</TR></TABLE>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
</CENTER>
<CENTER><A HREF="generic/credits.index.html" TARGET=_top><TT><B>C R E D I T S</B></TT></A></CENTER>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
<CENTER><TT><B>Back to top</B></TT></CENTER>
<BR CLEAR=ALL><BR>
<FONT SIZE=-1><P>© 1996 Cable News Network, Inc.<BR>
All Rights Reserved.</FONT>
<H6><A HREF="http://cnn.com/interactive_legal.html" target=_top>Terms</A> under which this
service is provided to you.</H6>
</CENTER>
</CENTER>
</BLOCKQUOTE>
</BODY>
</HTML>
"""
soup = BeautifulSoup(html, "lxml")
td_list = soup.find_all('td')
count = 0
for link in td_list:
if count == 20:
pass
elif link.a is not None:
print(link.text.strip())
count += 1
Output:
Israel elects Netanyahu
Crash of TWA Flight 800
Russia elects Yeltsin
U.S. elects Clinton
Hutu-Tutsi conflict in central Africa
Peace, elections in Bosnia
U.S. base bombed in Saudi Arabia
Centennial Olympic Games
Advances against AIDS
Unabomb suspect Ted Kaczynski arrested

Well, I've used re to shorten the road to select all tag a where href value starts with topten, also you can do it with different way such as.
for item in soup.select("a[href^=topten]"):
And then i got all text within the tag, then stripped it with strip=True and putted an empty separator so the text will not be assigned within together.
import requests
from bs4 import BeautifulSoup
import re
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for item in soup.findAll("a", href=re.compile("^topten")):
item = item.get_text(strip=True, separator=" ")
if item:
print(item)
main("http://edition.cnn.com/EVENTS/1996/year.in.review/main.html")
Output:
Israel elects Netanyahu
Crash of TWA Flight 800
Russia elects Yeltsin
U.S . elects Clinton
Hutu-Tutsi conflict in central Africa
Peace, elections in Bosnia
U.S . base bombed in Saudi Arabia
Centennial Olympic Games
Advances against AIDS
Unabomb suspect Ted Kaczynski arrested

Related

HTML table to database

At this point, my table looks as follows:
<table border="0" cellpadding="0" cellspacing="0" class="ms-formtable" id="formTbl" style="margin-top: 8px;" width="100%">
<tbody>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_FileLeafRef">
</a>
Name
</h3>
</td>
<td class="ms-formbody" id="SPFieldFile" valign="top" width="450px">
<a href="http://google.com" onclick="DispDocItemEx(this, 'FALSE', 'FALSE', 'FALSE', '');">
X
</a>
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_Owner">
</a>
Name#
</h3>
</td>
<td class="ms-formbody" id="SPFieldChoice" valign="top" width="450px">
Z
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_DirectiveRank">
</a>
Age
</h3>
</td>
<td class="ms-formbody" id="SPFieldChoice" valign="top" width="450px">
52
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_Number">
</a>
number
</h3>
</td>
<td class="ms-formbody" id="SPFieldText" valign="top" width="450px">
1
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_Title">
</a>
Name of File
</h3>
</td>
<td class="ms-formbody" id="SPFieldText" valign="top" width="450px">
Funny Names
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_EffectiveFrom">
</a>
date
</h3>
</td>
<td class="ms-formbody" id="SPFieldDateTime" valign="top" width="450px">
1.1.2022
</td>
</tr>
</tbody>
</table>
I basically need to open an HTML file, filter table with id "formTbl" and then either create JSON with values : {Firsttd:Secondtd, "Name":"Test", "Date":"Blank"} or insert into database where First td (in tr tag we have 2 td, first it name of column and second is value) in table A and second td in table B. Is there any way? I´ve tried using Python, where I got so far json looks like [["","Name","","Test",""],["","Age","","12",""]] and in C# I´ve tried HTMLAgilityPack but it wasn´t working.

Here is the solution with JQuery.
<html>
<body>
<table id="example-table">
<tr>
<th>Name</th>
<th>Name#</th>
<th>Age</th>
<th>Number</th>
<th>Name of file</th>
<th>Date</th>
</tr>
<tr>
<td>X</td>
<td>Z</td>
<td>52</td>
<td>1</td>
<td>Name of file</td>
<td>2021-22-10</td>
</tr>
</table>
<textarea rows="10" cols="50" id="jsonTextArea">
</textarea>
</body>
</html>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/table-to-json#1.0.0/lib/jquery.tabletojson.min.js"></script>
<script type="text/javascript">
var tableToJson = $('#example-table').tableToJSON();
var sendingData = JSON.stringify (tableToJson);
$('#jsonTextArea').val(sendingData);
// Send JSON data to backend
$.post('http://localhost/test.php', {sendingData}, function(data, textStatus, xhr) {
var backendResponse = data;
console.log(backendResponse);
});
</script>

How to beautifulsoup in this case without class or id

How to get the text of 'Wow, you get it!' i can print the Date, but i cant get the td that come next of the date.
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Account Here
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td colspan="2">
There is nothing
</td>
</tr>
</table>
<br/>
<br/>
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Death
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td valign="top" width="25%">
Aug 15 2021, 18:36:22 CEST
</td>
<td>
Wow, you get it!
</td>
</tr>
<tr bgcolor="#D4C0A1">
<td valign="top" width="25%">
Aug 01 2021, 21:25:39 CEST
</td>
<td>
Next Time
</td>
</tr>
</table>
i got the date with this code:
print(soup.find_all('td', {'valign': 'top'})[0].get_text())
show this
Aug 15 2021, 18:36:22 CEST
but i cant find any solution to get the next td of the date

If html_doc contains the HTML snippet from the question:
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('td[valign="top"] + td').get_text(strip=True)
print(txt)
Prints:
Wow, you get it!
Or:
txt = soup.find("td", {"valign": "top"}).find_next("td").get_text(strip=True)

Is there any way to edit programmatically nested tables in html file using BeatifulSoup?

I am scraping a table in a webpage with BeautifulSoup. I managed to put the text in a txt file.
However, some contains multiple tables inside. I guess that the developers had some aesthetic directive and they couldn't edit the cell any other way to meet their requirements. I have many problems in scraping the tables the way they are, so i was wondering if there exists a way to programmatically edit the HTML in order to extrapolate the txt from those nested tables into the original cell.
Here an example of what I mean.
From a nested table like this
<tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which:</p>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the materials of Chapter 4 used are wholly obtained,</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,</p>
<p class="normal">and</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>
</td>
</tr>
</tbody>
</table>
</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr>
I would like to edit the HTML file in order to get
<tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which: all the materials of Chapter 4 used are wholly obtained, — all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating, — the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>
</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr>
from all the nested tables in the cells.

Yes you can do that if your html be always like this.
Find all columns inside each rows and then check if the column has children table
Then get text of all the P tag w.r.t those columns and replace with first P tag text.
Then decompose() all the table tag from the column.
Code:
html='''<tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which:</p>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the materials of Chapter 4 used are wholly obtained,</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,</p>
<p class="normal">and</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>
</td>
</tr>
</tbody>
</table>
</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr>'''
soup=BeautifulSoup(html,'lxml')
for row in soup.find_all('tr',class_='table'):
for col in row.find_all('td'):
if col.findChildren("table"):
#Get all the p tag text from col which contains table
ptag_text=''.join([i.text for i in col.find_all('p')])
#Get the first p tag and replace the value with previus value
col.find('p').next_element.replace_with(ptag_text)
for item in col.findChildren("table"):
item.decompose()
print(soup)
Output:
<html><body><tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>
</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr></body></html>
If you don't want those new lines then do .replace all new lines like below.
finalhtml=str(soup).replace('\n','')
print(finalhtml)
Output:
<html><body><tr class="table"><td class="table" valign="top"><p class="tbl-cod">0403</p></td><td class="table" valign="top"><p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p></td><td class="table" valign="top"><p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p></td><td class="table" valign="top"><p class="normal"> </p></td></tr></body></html>
Now if you want to format again then try this
finalhtml=str(soup).replace('\n','')
soup=BeautifulSoup(finalhtml,'lxml')
print(soup.prettify(formatter=None))
Output:
<html>
<body>
<tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">
0403
</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">
Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa
</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">
Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product
</p>
</td>
<td class="table" valign="top">
<p class="normal">
</p>
</td>
</tr>
</body>
</html>

Beautiful Soup Table, stop getting info

Hey everyone I have some html that I am parsing, here it is:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<table class="dayinner">
<tr class="lun">
<td class="mealname" colspan="3">LUNCH</td>
</tr>
<tr class="lun">
<td class="station"> Deli</td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000010000047598_35356" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000047598_35356');" onmouseout="pcls(this);"
onmouseover="ws(this);">Made to Order Deli Core</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td colspan="3" style="height:3px;"></td>
</tr>
<tr class="lun">
<td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
</tr>
<tr class="lun">
<td class="station"> Dessert</td>
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
</tr>
</table>
</body>
</html>
Here is the code I have, I want just the items under the deli section, and normally I won't know how many there are is there a way to do this?
soup = BeautifulSoup(open("upperMenu.html"))
title = soup.find('td', class_='station').text.strip()
spans = soup.find_all('span', class_='ul')[:2]
but this only works if there are two items, how can I have it work if the number of items is unknown?
Thanks in advance

You can use the text attribute in find_all function to 1. find all the rows whose station column contains the substring Deli.. 2. Loop through every row and find the spans within that row whose class is ul.
import re
soup = BeautifulSoup(text)
tds_deli = soup.find_all(name='td', attrs={'class':'station'}, text=re.compile('Deli'))
for td in tds_deli:
try:
tr = td.find_parent()
spans = tr.find_all('span', {'class':'ul'})
for span in spans:
# do something
print span.text
print '------------one row -------------'
except:
pass
Sample Output in this case:
Made to Order Deli Core
------------one row -------------
Not sure if I am understanding the problem correctly but I think my code might help you get started.

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag.
I'm currently accomplishing this using list comprehension:
rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag]
but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.

thanks to J.F.Sebastian , the following will work:
rows=table.tbody.find_all(True, recursive=False)
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true
In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable:
rows=table.tbody.find_all('tr')
Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
I believe this is a better way than iterating through all the children of a Tag.
Worked with the following input:
<table cellspacing="0" cellpadding="0">
<thead>
<tr class="title-row">
<th class="title" colspan="100">
<div style="position:relative;">
President
<span class="pct-rpt">
99% reporting
</span>
</div>
</th>
</tr>
<tr class="header-row">
<th class="photo first">
</th>
<th class="candidate ">
Candidate
</th>
<th class="party ">
Party
</th>
<th class="votes ">
Votes
</th>
<th class="pct ">
Pct.
</th>
<th class="change ">
Change from ‘08
</th>
<th class="evotes last">
Electoral Votes
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td class="photo first">
<div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div>
</td>
<td class="candidate ">
<div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div>
</td>
<td class="party ">
Dem.
</td>
<td class="votes ">
2,916,811
</td>
<td class="pct ">
57.3%
</td>
<td class="change ">
-4.6%
</td>
<td class="evotes last">
20
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Mitt Romney</div>
</td>
<td class="party ">
Rep.
</td>
<td class="votes ">
2,090,116
</td>
<td class="pct ">
41.1%
</td>
<td class="change ">
+4.3%
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Gary Johnson</div>
</td>
<td class="party ">
Lib.
</td>
<td class="votes ">
54,798
</td>
<td class="pct ">
1.1%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="last-row">
<td class="photo first">
</td>
<td class="candidate ">
div class="not-winner">Jill Stein</div>
</td>
<td class="party ">
Green
</td>
<td class="votes ">
29,336
</td>
<td class="pct ">
0.6%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr>
<td class="footer" colspan="100">
President Map |
President Big Board |
Exit Polls
</td>
</tr>
</tbody>
</table>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping the top ten stories of a website using Beautiful Soup - python

Related

HTML table to database

How to beautifulsoup in this case without class or id

Is there any way to edit programmatically nested tables in html file using BeatifulSoup?

Beautiful Soup Table, stop getting info

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Categories

Resources