How to get a text of certain elements BeautifulSoup Python - python

I have this kind of html code
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
I need to get the text of every 3rd and 5th td of every tr
Apparently this doesn't work:)
from bs4 import BeautifulSoup
import index
soup = BeautifulSoup(index.index_doc, 'lxml')
for i in soup.find_all('tr')[2:]:
print(i[2].text, i[4].text)

You could use css selectors and pseudo classe :nth-of-type() to select your elements (assumed you need the date, so I selected the 6th td):
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
And to get a list of tuples:
list(zip(data, data[1:]))
Example
from bs4 import BeautifulSoup
html = '''
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
'''
soup = BeautifulSoup(html)
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
list(zip(data, data[1:]))

Related

How to beautifulsoup in this case without class or id

How to get the text of 'Wow, you get it!' i can print the Date, but i cant get the td that come next of the date.
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Account Here
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td colspan="2">
There is nothing
</td>
</tr>
</table>
<br/>
<br/>
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Death
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td valign="top" width="25%">
Aug 15 2021, 18:36:22 CEST
</td>
<td>
Wow, you get it!
</td>
</tr>
<tr bgcolor="#D4C0A1">
<td valign="top" width="25%">
Aug 01 2021, 21:25:39 CEST
</td>
<td>
Next Time
</td>
</tr>
</table>
i got the date with this code:
print(soup.find_all('td', {'valign': 'top'})[0].get_text())
show this
Aug 15 2021, 18:36:22 CEST
but i cant find any solution to get the next td of the date
If html_doc contains the HTML snippet from the question:
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('td[valign="top"] + td').get_text(strip=True)
print(txt)
Prints:
Wow, you get it!
Or:
txt = soup.find("td", {"valign": "top"}).find_next("td").get_text(strip=True)

How to extract text from table moving between<tr> tags using Beautifulsoup

I need to extract text from a table using BeautifulSoup.
Below is the code which I have written and output
HTML:
<div class="Tech">
<div class="select">
<span>Selection is mandatory</span>
</div>
<table id="product">
<tbody>
<tr class="feature">
<td class="title" rowspan="3">
<h2>Information</h2>
</td>
<td class="label">
<h3>Design</h3>
</td>
<td class="checkbox">product</td>
</tr>
<tr>
<td class="label">
<h3>Marque</h3>
</td>
<td class="checkbox">
<input type="checkbox">
<label>retro</label>
<a href="link">
Landlord
</a>
</td>
</tr>
<tr>
<td class="label">
<h3>Model</h3>
</td>
<td class="checkbox">model123</td>
</tr>
import requests
from bs4 import BeautifulSoup
url='someurl.com'
source2= requests.get(url,timeout=30).text
soup2=BeautifulSoup(source2,'lxml')
element2= soup2.find('div',class_='Tech')
pin= element2.find('table',id='product').tbody.tr.text
print(pin)
Output that I am getting is:
Information
Design
product
How to do I move between <tr>s? I need the output as: model123.
To get output model123, you can try:
# search <h3> that contains "Model"
h3 = soup.select_one('h3:contains("Model")')
# search next <td>
model = h3.find_next("td").text
print(model)
Prints:
model123
Or without CSS selectors:
model = (
soup.find(lambda tag: tag.name == "h3" and tag.text.strip() == "Model")
.find_next("td")
.text
)
print(model)

Retrieving the name of a class attribute with lxml

I am working on a python project using lxml to scrap a page and I am having the challenge of retrieving the name of a span class attribute. The html snippet is below:
<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
</tr>
....
How do I retrieve the value of the span's class attribute below:
<span class="brand">carlos santos</span>
You can use the following XPath to get class attribute of span element that is direct child of td with class product :
//td[#class="product"]/span/#class
working demo example :
from lxml import html
raw = '''<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
</tr>'''
root = html.fromstring(raw)
span = root.xpath('//td[#class="product"]/span/#class')[0]
print span
output :
Brand
from bs4 import BeautifulSoup
lxml = '''<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
<tr>'''
soup = BeautifulSoup(lxml, 'lxml')
result = soup.find('span')['class'] # result = 'brand'

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

BeautifulSoup return next sibling after using findAll(text=' ')

How would I go about getting the next sibling using bs4 after I've located the contents that I want by searching the HTML using soup.findAll
<td class="name">David<span class="flag away"</span>
</td>
<td class="team">b<span class="team b"></span></td>
<td class="time">99'</td>
<td class="name">James<span class="flag home"</span>
</td>
<td class="team">a<span class="team a"></span></td>
<td class="time">99'</td>
using find all I can locate the text
for t in soup.findAll(text='David'):
>> David
However my desired outupt is
<td class="team">b<span class="team b"></span></td>
<td class="time">99'</td>
from bs4 import BeautifulSoup as soup, Tag
input = """<td class="name">David<span class="flag away"</span>
</td>
<td class="team">b<span class="team b"></span></td>
<td class="time">99'</td>
<td class="name">James<span class="flag home"</span>"""
web_soup = soup(input)
for t in web_soup.findAll(text='David'):
for item in t.parent.next_siblings:
if isinstance(item, Tag):
if 'class' in item.attrs and 'name' in item.attrs['class']:
break
print item
prints:
<td class="team">b<span class="team b"></span></td>
<td class="time">99'</td>
Hope that is what you wanted.

Categories