Retrieving the name of a class attribute with lxml - python

I am working on a python project using lxml to scrap a page and I am having the challenge of retrieving the name of a span class attribute. The html snippet is below:
<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
</tr>
....
How do I retrieve the value of the span's class attribute below:
<span class="brand">carlos santos</span>

You can use the following XPath to get class attribute of span element that is direct child of td with class product :
//td[#class="product"]/span/#class
working demo example :
from lxml import html
raw = '''<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
</tr>'''
root = html.fromstring(raw)
span = root.xpath('//td[#class="product"]/span/#class')[0]
print span
output :
Brand

from bs4 import BeautifulSoup
lxml = '''<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
<tr>'''
soup = BeautifulSoup(lxml, 'lxml')
result = soup.find('span')['class'] # result = 'brand'

Related

How to get a text of certain elements BeautifulSoup Python

I have this kind of html code
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
I need to get the text of every 3rd and 5th td of every tr
Apparently this doesn't work:)
from bs4 import BeautifulSoup
import index
soup = BeautifulSoup(index.index_doc, 'lxml')
for i in soup.find_all('tr')[2:]:
print(i[2].text, i[4].text)
You could use css selectors and pseudo classe :nth-of-type() to select your elements (assumed you need the date, so I selected the 6th td):
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
And to get a list of tuples:
list(zip(data, data[1:]))
Example
from bs4 import BeautifulSoup
html = '''
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
'''
soup = BeautifulSoup(html)
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
list(zip(data, data[1:]))

how to find class with beautifulsoup

<hknbody>
<tr>
<td class="padding_25 font_7 bold xicolor_07" style="width:30%">
date
</td>
<td class="font_34 xicolor_42">
19 Eylül 2013
</td>
</tr>
<tr>
<td style="height:10px" colspan="3"></td>
</tr>
<tr>
<td class="bgcolor_09" style="height:5px" colspan="3"></td>
</tr>
<tr>
<td style="height:10px" colspan="3"></td>
</tr>
<tr>
<td class="padding_25 font_7 bold xicolor_07" style="width:30%">
Size
</td>
<td class="font_34 xicolor_42">
650 cm
Classes names same, classes in the same table.
How can I find correct data? Example; if "date" doesn't exist in <td class="padding_25 font_7 bold xicolor_07>, you don't pull date and find next data.
If this is your HTML and you can change it, you should be using semantic HTML to markup your elements with class, id, or name attributes that describe the meaning of the data, not its appearance. Then you will have an unambiguous way of selecting the required tags.
As it is all you have to do something like this:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
date_tag = soup.find('td', text=re.compile('^\s*date\s*$')) # find first <td> containing text "date"
if date_tag:
date_value = date_tag.find_next_sibling('td').text.strip()
>>> print date_value
19 Eylül 2013

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

BeautifulSoup help, how to extract content from not proper tags text in html file?

<tr>
<td nowrap> good1 </td>
<td class = "td_left" nowrap=""> 1 </td>
</tr>
<tr0>
<td nowrap> good2 </td>
<td class = "td_left" nowrap=""> </td>
</tr0>
How to using python parse it? please help.
I want to get the result as list ['good1',1,'good2',None]
Find all tr tags and get all tds from it:
from bs4 import BeautifulSoup
page = """<tr>
<td nowrap> good1 </td>
<td nowrap class = "td_left"> 1 </td>
</tr>
<tr>
<td nowrap> good2 </td>
<td nowrap class = "td_left"> 2 </td>
</tr>"""
soup = BeautifulSoup(page)
rows = soup.body.find_all('tr')
print [td.text.strip() for row in rows for td in row.find_all('td')]
prints:
[u'good1', u'1', u'good2', u'2']
Note, strip() helps to get rid of leading and trailing whitespaces.
Hope that helps.

BeautifulSoup return next sibling after using findAll(text=' ')

How would I go about getting the next sibling using bs4 after I've located the contents that I want by searching the HTML using soup.findAll
<td class="name">David<span class="flag away"</span>
</td>
<td class="team">b<span class="team b"></span></td>
<td class="time">99'</td>
<td class="name">James<span class="flag home"</span>
</td>
<td class="team">a<span class="team a"></span></td>
<td class="time">99'</td>
using find all I can locate the text
for t in soup.findAll(text='David'):
>> David
However my desired outupt is
<td class="team">b<span class="team b"></span></td>
<td class="time">99'</td>
from bs4 import BeautifulSoup as soup, Tag
input = """<td class="name">David<span class="flag away"</span>
</td>
<td class="team">b<span class="team b"></span></td>
<td class="time">99'</td>
<td class="name">James<span class="flag home"</span>"""
web_soup = soup(input)
for t in web_soup.findAll(text='David'):
for item in t.parent.next_siblings:
if isinstance(item, Tag):
if 'class' in item.attrs and 'name' in item.attrs['class']:
break
print item
prints:
<td class="team">b<span class="team b"></span></td>
<td class="time">99'</td>
Hope that is what you wanted.

Categories