how to find class with beautifulsoup - python

<hknbody>
<tr>
<td class="padding_25 font_7 bold xicolor_07" style="width:30%">
date
</td>
<td class="font_34 xicolor_42">
19 Eylül 2013
</td>
</tr>
<tr>
<td style="height:10px" colspan="3"></td>
</tr>
<tr>
<td class="bgcolor_09" style="height:5px" colspan="3"></td>
</tr>
<tr>
<td style="height:10px" colspan="3"></td>
</tr>
<tr>
<td class="padding_25 font_7 bold xicolor_07" style="width:30%">
Size
</td>
<td class="font_34 xicolor_42">
650 cm
Classes names same, classes in the same table.
How can I find correct data? Example; if "date" doesn't exist in <td class="padding_25 font_7 bold xicolor_07>, you don't pull date and find next data.

If this is your HTML and you can change it, you should be using semantic HTML to markup your elements with class, id, or name attributes that describe the meaning of the data, not its appearance. Then you will have an unambiguous way of selecting the required tags.
As it is all you have to do something like this:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
date_tag = soup.find('td', text=re.compile('^\s*date\s*$')) # find first <td> containing text "date"
if date_tag:
date_value = date_tag.find_next_sibling('td').text.strip()
>>> print date_value
19 Eylül 2013

Related

Parsing webpage with robobrowser and beautifulsoup

I'm new to webscraping trying to parse a website after doing a form submission with robobrowser. I get the correct data back (I can view it when I do: print(browser.parsed)) but am having trouble parsing it. The relevant part of the source code of the webpage looks like this:
<div id="ii">
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
if I do
in: browser.select('#t1b")
I get:
out: [<td id="t1b" scope="row">Inmate Name</td>]
instead of JONES, JOHN.
The only way I've been able to get the relevant data is doing:
browser.select('tr')
This returns a list of each of the 29 with each 'tr' that I can convert to text and search for the relevant info.
I've also tried creating a BeautifulSoup object:
x = browser.select('#ii')
soup = BeautifulSoup(x[0].text, "html.parser")
but it loses all tags/ids and so I can't figure out how to search within it.
Is there an easy way to have it loop through each element with 'tr' and get the actual data and not the label as oppose to repeatedly converting to a string variable and searching through it?
Thanks
Get all the "label" td elements and get the next td sibling value collecting results into a dict:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
</table>
"""
soup = BeautifulSoup(data, 'html5lib')
data = {
label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
for label in soup.select("tr > td[scope=row]")
}
pprint(data)
Prints:
{'Date of Birth': '11/25/2016',
'ID (ID Number)': '1234567',
'Participant Name': 'JONES, JOHN',
'Race / Ethnicity': 'White',
'Sex': 'MALE'}

Retrieving the name of a class attribute with lxml

I am working on a python project using lxml to scrap a page and I am having the challenge of retrieving the name of a span class attribute. The html snippet is below:
<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
</tr>
....
How do I retrieve the value of the span's class attribute below:
<span class="brand">carlos santos</span>
You can use the following XPath to get class attribute of span element that is direct child of td with class product :
//td[#class="product"]/span/#class
working demo example :
from lxml import html
raw = '''<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
</tr>'''
root = html.fromstring(raw)
span = root.xpath('//td[#class="product"]/span/#class')[0]
print span
output :
Brand
from bs4 import BeautifulSoup
lxml = '''<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
<tr>'''
soup = BeautifulSoup(lxml, 'lxml')
result = soup.find('span')['class'] # result = 'brand'

scraping tables with beautifulsoup

I seem to be stuck, If i had the following table:
<table align=center cellpadding=3 cellspacing=0 border=1>
<tr bgcolor="#EEEEFF">
<td align="center">
40 </td>
<td align="center">
44 </td>
<td align="center">
<font color="green"><b>+4</b></font>
</td>
<td align="center">
1,000</td>
<td align="center">
15,000 </td>
<td align="center">
44,000 </td>
<td align="center">
<font color="green"><b><nobr>+193.33%</nobr></b></font>
</td>
</tr>
what would be the ideal way to use find_all to pull the 44,000 td from the table?
If it is a recurring position of the table you would like to scrape you would like to scrape I would use beautiful soup to extract all elements in the table and then extract that data. See the pseudo code below.
known_position = 5
tds = bs4.find_all('td')
number = tds[known_position].text()
on the other hand if you're specifically searching for a given number I would just iterate over the list.
tds = bs4.find_all('td')
for td in tds:
if td.text = 'number here':
# do your stuff

BeautifulSoup help, how to extract content from not proper tags text in html file?

<tr>
<td nowrap> good1 </td>
<td class = "td_left" nowrap=""> 1 </td>
</tr>
<tr0>
<td nowrap> good2 </td>
<td class = "td_left" nowrap=""> </td>
</tr0>
How to using python parse it? please help.
I want to get the result as list ['good1',1,'good2',None]
Find all tr tags and get all tds from it:
from bs4 import BeautifulSoup
page = """<tr>
<td nowrap> good1 </td>
<td nowrap class = "td_left"> 1 </td>
</tr>
<tr>
<td nowrap> good2 </td>
<td nowrap class = "td_left"> 2 </td>
</tr>"""
soup = BeautifulSoup(page)
rows = soup.body.find_all('tr')
print [td.text.strip() for row in rows for td in row.find_all('td')]
prints:
[u'good1', u'1', u'good2', u'2']
Note, strip() helps to get rid of leading and trailing whitespaces.
Hope that helps.

How to get text content of multiple <td> tags inside a table using PyQuery?

How to select attribute's text field from given book-details table field where values are in text or in text field?
<table cellspacing="0" class="fk-specs-type2">
<tr>
<th class="group-head" colspan="2">Book Details</th>
</tr>
<tr>
<td class="specs-key">Publisher</td>
<td class="specs-value fk-data">HARPER COLLINS INDIA</td>
</tr>
<tr>
<td class="specs-key">ISBN-13</td>
<td class="specs-value fk-data">9789350291924</td>
</tr>
</table>
You can use following code snippet to get Publisher and ISBN-13 data:
from pyquery import PyQuery
html = """<table cellspacing="0" class="fk-specs-type2">
<tr>
<th class="group-head" colspan="2">Book Details</th>
</tr>
<tr>
<td class="specs-key">Publisher</td>
<td class="specs-value fk-data">HARPER COLLINS INDIA</td>
</tr>
<tr>
<td class="specs-key">ISBN-13</td>
<td class="specs-value fk-data">9789350291924</td>
</tr>
</table>
"""
doc = PyQuery(html)
for td in doc("table.fk-specs-type2").find("td.specs-key"):
print td.text, td.getnext().text
It should print following two lines
Publisher HARPER COLLINS INDIA
ISBN-13 9789350291924

Categories