Not getting desired text from BeautifulSoup

Not getting desired text from BeautifulSoup - python

<h3 class="jd_header3 text" style="font-size: 12px;">
Shift Pattern:
</h3>
<ul class="jd_NoBulletinRight">
<li style="font-size:11px;">
<span class="text">
No Shift
</span>
</li>
</ul>
<h3 class="jd_header3 text" style="align:left;font-size:12px;">
Salary:
</h3>
<ul class="jd_NoBulletinRight">
<li>
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td align="left" style="word-wrap: break-word;font-size: 11px;" valign="top">
<span class="text">
S$3,500.00
<span class="text">
-
</span>
S$5,400.00
</span>
</td>
</tr>
</tbody>
</table>
</li>
</ul>
This is a part of my BeautifulSoup tree. I wish to get the salary range S$3500 - S$5400. Following the suggestion here I use the following code:
salary = bsObj.find(text="Salary:").parent.nextSibling.find("td").get_text()
print(salary)
I get the error:
AttributeError: 'int' object has no attribute 'get_text'
But when I simply print out the integer:
salary = bsObj.find(text="Salary:").parent.nextSibling.find("td")
print(salary)
I get:
-1
Which is not what I want. I have used Selenium to obtain the page, so any javascript is already loaded.
Any ideas?

Try the following code, however I don't think your code can get your expect output:
>>> bsObj.find('td', {'align': "left"}).text
'\n\n S$3,500.00\n \n -\n \n S$5,400.00
\n \n'
>>> ' '.join(bsObj.find('td', {'align': "left"}).text.split())
'S$3,500.00 - S$5,400.00'

Not sure about this "get_text" attribute, but with BeautifulSoup, I rely heavily on .text as shown below. Is this what you're looking for?
s = '''<html here>'''
soup = BeautifulSoup(s, 'html.parser')
bsObj = soup.findAll('td')
for i in bsObj:
print(i.text)
>>>
S$3,500.00
-
S$5,400.00

Related

Parsing nested tags with BeautifulSoup and requests

I'm new to BeautifulSoup. I was trying to parse an HTML web page with requests. Code I wrote for now is:
import requests
from bs4 import BeautifulSoup
link = "SOME_URL"
f = requests.get(link)
soup = BeautifulSoup(f.text, 'html.parser')
for el in (soup.findAll("td",{"class": "g-res-tab-cell"})):
print(el)
exit
The output is as follows:
<td class="g-res-tab-cell">
<div style="padding:8px;">
<div style="padding-top:8px;">
<table cellspacing="0" cellpadding="0" border="0" style="width:100%;">
<tr>
<td valign="top">
<div itemscope itemtype="URL">
<table cellspacing="0" cellpadding="0" style="width:100%;">
<tr>
<td valign="top" class="g-res-tab-cell" style="width:100%;">
<div style="width:100%;padding-left:4px;">
<div class="subtext_view_med" itemprop="name">
NAME1
</div>
<div style="direction:ltr;padding-left:5px;margin-bottom:2px;" class="smtext">
<span class="Gray">In English:</span> ENGLISH_NAME1
</div>
<div style="padding-bottom:2px;padding-top:8px;font-size:14px;text-align:justify;min-height:158px;" itemprop="description">DESCRIPTION1</div>
</div>
</td>
</tr>
</table>
</div>
</td>
</tr>
</table>
<table cellspacing="0" cellpadding="0" border="0" style="width:100%;">
<tr>
<td valign="top">
<div itemscope itemtype="URL">
<table cellspacing="0" cellpadding="0" style="width:100%;">
<tr>
<td valign="top" class="g-res-tab-cell" style="width:100%;">
<div style="width:100%;padding-left:4px;">
<div class="subtext_view_med" itemprop="name">
NAME2
</div>
<div style="direction:ltr;padding-left:5px;margin-bottom:2px;" class="smtext">
<span class="Gray">In English:</span> ENGLISH_NAME2
</div>
</div>
<div style="padding-bottom:2px;padding-top:8px;font-size:14px;text-align:justify;min-height:158px;" itemprop="description">DESCRIPTION2</div>
</td>
</tr>
</table>
</div>
</td>
</tr>
</table>
</div>
</div>
</td>
Now I got stuck. I'm trying to parse the NAME, DESCRIPTION and ENGLISH_NAME for each block. I would like to print each one of them so the output will be:
name = NAME1
en_name = ENGLISH_NAME1
description = DESCRIPTION1
name = NAME2
en_name = ENGLISH_NAME2
description = DESCRIPTION2
I tried to read the docs but I could not find how to treat nested attributes especially without a class or id name. As I understand, each block starts with <table cellspacing="0" cellpadding="0" border="0" style="width:100%;">. In each block I should find tag a that has itemprop="url" and get the NAME. Then in <span class="Gray">In English:</span> get the en_name and in itemprop="description" get the description. But I feels like BeautifulSoup can't do it (or at least very hard to achieve it). How to solve it?

You can iterate over each td with class g-res-tab-cell using soup.find_all:
from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser').td.find_all('td', {'class':'g-res-tab-cell'})
results = [[i.find('div', {'class':'subtext_view_med'}).a.text, i.find('div', {'class':'smtext'}).contents[1].text, i.find('div', {'itemprop':'description'}).text] for i in d]
Output:
[['NAME1', 'In English:', 'DESCRIPTION1'], ['NAME2', 'In English:', 'DESCRIPTION2']]
Edit: from link:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.sratim.co.il/browsenewmovies.php?page=1').text, 'html.parser')
movies = d.find_all('div', {'itemtype':'http://schema.org/Movie'})
result = [[getattr(i.find('a', {'itemprop':'url'}), 'text', 'N/A'), getattr(i.find('div', {'class':'smtext'}), 'text', 'N/A'), getattr(i.find('div', {'itemprop':'description'}), 'text', 'N/A')] for i in movies]

Here is another way. As that information is present for all films you should have a fully populated result set.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://www.sratim.co.il/browsenewmovies.php?page=1')
soup = bs(r.content, 'lxml')
names = [item.text for item in soup.select('[itemprop=url]')] #32
english_names = [item.next_sibling for item in soup.select('.smtext:contains("In English: ") span')]
descriptions = [item.text for item in soup.select('[itemprop=description]')]
results = list(zip(names, english_names, descriptions))
df = pd.DataFrame(results, columns = ['Name', 'English_Name', 'Description'])
print(df)

Python, Beautiful Soup: how to get the desired element

I am trying to arrive to a certain element, parsing a source code of a site.
this is a snippet from the part i'm trying to parse (here until Friday), but it is the same for all the days of the week
<div id="intForecast">
<h2>Forecast for Rome</h2>
<table cellspacing="0" cellpadding="0" id="nonCA">
<tr>
<td onclick="showDetails('1');return false" id="day1" class="on">
<span>Thursday</span>
<div class="intIcon"><img src="http://icons.wunderground.com/graphics/conds/2005/sunny.gif" alt="sunny" /></div>
<div>Clear</div>
<div><span class="hi">H <span>22</span>°</span> / <span class="lo">L <span>11</span>°</span></div>
</td>
<td onclick="showDetails('2');return false" id="day2" class="off">
<span>Friday</span>
<div class="intIcon"><img src="http://icons.wunderground.com/graphics/conds/2005/partlycloudy.gif" alt="partlycloudy" /></div>
<div>Partly Cloudy</div>
<div><span class="hi">H <span>21</span>°</span> / <span class="lo">L <span>15</span>°</span></div>
</td>
</tr>
</table>
</div>
....and so on for all the days
Actually i got my result but in a ugly way i think:
forecastFriday= soup.find('div',text='Friday').findNext('div').findNext('div').string
now, as you can see i go deep down the elements repeating .findNext('div')and finally arrive at .string
I want to get the information "Partly Cloudy" of Friday
So any more pythonic way to do this?
thanks!

Simply find all of the <td>s and iterate over them:
soup = BeautifulSoup(your_html)
div = soup('div',{'id':'intForecast'})[0]
tds = div.find('table').findAll('td')
for td in tds:
day = td('span')[0].text
forecast = td('div')[1].text
print day, forecast

Parsing for text under specific tags in HTML, Python

How to find all the text on a page that falls under this criteria using beautiful soup?
<tr>
<td class="d_g_l_e" style="border-right:none;”>
<img src="/d2l/img/LP/pixel.gif" width="20" height="20" alt=“”
</td>
<th scope="row" class="d_gt d_ich" style="border-left:none;”>
<div class="dco”>
<div class="dco_c”>
<div class="dco”>
<div class="dco_c”>
<strong> **EXTRACT THIS (NAME)** </strong>
</div>
</div>
</div>
</div>
</th>
<td class="d_gn d_gr d_gt”>
<div class="dco”>
<div class="dco_c”>
<div class="dco”>
<div class="dco_c" style="text-align:right;”>
<div style="text-align:center;display:inline;”>
<label id="z_c"> **EXTRACT THIS (GRADE)** </label>
</div>
</div>
</div>
</div>
</div>
</td>
<td class="d_gn d_gr d_gt"> </td>
</tr>
I want the program to scan the whole html page and collect all of the variables this appear in this form. If the "tr" tag (main tag I'm looking for) has both a NAME and a GRADE underneath it, add the name to a list (List1), and then add the grade to a separate list (List2). If one of the two is missing underneath the "tr" tag, skip it, and don't record anything. So by the time the script is done scanning the page, a list would look something like:
List1 = [Grade 1, Grade 2, Grade 3, Grade 4]
List2 = [10/20, 20/40, 50/50, 33/44]
Also, the "z" label ID for the grade text changes from grade to grade, ex. z_a, z_b, z_c.

For each tr on the page, find strong tag inside the th and label tag inside the td:
soup = BeautifulSoup(data)
for row in soup.find_all('tr'):
grade = row.select('th strong')
name = row.select('td label')
if grade and name:
print grade[0].text, name[0].text

Python xpath and conditionals

I'm trying to find all elements that are h3 class="threadtitle" and within this element, if there is the text "NSW" return the value of the < a> element.
<h3 class="threadtitle">
<img border="0" alt="MARKET PLACE/AUCTIONS" src="vbcover/ibid/images/auction_open.png" title="MARKET PLACE/AUCTIONS">
<span class="prefix understate">
<b>
<font size="2" face="arial" color="#0000FF">NSW</font>
</b>
</span>
<a id="thread_title_1234" class="title" href="showthread.php?t=1234">Banana man</a>
</h3>
This is what I have so far:
I can find individual elements like this:
import requests
from lxml import etree, html
response '''
<h3 class="threadtitle">
<img border="0" alt="MARKET PLACE/AUCTIONS" src="vbcover/ibid/images/auction_open.png" title="MARKET PLACE/AUCTIONS">
<span class="prefix understate">
<b>
<font size="2" face="arial" color="#0000FF">NSW</font>
</b>
</span>
<a id="thread_title_1234" class="title" href="showthread.php?t=1234">Banana man</a>
</h3>
'''
tree = html.fromstring(response.text)
test = tree.xpath("//font[text()='NSW']")
#or
test2 = tree.xpath("//h3[#class='threadtitle']")
for i in test:
print i
NSW
But I don't know how to combine these.
The above example should return 'Banana man'

try this xpath:
//h3[#class='threadtitle'][descendant::font/text() = 'NSW']/a/text()

Parse the date string from html in lxml

s = """
<tbody>
<tr>
<td style="border-bottom: none">
<span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
<span class="graytext" style="font-size: 11px">
05/13/09 2:02am
<br>
</span>
</td>
</tr>
</tbody>
"""
In the HTML string I need to take out the date string.
I tried in this way
import lxml
doc = lxml.html.fromstring(s)
doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]')
But this is not working. I should have to take only the Datestring.

Your query is selecting the span, you need to grab the text from it:
>>> doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]')
[<Element span at 1c9d4c8>]
Most queries return a sequence, I normally use a helper function that gets the first item.
from lxml import etree
s = """
<tbody>
<tr>
<td style="border-bottom: none">
<span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
<span class="graytext" style="font-size: 11px">
05/13/09 2:02am
<br>
</span>
</td>
</tr>
</tbody>
"""
doc = etree.HTML(s)
def first(sequence,default=None):
for item in sequence:
return item
return default
Then:
>>> doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]')
[<Element span at 1c9d4c8>]
>>> doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]/text()')
['\n 05/13/09 2:02am\n ']
>>> first(doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]/text()'),'').strip()
'05/13/09 2:02am'

Try the following instead of the last line:
print doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]/text()')[0]
The first part of the xpath expression is correct, //span[#class="graytext" and #style="font-size: 11px"] selects all matching span nodes and then you need to specify what you want to select from the node. text() used here selects the contents of the node.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not getting desired text from BeautifulSoup - python

Try the following code, however I don't think your code can get your expect output: >>> bsObj.find('td', {'align': "left"}).text '\n\n S$3,500.00\n \n -\n \n S$5,400.00 \n \n' >>> ' '.join(bsObj.find('td', {'align': "left"}).text.split()) 'S$3,500.00 - S$5,400.00'

Not sure about this "get_text" attribute, but with BeautifulSoup, I rely heavily on .text as shown below. Is this what you're looking for? s = '''<html here>''' soup = BeautifulSoup(s, 'html.parser') bsObj = soup.findAll('td') for i in bsObj: print(i.text) >>> S$3,500.00 - S$5,400.00

Related

Parsing nested tags with BeautifulSoup and requests

Python, Beautiful Soup: how to get the desired element

Parsing for text under specific tags in HTML, Python

Python xpath and conditionals

Parse the date string from html in lxml

Categories

Resources