I am trying to read the following value 9.692
<li><span class="tab-box">Deposit:</span> 9.692</li>
I can't seem to be able to get text outside the span tag. I can retrieve the text deposit with the following:
driver.find_elements_by_xpath("//span[#class='tab-box']")
The text 9.692 is in the parent <li>. You can get the <li> tag with this xpath
driver.find_elements_by_xpath("//li[.//span[#class='tab-box']]")
And remove the <span> text to get the result
deposit_text = driver.find_elements_by_xpath('//span[#class="tab-box"]').text
all_text = driver.find_elements_by_xpath('//li[//span[#class="tab-box"]]').text
number_text = all_text.replace(deposit_text, '')
Related
This is the html code
<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>
I used like this
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
print(item.find('div', class_='salary-snippet'))
i got the result a list such as
<div aria-label="RM 3,500 to RM 8,000 a month" class="salary-snippet"><span>RM 3,500 - RM 8,000 a month</span></div>
if i used
print(item.find('div', class_='salary-snippet').text.strip())
it will return the error
AttributeError: 'NoneType' object has no attribute 'text'
so how can i get only the span text? its my first time web scraping
May be this is what you are looking for.
First select all the <div> tags with class as salary-snippet as this is the parent of the <span> tag that you are looking for. Use .find_all()
Now Iterate over the all the selected <div> tags from above and find the <span> from each <div>.
Based on your question, I assume that All these <div> may not have the <span> tag. In that case you can print the text only if the <div> contains a span tag. See below
# Find all the divs
d = soup.find_all('div', class_='salary-snippet')
# Iterating over the <div> tags
for item in d:
# Find <span> in each item. If not exists x will be None
x = item.find('span')
# Check if x is not None and then only print
if x:
print(x.text.strip())
Here is the complete code.
from bs4 import BeautifulSoup
s = """<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>"""
soup = BeautifulSoup(s, 'lxml')
d = soup.find_all('div', class_='salary-snippet')
for item in d:
x = item.find('span')
if x:
print(x.text.strip())
RM 6,000 a month
I believe the line should be:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
Alternatively, if there is only the span you can simply use:
item.find("span").text.strip()
Considering you used the .find_all() method you might want to ensure that every div returned from your HTML
soup.find_all('div', class_='job_seen_beacon')
contains the element you are looking for as thi could arise if only one element doesn't.
i.e.
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
try:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
except AttributeError:
print("Item Not available")
What this will do is try get the text but if this fails will print the item that failed so you can identify why... perhaps it doesn't have the element you are searching for.
We have the following HTML:
<a class="link contact-info__link" href="tel:+99999999999">
<svg class="icon icon--telephone contact-info__link-icon contact-info__link-icon--phone">
<use xlink:href="/local/templates/.default/img/icon-font/icon-font.svg#icon-phone"></use>
</svg>
<span class="contact-info__link-text">+9 (999) 999-99-99</span>
</a>
I need to get this dictionary:
{"tel:+99999999999": "+9 (999) 999-99-99"}
That is, I need the href link and the respective text, regardless of how many "child" tags there are after the href. In this case, I need the href link itself and the text in the span, but consider that it could be span or any other type of tag.
I am currently using this code to get all href + text from any page (as this is the goal):
for r in response.css('a'):
url = r.css('::attr(href)').get()
txt = r.css('::text').get()
That works for this type of case:
This is my phone
But not when it is recursive, like the first code, it just returns this:
{"tel:+99999999999": "\n"}
To get whole text under a tag you can use getall() method and then join all text into one string.
This example you can use:
url = r.css('::attr(href)').get()
txt = r.css('::text').getall()
txt = ''.join([t.strip() for t in txt if t.strip()]) if txt else txt
Try this
tel_s = response.css('.link contact-info__link')
yield {tel_s.css('::attr(href)').get(): tel_s.css('span::text)').get()}
output:
{"tel:+99999999999": "+9 (999) 999-99-99"}
I have an XML file from which I would like to extract heading tags (h1, h2, .. as well as its text) which are between </span> <span class='classy'> tags (this way around). I want to do this in Python 2.7, and I have tried beautifulsoup and elementtree but couldn't work it out.
The file contains sections like this:
<section>
<p>There is some text <span class='classy' data-id='234'></span> and there is more text now.</p>
<h1>A title</h1>
<p>A paragraph</p>
<h2>Some second title</h2>
<p>Another paragraph with random tags like <img />, <table> or <div></p>
<div>More random content</div>
<h2>Other title.</h2>
<p>Then more paragraph <span class='classy' data-id='235'></span> with other stuff.</p>
<h2>New title</h2>
<p>Blhablah, followed by a div like that:</p>
<div class='classy' data-id='236'></div>
<p>More text</p>
<h3>A new title</h3>
</section>
I would like to write in a csv file like this:
data-id,heading.name,heading.text
234,h1,A title
234,h2,Some second title
234,h2,Another title.
235,h2,New title
236,h3,A new title
and ideally I would write this:
id,h1,h2,h3
234,A title,Some second title,
234,A title,Another title,
235,A title,New title,
236,A title,New title,A new title
but then I guess I can always reshape it afterwards.
I have tried to iterate through the file, but I only seem to be able to keep all the text without the heading tags. Also, to make things more annoying, sometimes it is not a span but a div, which has the same class and attributes.
Any suggestion on what would be the best tool for this in Python?
I've two pieces of code that work:
- finding the text with itertools.takewhile
- finding all h1,h2,h3 but without the span id.
soup = BeautifulSoup(open(xmlfile,'r'),'lxml')
spans = soup('span',{'class':'page-break'})
for el in spans:
els = [i for i in itertools.takewhile(lambda x:x not in [el,'script'],el.next_siblings)]
print els
This gives me a list of text contained between spans. I wanted to iterate through it, but there are no more html tags.
To find the h1,h2,h3 I used:
with open('titles.csv','wb') as f:
csv_writer = csv.writer(f)
for header in soup.find_all(['h1','h2','h3']):
if header.name == 'h1':
h1text = header.get_text()
elif header.name == 'h2':
h2text = header.get_text()
elif header.name == 'h3':
h3text = header.get_text()
csv_writer.writerow([h1text,h2text,h3text,header.name])
I've now tried with xpath without much luck.
Since it's an xhtml document, I used:
from lxml import etree
with open('myfile.xml', 'rt') as f:
tree = etree.parse(f)
root = tree.getroot()
spans = root.xpath('//xhtml:span',namespaces={'xhtml':'http://www.w3.org/1999/xhtml'})
This gives me the list of spans objects but I don't know how to iterate between two spans.
Any suggestion?
I am using BeautifulSoup to get the title of a book from a goodreads page.
Sample HTML -
<td class="field title"><a href="/book/show/12996.Othello" title="Othello">
Othello
</a></td>
I want to get the text between the anchor tags. Using the code below, I can get all the children of with class="field title" in a list form.
for txt in soup.findAll('td',{'class':"field title"}):
child = txt.findAll('a')
which gives output-
[<a href="/book/show/12996.Othello" title="Othello">
Othello
</a>]
...
How do I get the 'Othello' part only? This regex doesn't work -
for ch in child:
match = re.search(r"([.]*)title=\"<name>\"([.]*)",str(ch))
print(match.group('name'))
Edited:
Just print the text of txt (thanks for #angurar clarifying OP's requirements):
for txt in soup.findAll('td',{'class':"field title"}):
print txt.string
Or if you're after the title attribute of <a>:
for txt in soup.findAll('td',{'class':"field title"}):
print [a.get('title') for a in txt.findAll('a')]
It will return a list of all <a> title's attribute.
if I have such bs4 element its called tab_window_uls[1]:
<ul>
<li><b>Cut:</b> Sits low on the waist.</li>
<li><b>Fit:</b> Skinny through the leg.</li>
<li><b>Leg opening:</b> Skinny.</li>
</ul>
How can I add new <li> to <ul>?
Currently my code looks like:
lines = ['a', 'b']
li_tag = tab_window_uls[1].new_tag('li')
for i in lines:
li_tag.string = i
tab_window_uls[1].b.string.insert_before(li_tag)
You have to create a new tag like I did and insert that tag within the ul. I load the soup, create a tag. append that tag within the other tag. (the <b> tag within the <li> tag). then load the ul tags. And insert the newly created li tag into the tree at position whatever. NOTE: you can't have it at the end, if you want it to be the last li in the list, use append.
from bs4 import BeautifulSoup
htmlText = '''
<ul>
<li><b>Cut:</b> Sits low on the waist.</li>
<li><b>Fit:</b> Skinny through the leg.</li>
<li><b>Leg opening:</b> Skinny.</li>
</ul>
'''
bs = BeautifulSoup(htmlText)
li_new_tag = bs.new_tag('li')
li_new_tag.string = 'Size:'
b_new_tag = bs.new_tag('b')
b_new_tag.string = '0 through 12.'
li_new_tag.append(b_new_tag)
tags = bs.ul
tags.insert(1, li_new_tag)
print bs.prettify()