Extract text from an href recursively with Scrapy

Extract text from an href recursively with Scrapy - python

We have the following HTML:
<a class="link contact-info__link" href="tel:+99999999999">
<svg class="icon icon--telephone contact-info__link-icon contact-info__link-icon--phone">
<use xlink:href="/local/templates/.default/img/icon-font/icon-font.svg#icon-phone"></use>
</svg>
<span class="contact-info__link-text">+9 (999) 999-99-99</span>
</a>
I need to get this dictionary:
{"tel:+99999999999": "+9 (999) 999-99-99"}
That is, I need the href link and the respective text, regardless of how many "child" tags there are after the href. In this case, I need the href link itself and the text in the span, but consider that it could be span or any other type of tag.
I am currently using this code to get all href + text from any page (as this is the goal):
for r in response.css('a'):
url = r.css('::attr(href)').get()
txt = r.css('::text').get()
That works for this type of case:
This is my phone
But not when it is recursive, like the first code, it just returns this:
{"tel:+99999999999": "\n"}

To get whole text under a tag you can use getall() method and then join all text into one string.
This example you can use:
url = r.css('::attr(href)').get()
txt = r.css('::text').getall()
txt = ''.join([t.strip() for t in txt if t.strip()]) if txt else txt

Try this
tel_s = response.css('.link contact-info__link')
yield {tel_s.css('::attr(href)').get(): tel_s.css('span::text)').get()}
output:
{"tel:+99999999999": "+9 (999) 999-99-99"}

Related

Retrieving value 'outside' a span class selenium

I am trying to read the following value 9.692
<li><span class="tab-box">Deposit:</span> 9.692</li>
I can't seem to be able to get text outside the span tag. I can retrieve the text deposit with the following:
driver.find_elements_by_xpath("//span[#class='tab-box']")

The text 9.692 is in the parent <li>. You can get the <li> tag with this xpath
driver.find_elements_by_xpath("//li[.//span[#class='tab-box']]")
And remove the <span> text to get the result
deposit_text = driver.find_elements_by_xpath('//span[#class="tab-box"]').text
all_text = driver.find_elements_by_xpath('//li[//span[#class="tab-box"]]').text
number_text = all_text.replace(deposit_text, '')

How to find an ID in a div class with multiple values BS4 Python

I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!

Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "

Beautifulsoup get text based on nextSibling tag name

I'm scraping multiple pages that all have a similar format, but it changes a little here and there and there are no classes to use to search for what I need.
The format looks like this:
<div id="mainContent">
<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<span> More text I don't want</span>
<ul>...unordered-list items..</ul>
<p>Text I WANT</p>
<ol>...ordered-list items..</ol>
<p>Text I WANT</p>
<ol>...ordered-list items..</ol>
</div>
The number of ordered/unordered lists and other tags changes depending on the page, but what stays the same is I always want the text from the <p> tag that is the previous sibling of the <ol> tag.
What I'm trying (and isn't working) is:
main = soup.find("div", {"id":"mainContent"})
for d in main.children:
if d.name == 'p' and d.nextSibling.name == 'ol':
print(d.text)
else:
print("fail")
The out put of this is fail for every iteration. In trying to figure out why this isn't working I tried:
for d in main.children:
if d.name == 'p':
print(d.nextSibling.name)
else:
print("fail")
The output of this is something like:
fail
None
fail
None
fail
None
fail
fail
fail
fail
fail
None
fail
etc...
Why is this not working like I think it would? How could I get the text from a <p> element only if the next tag is <ol>?

You want only the p tags which are before ol tag. Find the ol tags first and then find the previous Tag objects which are in this case, the p tag. Now your code is not working because, there is a newline between the Tag elements which are NavigableString type objects. And d.nextSibling yields you those newlines also. So You have to check the type of the object here.
from bs4 import Tag
# create soup
# find the ols
ols = soup.find_all('ol')
for ol in ols:
prev = ol.previous_sibling
while(not isinstance(prev, Tag)):
prev = prev.previous_sibling
print(prev.text)
This will give you the text you want.
Text I WANT
Text I WANT

You can use a css selector, i.e ul ~ p to find all the p tags preceded by the ul:
html = """<div id="mainContent">
<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<span> More text I don't want</span>
<ul>...unordered-list items..</ul>
<p>Text I WANT</p>
<ol>...ordered-list items..</ol>
<p>Text I WANT</p>
<ol>...ordered-list items..</ol>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print([p.text for p in soup.select("#mainContent ul ~ p")])
Which will give you:
['Text I WANT', 'Text I WANT']
Or find the ol's and then look for the previous sibling p:
print([ol.find_previous_sibling("p").text for ol in soup.select("#mainContent ol")])
Which would also give you:
['Text I WANT', 'Text I WANT']

Headings from XML after span tag

I have an XML file from which I would like to extract heading tags (h1, h2, .. as well as its text) which are between </span> <span class='classy'> tags (this way around). I want to do this in Python 2.7, and I have tried beautifulsoup and elementtree but couldn't work it out.
The file contains sections like this:
<section>
<p>There is some text <span class='classy' data-id='234'></span> and there is more text now.</p>
<h1>A title</h1>
<p>A paragraph</p>
<h2>Some second title</h2>
<p>Another paragraph with random tags like <img />, <table> or <div></p>
<div>More random content</div>
<h2>Other title.</h2>
<p>Then more paragraph <span class='classy' data-id='235'></span> with other stuff.</p>
<h2>New title</h2>
<p>Blhablah, followed by a div like that:</p>
<div class='classy' data-id='236'></div>
<p>More text</p>
<h3>A new title</h3>
</section>
I would like to write in a csv file like this:
data-id,heading.name,heading.text
234,h1,A title
234,h2,Some second title
234,h2,Another title.
235,h2,New title
236,h3,A new title
and ideally I would write this:
id,h1,h2,h3
234,A title,Some second title,
234,A title,Another title,
235,A title,New title,
236,A title,New title,A new title
but then I guess I can always reshape it afterwards.
I have tried to iterate through the file, but I only seem to be able to keep all the text without the heading tags. Also, to make things more annoying, sometimes it is not a span but a div, which has the same class and attributes.
Any suggestion on what would be the best tool for this in Python?
I've two pieces of code that work:
- finding the text with itertools.takewhile
- finding all h1,h2,h3 but without the span id.
soup = BeautifulSoup(open(xmlfile,'r'),'lxml')
spans = soup('span',{'class':'page-break'})
for el in spans:
els = [i for i in itertools.takewhile(lambda x:x not in [el,'script'],el.next_siblings)]
print els
This gives me a list of text contained between spans. I wanted to iterate through it, but there are no more html tags.
To find the h1,h2,h3 I used:
with open('titles.csv','wb') as f:
csv_writer = csv.writer(f)
for header in soup.find_all(['h1','h2','h3']):
if header.name == 'h1':
h1text = header.get_text()
elif header.name == 'h2':
h2text = header.get_text()
elif header.name == 'h3':
h3text = header.get_text()
csv_writer.writerow([h1text,h2text,h3text,header.name])
I've now tried with xpath without much luck.
Since it's an xhtml document, I used:
from lxml import etree
with open('myfile.xml', 'rt') as f:
tree = etree.parse(f)
root = tree.getroot()
spans = root.xpath('//xhtml:span',namespaces={'xhtml':'http://www.w3.org/1999/xhtml'})
This gives me the list of spans objects but I don't know how to iterate between two spans.
Any suggestion?

Get a substring from a list item in python after a word

I am using BeautifulSoup to get the title of a book from a goodreads page.
Sample HTML -
<td class="field title"><a href="/book/show/12996.Othello" title="Othello">
Othello
</a></td>
I want to get the text between the anchor tags. Using the code below, I can get all the children of with class="field title" in a list form.
for txt in soup.findAll('td',{'class':"field title"}):
child = txt.findAll('a')
which gives output-
[<a href="/book/show/12996.Othello" title="Othello">
Othello
</a>]
...
How do I get the 'Othello' part only? This regex doesn't work -
for ch in child:
match = re.search(r"([.]*)title=\"<name>\"([.]*)",str(ch))
print(match.group('name'))

Edited:
Just print the text of txt (thanks for #angurar clarifying OP's requirements):
for txt in soup.findAll('td',{'class':"field title"}):
print txt.string
Or if you're after the title attribute of <a>:
for txt in soup.findAll('td',{'class':"field title"}):
print [a.get('title') for a in txt.findAll('a')]
It will return a list of all <a> title's attribute.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract text from an href recursively with Scrapy - python

To get whole text under a tag you can use getall() method and then join all text into one string. This example you can use: url = r.css('::attr(href)').get() txt = r.css('::text').getall() txt = ''.join([t.strip() for t in txt if t.strip()]) if txt else txt

Try this tel_s = response.css('.link contact-info__link') yield {tel_s.css('::attr(href)').get(): tel_s.css('span::text)').get()} output: {"tel:+99999999999": "+9 (999) 999-99-99"}

Related

Retrieving value 'outside' a span class selenium

How to find an ID in a div class with multiple values BS4 Python

Beautifulsoup get text based on nextSibling tag name

Headings from XML after span tag

Get a substring from a list item in python after a word

Categories

Resources