Scrape a line of text from a website inside a div

Scrape a line of text from a website inside a div - python

I don't know how to scrape this text
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB
Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
<div class="npi_name">
<h2>
<a href="/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html">
<span style="color:red">Stoc limitat!</span>
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
</a>
</h2>
</div>
What I've tried:
for n in j.find_all("div","npi_name"):
n2=n.find("a", href=True, text=True)
try:
n1=n2['href']
except:
n2=n.find("a")
n1=n2['href']
n3=n2.string
print(n3)
Output:
None

Try:
from bs4 import BeautifulSoup
html_doc = """
<div class="npi_name">
<h2>
<a href="/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html">
<span style="color:red">Stoc limitat!</span>
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
</a>
</h2>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
t = "".join(soup.select_one(".npi_name a").find_all(text=True, recursive=False))
print(t.strip())
Prints:
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)

I've made a few assumptions but something like this should work:
for n in j.find_all("div", {"class": "npi_name"}):
print(n.find("a").contents[2].strip())
This is how I arrived at my answer (the HTML you provided was entered in to a.html):
from bs4 import BeautifulSoup
def main():
with open("a.html", "r") as file:
html = file.read()
soup = BeautifulSoup(html, "html.parser")
divs = soup.find_all("div", {"class": "npi_name"})
for div in divs:
a = div.find("a").contents[2].strip()
# Testing
print(a)
if __name__ == "__main__":
main()

texts = []
for a in soup.select("div.npi_name a[href]"):
texts.append(a.contents[-1].strip())
Or more explicitly:
texts = []
for a in soup.select("div.npi_name a[href]"):
if a.span:
text = a.span.next_sibling
else:
text = a.string
texts.append(text.strip())

Select your elements more specific e.g. css selectors and use stripped_strings to get text, assuming it is always the last node in your element:
for e in soup.select('div.npi_name a[href]'):
text = list(e.stripped_strings)[-1]
print(text)
This way you could also process other information if needed e.g. href,span text,...
Example
Select multiple items, store information in list of dicts and convert it into a dataframe:
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<div class="npi_name">
<h2>
<a href="/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html">
<span style="color:red">Stoc limitat!</span>
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
</a>
</h2>
</div>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('div.npi_name a[href]'):
data.append({
'url' : e['href'],
'stock': s.text if (s := e.span) else None,
'label' :list(e.stripped_strings)[-1]
})
pd.DataFrame(data)
Output
url
stock
label
/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html
Stoc limitat!
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)

Related

Conditioning the soup selection on a web scrape.Python/BeautifulSoup

I have the following code for an item of a list of products:
<div class="nice_product_item">
<div class="npi_name">
<h2>
<a href="/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html">
<span style="color:red">Stoc limitat!</span>
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
</a>
</h2>
</div>
<div class="price_block_list">
<span class="old_price"> 999,00 Lei </span>
<span class="price_discount">-12%</span>
<span class="cheaper_by">mai ieftin cu 120,00 lei</span>
<span class="real_price">879,00 Lei</span>
<span class="evo-credit">evoCREDIT</span></div>
</div>
</div>
Some products got the price_discount span,while others dont
<span class="price_discount">-12%</span>
I use the following code to scrape the names of products:
texts = []
for a in soup.select("div.npi_name a[href]"):
if a.span:
text = a.span.next_sibling
else:
text = a.string
texts.append(text.strip())
I don't know what conditions do I need to get the names of the products with discounts.
Note:It has to work for a list

A way to process the data could be to select all items with discounts:
soup.select('div.nice_product_item:has(.price_discount):has(a[href])')
Iterate over ResultSet, pick information you need and store it in a structured way like list of dicts to process it later e.g. DataFrame and save to csv, json, ...
Example
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<div class="nice_product_item">
<div class="npi_name">
<h2>
<a href="/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html">
<span style="color:red">Stoc limitat!</span>
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
</a>
</h2>
</div>
<div class="price_block_list">
<span class="old_price"> 999,00 Lei </span>
<span class="price_discount">-12%</span>
<span class="cheaper_by">mai ieftin cu 120,00 lei</span>
<span class="real_price">879,00 Lei</span>
<span class="evo-credit">evoCREDIT</span></div>
</div>
</div>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('div.nice_product_item:has(.price_discount):has(a[href])'):
data.append({
'url' : e.a['href'],
'label' :s[-1] if (s := list(e.a.stripped_strings)) else None,
'price' : s.text if (s := e.select_one('span.real_price')) else None,
'discount' : s.text if (s := e.select_one('span.price_discount')) else None,
'other' : 'edit for elements you need'
})
pd.DataFrame(data)
Output
url
label
price
discount
other
/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
879,00 Lei
-12%
edit for elements you need

BeautifulSoup - Element only showing first result multiple times

I am currently trying to scrape football match data from the following URL:
https://liveonsat.com/uk-england-all-football.php
I am able to scrape the match names, start times and channel names correctly. Unfortunately I seem to be having an issue scraping the correct match date. I have identified with help from stackoverflow previously that the element containing the match date can be called with
parent.find
The issue I am having though is that the first date that is scraped persists throughout all the matches that are scraped even if a particular game is not on that date. For instance if I run the code today it is showing the match date for all matches as Saturday 11th July, even though some of the matches that are scraped are on different dates.
I am unsure at this point what the problem could be and would be extremely grateful if someone could assist me or point me in the right direction to attempt to solve this issue. I was first thinking that the problem was with the HTML element that was selected to grab the match date from but I have changed this to previous parent elements to test and no date is scraped at all, so it appears the element currently selected to gather match date is correct, but it is possible not implemented correctly by me.
To help I have left a comment beside the match date element which I am having the issue with.
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
import tkinter as tk
from tkinter import messagebox
from tkinter import *
from PIL import ImageTk, Image
def makesoup(url):
page=requests.get(url)
return BeautifulSoup(page.text,"lxml")
def matchscrape(g_data):
for match in g_data:
competitors = match.find('div', class_='fix').text
match_date = match.parent.find('h2',class_='time_head').text # this is used to scrape the match date as it is not contained within "div", {"class": "blockfix"}))
match_time = match.find('div',class_='fLeft_time_live').text.strip()
print("Competitors ", competitors)
print("Match date", match_date)
print("Match time", match_time)
#Match time
channel = match.find_all("td", {"class": "chan_col"})
for i in channel:
print(i.get_text().strip())
def matches():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))
root = tk.Tk()
root.resizable(False, False)
root.geomAetry("600x600")
root.wm_title("liveonsat scraper")
Label = tk.Label(root, text = 'liveonsat scraper', font = ('Comic Sans MS',18))
button = tk.Button(root, text="Scrape Matches", command=matches)
button3 = tk.Button(root, text = "Quit Program", command=quit)
Label.pack()
button.pack()
button3.pack()
status_label = tk.Label(text="")
status_label.pack()
root.mainloop()
Below is the relevant example HTML code of the site I am scraping:
<div style="clear:right"> <div class=floatAndClearL><h2 class = sport_head >Football</h2></div> <!-- sport_head -->
<div class=floatAndClearL><h2 class = time_head>Saturday, 11th July</h2></div> <!-- time_head --> <div><span class = comp_head>English Championship - Week 43</span></div>
<div class = blockfix > <!-- block 1-->
<div class=fix> <!-- around fixture and notes 2-->
<div class=fix_text> <!-- around fixture text 3-->
<div class = imgCenter><span><img src="../img/team/england.gif"></span></div>
<div class = fLeft style="width:270px;text-align:center;background-color:#ffd379;color:#800000;font-size:10pt;font-family:Tahoma, Geneva, sans-serif">Derby County v Brentford</div>
<div class = imgCenter><img src="../img/team/england.gif"></div>
</div> <!-- around fixture text 3 ENDS-->
<div class=notes></div>
</div> <!-- around fixture and notes 2 ENDS-->
<div class = fLeft> <!-- around all of channel types 2--> <div> <!-- around channel type group 3-->
<div class=fLeft_icon_live_l> <!-- around icon 4-->
<img src="../img/icon/live3.png"/>
</div>
<div class=fLeft_time_live> <!-- around icon 4-->
ST: 12:30
</div> <!-- around icon 4 ENDS--> <div class = fLeft_live> <!-- around all tables of a channel type 4--> <table border="0" cellspacing="0" cellpadding="0"><tr><td class=chan_col> <a href="https://connect.bein.net/" target="_blank" class = chan_live_iptvcable> beIN Connect MENA 📺</a></td><td width = 0></td>
</tr></table> <table border="0" cellspacing="0" cellpadding="0"><tr><td class=chan_col> <a href="https://tr.beinsports.com/kullanici/giris?ReturnUrl=" target="_blank" class = chan_live_iptvcable> beIN Connect TURKEY 📺</a></td><td width = 0></td>

Instead of find.parent use .find_previous(), because parent is common (and thus the same) for all <div class="blockfix">:
import requests
from bs4 import BeautifulSoup
url = 'https://liveonsat.com/uk-england-all-football.php'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for match in soup.select('div.blockfix'):
competitors = match.find('div', class_='fix').text.strip()
match_date = match.find_previous('h2', class_='time_head').text.strip() # <-- use .find_previous()
match_time = match.find('div',class_='fLeft_time_live').text.strip()
channels = match.select('.chan_col')
print("Competitors ", competitors)
print("Match date", match_date)
print("Match time", match_time)
print('Channels:\n\t' + '\n\t'.join(c.get_text(strip=True) for c in channels))
print('-' * 80)
Prints:
Competitors Derby County v Brentford
Match date Saturday, 11th July
Match time ST: 13:30
Channels:
beIN Connect MENA ðŸ“º
beIN Connect TURKEY ðŸ“º
beIN Sports MENA 5 HD
beIN Sports Turkey 4 HD
Eleven Sports 1 Portugal HD
Nova Sport (serbia) HD
Nova Sports 1 HD (Cyprus)
Nova Sports 1 HD (Hellas)
Sky Sports Football UK / HD
Sport 4 Israel / HD
Sportdigital TV HD
SportsMax 2 HD
StÃ¶d 2 Sport 2 / HD
SuperSport 9 RSA
Telekanal Futbol
TV3 Sport HD Sweden
V Sport 1 HD (norge)
V Sport Extra HD (sweden)
ViaPlay (denmark) / HD
ViaPlay (finland) / HD
ViaPlay (norway) / HD
ViaPlay (sweden) / HD
--------------------------------------------------------------------------------
Competitors Watford v Newcastle United
Match date Saturday, 11th July
Match time ST: 13:30
Channels:
Amazon Prime UK Only [$]
beIN Connect MENA ðŸ“º
beIN Sports MENA 12 HD
beIN Sports MENA 2 HD
Belarus 5 TV
Canal+ Now HD (poland)
Cosmote Sport 7 HD
Cytavision Sports 1 HD
DAZN Canada [$] (geo/R)
DAZN EspaÃ±a [$] (geo/R)
Diema Sport 2 HD
ESPN Brasil HD
EuroSport 1 Romania / HD
Premier Sports 1 HD (ROI only)
QazSport / HD
RMC Sport 2 HD
Setanta Qazaqstan HD
Setanta Sports Ukraine+ HD
Sky Sport 1 / HD Germany
Sky Sport Arena Italia / HD
Sky Sport Austria 1 HD
Sky Sport Football Italia / HD
Sport 2 Israel / HD
Sport TV2 (portugal) / HD
SportKlub 2 (serbia) HD
SpÃler 1 TV / HD
SuperSport 4 RSA / HD
TRT Spor / HD ðŸ“º
TSN Malta 2 HD
TV2 Sport Premium 2 HD
TV2sumo.no [$] (geo/R)
TV3 MAX (denmark) / HD
V Sport Premium HD
V Sport Urheilu / HD
ViaPlay (denmark) / HD
ViaPlay (finland) / HD
ViaPlay (sweden) / HD
VOOsport World 1 / HD
--------------------------------------------------------------------------------
... and so on.

REGEX extract information from EDGAR SC-13 form

I am trying to extract information from the latest SEC EDGAR Schedule 13 forms filings.
Link of the filing as an example:
1) Saba Capital_27-Dec-2019_SC13
The information I am trying to extract (and the parts of the filing with the information)
1) Names of reporting persons: Saba Capital Management, L.P.
<p style="margin-bottom: 0pt;">NAME OF REPORTING PERSON</p>
<p style="margin-top: 0pt; margin-left: 18pt;">Saba Capital Management GP, LLC<br><br/>
2) Name of issuer : WESTERN ASSET HIGH INCOME FUND II INC
<p style="text-align: center;"><b><font size="5"><u>WESTERN ASSET HIGH INCOME FUND II INC.</u></font><u><br/></u>(Name of Issuer)</b>
3) CUSIP Number: 95766J102 (managed to get)
<p style="text-align: center;"><b><u>95766J102<br/></u>(CUSIP Number)</b>
4) Percentage of class represented by amount : 11.3% (managed to get)
<p style="margin-bottom: 0pt;">PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW (11)</p>
<p style="margin-top: 0pt; margin-left: 18pt;">11.3%<br><br/>
5) Date of Event Which requires filing of this statement: December 24, 2019
<p style="text-align: center;"><b><u>December 24, 2019<br/></u>(Date of Event Which Requires Filing of This Statement)</b>
.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'xml')
## get CUSIP number
CUSIP = re.findall(r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*##]{3}[0-9]', soup.text)
### get %
regex = r"(?<=PERCENT OF CLASS|Percent of class)(.*)(?=%)"
percent = re.findall(r'\d+.\d+', re.search(regex, soup.text, re.DOTALL).group().split('%')[0])
How can I extract the 5 pieces of information from the filing? Thanks in advance

Try the following Code to get all the values.Using find() and css selector select_one()
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-child(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-child(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-child(11) > b > u").text.strip()
print(Dateof)
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019
UPDATED
If you don't want to use position then try below one.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:contains(Issuer)').find_next('u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one('p:contains(CUSIP)').find_next('u').text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one('p:contains(Event)').find_next('u').text.strip()
print(Dateof)
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019
Update 2:
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-of-type(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-of-type(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-of-type(11) > b > u").text.strip()
print(Dateof)

Using lxml, it should work this way:
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
name = doc.xpath('//p[text()="NAME OF REPORTING PERSON"]/following-sibling::p/text()')[0]
issuer = doc.xpath('//p[contains(text(),"(Name of Issuer)")]//u/text()')[0]
cusip = doc.xpath('//p[contains(text(),"(CUSIP Number)")]//u/text()')[0]
perc = doc.xpath('//p[contains(text(),"PERCENT OF CLASS REPRESENTED")]/following-sibling::p/text()')[0]
event = doc.xpath('//p[contains(text(),"(Date of Event Which Requires")]//u/text()')[0]
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019

How do I get the "title" of an anchor tag using BeautifulSoup4?

I can't figure out how to get the title on the anchor.
Here is my code:
from flask import Flask
import requests
from bs4 import BeautifulSoup
laptops = 'http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
def scrape():
page = requests.get('http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops')
soup = BeautifulSoup(page.content, "lxml")
links = soup("a", {"class":"title"})
for link in links:
print(link.prettify())
scrape()
Example of result:
<a class="title" href="/test-sites/e-commerce/allinone/product/251" title="Asus VivoBook X441NA-GA190">
Asus VivoBook X4...
</a>
<a class="title" href="/test-sites/e-commerce/allinone/product/252" title="Prestigio SmartBook 133S Dark Grey">
Prestigio SmartB...
</a>
<a class="title" href="/test-sites/e-commerce/allinone/product/253" title="Prestigio SmartBook 133S Gold">
Prestigio SmartB...
</a>
How do I get the "title"?

Attributes like title are accessible via subscription or the .attrs dictionary on an element:
for link in links:
print(link['title'])
See the BeautifulSoup documentation on Attributes.
For the given URL this produces:
Asus VivoBook X441NA-GA190
Prestigio SmartBook 133S Dark Grey
Prestigio SmartBook 133S Gold
Aspire E1-510
Lenovo V110-15IAP
Lenovo V110-15IAP
Hewlett Packard 250 G6 Dark Ash Silver
# ... etc

BeautifulSoup: <div class <span class></span><span class>TEXT I WANT</span>

I am trying to extract the string enclosed by the span with id="titleDescription" using BeautifulSoup.
<div class="itemText">
<div class="wrapper">
<span class="itemPromo">Customer Choice Award Winner</span>
<a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501" title="View Details" >
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
<span class="itemDescription" id="lineDescriptionID" style="display:none">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
</a>
</div>
Code snippet
f = open('egg.data', 'rb')
content = f.read()
content = content.decode('utf-8', 'replace')
content = ''.join([x for x in content if ord(x) < 128])
soup = bs(content)
for itemText in soup.find_all('div', attrs={'class':'itemText'}):
wrapper = itemText.div
wrapper_href = wrapper.a
for child in wrapper_href.descendants:
if child['id'] == 'titleDescriptionID':
print(child, "\n")
Traceback Error:
Traceback (most recent call last):
File "egg.py", line 66, in <module>
if child['id'] == 'titleDescriptionID':
TypeError: string indices must be integers

spans = soup.find_all('span', attrs={'id':'titleDescriptionID'})
for span in spans:
print span.string
In your code, wrapper_href.descendants contains at least 4 elements, 2 span tags and 2 string enclosed by the 2 span tags. It searches its children recursively.

wrapper_href.descendants includes any NavigableString objects, which is what you are tripping over. NavigableString are essentially string objects, and you are trying to index that with the child['id'] line:
>>> next(wrapper_href.descendants)
u'\n'
Why not just load the tag directly using itemText.find('span', id='titleDescriptionID')?
Demo:
>>> for itemText in soup.find_all('div', attrs={'class':'itemText'}):
... print itemText.find('span', id='titleDescriptionID')
... print itemText.find('span', id='titleDescriptionID').text
...
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K

from BeautifulSoup import BeautifulSoup
pool = BeautifulSoup(html) # where html contains the whole html as string
for item in pool.findAll('span', attrs={'id' : 'titleDescriptionID'}):
print item.string
When we search for a tag using BeautifulSoup, we get a BeautifulSoup.Tag object, which can directly be used to access its other attributes like inner content, style, href etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape a line of text from a website inside a div - python

texts = [] for a in soup.select("div.npi_name a[href]"): texts.append(a.contents[-1].strip()) Or more explicitly: texts = [] for a in soup.select("div.npi_name a[href]"): if a.span: text = a.span.next_sibling else: text = a.string texts.append(text.strip())

Related

Conditioning the soup selection on a web scrape.Python/BeautifulSoup

BeautifulSoup - Element only showing first result multiple times

REGEX extract information from EDGAR SC-13 form

How do I get the "title" of an anchor tag using BeautifulSoup4?

BeautifulSoup: <div class <span class></span><span class>TEXT I WANT</span>

Categories

Resources