Extracting anchor text from span class with BeautifulSoup - python

This is the html I am trying to scrape:
<span class="meta-attributes__attr-tags">
cinematic,
dissolve,
epic,
fly,
</span>
I want to get the anchor text for each a href: cinematic, dissolve, epic, etc.
This is the code I have:
url = urllib2.urlopen("http: example.com")
content = url.read()
soup = BeautifulSoup(content)
links = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for link in links:
print link.find_all('a')['href']
If I do it with "link.find_all" I get error: TypeError: List indices must be integers, not str.
But if I do print link.find('a')['href'] I get the first one only.
How can I get all of them ?

You could do the following:
from bs4 import BeautifulSoup
content = '''
<span class="meta-attributes__attr-tags">
cinematic,
dissolve,
epic,
fly,
</span>
'''
soup = BeautifulSoup(content)
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for span in spans:
links = span.find_all('a')
for link in links:
print link['href']
Output
/tags/cinematic
/tags/dissolve
/tags/epic
/tags/fly

from bs4 import BeautifulSoup
html = """
<span class="meta-attributes__attr-tags">
cinematic,
dissolve,
epic,
fly,
</span>
"""
soup = BeautifulSoup(html, "lxml")
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for span in spans:
for link in span.find_all('a'):
print link.text, link['href']
Another, pricier, way could be:
from bs4 import BeautifulSoup
html = """
<span class="meta-attributes__attr-tags">
cinematic,
dissolve,
epic,
fly,
</span>
"""
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a")
for link in links:
if 'meta-attributes__attr-tags' not in link.parent.get('class', []):
continue
print link.text, link['href']

link.find_all('a') returns a list with bs4 Tags. You probably want to index each of this links by href. So maybe this comes closer to your needs:
span = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for links in span:
for link in links.find_all('a'):
print(link['href'])

You may avoid nested loops or any additional if checks inside a loop by using a CSS selector:
for link in soup.select(".meta-attributes__attr-tags a[href]"):
print(link["href"], link.get_text())

Related

Trying to find all <a> elements without a *specific* class

I'm trying web scraping for the first time and I'm using BeautifulSoup to gather bits of information from a website. I'm trying to get all the elements which has one class but not another. For example:
from bs4 import BeautifulSoup
html = """
<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""
soup = BeautifulSoup(html)
In this example, I want to get all the elements with the something class. However, when finding all elements containing that class I also get the element containing the somethingelse class, and I do not want these.
The code I'm using to get it is:
results = soup.find_all("a", {"class": "something"})
Any help is appreciated! Thanks.
This will work fine:
from bs4 import BeautifulSoup
text = '''<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>'''
soup = BeautifulSoup(text, 'html.parser')
r1 = soup.find_all("a", {"class": "something"})
r2 = soup.find_all("a", {"class": "somethingelse"})
for item in r2:
if item in r1:
r1.remove(item)
print(r1)
Output
[<a class="something">Information I want</a>]
For extracting the text present in the tags, just add this lines:
for item in r1:
print(item.text)
Output
Information I want
For this task, you can find elements by lambda function, for example:
from bs4 import BeautifulSoup
html_doc = """<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""
soup = BeautifulSoup(html_doc, "html.parser")
a = soup.find(
lambda tag: tag.name == "a" and tag.get("class", []) == ["something"]
)
print(a)
Prints:
<a class="something">Information I want</a>
Or: specify "class" as a list:
a = soup.find("a", {"class": ["something"]})
print(a)
Prints:
<a class="something">Information I want</a>
EDIT:
For filtering type-icon type-X:
from bs4 import BeautifulSoup
html_doc = """
<a class="type-icon type-1">Information I want 1</a>
<a class="type-icon type-1 type-cell type-abbr">Information I don't want</a>
<a class="type-icon type-2">Information I want 2</a>
<a class="type-icon type-2 type-cell type-abbr">Information I don't want</a>
"""
soup = BeautifulSoup(html_doc, "html.parser")
my_types = ["type-icon", "type-1", "type-2"]
def my_filter(tag):
if tag.name != "a":
return False
c = tag.get("class", [])
return "type-icon" in c and not set(c).difference(my_types)
a = soup.find_all(my_filter)
print(a)
Prints:
[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]
Or extract tags you don't want first:
soup = BeautifulSoup(html_doc, "html.parser")
# extract tags I don't want:
for t in soup.select(".type-cell.type-abbr"):
t.extract()
print(soup.select(".type-icon.type-1, .type-icon.type-2"))
Prints:
[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]

BeautifulSoup parsing issues some div not showing

I'm trying to parse this page: https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/
The problem is, in this element: https://gyazo.com/e544be64a41a121bdb0c0f71aef50692 ,
I want the div that contains the price. If you inspect the page, you can see the html code for this part, shows like this:
<div class="price">
<div class"price">
"thePrice"
<sup>93</sup>
</div>
</div>
BUT, when using page_soup = soup(my_html_page, 'html.parser') or page_soup = soup(my_html_page, 'lxml') or page_soup = soup(my_html_page, 'html5lib') I only get this as the result for that part:
<div class="price"></div>
And that's it. I've been searching for hours on the internet to figure out why that inner div doesn't get parsed.
Three different parsers, and none seems to get passed the fact that the inner child shares the same class name than its parent, if this is the issue.
Hope its help you.
from bs4 import BeautifulSoup
import requests
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
html = BeautifulSoup(requests.get(url).content, 'html.parser')
prices = html.find_all("div", {"class": "price"})
for price in prices:
print(price.text)
print output
561€95
169€94
165€95
1 165€94
7 599€95
267€95
259€94
599€95
511€94
1 042€94
2 572€94
783€95
2 479€94
2 699€95
499€94
386€95
169€94
2 343€95
783€95
499€94
499€94
259€94
267€95
165€95
169€94
2 399€95
561€95
2 699€95
2 699€95
6 059€95
7 589€95
10 991€95
9 619€94
2 479€94
3 135€95
7 589€95
511€94
1 042€94
386€95
599€95
1 165€94
2 572€94
783€95
2 479€94
2 699€95
499€94
169€94
2 343€95
2 699€95
3 135€95
6 816€95
7 589€95
561€95
267€95
To scrape all prices where class="price"> see this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Select all the 'price' classes
for tag in soup.select('div.price'):
print(tag.text)

python nested Tags (beautiful Soup)

I used beautiful soup using python to get data from a specific website
but I don't know how to get one of these prices but I want the price in gram (g)
AS shown below this is the HTML codeL:
<div class="promoPrice margBottom7">16,000
L.L./200g<br/><span class="kiloPrice">79,999
L.L./Kg</span></div>
I use this code:
p_price = product.findAll("div{"class":"promoPricemargBottom7"})[0].text
my result was:
16,000 L.L./200g 79,999 L.L./Kg
but i want to have:
16,000 L.L./200g
only
You will need to first decompose the span inside the div element:
from bs4 import BeautifulSoup
h = """
<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>
"""
soup = BeautifulSoup(h, "html.parser")
element = soup.find("div", {'class': 'promoPrice'})
element.span.decompose()
print(element.text)
#16,000 L.L./200g
Try using soup.select_one('div.promoPrice').contents[0]
from bs4 import BeautifulSoup
html = """<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>"""
soup = BeautifulSoup(html, features='html.parser')
# value = soup.select('div.promoPrice > span') # for 79,999 L.L./Kg
value = soup.select_one('div.promoPrice').contents[0]
print(value)
Prints
16,000 L.L./200g

How to extract link under a <li> tag with a specific class?

<li class="a-last">Buy Now</li>
How can you extract the link /macbook-pro inside the class a-last? Efficiency is a consideration.
One possibility is CSS selectors:
data = '''<li class="a-last">Buy Now</li>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('li.a-last [href]')['href'])
Prints:
/macbook-pro
li.a-last [href] will select tag with attribute href that is under <li> tag with class a-last.
If you want to be more specific and want to extract only <a> tag directly under <li class="a-last">, you can use:
print(soup.select_one('li.a-last > a[href]')['href'])
You can do this:
from bs4 import BeautifulSoup
html = """<li class="a-last">Buy Now</li>"""
soup = BeautifulSoup(html, 'html.parser')
href = soup.find('li', {'class': 'a-last'}).find('a').get('href')
print(href)
RESULTS:
/macbook-pro
This is the list of all needed hrefs:
[el.find('a').get('href') for el in soup.find_all('li', {'class': 'a-last'})]

Web crawling using python beautifulsoup

How to extract data that is inside <p> paragraph tags and <li> which are under a named <div> class?
Use the functions find() and find_all():
import requests
from bs4 import BeautifulSoup
url = '...'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', {'class':'class-name'})
ps = div.find_all('p')
lis = div.find_all('li')
# print the content of all <p> tags
for p in ps:
print(p.text)
# print the content of all <li> tags
for li in lis:
print(li.text)

Categories