When I run this code:
url = "https://www.researchgate.net/profile/David_Severson"
r = requests.get(url)
data = r.text
soup = bsoup(data, "lxml")
for a in soup.find_all('a', {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"}, href=True):
print (a['href'])
It returns all of the links no problem.
When I put it as part of a bit more complicated of a loop with other elements:
item = soup.find("div", {"class": "section section-research"})
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
for p in papers:
title = p.find("div", {"class": "nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__title nova-v-publication-item__title--clamp-3"})
abstract = p.find("div", {"class": "nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__description nova-v-publication-item__description--clamp-3"})
views = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__metrics"})
date = p.find("li", {"class": "nova-e-list__item publication-item-meta-items__meta-data-item"})
authors = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__person-list"})
link = p.find("a", {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"}, href=True)
print(p['href'])
print(p.text)
It no longer returns the href (link) that I want, and instead gives me KeyError: Href
Why is it no longer returning the links?
Fetch the in from the link
Ex:
from bs4 import BeautifulSoup
import requests
url = "https://www.researchgate.net/profile/David_Severson"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
item = soup.find("div", {"class": "section section-research"})
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
for p in papers:
title = p.find("div", {"class": "nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__title nova-v-publication-item__title--clamp-3"})
abstract = p.find("div", {"class": "nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__description nova-v-publication-item__description--clamp-3"})
views = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__metrics"})
date = p.find("li", {"class": "nova-e-list__item publication-item-meta-items__meta-data-item"})
authors = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__person-list"})
link = p.find("a", {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"},href=True)
if link:
print(link["href"])
your p in papers is div element and not a element as in the previous code snippet, that is why the href key error. maybe you want link['href'] assuming link is not None.
Related
I am trying to scrape the text between divs here:
I tried to use .next_sibling like mentioned in this post: get text after specific tag with beautiful soup
But it didn't work.
My current code:
for pageNumber in range(1565, 1566):
address = "https://dojrzewamy.pl/cat/3/nowe/%d/seks" % pageNumber
page = requests.get(address)
soup = BeautifulSoup(page.content, 'html.parser')
containers = soup.findAll("div", {"class": "question"})
for container in containers:
h2 = container.find("div", {"class": "info"}).find("h2")
content = container.find("div", {"class": "info"}).find("div", {"class": "clear:both"})
desc = content.next_sibling
print(desc)
Could you help me in guding how to access the text between divs using BeautifulSoup4?
The class attribute is not there the second div you are searching.The attribute is style
You need to provide one more check to verify if element is present then find the next_sibling.
Try Now.
for pageNumber in range(1565, 1566):
address = "https://dojrzewamy.pl/cat/3/nowe/%d/seks" % pageNumber
page = requests.get(address)
soup = BeautifulSoup(page.content, 'html.parser')
containers = soup.findAll("div", {"class": "question"})
for container in containers:
h2 = container.find("div", {"class": "info"}).find("h2")
content = container.find("div", {"class": "info"}).find("div", {"style": "clear:both"})
if content:
desc = content.next_sibling
print(desc)
Here you go with simple css selector options.
for pageNumber in range(1565, 1566):
address = "https://dojrzewamy.pl/cat/3/nowe/%d/seks" % pageNumber
page = requests.get(address)
soup = BeautifulSoup(page.content, 'html.parser')
containers = soup.findAll("div", {"class": "question"})
for container in containers:
h2 = container.find("div", {"class": "info"}).find("h2")
content = container.select_one("div[style='clear:both']")
if content:
desc = content.next_sibling
print(desc)
Okay, I found another solution:
for pageNumber in range(1565, 1566):
address = "https://dojrzewamy.pl/cat/3/nowe/%d/seks" % pageNumber
page = requests.get(address)
soup = BeautifulSoup(page.content, 'html.parser')
containers = soup.findAll("div", {"class": "question"})
for container in containers:
h2 = container.find("div", {"class": "info"}).find("h2")
info = container.find("div", {"class": "info"})
print(info(text=True, recursive=False))
How can I iterate through all tags under the found tag?
This gives me only top level tags
description = soup.find("div", {"class": "description"})
for tag in description:
print(tag)
This gives me iteration until the end of html
description = soup.find("div", {"class": "description"})
while description:
description = description.next_element
print(description)
description is not iterable, because find() method returns the first selected tag from soup, so comes use the findAll() method.
descriptions = soup.findAll("div", {"class": "description"})
for description in descriptions:
print(description)
are you looking for .descendants?
description = soup.find("div", {"class": "description"})
for tag in description.descendants:
print(tag)
I want to assign every href I find with beautifulsoup to variable. For example
link1, link2, link3 ....
my function look like this now
def board_list():
html = driver.page_source
soup=BeautifulSoup(html, "html.parser")
for link in soup.findAll('a', {'class': 'board-tile'}):
href = driver.current_url+link.get('href')
title = link.string
print(title)
print(href)
I wrote a for-loop that I thought was extracting the text from the html elements that I had indicated using the Beautifulsoup library. It looks like this:
url = "https://www.researchgate.net/profile/David_Severson"
r = requests.get(url)
data = r.text
soup = bsoup(data, "lxml")
item = soup.find("div", {"class": "section section-research"})
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
for p in papers:
title = p.find("div", {"class": "nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__title nova-v-publication-item__title--clamp-3"})
abstract = p.find("div", {"class": "nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__description nova-v-publication-item__description--clamp-3"})
views = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__metrics"})
date = p.find("li", {"class": "nova-e-list__item publication-item-meta-items__meta-data-item"})
authors = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__person-list"})
link = p.find("a", {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"}, href=True)
if link:
full_link = urllib.parse.urljoin("https://www.researchgate.net/", link["href"])
print(full_link)
print(p.text)
I noticed that it was printing out more than what I had indicated in the contents of the loop. After trying to debug each of the individual items (title, abstract etc...), I realized the loop was not even accessing the items therein at all.
For example, if I commented them all out, or totally remove them, it still gave the exact same output:
for p in papers:
print(p.text)
print("")
(This ^ gives me the exact same output as the code with the contents in the body.)
Somehow the loop is not even reading the elements it's supposed to be using to iterate through p...How can I get it to recognize the script contained therein, and extract the desired elements as defined by the elements I have (or thought I had) written in the body of the loop?
The problem is that you have space in your class that you specified
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
I removed the space and your code worked. so retry it by using this code.
I am trying to loop through multiple tags in the HTML so I can print all the IDs.
My code right now prints only first ID, how can I print the second, third, fourth and so on values.
soup = BeautifulSoup(r.content, "html.parser")
product_div = soup.find_all('div', {'class': 'valu '})
product_tag = product_div[0].find('a')
products = product_tag.attrs['val']
print products
This should help
soup = BeautifulSoup(r.content, "html.parser")
for product_div in soup.find_all('div', {'class': 'size '}):
product_tag = product_div.find('a')
if product_tag:
print product_tag.attrs['id']