Scape text after div using BeautifulSoup - .next_sibling doesn't work

Scape text after div using BeautifulSoup - .next_sibling doesn't work - python

I am trying to scrape the text between divs here:
I tried to use .next_sibling like mentioned in this post: get text after specific tag with beautiful soup
But it didn't work.
My current code:
for pageNumber in range(1565, 1566):
address = "https://dojrzewamy.pl/cat/3/nowe/%d/seks" % pageNumber
page = requests.get(address)
soup = BeautifulSoup(page.content, 'html.parser')
containers = soup.findAll("div", {"class": "question"})
for container in containers:
h2 = container.find("div", {"class": "info"}).find("h2")
content = container.find("div", {"class": "info"}).find("div", {"class": "clear:both"})
desc = content.next_sibling
print(desc)
Could you help me in guding how to access the text between divs using BeautifulSoup4?

The class attribute is not there the second div you are searching.The attribute is style
You need to provide one more check to verify if element is present then find the next_sibling.
Try Now.
for pageNumber in range(1565, 1566):
address = "https://dojrzewamy.pl/cat/3/nowe/%d/seks" % pageNumber
page = requests.get(address)
soup = BeautifulSoup(page.content, 'html.parser')
containers = soup.findAll("div", {"class": "question"})
for container in containers:
h2 = container.find("div", {"class": "info"}).find("h2")
content = container.find("div", {"class": "info"}).find("div", {"style": "clear:both"})
if content:
desc = content.next_sibling
print(desc)
Here you go with simple css selector options.
for pageNumber in range(1565, 1566):
address = "https://dojrzewamy.pl/cat/3/nowe/%d/seks" % pageNumber
page = requests.get(address)
soup = BeautifulSoup(page.content, 'html.parser')
containers = soup.findAll("div", {"class": "question"})
for container in containers:
h2 = container.find("div", {"class": "info"}).find("h2")
content = container.select_one("div[style='clear:both']")
if content:
desc = content.next_sibling
print(desc)

Okay, I found another solution:
for pageNumber in range(1565, 1566):
address = "https://dojrzewamy.pl/cat/3/nowe/%d/seks" % pageNumber
page = requests.get(address)
soup = BeautifulSoup(page.content, 'html.parser')
containers = soup.findAll("div", {"class": "question"})
for container in containers:
h2 = container.find("div", {"class": "info"}).find("h2")
info = container.find("div", {"class": "info"})
print(info(text=True, recursive=False))

Related

how to get list of members using beautifulsoup

I tried to do this
URL=str(browser.current_url)
page=requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
imena = soup.findAll('a', class_='text-headline')
imena

Assuming the URL is for the Members tab of the Starva club Россия 2021, i.e., https://www.strava.com/clubs/236545/members the following should work to get all of the members across the 193 pages (you really should be using the Strava API...):
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
BASE_URL = "https://www.strava.com/clubs/236545/members?page="
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(f"{BASE_URL}{1}")
driver.find_element_by_id("email").send_keys("<your-email>")
driver.find_element_by_id("password").send_keys("<your-password>")
driver.find_element_by_id("login-button").click()
time.sleep(1)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
num_pages = int(soup.find("ul", "pagination").find_all("li")[-2].text)
# Ignore the admins shown on each page
athletes = soup.find_all("ul", {"class": "list-athletes"})[1]
members = [
avatar.attrs['title']
for avatar in athletes.find_all("div", {"class": "avatar"})
if 'title' in avatar.attrs
]
for page in range(2, num_pages + 1):
time.sleep(1)
driver.get(f"{BASE_URL}{page}")
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
athletes = soup.find_all("ul", {"class": "list-athletes"})[1]
for avatar in athletes.find_all("div", {"class": "avatar"}):
if 'title' in avatar.attrs:
members.append(avatar.attrs['title'])
# Print first 10 members
print('\n'.join(m.strip() for m in members[:10]))
driver.close()
Output (first 10 members):
- Victor Koldaev - ♥LCHF Runners♥
Antonio Raposo ®️
Vadim Issin
"DuSenna🇧🇷 Vá com Garra e a Felicidade te Agarra 😉
#MIX MIX
#RunВасяRun ...
$ерЖ 🇷🇺 КЛИМoff
'Luis Fernando Osorio' MTB
( CE )Faisal ALShammary "حائل $الشرقية "
(# Monique #) bermudez

You need to first get the "text-headline" elements div and then loop through each of them to get the anchor links.
URL=str(browser.current_url)
page=requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
members = soup.findAll('div', {'class': 'text-headline'})
for (member in members):
name = member.find("a")
print(name.get_text())

I'm trying to create a crawler for my wordpress website using python3

import requests
from bs4 import BeautifulSoup
def page(current_page):
current = "h2"
while current == current_page:
url = 'https://vishrantkhanna.com/?s=' + str(current)
source_code = requests.get(url)
plain_text = source_code.txt
soup = BeautifulSoup(plain_text)
for link in soup.findAll('h2', {'class': 'entry-title'}):
href = "https://vishrantkhanna.com/" + link.get('href')
title = link.string
print(href)
print(title)
page("h2")
I'm trying to copy and print the article title and the href link associated with it.

You need to extract the <a> tag from the heading:
import requests
from bs4 import BeautifulSoup
URL = 'https://vishrantkhanna.com/?s=1'
html = requests.get(URL).text
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('h2', {'class': 'entry-title'}):
a = link.find('a', href=True)
href = "https://vishrantkhanna.com/" + a.get('href')
title = link.string
print(href)
print(title)

Href key error when part of loop with other items

When I run this code:
url = "https://www.researchgate.net/profile/David_Severson"
r = requests.get(url)
data = r.text
soup = bsoup(data, "lxml")
for a in soup.find_all('a', {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"}, href=True):
print (a['href'])
It returns all of the links no problem.
When I put it as part of a bit more complicated of a loop with other elements:
item = soup.find("div", {"class": "section section-research"})
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
for p in papers:
title = p.find("div", {"class": "nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__title nova-v-publication-item__title--clamp-3"})
abstract = p.find("div", {"class": "nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__description nova-v-publication-item__description--clamp-3"})
views = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__metrics"})
date = p.find("li", {"class": "nova-e-list__item publication-item-meta-items__meta-data-item"})
authors = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__person-list"})
link = p.find("a", {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"}, href=True)
print(p['href'])
print(p.text)
It no longer returns the href (link) that I want, and instead gives me KeyError: Href
Why is it no longer returning the links?

Fetch the in from the link
Ex:
from bs4 import BeautifulSoup
import requests
url = "https://www.researchgate.net/profile/David_Severson"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
item = soup.find("div", {"class": "section section-research"})
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
for p in papers:
title = p.find("div", {"class": "nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__title nova-v-publication-item__title--clamp-3"})
abstract = p.find("div", {"class": "nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__description nova-v-publication-item__description--clamp-3"})
views = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__metrics"})
date = p.find("li", {"class": "nova-e-list__item publication-item-meta-items__meta-data-item"})
authors = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__person-list"})
link = p.find("a", {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"},href=True)
if link:
print(link["href"])

your p in papers is div element and not a element as in the previous code snippet, that is why the href key error. maybe you want link['href'] assuming link is not None.

Loop through multiple tags in Python BeautifulSoup

I am trying to loop through multiple tags in the HTML so I can print all the IDs.
My code right now prints only first ID, how can I print the second, third, fourth and so on values.
soup = BeautifulSoup(r.content, "html.parser")
product_div = soup.find_all('div', {'class': 'valu '})
product_tag = product_div[0].find('a')
products = product_tag.attrs['val']
print products

This should help
soup = BeautifulSoup(r.content, "html.parser")
for product_div in soup.find_all('div', {'class': 'size '}):
product_tag = product_div.find('a')
if product_tag:
print product_tag.attrs['id']

web scraping with beautifulsoup getting error

I'm pretty new to Python and mainly need it for getting information from website.
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class': 'c5'}):
href = link.get('href')
time.sleep(0.3)
# print(href)
single_item(href)
page += 1
def single_item(item_url):
s_code = requests.get(item_url)
p_text = s_code.text
soup = BeautifulSoup(p_text, "html.parser")
upc = ('div', {'class': 'product-upc'})
for upc in soup.findAll('span', {'class': 'upcNum'}):
print(upc.string)
sku = ('span', {'data-selenium': 'bhSku'})
for sku in soup.findAll('span', {'class': 'fs16 c28'}):
print(sku.text)
price = ('span', {'class': 'price'})
for price in soup.findAll('meta', {'itemprop': 'price'}):
print(price)
outFile = open(r'C:\Users\abc.txt', 'a')
outFile.write(str(upc))
outFile.write("\n")
outFile.write(str(sku))
outFile.write("\n")
outFile.write(str(price))
outFile.write('\n')
outFile.close()
spider(1)
What i want to get is "UPC:813066012487, price:26.45 and SKU:KBPTMCC2" without any span, meta or content attributes.I attached my output below
Here is my output:
screenshot
Where do i do wrong ?
Hope someone can figure it out! Thanks!!

The data you want is in the div attribute data-itemdata, you can call json.loads and it will give you a dict that you can access to get what you want:
from bs4 import BeautifulSoup
import requests
import json
soup = BeautifulSoup(requests.get("https://www.bhphotovideo.com/c/buy/accessories/ipp/100/mnp/25/Ns/p_PRICE_2%7c0/ci/20861/pn/1/N/4005352853+35").content, "html.parser")
for d in soup.select("div[data-selenium=itemDetail]"):
data = json.loads(d["data-itemdata"])
print(data)
Each data dict will look like:
{u'catagoryId': u'20861',
u'inCart': False,
u'inWish': False,
u'is': u'REG',
u'itemCode': u'KBPTMCC2',
u'li': [],
u'price': u'26.45',
u'searchTerm': u'',
u'sku': u'890522'}
So just access by key i.e price = data["price"].
To get the UPC we just need to visit the items page, we can get the url from h3 with the data-selenium attribute:
for d in soup.select("div[data-selenium=itemDetail]"):
url = d.select_one("h3[data-selenium] a")["href"]
upc = BeautifulSoup(requests.get(url).content, "html.parser").select_one("span.upcNum").text.strip()
data = json.loads(d["data-itemdata"])
Not all pages have a UPC value so you will have to decide what to do, if you just want products with UPC's first check if the select finds anything:
for d in soup.select("div[data-selenium=itemDetail]"):
url = d.select_one("h3[data-selenium] a")["href"]
upc = BeautifulSoup(requests.get(url).content, "html.parser").select_one("span.upcNum")
if upc:
data = json.loads(d["data-itemdata"])
text = (upc.text.strip()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scape text after div using BeautifulSoup - .next_sibling doesn't work - python

Related

how to get list of members using beautifulsoup

I'm trying to create a crawler for my wordpress website using python3

Href key error when part of loop with other items

Loop through multiple tags in Python BeautifulSoup

web scraping with beautifulsoup getting error

Categories

Resources