How to scrape text in title attribute using beautiful soup? - python

I am scraping this website using beautiful soup and I need to get the product name completely.
When I use the h2 tag, I end up getting names such as "NIVEA Soft Light Moisturizing Cream Berry Blossom Fragrance ...".
I don't want these dots at the end, just the complete name.
Here is my code snippet for scraping the data:
div_soup=data_soup.findAll('div',{'class':'product-list-box card desktop-cart'})
table_rows=[]
for div in div_soup:
current_row=[]
product_name=div.findAll('h2',{})
product_price=div.findAll('span',{'class':'post-card__content-price-offer'})
for idx,data in enumerate(product_name):
current_row.append(data.text)
for idx,data in enumerate(product_price):
current_row.append(data.text)
table_rows.append(current_row)
I can't figure out the appropriate tag to use and also if I should pass something in the dictionary.
URL of the website I am scraping: https://www.nykaa.com/skin/moisturizers/face-moisturizer-day-cream/c/8394?root=nav_3

for idx,data in enumerate(product_name):
if data.get('title') is not None:
current_row.append(data['title'])
Should do what you want
Also might be best to refactor your code as
product_name=div.find('h2', {'title': True).get('title')
So you will just look for a h2 tag with title attribute and you can avoid the for loop

Related

get text from <p> on a website with class or id in python

is there a way in python, to get the text from a element on a website with classname or id?
I want to check if a text-element on a website has the same text as a string.
Thanks in advance!
Yes, you can, using beautiful soup.
For example, you can refine your search to only find those divs with a given class:
mydivs = soup.find_all("div", {"class": "class_to_find"})
Take a look here : https://beautiful-soup-4.readthedocs.io/en/latest/

Scrape clickable link or xpath

I have this html tree in a web app:
I have scraped all the text for all the league names.
But I also need a XPATH or any indicator so that I can tell selenium: if I choose for example EFL League Two (ENG 4) in my GUI from e. g. a drop down menu, then use the corresponding xpath to choose the right league in the web app.
I have no idea how I could either extract a XPATCH from that tree nor any other solution that could be used for my scenario.
Any idea how I could fix this?
If I try to get a 'href' extracted, it prints just "None"
This is my code so far:
def scrape_test():
leagues = []
#click the dropdown menue to open the folder with all the leagues
league_dropdown_menu = driver.find_element_by_xpath('/html/body/main/section/section/div[2]/div/div[2]/div/div[1]/div[1]/div[7]/div')
league_dropdown_menu.click()
time.sleep(1)
#get all league names as text
scrape_leagues = driver.find_elements_by_xpath("//li[#class='with-icon' and contains(text(), '')]")
for league in scrape_leagues:
leagues.append(league.text)
print('\n')
# HERE I NEED HELP! - I try to get a link/xpath for each corresponding league to use later with selenium
scrape_leagues_xpath = driver.find_elements_by_xpath("//li[#class='with-icon']")
for xpath in scrape_leagues_xpath:
leagues.append(xpath.get_attribute('xpath')) #neither xpath, text, href is working here
print(leagues)
li node doesn't have text, href or xpath (I don't think its a valid HTML attribute). You can scrape and parse #style.
Try to use this approach to extract background-image URL
leagues.append(xpath.get_attribute('style').strip('background-image:url("').rstrip('");'))

How to find specific header in a div when web scraping using Python?

I'm currently trying to scape the headers within a div, shown in the following screenshot:
I am trying to find the text highlighted in light blue from the div class in green, but instead my code (below) print the code boxed in pink.
My code
Any help is appreciated. Thanks.
Edit: The headers do not contain any tags or class.
If what you want is just the club name and & amount for each separate club, then you need to .findAll() the <h3> tags, rather than requesting the content of a specific box. The <h3> tags contain the information you want, and .findAll() pulls them out as a list.
Here's a minimal example:
# Import statements
import requests
from bs4 import BeautifulSoup
# Get the page
page = requests.get("https://talksport.com/football/572055/")
soup = BeautifulSoup(page.content)
# Find all the H3 tags and store them in a list
clubs = soup.findAll("h3")
# Print out the text of each heading
# I've picked the list indexes to ignore the irrelevant H3 tags. I could
# also have filtered by the text itself, but for a single page, this works.
for club in clubs[17:-2]:
print(club.text)

BeautifulSoup children of div

I am trying to find the children div for a specific div on a website using beautifulSoup.
I have inspired myself from this answer : Beautiful Soup find children for particular div
However, when I want to retrieve all the content of the divs with class='row' which has a parent div with class="container search-results-wrapper endless_page_template" as seen below: My problem is that it only retrieves the content of the first div class='row'.
I am using the following code :
boatContainer= page_soup.find_all('div', class_='container search-results-wrapper endless_page_template')
for row in boatContainer:
all_boats = row.find_all('div', class_='row')
for boat in all_boats:
print(boat.text)
I apply this on this website.
What can I do so that my solution retrieves the data of the divs from class='row' which belong in the div class='container search-results-wrapper endless_page_template' ?
Use response.content instead of response.text.
you're also not requesting the correct url in your code. https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&skipper=False&search_src=home only displays a single boat hence you're code is only returning one row.
Use https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&guests_count=&order_by=-rank&is_roundtrip=&coupon_code=&skipper=None instead in this case
You'll probally find use in adjusting the url parameters to filter boats at some point !

Printing all occurence of certain document elements of a webpage

So I was scraping this particular webpage https://www.zomato.com/srijata , for all the "restaurant reviews"(not the self comments on her own reviews) posted by user "Sri".
zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata')
zomato_info = zomato_ind.read()
open('zomato_info.html', 'w').write(zomato_info)
soup = BeautifulSoup(open('zomato_info.html'))
soup.find('div','mtop0 rev-text').text
This prints her first restaurant review i.e. - "Sri reviewed Big Straw - Chew On This" as :-
u'Rated This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol..'
I also tried another selector :-
I have this questions :-
How can I print the next restaurant review ? I tried findNextSiblings etc and all but none seem to work.
First of all, you don't need to write the output to the file, pass the result of urlopen() call to the BeautifulSoup constructor.
To get the reviews, you need to iterate over all div tags with class rev-text, and get the .next_sibling of the div element inside:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('https://www.zomato.com/srijata'))
for div in soup.find_all('div', class_='rev-text'):
print div.div.next_sibling
Prints:
This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol..
The ambience is good. The food quality is good. I Didn't find anything to complain. I wanted to visit the place fir a very long time and had dinner today. The meals are very good and if u want the better quality compared to other Andhra restaurants then this is the place. It's far better than nandhana. The staffs are very polite too.
...
You should make a for loop and use find_all instead of find:
zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata')
zomato_info = zomato_ind.read()
open('zomato_info.html', 'w').write(zomato_info)
soup = BeautifulSoup(open('zomato_info.html'))
for div in soup.find_all('div','rev-text'):
print div.text
Also one question: Why are saving the html in a file and then reading the file into a soup object?

Categories