Extracting the titles of the websites mentioned in the link - python

https://www.g2.com/categories/marketing-automation
I am trying webscrap the above link that has list of 350+ websites i need to extract the title of the websites mentioned
But I am failing to get any results i have tried with using requests and beautiful soup
then with selenium and all i am getting is empty list "[]" or none
import requests
from bs4 import BeautifulSoup
# Send a GET request to the URL and parse the HTML content
url = 'https://www.g2.com/categories/marketing-automation'
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
name = soup.find(class_ = "product-card__product-name")
print(name)
This above code is just a test code to check if the data is being pulled or not and the response is 'None'
From this code i am expecting to see the results of the class mentioned upon calling print

I kind of got this code to get something. Im still working on it.
from selenium import webdriver
from selenium.webdriver.common.by import By
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
# Navigate to the webpage
driver.get('https://www.g2.com/categories/marketing-automation')
# Wait for the page to load
driver.implicitly_wait(10)
# Find all the product cards on the page
product_cards = driver.find_elements(By.CLASS_NAME, 'product-card__product-name')
# Iterate over the product cards and extract the title from each one
for product_card in product_cards:
title = product_card.text
print(title)
# Close the browser
driver.quit()

Related

Python - Item Price Web Scraping for Target

I'm trying to get any item's price from Target website. I did some examples for this website using selenium and Redsky API but now I tried to wrote bs4 code below:
import requests
from bs4 import BeautifulSoup
url = "https://www.target.com/p/ruffles-cheddar-38-sour-cream-potato-chips-2-5oz/-/A-14930847#lnk=sametab"
r= requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
price = soup.find("div",class_= "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp")
print(price)
But it returns me None .
I tried soup.find("div",{'class': "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp"})
What am I missing?
I can accept any selenium code or Redsky API code but my priority is bs4
The page is dynamic. The data is rendered after the initial request is made. You can use selenium to load the page, and once it's rendered, then you can pull out the relevant tag. API though is always the preferred way to go if it's available.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
# If you don't want to open a browser, comment out the line above and uncomment below
#options = webdriver.ChromeOptions()
#options.add_argument('headless')
#driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe', options=options)
url = "https://www.target.com/p/ruffles-cheddar-38-sour-cream-potato-chips-2-5oz/-/A-14930847#lnk=sametab"
driver.get(url)
r = driver.page_source
soup = BeautifulSoup(r, "lxml")
price = soup.find("div",class_= "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp")
print(price.text)
Output:
$1.99
You are simply using wrong locator.
Try this
price_css_locator = 'div[data-test=product-price]'
or in XPath style
price_xpath_locator = '//div[#data-test="product-price"]'
With bs4 it should be something like this:
soup.select('div[data-test="product-price"]')
to get the element get you just need to add .text
price = soup.select('div[data-test="product-price"]').text
print(price)
use .text
price = soup.find("div",class_= "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp")
print(price.text)

BeatifulSoap find() returns "None" with any name/attributes

I'm trying to get some informations about a product i'm interested in, on Amazon.
I'm using BeatifulSoap library for webscraping :
URL = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
In the pic, the highlined row it's the one i want to select, but when i run my script i get 'None' everytime. (Printing the entire output after BeatifulSoap call, give me the entire HTML source, so i'm using the right URL)
Any solutions?
You need to use .text() to get the text of an element.
so change:
print(title)
to:
print(title.text)
Output:
EUR 1.153,00
I wouldn't use BS alone in this case. You can easily use add Selenium to scrape the website:
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium import webdriver
url = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
driver = webdriver.Safari()
driver.get(url)
html_content = driver.page_source
soup = BeautifulSoup(html_content, "html.parser")
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
If you don't can use Safari you have to download the webdriver for Chrome, Firefox etc. but there is plenty of reading material on this topic.

Beautifulsoup can not find table containing specific class

from bs4 import BeautifulSoup
import requests
url="https://www.calculator.net/currency-calculator.html"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content,'html5lib')
print(soup.prettify()) # print the parsed data of html
conv_table = soup.find("table", attrs={"class":"cinfoT "})
conv_data = gdp_table.tbody.find_all("tr")
I have written the above script to get the table listed on this particular website.
when i run the same conv_table comes as None type object.
If you visit the website, basically i want to extract the 2nd table bigger table and its class name contains "cinfoT ". Also i have checked that there are some blank spaces in the class name.
Please help me out.
Thanks in advance.
It is because this data is loaded by javascript. Try selenium. requests will give you plain html file
from selenium import webdriver
from bs4 import BeautifulSoup
DRIVER_PATH="Your selenium chrome driver path"
url = 'https://www.calculator.net/currency-calculator.html'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get(url)
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
table = html_soup.find_all('table', class_ = 'cinfoT ')
driver.quit()
print(table[0].tbody)

Beautifulsoup not returning complete HTML of the page

I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup.
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
page = urlopen(url).read()
soup = BeautifulSoup(page, "lxml")
print(soup)
At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the fact that a tag that I am looking for is not in the output.
For example: using the built in find() I can grab the following div class tag:
class="l__grid js-page-layout"
However what i'm actually looking for are the contents of a tag that is embedded at a lower level in the tree.
js-event-list-tournament-events
When I perform the same find operation on the lower-level tag, I get no results.
Using Azure-based Jupyter Notebook, i have tried a number of the solutions to similar problems on stackoverflow and no luck.
Thanks!
Kenny
The page use JS to load the data dynamically so you have to use selenium. Check below code.
Note you have to install selenium and chromedrive (unzip the file and copy into python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'js-event-list-tournament-events'})
print(container)
or you can use their json api
import requests
url = 'https://www.sofascore.com/football//2018-09-18/json'
r = requests.get(url)
print(r.json())
I had the same problem and the following code worked for me. Chromedriver must be installed!
import time
from bs4 import BeautifulSoup
from selenium import webdriver
chromedriver_path= "/Users/.../chromedriver"
driver = webdriver.Chrome(chromedriver_path)
url = "https://yourURL.com"
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
page_source = driver.page_source
soup = bs4.BeautifulSoup(page_source, 'lxml')
This soup you can use as usual.

Python Selenium accessing HTML source - After Search

Source Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
path = "C:\\Python27\\chromedriver\\chromedriver"
driver = webdriver.Chrome(executable_path=path)
# Open Chrome
driver.get("http://www.thehindu.com/")
# 10 Second Delay
time.sleep(10)
elem = driver.find_element_by_id("searchString")
# Enter Keyword
elem.send_keys("unilever")
elem.send_keys(Keys.RETURN)
time.sleep(10)
# Problem Here
page = driver.page_source
soup = BeautifulSoup(page, 'lxml')
print soup
Above it the code.
I want to scrap data from "http://www.thehindu.com/", It searches for "unilever" word in search box and redirect to result page
Link for Search Page
Now I have a question for this, How can I get Source code of the searched Page.
Basically I want news related to "Unilever".
You can get text inside <body>:
body = driver.find_element_by_tag_name("body")
bodyText = body.get_attribute("innerText")
Then you can find your keyword in bodyText.

Categories