Unable to parse href using beautifulsoup Python - python

I am trying to webscrape the following webpage to get a specific href using BS4. They've just changed the page layout and due to that I am unable to parse it correctly. Hope anyone can help.
Webpage trying to scrape: https://www.temit.co.uk/investor/resources/temit-literature
Tag trying to get: href="https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheet"
Currently I am using the following code (BS4), however, I am unable to get the specific element anymore.
url = "https://www.temit.co.uk/investor/resources/temit-literature"
page = requests.get(url) # Requests website
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all('div', attrs={'class':'row ng-star-inserted'})
url_TEM = table[1].find_all('a')[0].get('href')
url_TEM = 'https://www.temit.co.uk' + url_TEM

The url is dynamic that's why I use selenium with bs4 and getting the desired output as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
url = "https://www.temit.co.uk/investor/resources/temit-literature"
driver.get(url)
time.sleep(8)
soup = BeautifulSoup(driver.page_source, 'html.parser')
table_urls = soup.select('table.table-striped tbody tr td a')
for table_url in table_urls:
url = table_url['href']
print(url)
Output:
https://franklintempletonprod.widen.net/s/gd2tmrc8cl/final-temit-annual-report-31-march-2021
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://www.londonstockexchange.com/stock/TEM/templeton-emerging-markets-investment-trust-plc/analysis
https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheethttps://franklintempletonprod.widen.net/s/bdxrtlljxg/temit_manager_update
https://franklintempletonprod.widen.net/s/6djgk6xknx/temit-holdings-report
/content-kid/kid/en-GB/KID-GB0008829292-GB-en-GB.pdf
https://franklintempletonprod.widen.net/s/flblzqxmcg/temit-investor-disclosure-document.pdf-26-04-2021
https://franklintempletonprod.widen.net/s/l5gmdbf6vp/agm-results-announcement
https://franklintempletonprod.widen.net/s/qmkphz9s5s/agm-shareholder-documentation
https://franklintempletonprod.widen.net/s/b258tgpjb7/agm-uk-voting-guide
https://franklintempletonprod.widen.net/s/c5zhbbnxql/agm-nz-voting-guide
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/5bjq2qkmh5/temit-annual-report-mar-2018
https://franklintempletonprod.widen.net/s/bnx9mfwlzw/temit-annual-report-mar-2017
https://franklintempletonprod.widen.net/s/rfqc7xrnfn/temit-annual-report-mar-2016
https://franklintempletonprod.widen.net/s/zfzxlflxnq/temit-annual-report-mar-2015
https://franklintempletonprod.widen.net/s/dj9zl8rpcm/temit-annual-report-mar-2014
https://franklintempletonprod.widen.net/s/7xshxmkpnh/temit-annual-report-mar-2013
https://franklintempletonprod.widen.net/s/7gwx2qmcdr/temit-annual-report-mar-2012
https://franklintempletonprod.widen.net/s/drpd7gbvxl/temit-annual-report-mar-2011
https://franklintempletonprod.widen.net/s/2pb2kxkgbl/temit-annual-report-mar-2010
https://franklintempletonprod.widen.net/s/g6pdr9hq2d/temit-annual-report-mar-2009
https://franklintempletonprod.widen.net/s/7pvjf6fhl9/temit-annual-report-mar-2008
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://franklintempletonprod.widen.net/s/xwvrncvkj2/temit-half-year-report-sept-2019
https://franklintempletonprod.widen.net/s/lbp5ssv8mc/temit-half-year-report-sept-2018
https://franklintempletonprod.widen.net/s/hltddqhqcf/temit-half-year-report-sept-2017
https://franklintempletonprod.widen.net/s/2tlqxxflgn/temit-half-year-report-sept-2016
https://franklintempletonprod.widen.net/s/lbcgztjjkj/temit-half-year-report-sept-2015
https://franklintempletonprod.widen.net/s/2tjxzgbdvx/temit-half-year-report-sept-2014
https://franklintempletonprod.widen.net/s/gzrpjwb7bf/temit-half-year-report-sept-2013
https://franklintempletonprod.widen.net/s/lxhbdrmc8z/temit-half-year-report-sept-2012
https://franklintempletonprod.widen.net/s/zzpxrrrpmc/temit-half-year-report-sept-2011
https://franklintempletonprod.widen.net/s/zjdd2gn5jc/temit-half-year-report-sept-2010
https://franklintempletonprod.widen.net/s/7sbqfxxkrd/temit-half-year-report-sept-2009
https://franklintempletonprod.widen.net/s/pvswpqkdvb/temit-half-year-report-sept-2008-1

Related

BeautifulSoup.select classname not working

I am trying to find tags by CSS class, using BeautifulSoup.
Read documentation and tried different ways but the below code returns new_elem : [].
Could you help me understand what I am doing wrong? Thanks.
import requests
from bs4 import BeautifulSoup
url = "https://solanamonkeysclub.com/#/#mint"
response = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, 'html.parser')
new_elems = str(soup.select('.ant-card-body'))
print(f'{"new_elem":10} : {new_elems}')
As the url is dynamic,I use selenium with bs4 and getting the follwing output:
Code:
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
url = "https://solanamonkeysclub.com/#/#mint"
driver.get(url)
time.sleep(8)
soup = BeautifulSoup(driver.page_source, 'html.parser')
new_elems = soup.select('.ant-card-body')
for new_elem in new_elems:
print(f'{"new_elem":10} : {new_elem.text}')
OUTPUT:
new_elem : 0
new_elem : 0
Have you looked at the output at all? You should bring up this page in a browser and do a "view source", or do print(response.text) after you fetch it. The page as delivered contains no HTML elements. The entire page is built dynamically using Javascript.
You will need do use something like Selenium to scrape it.

Python - Item Price Web Scraping for Target

I'm trying to get any item's price from Target website. I did some examples for this website using selenium and Redsky API but now I tried to wrote bs4 code below:
import requests
from bs4 import BeautifulSoup
url = "https://www.target.com/p/ruffles-cheddar-38-sour-cream-potato-chips-2-5oz/-/A-14930847#lnk=sametab"
r= requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
price = soup.find("div",class_= "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp")
print(price)
But it returns me None .
I tried soup.find("div",{'class': "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp"})
What am I missing?
I can accept any selenium code or Redsky API code but my priority is bs4
The page is dynamic. The data is rendered after the initial request is made. You can use selenium to load the page, and once it's rendered, then you can pull out the relevant tag. API though is always the preferred way to go if it's available.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
# If you don't want to open a browser, comment out the line above and uncomment below
#options = webdriver.ChromeOptions()
#options.add_argument('headless')
#driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe', options=options)
url = "https://www.target.com/p/ruffles-cheddar-38-sour-cream-potato-chips-2-5oz/-/A-14930847#lnk=sametab"
driver.get(url)
r = driver.page_source
soup = BeautifulSoup(r, "lxml")
price = soup.find("div",class_= "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp")
print(price.text)
Output:
$1.99
You are simply using wrong locator.
Try this
price_css_locator = 'div[data-test=product-price]'
or in XPath style
price_xpath_locator = '//div[#data-test="product-price"]'
With bs4 it should be something like this:
soup.select('div[data-test="product-price"]')
to get the element get you just need to add .text
price = soup.select('div[data-test="product-price"]').text
print(price)
use .text
price = soup.find("div",class_= "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp")
print(price.text)

Scraper not extracting url link:

Hi I am trying to scrape the Amazon url address on this site linked under "View item on Amazon".
My code is below, I get zero response. Appreciate any assistance. Thanks
import requests
url = "https://app.jumpsend.com/deals/230513"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
The Amazon link (https://www.amazon.com/dp/B07MH9DK5B) isn't in the html page source. You need to use Selenium in order to read-in the html of all the elements that are set by Java script:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://app.jumpsend.com/deals/230513"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
soup.find('a', attrs={'class': 'deal-modal-link'})['href']
The above code prints out the Amazon link:
'https://www.amazon.com/dp/B07MH9DK5B'

Unable to extract download link using beautifulsoup

I am trying to fetch the download CSV file link from this: https://patents.google.com/?assignee=intel
This is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://patents.google.com/?assignee=intel")
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('a', class_='style-scope search-results')
soup.find_all('a', class_='style-scope')
But last 2 lines are returning empty array. What am I missing here?
Even this is not returning anything:
soup.find(id="resultsLayout")
That's because the elements are being generated by javascript. You can use selenium to get the whole page source.
Here's an edited version of your code using selenium.
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://patents.google.com/?assignee=intel')
page = browser.page_source
browser.quit()
soup = BeautifulSoup(page, 'html.parser')
soup.find_all('a', class_='style-scope search-results')
soup.find_all('a', class_='style-scope')
Let me know if you need clarifications. Thanks!

BeautifulSoup doesn't shows all the elements in tag

I'm trying to parse this website and get information about auto in content-box card__body with BeautifulSoup.find but it doesn't find all classes. I also tried webdriver.PhantomJS(), but it also showed nothing.
Here is my code:
from bs4 import BeautifulSoup
from selenium import webdriver
url='http://www.autobody.ru/catalog/10230/217881/'
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html5lib')
JTitems = soup.find("div", attrs={"class":"content-box__strong red-text card__price"})
JTitems
or
w = soup.find("div", attrs={"class":"content-box card__body"})
w
Why doesn't this approach work? What should I do to get all the information about auto? I am using Python 2.7.
Find the table where your necessary info are. Then find all the tr and loop through them to get the texts.
from bs4 import BeautifulSoup
from selenium import webdriver
url='http://www.autobody.ru/catalog/10230/217881/'
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
price = soup.find('div', class_='price').get_text()
print(price)
for tr in soup.find('table', class_='tech').find_all('tr'):
print(tr.get_text())
browser.close()
browser.quit()

Categories