I dont know why beautiful soup is nt scraping this webpage

I dont know why beautiful soup is nt scraping this webpage - python

I cant get beautiful soup to locate a certain element
I have tried changing the parser it uses and what its searching for
import urllib2
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
url= driver.current_url
#sets the array counter
i = 1
#opens the url
page = urllib2.urlopen(url)
#reads the url
soup = BeautifulSoup(page, 'lxml')
print(soup)
h2 = soup.find_all("h3", {"class": "nbf_carparking_title nbf_tooltip"}) [i]
#sets strong value
strong = soup.find_all("div", {"class": "nbf_fancy_product_results_totalcost2"})[i]
#prints h2 (text)
print(h2).text
print(strong).text
I am expecting it to bring up the prices of certain products but instead it prints nothing
the url is https://www.weholiday.co.uk/travelreq.php?cd4504d8-b2a2-11e9-8114-3ca82a238f71

Related

Unable to parse href using beautifulsoup Python

I am trying to webscrape the following webpage to get a specific href using BS4. They've just changed the page layout and due to that I am unable to parse it correctly. Hope anyone can help.
Webpage trying to scrape: https://www.temit.co.uk/investor/resources/temit-literature
Tag trying to get: href="https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheet"
Currently I am using the following code (BS4), however, I am unable to get the specific element anymore.
url = "https://www.temit.co.uk/investor/resources/temit-literature"
page = requests.get(url) # Requests website
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all('div', attrs={'class':'row ng-star-inserted'})
url_TEM = table[1].find_all('a')[0].get('href')
url_TEM = 'https://www.temit.co.uk' + url_TEM

The url is dynamic that's why I use selenium with bs4 and getting the desired output as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
url = "https://www.temit.co.uk/investor/resources/temit-literature"
driver.get(url)
time.sleep(8)
soup = BeautifulSoup(driver.page_source, 'html.parser')
table_urls = soup.select('table.table-striped tbody tr td a')
for table_url in table_urls:
url = table_url['href']
print(url)
Output:
https://franklintempletonprod.widen.net/s/gd2tmrc8cl/final-temit-annual-report-31-march-2021
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://www.londonstockexchange.com/stock/TEM/templeton-emerging-markets-investment-trust-plc/analysis
https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheethttps://franklintempletonprod.widen.net/s/bdxrtlljxg/temit_manager_update
https://franklintempletonprod.widen.net/s/6djgk6xknx/temit-holdings-report
/content-kid/kid/en-GB/KID-GB0008829292-GB-en-GB.pdf
https://franklintempletonprod.widen.net/s/flblzqxmcg/temit-investor-disclosure-document.pdf-26-04-2021
https://franklintempletonprod.widen.net/s/l5gmdbf6vp/agm-results-announcement
https://franklintempletonprod.widen.net/s/qmkphz9s5s/agm-shareholder-documentation
https://franklintempletonprod.widen.net/s/b258tgpjb7/agm-uk-voting-guide
https://franklintempletonprod.widen.net/s/c5zhbbnxql/agm-nz-voting-guide
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/5bjq2qkmh5/temit-annual-report-mar-2018
https://franklintempletonprod.widen.net/s/bnx9mfwlzw/temit-annual-report-mar-2017
https://franklintempletonprod.widen.net/s/rfqc7xrnfn/temit-annual-report-mar-2016
https://franklintempletonprod.widen.net/s/zfzxlflxnq/temit-annual-report-mar-2015
https://franklintempletonprod.widen.net/s/dj9zl8rpcm/temit-annual-report-mar-2014
https://franklintempletonprod.widen.net/s/7xshxmkpnh/temit-annual-report-mar-2013
https://franklintempletonprod.widen.net/s/7gwx2qmcdr/temit-annual-report-mar-2012
https://franklintempletonprod.widen.net/s/drpd7gbvxl/temit-annual-report-mar-2011
https://franklintempletonprod.widen.net/s/2pb2kxkgbl/temit-annual-report-mar-2010
https://franklintempletonprod.widen.net/s/g6pdr9hq2d/temit-annual-report-mar-2009
https://franklintempletonprod.widen.net/s/7pvjf6fhl9/temit-annual-report-mar-2008
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://franklintempletonprod.widen.net/s/xwvrncvkj2/temit-half-year-report-sept-2019
https://franklintempletonprod.widen.net/s/lbp5ssv8mc/temit-half-year-report-sept-2018
https://franklintempletonprod.widen.net/s/hltddqhqcf/temit-half-year-report-sept-2017
https://franklintempletonprod.widen.net/s/2tlqxxflgn/temit-half-year-report-sept-2016
https://franklintempletonprod.widen.net/s/lbcgztjjkj/temit-half-year-report-sept-2015
https://franklintempletonprod.widen.net/s/2tjxzgbdvx/temit-half-year-report-sept-2014
https://franklintempletonprod.widen.net/s/gzrpjwb7bf/temit-half-year-report-sept-2013
https://franklintempletonprod.widen.net/s/lxhbdrmc8z/temit-half-year-report-sept-2012
https://franklintempletonprod.widen.net/s/zzpxrrrpmc/temit-half-year-report-sept-2011
https://franklintempletonprod.widen.net/s/zjdd2gn5jc/temit-half-year-report-sept-2010
https://franklintempletonprod.widen.net/s/7sbqfxxkrd/temit-half-year-report-sept-2009
https://franklintempletonprod.widen.net/s/pvswpqkdvb/temit-half-year-report-sept-2008-1

remove html tags from string using bs4

I'm trying to make a program to read the price of bitcoin from a website. I used bs4 and was bale to get the section I was looking for but its surrounded by the html tags.
output: <div class="priceValue___11gHJ">$52,693.18</div>
I just want the price and i have tried the regex and lxml methods, but I keep getting errors
import requests
from bs4 import BeautifulSoup
#get url
url = "https://coinmarketcap.com/currencies/bitcoin/"
r = requests.get(url)
#parse html
soup = BeautifulSoup(r.content, 'html5lib')
#find div
find_div = soup.find('div', {"class": "priceValue___11gHJ"})
print(find_div)

You need to do .text:
import requests
from bs4 import BeautifulSoup
#get url
url = "https://coinmarketcap.com/currencies/bitcoin/"
r = requests.get(url)
#parse html
soup = BeautifulSoup(r.content, 'html5lib')
#find div
find_div = soup.find('div', {"class": "priceValue___11gHJ"})
print(find_div.text) # $52,693.18

Scraping tracks' links from YouTube playlist with Beautiful Soup

I am trying to scrape all links for tracks from my playlist.
This is my code
from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
playlist = 'minimal_house'
url = 'https://www.youtube.com/channel/UCt2GxiTBN_RiE-cbP0cmk5Q/playlists'
html = urlopen(url)
soup = BeautifulSoup(html , 'html.parser')
tracks = soup.find(title = playlist).get('href')
print(tracks)
url = url + tracks
print(url)
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a',attrs={'class':'yt-simple-endpoint style-scope ytd-playlist-panel-video-renderer'})
print(links)
I couldn't scrape 'a'; nor by id; nor by class name.

it's my messy code that works for me:
from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
playlist = 'minimal_house'
url = 'https://www.youtube.com/channel/UCt2GxiTBN_RiE-cbP0cmk5Q/playlists'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
tracks = soup.find('a', attrs={'title': playlist}).get('href')
print(tracks)
url = 'https://www.youtube.com' + str(tracks)
print(url)
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')
links = set([link.get('href') for link in links if link.get('href').count('watch')])
print(links)
since class names change base on device request, it's better to get all links in this case.
and you need to use selenium to scroll down to fetch all the list.

Scraper not extracting url link:

Hi I am trying to scrape the Amazon url address on this site linked under "View item on Amazon".
My code is below, I get zero response. Appreciate any assistance. Thanks
import requests
url = "https://app.jumpsend.com/deals/230513"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))

The Amazon link (https://www.amazon.com/dp/B07MH9DK5B) isn't in the html page source. You need to use Selenium in order to read-in the html of all the elements that are set by Java script:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://app.jumpsend.com/deals/230513"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
soup.find('a', attrs={'class': 'deal-modal-link'})['href']
The above code prints out the Amazon link:
'https://www.amazon.com/dp/B07MH9DK5B'

BeautifulSoup doesn't shows all the elements in tag

I'm trying to parse this website and get information about auto in content-box card__body with BeautifulSoup.find but it doesn't find all classes. I also tried webdriver.PhantomJS(), but it also showed nothing.
Here is my code:
from bs4 import BeautifulSoup
from selenium import webdriver
url='http://www.autobody.ru/catalog/10230/217881/'
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html5lib')
JTitems = soup.find("div", attrs={"class":"content-box__strong red-text card__price"})
JTitems
or
w = soup.find("div", attrs={"class":"content-box card__body"})
w
Why doesn't this approach work? What should I do to get all the information about auto? I am using Python 2.7.

Find the table where your necessary info are. Then find all the tr and loop through them to get the texts.
from bs4 import BeautifulSoup
from selenium import webdriver
url='http://www.autobody.ru/catalog/10230/217881/'
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
price = soup.find('div', class_='price').get_text()
print(price)
for tr in soup.find('table', class_='tech').find_all('tr'):
print(tr.get_text())
browser.close()
browser.quit()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

I dont know why beautiful soup is nt scraping this webpage - python

Related

Unable to parse href using beautifulsoup Python

remove html tags from string using bs4

Scraping tracks' links from YouTube playlist with Beautiful Soup

Scraper not extracting url link:

BeautifulSoup doesn't shows all the elements in tag

Categories

Resources