I'm trying to parse this url
First, I've tried to use requests with bs4, but result page was differ from content from browser.
cont = requests.get(path).content
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
Next I try to use selenium:
def render_page(path):
driver = webdriver.PhantomJS()
driver.get(path)
time.sleep(3)
r = driver.page_source
return r
r = render_page(path)
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
But it returns another content. Сontent of page
After it I've tried add to my code
js_code = "return document.getElementsByTagName('html').innerHTML"
your_elements = sel.execute_script(js_code)
but it didn't help.
So Are any ways to get content of page using requests or selenium or maybe some another parser the same as in the browser?
Related
I am trying to webscrape the following webpage to get a specific href using BS4. They've just changed the page layout and due to that I am unable to parse it correctly. Hope anyone can help.
Webpage trying to scrape: https://www.temit.co.uk/investor/resources/temit-literature
Tag trying to get: href="https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheet"
Currently I am using the following code (BS4), however, I am unable to get the specific element anymore.
url = "https://www.temit.co.uk/investor/resources/temit-literature"
page = requests.get(url) # Requests website
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all('div', attrs={'class':'row ng-star-inserted'})
url_TEM = table[1].find_all('a')[0].get('href')
url_TEM = 'https://www.temit.co.uk' + url_TEM
The url is dynamic that's why I use selenium with bs4 and getting the desired output as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
url = "https://www.temit.co.uk/investor/resources/temit-literature"
driver.get(url)
time.sleep(8)
soup = BeautifulSoup(driver.page_source, 'html.parser')
table_urls = soup.select('table.table-striped tbody tr td a')
for table_url in table_urls:
url = table_url['href']
print(url)
Output:
https://franklintempletonprod.widen.net/s/gd2tmrc8cl/final-temit-annual-report-31-march-2021
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://www.londonstockexchange.com/stock/TEM/templeton-emerging-markets-investment-trust-plc/analysis
https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheethttps://franklintempletonprod.widen.net/s/bdxrtlljxg/temit_manager_update
https://franklintempletonprod.widen.net/s/6djgk6xknx/temit-holdings-report
/content-kid/kid/en-GB/KID-GB0008829292-GB-en-GB.pdf
https://franklintempletonprod.widen.net/s/flblzqxmcg/temit-investor-disclosure-document.pdf-26-04-2021
https://franklintempletonprod.widen.net/s/l5gmdbf6vp/agm-results-announcement
https://franklintempletonprod.widen.net/s/qmkphz9s5s/agm-shareholder-documentation
https://franklintempletonprod.widen.net/s/b258tgpjb7/agm-uk-voting-guide
https://franklintempletonprod.widen.net/s/c5zhbbnxql/agm-nz-voting-guide
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/5bjq2qkmh5/temit-annual-report-mar-2018
https://franklintempletonprod.widen.net/s/bnx9mfwlzw/temit-annual-report-mar-2017
https://franklintempletonprod.widen.net/s/rfqc7xrnfn/temit-annual-report-mar-2016
https://franklintempletonprod.widen.net/s/zfzxlflxnq/temit-annual-report-mar-2015
https://franklintempletonprod.widen.net/s/dj9zl8rpcm/temit-annual-report-mar-2014
https://franklintempletonprod.widen.net/s/7xshxmkpnh/temit-annual-report-mar-2013
https://franklintempletonprod.widen.net/s/7gwx2qmcdr/temit-annual-report-mar-2012
https://franklintempletonprod.widen.net/s/drpd7gbvxl/temit-annual-report-mar-2011
https://franklintempletonprod.widen.net/s/2pb2kxkgbl/temit-annual-report-mar-2010
https://franklintempletonprod.widen.net/s/g6pdr9hq2d/temit-annual-report-mar-2009
https://franklintempletonprod.widen.net/s/7pvjf6fhl9/temit-annual-report-mar-2008
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://franklintempletonprod.widen.net/s/xwvrncvkj2/temit-half-year-report-sept-2019
https://franklintempletonprod.widen.net/s/lbp5ssv8mc/temit-half-year-report-sept-2018
https://franklintempletonprod.widen.net/s/hltddqhqcf/temit-half-year-report-sept-2017
https://franklintempletonprod.widen.net/s/2tlqxxflgn/temit-half-year-report-sept-2016
https://franklintempletonprod.widen.net/s/lbcgztjjkj/temit-half-year-report-sept-2015
https://franklintempletonprod.widen.net/s/2tjxzgbdvx/temit-half-year-report-sept-2014
https://franklintempletonprod.widen.net/s/gzrpjwb7bf/temit-half-year-report-sept-2013
https://franklintempletonprod.widen.net/s/lxhbdrmc8z/temit-half-year-report-sept-2012
https://franklintempletonprod.widen.net/s/zzpxrrrpmc/temit-half-year-report-sept-2011
https://franklintempletonprod.widen.net/s/zjdd2gn5jc/temit-half-year-report-sept-2010
https://franklintempletonprod.widen.net/s/7sbqfxxkrd/temit-half-year-report-sept-2009
https://franklintempletonprod.widen.net/s/pvswpqkdvb/temit-half-year-report-sept-2008-1
I'm trying to scrape a link in the video description on youtube, but the list always return empty.
I've tried to change the tag from where I'm scraping, but there is no change in either the output nor the error message.
Here's the code I'm using:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/watch?v=gqUqGaXipe8').text
soup = BeautifulSoup(source, 'lxml')
link = [i['href'] for i in soup.findAll('a', class_='yt-simple-endpoint style-scope yt-formatted-string', href=True)]
print(link)
What is wrong, and how can I solve it?
In your case, requests doesn't return the whole HTML structure of the page. If Youtube is filling in the data using JavaScript we must run it through a real browser to get the source of the page, such as Chrome Headless using the Selenium library. Here is the general solution:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
url = "https://www.youtube.com/watch?v=Oh1nqnZAKxw"
driver.get(url)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
link = [i['href'] for i in soup.select('div#meta div#description [href]')]
print(link)
I cant get beautiful soup to locate a certain element
I have tried changing the parser it uses and what its searching for
import urllib2
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
url= driver.current_url
#sets the array counter
i = 1
#opens the url
page = urllib2.urlopen(url)
#reads the url
soup = BeautifulSoup(page, 'lxml')
print(soup)
h2 = soup.find_all("h3", {"class": "nbf_carparking_title nbf_tooltip"}) [i]
#sets strong value
strong = soup.find_all("div", {"class": "nbf_fancy_product_results_totalcost2"})[i]
#prints h2 (text)
print(h2).text
print(strong).text
I am expecting it to bring up the prices of certain products but instead it prints nothing
the url is https://www.weholiday.co.uk/travelreq.php?cd4504d8-b2a2-11e9-8114-3ca82a238f71
Hi I am trying to scrape the Amazon url address on this site linked under "View item on Amazon".
My code is below, I get zero response. Appreciate any assistance. Thanks
import requests
url = "https://app.jumpsend.com/deals/230513"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
The Amazon link (https://www.amazon.com/dp/B07MH9DK5B) isn't in the html page source. You need to use Selenium in order to read-in the html of all the elements that are set by Java script:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://app.jumpsend.com/deals/230513"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
soup.find('a', attrs={'class': 'deal-modal-link'})['href']
The above code prints out the Amazon link:
'https://www.amazon.com/dp/B07MH9DK5B'
I'm trying to parse this website and get information about auto in content-box card__body with BeautifulSoup.find but it doesn't find all classes. I also tried webdriver.PhantomJS(), but it also showed nothing.
Here is my code:
from bs4 import BeautifulSoup
from selenium import webdriver
url='http://www.autobody.ru/catalog/10230/217881/'
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html5lib')
JTitems = soup.find("div", attrs={"class":"content-box__strong red-text card__price"})
JTitems
or
w = soup.find("div", attrs={"class":"content-box card__body"})
w
Why doesn't this approach work? What should I do to get all the information about auto? I am using Python 2.7.
Find the table where your necessary info are. Then find all the tr and loop through them to get the texts.
from bs4 import BeautifulSoup
from selenium import webdriver
url='http://www.autobody.ru/catalog/10230/217881/'
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
price = soup.find('div', class_='price').get_text()
print(price)
for tr in soup.find('table', class_='tech').find_all('tr'):
print(tr.get_text())
browser.close()
browser.quit()