Scraper not extracting url link: - python

Hi I am trying to scrape the Amazon url address on this site linked under "View item on Amazon".
My code is below, I get zero response. Appreciate any assistance. Thanks
import requests
url = "https://app.jumpsend.com/deals/230513"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))

The Amazon link (https://www.amazon.com/dp/B07MH9DK5B) isn't in the html page source. You need to use Selenium in order to read-in the html of all the elements that are set by Java script:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://app.jumpsend.com/deals/230513"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
soup.find('a', attrs={'class': 'deal-modal-link'})['href']
The above code prints out the Amazon link:
'https://www.amazon.com/dp/B07MH9DK5B'

Related

Unable to parse href using beautifulsoup Python

I am trying to webscrape the following webpage to get a specific href using BS4. They've just changed the page layout and due to that I am unable to parse it correctly. Hope anyone can help.
Webpage trying to scrape: https://www.temit.co.uk/investor/resources/temit-literature
Tag trying to get: href="https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheet"
Currently I am using the following code (BS4), however, I am unable to get the specific element anymore.
url = "https://www.temit.co.uk/investor/resources/temit-literature"
page = requests.get(url) # Requests website
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all('div', attrs={'class':'row ng-star-inserted'})
url_TEM = table[1].find_all('a')[0].get('href')
url_TEM = 'https://www.temit.co.uk' + url_TEM
The url is dynamic that's why I use selenium with bs4 and getting the desired output as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
url = "https://www.temit.co.uk/investor/resources/temit-literature"
driver.get(url)
time.sleep(8)
soup = BeautifulSoup(driver.page_source, 'html.parser')
table_urls = soup.select('table.table-striped tbody tr td a')
for table_url in table_urls:
url = table_url['href']
print(url)
Output:
https://franklintempletonprod.widen.net/s/gd2tmrc8cl/final-temit-annual-report-31-march-2021
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://www.londonstockexchange.com/stock/TEM/templeton-emerging-markets-investment-trust-plc/analysis
https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheethttps://franklintempletonprod.widen.net/s/bdxrtlljxg/temit_manager_update
https://franklintempletonprod.widen.net/s/6djgk6xknx/temit-holdings-report
/content-kid/kid/en-GB/KID-GB0008829292-GB-en-GB.pdf
https://franklintempletonprod.widen.net/s/flblzqxmcg/temit-investor-disclosure-document.pdf-26-04-2021
https://franklintempletonprod.widen.net/s/l5gmdbf6vp/agm-results-announcement
https://franklintempletonprod.widen.net/s/qmkphz9s5s/agm-shareholder-documentation
https://franklintempletonprod.widen.net/s/b258tgpjb7/agm-uk-voting-guide
https://franklintempletonprod.widen.net/s/c5zhbbnxql/agm-nz-voting-guide
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/5bjq2qkmh5/temit-annual-report-mar-2018
https://franklintempletonprod.widen.net/s/bnx9mfwlzw/temit-annual-report-mar-2017
https://franklintempletonprod.widen.net/s/rfqc7xrnfn/temit-annual-report-mar-2016
https://franklintempletonprod.widen.net/s/zfzxlflxnq/temit-annual-report-mar-2015
https://franklintempletonprod.widen.net/s/dj9zl8rpcm/temit-annual-report-mar-2014
https://franklintempletonprod.widen.net/s/7xshxmkpnh/temit-annual-report-mar-2013
https://franklintempletonprod.widen.net/s/7gwx2qmcdr/temit-annual-report-mar-2012
https://franklintempletonprod.widen.net/s/drpd7gbvxl/temit-annual-report-mar-2011
https://franklintempletonprod.widen.net/s/2pb2kxkgbl/temit-annual-report-mar-2010
https://franklintempletonprod.widen.net/s/g6pdr9hq2d/temit-annual-report-mar-2009
https://franklintempletonprod.widen.net/s/7pvjf6fhl9/temit-annual-report-mar-2008
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://franklintempletonprod.widen.net/s/xwvrncvkj2/temit-half-year-report-sept-2019
https://franklintempletonprod.widen.net/s/lbp5ssv8mc/temit-half-year-report-sept-2018
https://franklintempletonprod.widen.net/s/hltddqhqcf/temit-half-year-report-sept-2017
https://franklintempletonprod.widen.net/s/2tlqxxflgn/temit-half-year-report-sept-2016
https://franklintempletonprod.widen.net/s/lbcgztjjkj/temit-half-year-report-sept-2015
https://franklintempletonprod.widen.net/s/2tjxzgbdvx/temit-half-year-report-sept-2014
https://franklintempletonprod.widen.net/s/gzrpjwb7bf/temit-half-year-report-sept-2013
https://franklintempletonprod.widen.net/s/lxhbdrmc8z/temit-half-year-report-sept-2012
https://franklintempletonprod.widen.net/s/zzpxrrrpmc/temit-half-year-report-sept-2011
https://franklintempletonprod.widen.net/s/zjdd2gn5jc/temit-half-year-report-sept-2010
https://franklintempletonprod.widen.net/s/7sbqfxxkrd/temit-half-year-report-sept-2009
https://franklintempletonprod.widen.net/s/pvswpqkdvb/temit-half-year-report-sept-2008-1

BeautifulSoup scraping binance BTC_USDT page and not every <div> is found

I'm trying to scrape from the BTC_USDT page from Binance.
From the resulting html I couldn't find the tags that I am looking for.
For eg: The live price is under <div class="subPrice css-4lmq3e", it is found when inspected but not when scraped. Although I could find some other tags like id="__APP".
Here is my code:
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "https://www.binance.com/en/trade/BTC_USDT?type=spot"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "lxml")
div = soup.find("div", {"class" : "subPrice css-4lmq3e"})
content = f'Content : {str(div)}'
print(content)
you get can the current price of BTC_USDT from binance using there API.
https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT
https://github.com/binance/binance-spot-api-docs/blob/master/rest-api.md

Python Href scraping

I'm trying to loop over a href and get the URL. I've managed to extrat the href but i need the full url to get into this link. This is my code at the minute
import requests
from bs4 import BeautifulSoup
webpage_response = requests.get('http://www.harness.org.au/racing/results/?activeTab=tab')
webpage_response.content
webpage_response = requests.get
soup = BeautifulSoup(webpage, "html.parser")
#only finding one track
#soup.table to find all links for days racing
harness_table = soup.table
#scraps a href that is an incomplete URL that im trying to get to
for link in soup.select(".meetingText > a"):
link.insert(0, "http://www.harness.org.au")
webpage = requests.get(link)
new_soup = BeautifulSoup(webpage.content, "html.parser")
#work through table to get links to tracks
print(new_soup)'''
You can store the base url of website in a variable and then once you get the href from link you can join them both to create the next url.
import requests
from bs4 import BeautifulSoup
base_url = "http://www.harness.org.au"
webpage_response = requests.get('http://www.harness.org.au/racing/results/?activeTab=tab')
soup = BeautifulSoup(webpage_response.content, "html.parser")
# only finding one track
# soup.table to find all links for days racing
harness_table = soup.table
# scraps a href that is an incomplete URL that im trying to get to
for link in soup.select(".meetingText > a"):
webpage = requests.get(base_url + link["href"])
new_soup = BeautifulSoup(webpage.content, "html.parser")
# work through table to get links to tracks
print(new_soup)
Try this solution. Maybe you'll like this library.
from simplified_scrapy import SimplifiedDoc,req
url = 'http://www.harness.org.au/racing/results/?activeTab=tab'
html = req.get(url)
doc = SimplifiedDoc(html)
links = [doc.absoluteUrl(url,ele.a['href']) for ele in doc.selects('td.meetingText')]
print(links)
Result:
['http://www.harness.org.au/racing/fields/race-fields/?mc=BA040320', 'http://www.harness.org.au/racing/fields/race-fields/?mc=BH040320', 'http://www.harness.org.au/racing/fields/race-fields/?mc=RE040320']

Selenium: different content of url between selenium result and browser

I'm trying to parse this url
First, I've tried to use requests with bs4, but result page was differ from content from browser.
cont = requests.get(path).content
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
Next I try to use selenium:
def render_page(path):
driver = webdriver.PhantomJS()
driver.get(path)
time.sleep(3)
r = driver.page_source
return r
r = render_page(path)
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
But it returns another content. Сontent of page
After it I've tried add to my code
js_code = "return document.getElementsByTagName('html').innerHTML"
your_elements = sel.execute_script(js_code)
but it didn't help.
So Are any ways to get content of page using requests or selenium or maybe some another parser the same as in the browser?

How to get Html code after crawling with python

https://plus.google.com/s/casasgrandes27%40gmail.com/top
I need to crawl the following page with python but I need its HTML not the generic source code of link.
For example
Open the link: plus.google.com/s/casasgrandes27%40gmail.com/top without login second last thumbnail will be "G Suite".
<div class="Wbuh5e" jsname="r4nke">G Suite</div>
I am unable to find the above line of HTML-code after executing this python-code.
from bs4 import BeautifulSoup
import requests
L = list()
r = requests.get("https://plus.google.com/s/casasgrandes27%40gmail.com/top")
data = r.text
soup = BeautifulSoup(data,"lxml")
print(soup)
To get the soup object try the following
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
you can try this code to read a HTML page :
import urllib.request
urls = "https://plus.google.com/s/casasgrandes27%40gmail.com/top"
html_file = urllib.request.urlopen(urls)
html_text = html_file.read()
html_text = str(html_text)
print(html_text)

Categories