Webscraping Live data

Webscraping Live data - python

I am currently trying to scrape live stock market data from the yahoo finance page.
I am using bs4. My current issue is that whenever I run my script, it does not update properly to reflect the current price of the stock.
If anybody has any advice on how to change that it would be appreciated.
import requests
from bs4 import BeautifulSoup
while True:
page = requests.get("https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X")
soup = BeautifulSoup(page.text, "html.parser")
price = soup.find("div", {"class": "My(6px) Pos(r) smartphone_Mt(6px)"}).find("span").text
print(price)

NOT POSSIBLE WITH BS4 ALONE
This website particularly uses JavaScript to update the page and urlib etc. just parses the html content of the page not Java Script or AJAX content.
PhantomJs or Selenium Web Browser provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. Try Using this :)
Using Selenium It can be done as:
from selenium import webdriver #its the library
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
#it Says that we are going to Use chrome browser
chrome_options = webdriver.ChromeOptions()
#hiding the Chrome Browser
chrome_options.add_argument("--headless")
#Initiating Chrome with all properties we need (in this case we use no specific properties
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path='C:/Users/shary/Downloads/chromedriver.exe')
#URL We need to open
url = 'https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X'
#Starting Our Browser
driver = webdriver.Chrome()
#Accessing the url .. this will open the page just as you open in Chrome etc.
driver.get(url)
while 1:
#it will get you the html content repeatedly .. So you can get the changing price
html = driver.page_source
page_soup = soup(html,features="lxml")
price = page_soup.find("div", {"class": "D(ib) Mend(20px)"}).text
print(price)
time.sleep(5)
Note the Best Comments But Hope this you will understand it :) Else Watch a youtube tutorial to get proper idea what a Selenium Bot does
Hope This will Help. Its working perfect for me :) Accept This Answer if it helps you

Related

Trying to scrape from mutiple pages with same link

from bs4 import BeautifulSoup
import requests
import time
from selenium import webdriver
driver = webdriver.Chrome(r'C:\chromedriver.exe')
url ='https://www.sambav.com/hyderabad/doctors'
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
for links in soup.find_all('div',class_='sambavdoctorname'):
link = links.find('a')
print(link['href'])
driver.close()
I am trying to scrape this page, the link is same in all pages. I am trying to extract the links from all mutiple pages but it's not giving any output nor showing any error just the program gets end.

If you check that website by developer tools in browser ( chrome or mozilla or whatever), before loading website, the website fetch data from few sources. One of this sources is "https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=" . Your code could be simplified (and there is no need to use selenium):
import requests
r = requests.get('https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=')
BASE_URL_DOCTOR = 'https://www.sambav.com/hyderabad/doctor/'
for item in r.json():
print(BASE_URL_DOCTOR + item['uniqueName'])

Can't locate data on page

I am attempting to pull the title and link of each so called raffle in the list on this website. However, when i try to scrape this data, it can't seem to be found.
I have tried scraping all links on the page, but I think these "boxes" may be loaded via javascript.
The results I am receiving are a few links unrelated to what I want to get. There should be 40+ links that show up in this list, but the majority are not showing. Any help would be great, been stuck on this for a while
For some reason, this link and many others aren't showing up when I am scraping:
my code:
def raffle_page_collection():
chrome_driver()
page = requests.get('https://www.soleretriever.com/yeezy-boost-350-v2-black/')
soup = BeautifulSoup(page.text,'html.parser')
product_header = soup.find('h1').text
product_colorway = soup.find('h2').text
product_sku_and_release_date_and_price = soup.find('h3').text
container = soup.find(class_='main-container')
raffles = container.find_all('a')
raffle_list = []
for items in raffles:
raffle_list.append(items.get('href'))
print(raffle_list)

You should try automation selenium library. it allows you to scrape dynamic rendering request(js or ajax) page data.
Try this:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Chrome()
browser.get('https://www.soleretriever.com/yeezy-boost-350-v2-black/')
time.sleep(3)
soup = BeautifulSoup(browser.page_source,'html.parser')
product_header = soup.find('h1').text
product_colorway = soup.find('h2').text
product_sku_and_release_date_and_price = soup.find('h3').text
container = soup.find(class_='main-container')
raffles = container.find("div",{"class":"vc_pageable-slide-wrapper vc_clearfix"})
raffle_list = []
for items in raffles.find_all("a",href=True):
raffle_list.append(items.get('href'))
print(product_header)
print(product_colorway)
print(product_sku_and_release_date_and_price)
print(raffle_list)
O/P:
Yeezy Boost 350 v2 Black
Black/ Black/ Black
FU9006 | 07/06/19 | $220
['https://www.43einhalb.com/en/adidas-yeezy-boost-350-v2-black-328672#SR', 'https://www.adidas.co.uk/yeezy#SR', 'https://www.allikestore.com/default/adidas-yeezy-boost-350-v2-static-black-black-fu9006-105308.html#SR', 'https://archive1820.com/en/forms/6/yeezy-raffle#SR', 'https://drops.blackboxstore.com/blackbox_launches_view/catalog/product/view/id/22296/s/yeezy-boost-350-v2#SR', 'https://woobox.com/4szm9v#SR', 'https://launches.endclothing.com/product/yeezy-boost-350-v2-fu9006#SR', 'https://www.instagram.com/p/ByEIHSHDSY6/', 'https://www.instagram.com/p/ByFG1G0lWf7/', 'https://releases.footshop.com/adidas-yeezy-boost-350-v2-agqn6WoBJZ9y4RSnzw9G#SR', 'https://launches.goodhoodstore.com/launches/yeezy-boost-350-v2-black-33#SR', 'https://www.hervia.com/launches/yeezy-350#SR', 'https://www.hibbett.com/adidas-yeezy-350-v2-black-mens-shoe/M0991.html#SR', 'https://reporting.jdsports.co.uk/cgi-bin/msite?yeezy_comp+a+0+0+0+0+0&utm_source=RedEye&utm_medium=Email&utm_campaign=Yeezy%20Boost%20351%20Clay&utm_content=0905%20Yeezy%20Clay#SR', 'https://www.instagram.com/p/ByDnK6uH6kE/', 'https://www.nakedcph.com/yeezy-boost-v2-350-static-black-raffle/s/635#SR', 'https://www.instagram.com/p/ByIXT8zHvYz/', 'https://launches.sevenstore.com/launch/yeezy-boost-350-v2-black-4033024#SR', 'https://shelta.eu/news/adidas-yeezy-boost-350-v2-black-fu9006x#SR', 'https://www.instagram.com/p/ByDI_6JAfty/', 'https://www.sneakersnstuff.com/en/product/38889/adidas-yeezy-350-v2#SR', 'https://www.instagram.com/p/ByHtt3HFkE0/', 'https://www.instagram.com/p/ByCaKR7Cde1/', 'https://tres-bien.com/adidas-yeezy-boost-350-v2-black-fu9006-fw19#SR', 'https://yeezysupply.com/products/yeezy-boost-350-v2-black-june-7-2019#SR']
for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
selenium tutorial
https://selenium-python.readthedocs.io/

Can't get all titles from a list with Python WebScraping

I'm practicing web scraping with Python atm and I found a problem, I wanted to scrape one website that has a list of anime that I watched before but when I try to scrape it (via requests or selenium) it only gets around 30 of 110 anime names from the page.
Here is my code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://anilist.co/user/Agusmaris/animelist/Completed")
data = BeautifulSoup(browser.page_source, 'lxml')
for title in data.find_all(class_="title"):
print(title.getText())
And when I run it, the page source only shows up until an anime called 'Golden time' when there are like 70 or more left that are in the page.
Thanks
Edit: Code that works now thanks to 'supputuri':
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer = driver.find_element_by_css_selector("div.footer")
preY = 0
print(str(footer))
while footer.rect['y'] != preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
print('loading')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for title in soup.find_all(class_="title"):
print(title.getText())
driver.close()
driver.quit()
ret = input()

Here is the solution.
Make sure to add import time
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer =driver.find_element_by_css_selector("div.footer")
preY =0
while footer.rect['y']!=preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
time.sleep(1)
print(str(driver.page_source))
This will iterate until all the anime is loaded and then gets the page source.
Let us know if this was helpful.

So, this is the jist of what I get when I load the page source:
AniListwindow.al_token = 'E1lPa1kzYco5hbdwT3GAMg3OG0rj47Gy5kF0PUmH';Sorry, AniList requires Javascript.Please enable Javascript or http://outdatedbrowser.com>upgrade to a modern web browser.Sorry, AniList requires a modern browser.Please http://outdatedbrowser.com>upgrade to a newer web browser.
Since I know damn well that Javascript is enabled and my Chrome version is fully up to date, and the URL listed takes one to a nonsecure website to "download" a new version of your browser, I think this is a spam site. Not sure if you were aware of that when posting so I won't flag as such, but I wanted you and others who come across this to be aware.

scrape website traffic from semrush using beautiful soup python

I'm trying to scrape website traffic from semrush.com.
my current code using BeautifulSoup is:
from bs4 import BeautifulSoup, BeautifulStoneSoup
import urllib
import json
req = urllib.request.Request('https://www.semrush.com/info/burton.com', headers={'User-Agent':'Magic Browser'})
response = urllib.request.urlopen(req)
raw_data = response.read()
response.close()
soup = BeautifulSoup(raw_data)
I've been trying data = soup.findAll("a", {"href":"/info/burton.com+(by+organic)"}) or data = soup.findAll("span", {"class":"sem-report-counter"}) without much luck.
I can see the numbers on the webpage that I would like to get. Is there a way to pull this information off? I'm not seeing it in the html I pull.

I went the extra mile and set up a working example of how you can use selenium to scrape that page. Install selenium and try it out!
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = 'https://www.semrush.com/info/burton.com' #your url
options = Options() #set up options
options.add_argument('--headless') #add --headless mode to options
driver = webdriver.Chrome(executable_path='/opt/ChromeDriver/chromedriver',
chrome_options=options)
#note: executable_path will depend on where your chromedriver.exe is located
driver.get(url) #get response
driver.implicitly_wait(1) #wait to load content
elements = driver.find_elements_by_xpath(xpath='//a[#href="/info/burton.com+(by+organic)"]') #grab that stuff you wanted?
for e in elements: print(e.get_attribute('text').strip()) #print text fields
driver.quit() #close the driver when you're done
Output that I see in my terminal:
356K
6.5K
59.3K
$usd305K
Organic keywords
Organic
Top Organic Keywords
View full report
Organic Position Distribution

How to bypass disclaimer while scraping a website

I was able to scrape the following website before using "driver = webdriver.PhantomJS()" for work reason. What I was scraping were the price and the date.
https://www.cash.ch/fonds/swisscanto-ast-avant-bvg-portfolio-45-p-19225268/swc/chf
This stopped working some days ago due to a disclaimer page which I have to agree at first.
https://www.cash.ch/fonds-investor-disclaimer?redirect=fonds/swisscanto-ast-avant-bvg-portfolio-45-p-19225268/swc/chf
Once agreed I visually saw the real content, however the driver seems not, print out is [], so it must be still with the url of the disclaimer.
Please see code below.
from selenium import webdriver
from bs4 import BeautifulSoup
import csv
import os
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
#Swisscanto
driver.get("https://www.cash.ch/fonds/swisscanto-ast-avant-bvg- portfolio-45-p-19225268/swc/chf")
s_swisscanto = BeautifulSoup(driver.page_source, 'lxml')
nav_sc = s_swisscanto.find_all('span', {"data-field-entry": "value"})
date_sc = s_swisscanto.find_all('span', {"data-field-entry": "datetime"})
print(nav_sc)
print(date_sc)
print("Done Swisscanton")

This should work (I think the button you want to click in zustimmen?)
driver = webdriver.PhantomJS()
driver.get("https://www.cash.ch/fonds/swisscanto-ast-avant-bvg-portfolio-45-p-19225268/swc/chf"
accept_button = driver.find_element_by_link_text('zustimmen')
accept_button.click()
content = driver.page_source
More details here
python selenium click on button

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping Live data - python

Related

Trying to scrape from mutiple pages with same link

Can't locate data on page

Can't get all titles from a list with Python WebScraping

scrape website traffic from semrush using beautiful soup python

How to bypass disclaimer while scraping a website

Categories

Resources