I am attempting to pull the title and link of each so called raffle in the list on this website. However, when i try to scrape this data, it can't seem to be found.
I have tried scraping all links on the page, but I think these "boxes" may be loaded via javascript.
The results I am receiving are a few links unrelated to what I want to get. There should be 40+ links that show up in this list, but the majority are not showing. Any help would be great, been stuck on this for a while
For some reason, this link and many others aren't showing up when I am scraping:
my code:
def raffle_page_collection():
chrome_driver()
page = requests.get('https://www.soleretriever.com/yeezy-boost-350-v2-black/')
soup = BeautifulSoup(page.text,'html.parser')
product_header = soup.find('h1').text
product_colorway = soup.find('h2').text
product_sku_and_release_date_and_price = soup.find('h3').text
container = soup.find(class_='main-container')
raffles = container.find_all('a')
raffle_list = []
for items in raffles:
raffle_list.append(items.get('href'))
print(raffle_list)
You should try automation selenium library. it allows you to scrape dynamic rendering request(js or ajax) page data.
Try this:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Chrome()
browser.get('https://www.soleretriever.com/yeezy-boost-350-v2-black/')
time.sleep(3)
soup = BeautifulSoup(browser.page_source,'html.parser')
product_header = soup.find('h1').text
product_colorway = soup.find('h2').text
product_sku_and_release_date_and_price = soup.find('h3').text
container = soup.find(class_='main-container')
raffles = container.find("div",{"class":"vc_pageable-slide-wrapper vc_clearfix"})
raffle_list = []
for items in raffles.find_all("a",href=True):
raffle_list.append(items.get('href'))
print(product_header)
print(product_colorway)
print(product_sku_and_release_date_and_price)
print(raffle_list)
O/P:
Yeezy Boost 350 v2 Black
Black/ Black/ Black
FU9006 | 07/06/19 | $220
['https://www.43einhalb.com/en/adidas-yeezy-boost-350-v2-black-328672#SR', 'https://www.adidas.co.uk/yeezy#SR', 'https://www.allikestore.com/default/adidas-yeezy-boost-350-v2-static-black-black-fu9006-105308.html#SR', 'https://archive1820.com/en/forms/6/yeezy-raffle#SR', 'https://drops.blackboxstore.com/blackbox_launches_view/catalog/product/view/id/22296/s/yeezy-boost-350-v2#SR', 'https://woobox.com/4szm9v#SR', 'https://launches.endclothing.com/product/yeezy-boost-350-v2-fu9006#SR', 'https://www.instagram.com/p/ByEIHSHDSY6/', 'https://www.instagram.com/p/ByFG1G0lWf7/', 'https://releases.footshop.com/adidas-yeezy-boost-350-v2-agqn6WoBJZ9y4RSnzw9G#SR', 'https://launches.goodhoodstore.com/launches/yeezy-boost-350-v2-black-33#SR', 'https://www.hervia.com/launches/yeezy-350#SR', 'https://www.hibbett.com/adidas-yeezy-350-v2-black-mens-shoe/M0991.html#SR', 'https://reporting.jdsports.co.uk/cgi-bin/msite?yeezy_comp+a+0+0+0+0+0&utm_source=RedEye&utm_medium=Email&utm_campaign=Yeezy%20Boost%20351%20Clay&utm_content=0905%20Yeezy%20Clay#SR', 'https://www.instagram.com/p/ByDnK6uH6kE/', 'https://www.nakedcph.com/yeezy-boost-v2-350-static-black-raffle/s/635#SR', 'https://www.instagram.com/p/ByIXT8zHvYz/', 'https://launches.sevenstore.com/launch/yeezy-boost-350-v2-black-4033024#SR', 'https://shelta.eu/news/adidas-yeezy-boost-350-v2-black-fu9006x#SR', 'https://www.instagram.com/p/ByDI_6JAfty/', 'https://www.sneakersnstuff.com/en/product/38889/adidas-yeezy-350-v2#SR', 'https://www.instagram.com/p/ByHtt3HFkE0/', 'https://www.instagram.com/p/ByCaKR7Cde1/', 'https://tres-bien.com/adidas-yeezy-boost-350-v2-black-fu9006-fw19#SR', 'https://yeezysupply.com/products/yeezy-boost-350-v2-black-june-7-2019#SR']
for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
selenium tutorial
https://selenium-python.readthedocs.io/
Related
I am currently trying to scrape live stock market data from the yahoo finance page.
I am using bs4. My current issue is that whenever I run my script, it does not update properly to reflect the current price of the stock.
If anybody has any advice on how to change that it would be appreciated.
import requests
from bs4 import BeautifulSoup
while True:
page = requests.get("https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X")
soup = BeautifulSoup(page.text, "html.parser")
price = soup.find("div", {"class": "My(6px) Pos(r) smartphone_Mt(6px)"}).find("span").text
print(price)
NOT POSSIBLE WITH BS4 ALONE
This website particularly uses JavaScript to update the page and urlib etc. just parses the html content of the page not Java Script or AJAX content.
PhantomJs or Selenium Web Browser provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. Try Using this :)
Using Selenium It can be done as:
from selenium import webdriver #its the library
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
#it Says that we are going to Use chrome browser
chrome_options = webdriver.ChromeOptions()
#hiding the Chrome Browser
chrome_options.add_argument("--headless")
#Initiating Chrome with all properties we need (in this case we use no specific properties
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path='C:/Users/shary/Downloads/chromedriver.exe')
#URL We need to open
url = 'https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X'
#Starting Our Browser
driver = webdriver.Chrome()
#Accessing the url .. this will open the page just as you open in Chrome etc.
driver.get(url)
while 1:
#it will get you the html content repeatedly .. So you can get the changing price
html = driver.page_source
page_soup = soup(html,features="lxml")
price = page_soup.find("div", {"class": "D(ib) Mend(20px)"}).text
print(price)
time.sleep(5)
Note the Best Comments But Hope this you will understand it :) Else Watch a youtube tutorial to get proper idea what a Selenium Bot does
Hope This will Help. Its working perfect for me :) Accept This Answer if it helps you
from bs4 import BeautifulSoup
import requests
import time
from selenium import webdriver
driver = webdriver.Chrome(r'C:\chromedriver.exe')
url ='https://www.sambav.com/hyderabad/doctors'
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
for links in soup.find_all('div',class_='sambavdoctorname'):
link = links.find('a')
print(link['href'])
driver.close()
I am trying to scrape this page, the link is same in all pages. I am trying to extract the links from all mutiple pages but it's not giving any output nor showing any error just the program gets end.
If you check that website by developer tools in browser ( chrome or mozilla or whatever), before loading website, the website fetch data from few sources. One of this sources is "https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=" . Your code could be simplified (and there is no need to use selenium):
import requests
r = requests.get('https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=')
BASE_URL_DOCTOR = 'https://www.sambav.com/hyderabad/doctor/'
for item in r.json():
print(BASE_URL_DOCTOR + item['uniqueName'])
I'm practicing web scraping with Python atm and I found a problem, I wanted to scrape one website that has a list of anime that I watched before but when I try to scrape it (via requests or selenium) it only gets around 30 of 110 anime names from the page.
Here is my code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://anilist.co/user/Agusmaris/animelist/Completed")
data = BeautifulSoup(browser.page_source, 'lxml')
for title in data.find_all(class_="title"):
print(title.getText())
And when I run it, the page source only shows up until an anime called 'Golden time' when there are like 70 or more left that are in the page.
Thanks
Edit: Code that works now thanks to 'supputuri':
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer = driver.find_element_by_css_selector("div.footer")
preY = 0
print(str(footer))
while footer.rect['y'] != preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
print('loading')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for title in soup.find_all(class_="title"):
print(title.getText())
driver.close()
driver.quit()
ret = input()
Here is the solution.
Make sure to add import time
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer =driver.find_element_by_css_selector("div.footer")
preY =0
while footer.rect['y']!=preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
time.sleep(1)
print(str(driver.page_source))
This will iterate until all the anime is loaded and then gets the page source.
Let us know if this was helpful.
So, this is the jist of what I get when I load the page source:
AniListwindow.al_token = 'E1lPa1kzYco5hbdwT3GAMg3OG0rj47Gy5kF0PUmH';Sorry, AniList requires Javascript.Please enable Javascript or http://outdatedbrowser.com>upgrade to a modern web browser.Sorry, AniList requires a modern browser.Please http://outdatedbrowser.com>upgrade to a newer web browser.
Since I know damn well that Javascript is enabled and my Chrome version is fully up to date, and the URL listed takes one to a nonsecure website to "download" a new version of your browser, I think this is a spam site. Not sure if you were aware of that when posting so I won't flag as such, but I wanted you and others who come across this to be aware.
I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)
It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.
Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?
Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools.
See example for quora
the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.
other easier solution will be by using selenium
Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.
I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.
The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.
To use the code below you need to install chromedriver.
http://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
CHROMEDRIVER_PATH = ""
CHROME_PATH = ""
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
def scrape(url, times):
if not url.startswith('http'):
raise Exception('URLs need to start with "http"')
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH,
chrome_options=chrome_options
)
driver.get(url)
counter = 1
while counter <= times:
q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
questions = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
q_len = len(questions)
print(q_len)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
wait = WebDriverWait(driver, 5)
time.sleep(5)
questions2 = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
print(len(questions2))
counter += 1
driver.close()
if __name__ == '__main__':
scrape(url, 5)
I recommend using selenium rather than bs.
selenium can control browser and parsing. like scroll down, click button, etc…
this example is for scroll down for get all liker user in instagram.
https://stackoverflow.com/a/54882356/5611675
If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.
You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).
I am learning Python scraping technique but I am stuck with the problem of scraping an Ajax page like this one.
I want to scrape all the medicines name and details coming in the page. Since I read most of the answer on the stack overflow but I am not getting the right data after scraping. I also tried to scrape using selenium or send a forge post request but it failed.
So please help me on this Ajax scraping topic specially this page because ajax is triggered on selecting an option from dropdown options.
Also please provide me with some resources for ajax page scraping.
//using selenium
from selenium import webdriver
import bs4 as bs
import lxml
import requests
path_to_chrome = '/home/brutal/Desktop/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chrome)
url = 'https://www.gianteagle.com/Pharmacy/Savings/4-10-Dollar-Drug-Program/Generic-Drug-Program/'
browser.get(url)
browser.find_element_by_xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option[contains(text(), "Ohio")]').click()
new_url = browser.current_url
r = requests.get(new_url)
print(r.content)
ChromeDriver you can download here
normalize-space is used in order to remove trash from web text, such as x0
from time import sleep
from selenium import webdriver
from lxml.html import fromstring
data = {}
driver = webdriver.Chrome('PATH TO YOUR DRIVER/chromedriver') # i.e '/home/superman/www/myproject/chromedriver'
driver.get('https://www.gianteagle.com/Pharmacy/Savings/4-10-Dollar-Drug-Program/Generic-Drug-Program/')
# Loop states
for i in range(2, 7):
dropdown_state = driver.find_element(by='id', value='ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList')
# open dropdown
dropdown_state.click()
# click state
driver.find_element_by_xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option['+str(i)+']').click()
# let download the page
sleep(3)
# prepare HTML
page_content = driver.page_source
tree = fromstring(page_content)
state = tree.xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option['+str(i)+']/text()')[0]
data[state] = []
# Loop products inside the state
for line in tree.xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_gridSearchResults"]/tbody/tr[#style]'):
med_type = line.xpath('normalize-space(.//td[#class="medication-type"])')
generic_name = line.xpath('normalize-space(.//td[#class="generic-name"])')
brand_name = line.xpath('normalize-space(.//td[#class="brand-name hidden-xs"])')
strength = line.xpath('normalize-space(.//td[#class="strength"])')
form = line.xpath('normalize-space(.//td[#class="form"])')
qty_30_day = line.xpath('normalize-space(.//td[#class="30-qty"])')
price_30_day = line.xpath('normalize-space(.//td[#class="30-price"])')
qty_90_day = line.xpath('normalize-space(.//td[#class="90-qty hidden-xs"])')
price_90_day = line.xpath('normalize-space(.//td[#class="90-price hidden-xs"])')
data[state].append(dict(med_type=med_type,
generic_name=generic_name,
brand_name=brand_name,
strength=strength,
form=form,
qty_30_day=qty_30_day,
price_30_day=price_30_day,
qty_90_day=qty_90_day,
price_90_day=price_90_day))
print('data:', data)
driver.quit()