I'm trying to scrape all of the images for a specific TripAdivsor page but when using the find_elements_by_class_name function in Selenium, it is giving me no values whatsoever. I am confused as that is the exact class name for what values I want to iterate through and append to a list, here is the site. Any help would be greatly appreciated!
# importing dependencies
import re
import selenium
import io
import pandas as pd
import urllib.request
import urllib.parse
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import time
from _datetime import datetime
from selenium.webdriver.common.keys import Keys
#setup opening url window of website to be scraped
options = webdriver.ChromeOptions()
options.headless=False
prefs = {"profile.default_content_setting_values.notifications" : 2}
options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome("/Users/rishi/Downloads/chromedriver 3") #possible issue by not including the file extension
driver.maximize_window()
time.sleep(5)
driver.get("""https://www.tripadvisor.com/""") #get the information from the page
#automate searching for hotels in specific city
driver.find_element_by_xpath('/html/body/div[2]/div/div[6]/div[1]/div/div/div/div/span[1]/div/div/div/a').click() #clicks on hotels option
driver.implicitly_wait(12) #allows xpath to be found
driver.find_element_by_xpath('//*[#id="BODY_BLOCK_JQUERY_REFLOW"]/div[12]/div/div/div[1]/div[1]/div/input').send_keys("Washington D.C.", Keys.ENTER) #change string to get certain city
time.sleep(8)
#now get current url
url = driver.current_url
response = requests.get(url)
response = response.text
data = BeautifulSoup(response, 'html.parser')
#get list of all hotels
hotels = driver.find_elements_by_class_name("prw_rup prw_meta_hsx_responsive_listing ui_section listItem")
print("Total Number of Hotels: ", len(hotels))
I would recommend that, if you use Selenium, don't use BeautifulSoup beside it because you can get whatever you want using Selenium.
You can simply achieve your goal as follows:
driver = webdriver.Chrome("/Users/rishi/Downloads/chromedriver 3")
driver.maximize_window()
driver.get("https://www.tripadvisor.ca/Hotels")
time.sleep(1)
driver.implicitly_wait(12)
driver.find_element_by_xpath('//*[#class="typeahead_input"]').send_keys("Washington D.C.", Keys.ENTER)
time.sleep(1)
hotels = driver.find_elements_by_xpath('//*[#class="listing collapsed"]')
print("Total Number of Hotels: ", len(hotels))
Please note that using this code you would get the first 30 hotels (i.e., first page). You would need to loop through all the pages of hotels of the specified city in order to get them all.
Hope it helps.
Related
I am working on a project analyzing the Supercluster Astronaut Database. I posted a StackOverflow question a few weeks ago about scraping the data from the website and got the code below from one of the helpful posters.
My only remaining issue with this process is that when I load the code, a browser window pops open linked to the data source I am trying to scrape. I've tried tinkering with this code to get the browser window to not pop up by commenting out a few lines here and there, but nothing I've tried seems to work properly. Can someone help point me in the right direction to modify the code below to not have a browser pop up?
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
#print(name.text)
country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
if country:
country=country.get_text()
#print(country)
data.append([name, country])
cols=['name','country']
df = pd.DataFrame(data,columns=cols)
print(df)
I think you're looking to run your code in headless mode. You can add a headless argument in the Options() class to achieve this.
Code :-
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
#print(name.text)
country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
if country:
country=country.get_text()
#print(country)
data.append([name, country])
cols=['name','country']
df = pd.DataFrame(data,columns=cols)
print(df)
I am trying to web scrape Aliexpress using Selenium and Python. I'm doing it by following a youtube tutorial, I have followed every steps but I just can't seem to get it to work.
I tried to use requests, BeautifulSoup as well. But it seems like Aliexpress uses lazy loaders on their product listings. I tried using the window scroll script but that didn't work. It seems like the content would not load until I personally scroll on it.
This is the url for the page I would like to web scrape
https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText=dog+supplies<ype=wholesale&SortType=default&g=n
This is the code I have currently. It doesn't return anything in the output. I think that's because it's trying to go through all the product listings but it couldn't find any because it's not loaded...
Any suggestions/help would be greatly appreciated, sorry for the bad formatting and the bad code in advance.
Thank you!
"""
To do
HOT PRODUCT FINDER Enter: Keyword, to generate a url
Product Name
Product Image
Product Link
Sales Number
Price
Create an excel file that contains these data
Sort the list by top selling orders
Develop an algorithm for the velocity of the product (total sales increased / time?)
Scrape site every day """
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import requests
import lxml
#Starting Up the web driver
driver = webdriver.Chrome()
# grab Keywords
search_term = input('Keywords: ')
# url generator
def get_url(search_term):
"""Generate a url link using search term provided"""
url_template = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText={}<ype=wholesale&SortType=default&g=n'
search_term = search_term.replace(" ", "+")
return url_template.format(search_term)
url = get_url('search_term')
driver.get(url)
#scrolling down to the end of the page
time.sleep(2)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
#Extracting the Collection
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
productlist = soup.find_all('div', class_='list product-card')
print(productlist)
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import requests
import lxml
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(executable_path = 'chromedriver.exe',options = chrome_options)
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
# grab Keywords
search_term = input('Keywords: ')
# url generator
driver.get('https://www.aliexpress.com')
driver.implicitly_wait(10)
p = driver.find_element_by_name('SearchText')
p.send_keys(search_term)
p.send_keys(Keys.ENTER)
productlist = []
product = driver.find_element_by_xpath('//*[#id="root"]/div/div/div[2]/div[2]/div/div[2]/ul')
height = driver.execute_script("return document.body.scrollHeight")
for scrol in range(100,height-1800,100):
driver.execute_script(f"window.scrollTo(0,{scrol})")
time.sleep(0.5)
# driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
div = []
list_i = []
item_title = []
a = []
for z in range(1,16):
div.append(product.find_element_by_xpath('//*[#id="root"]/div/div/div[2]/div[2]/div/div[2]/ul/div'+str([z])))
for pr in div:
list_i.append(pr.find_elements_by_class_name('list-item'))
for pc in list_i:
for p in pc:
item_title.append(p.find_element_by_class_name('item-title-wrap'))
for pt in item_title:
a.append(pt.find_element_by_tag_name('a'))
for prt in a:
productlist.append(prt.text)
I am trying to crawl all the links of jobs from https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam by using BeautifulSoup and Selenium. The problem is that I just only can crawl the links of the 1st page and don't know how to crawl the link from every next page.
This is the code I have tried:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from bs4 import BeautifulSoup
import array as arr
import pandas as pd
#The first line import the Web Driver, and the second import Chrome Options
#-----------------------------------#
#Chrome Options
all_link = []
chrome_options = Options()
chrome_options.add_argument ('--ignore-certificate-errors')
chrome_options.add_argument ("--igcognito")
chrome_options.add_argument ("--window-size=1920x1080")
chrome_options.add_argument ('--headless')
#-----------------------------------#
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path="C:/webdriver/chromedriver.exe")
#Open url
url = "https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam"
driver.get(url)
time.sleep(2)
#-----------------------------------#
page_source = driver.page_source
page = page_source
soup = BeautifulSoup(page_source,"html.parser")
block_job_list = soup.find_all("div",{"class":"d-flex justify-content-center align-items-center logo-area-wrapper logo-border"})
for i in block_job_list:
link = i.find("a")
all_link.append("https://www.vietnamworks.com/"+ link.get("href"))
Since your problem is traversing through the pages, this code will help you do that. Insert the scraping code inside the while loop as mentioned.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
from webdriver_manager.chrome import ChromeDriverManager # use pip install webdriver_manager if not installed
option = webdriver.ChromeOptions()
CDM = ChromeDriverManager()
driver = webdriver.Chrome(CDM.install(),options=option)
url = 'https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam'
driver.get(url)
time.sleep(3)
page_num = 1
links = []
driver.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
while True:
# create the soup element here so that it can get the page source of every page
# sample scraping of url's of the jobs posted
for i in driver.find_elements_by_class_name('job-title '):
links.append(i.get_attribute('href'))
# moves to next page
try:
print(f'On page {str(page_num)}')
print()
page_num+=1
driver.find_element_by_link_text(str(page_num)).click()
time.sleep(3)
# checks only at the end of the page
except NoSuchElementException:
print('End of pages')
break
driver.quit()
EDIT:
Simplified and modified the pagination method
If you are using BeautifulSoup then you have to insert the page_source and soup variables inside the while loop because after every iteration, the source page code changes. In your code you had extracted only the first page's source code and hence you got repetitive outputs which is equal to the number of pages.
By using ChromeDriverManager in the package webdriver-manager, the need to mention the location/executable path is not needed. You can just copy paste this code and run it in any machine that has Chrome installed in it. If you have to installed use pip install webdriver_manager in cmd before running the code.
Warning: AVOID DISPLAYING your actual username and password of any of your accounts like you have in your GitHub code.
I'm trying to loop through 2 sets of links. Starting with https://cuetracker.net/seasons > click through each season link (Last 5 seasons) and then click through each tournament link within each season link and scrape the match data from each tournament.
Using the below code I have managed to get a list of season links I desire but then when I try and grab the tournament links and put them into a list it is only getting the last season tournament links as opposed to each season's.
I'd guess it's something to do with driver.get just completing before the next lines of code work and I need to loop/iterate using indexes but I'm a complete novice so I'm not too sure.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
Chrome_Path = r"C:\Users\George\Desktop\chromedriver.exe"
Browser = webdriver.Chrome(Chrome_Path)
Browser.get("https://cuetracker.net/seasons")
links = Browser.find_elements_by_css_selector("table.table.table-striped a")
hrefs=[]
for link in links:
hrefs.append(link.get_attribute("href"))
hrefs = hrefs[1:5]
for href in hrefs:
Browser.get(href)
links2 = Browser.find_elements_by_partial_link_text("20")
hrefs2 =[]
for link in links2:
hrefs2.append(link.get_attribute("href"))
You are pretty close and you are right about "you just need to wait a bit".
You could wait for page load: wait_for_page_load checks the document readystate and if everything is loaded then you are good to go. Check this thread for more. :)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup
import os
import re
import time
import pandas as pd
def wait_for_page_load():
timer = 10
start_time = time.time()
page_state = None
while page_state != 'complete':
time.sleep(0.5)
page_state = Browser.execute_script('return document.readyState;')
if time.time() - start_time > timer:
raise Exception('Timeout :(')
Chrome_Path = r"C:\Users\George\Desktop\chromedriver.exe"
Browser = webdriver.Chrome()
Browser.get("https://cuetracker.net/seasons")
links = Browser.find_elements_by_css_selector("table.table.table-striped a")
hrefs=[]
for link in links:
hrefs.append(link.get_attribute("href"))
hrefs = hrefs[1:5]
hrefs2 = {}
for href in hrefs:
hrefs2[href] = []
Browser.get(href)
wait_for_page_load()
links2 = Browser.find_elements_by_partial_link_text("20")
for link in links2:
hrefs2[href].append((link.get_attribute("href")))
A few notes if you don't mind:
Browser should be browser or driver, same applies to Chrome_Path
check out Xpath, it is awesome
EDIT:
I've been sloppy for the first time so I've updated the answer to answer the question :D. Waiting for page load is still a good idea :)
The problem was that you re-defined hrefs2 in each cycle so it always contained the result of the last iteration.
About why xpath:
If you would like to to load results before 2000, your url collecting logic would break. You could still do this:
table = Browser.find_element_by_xpath('//*[#class="table table-striped"]')
all_urls = [x.get_attribute('href') for x in table.find_elements_by_xpath('.//tr/td[2]/a')]
Where you find the table by the class name, then collect the urls from the second column of the table.
If you know the url pattern you can even do this:
all_urls = [x.get_attribute('href') for x in Browser.find_elements_by_xpath('//td//a[contains(#href, "https://cuetracker.net/tournaments")]')]
The Xpath above:
//td <- in any depth of the document tree find td tagged elements
//a <- in collected td elements get all children which are a tagged (in any depth)
[contains(#href, "https://cuetracker.net/tournaments")] from the list of collected a tagged elements which contain the "https://cuetracker.net/tournaments" text in the href attribute (partial match)
I am very new to web scraping. I have the following url:
https://www.bloomberg.com/markets/symbolsearch
So, I use Selenium to enter the Symbol Textbox and press Find Symbols to get the details. This is the code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.bloomberg.com/markets/symbolsearch/")
element = driver.find_element_by_id("query")
element.send_keys("WMT:US")
driver.find_element_by_name("commit").click()
It returns the table. How can I retrieve that? I am pretty clueless.
Second question,
Can I do this without Selenium as it is slowing down things? Is there a way to find an API which returns a JSON?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import requests
driver = webdriver.Firefox()
driver.get("https://www.bloomberg.com/markets/symbolsearch/")
element = driver.find_element_by_id("query")
element.send_keys("WMT:US")
driver.find_element_by_name("commit").click()
time.sleep(5)
url = driver.current_url
time.sleep(5)
parsed = requests.get(url)
soup = BeautifulSoup(parsed.content,'html.parser')
a = soup.findAll("table", { "class" : "dual_border_data_table" })
print(a)
here is the total code by which you can get the table you are looking for. now do what you need to do after getting the table. hope it helps