Webscraping - Selenium - Python

Webscraping - Selenium - Python - python

I want to extract all the fantasy teams that have been entered for past contests. To loop through the dates, I just change a small part of the URL as shown in my code below:
#Packages:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
# Driver
chromedriver =("C:/Users/Michel/Desktop/python/package/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(chromedriver)
# Dataframe that will be use later
results = pd.DataFrame()
best_lineups=pd.DataFrame()
opti_lineups=pd.DataFrame()
#For loop over all DATES:
calendar=[]
calendar.append("2019-01-10")
calendar.append("2019-01-11")
for d in calendar:
driver.get("https://rotogrinders.com/resultsdb/date/"+d+"/sport/4/")
Then, to access the different contests of that day, you need to click on the contest tab. I use the following code to locate and click on it.
# Find "Contest" tab
contest= driver.find_element_by_xpath("//*[#id='root']/div/main/main/div[2]/div[3]/div/div/div[1]/div/div/div/div/div[3]")
contest.click()
I simply inspect and copy the xpath of the tab. However, most of the times it is working, but sometimes I get an error message " Unable to locate element...". Moreover, it seems to work only for the first date in my calendar loop and always fails in the next iteration... I do not know why. I try to locate it differently, but I feel I am missing something such as:
contests=driver.find_element_by_xpath("//*[#role='tab']
Once, the contest tab is successfully clicked, all contests of that day are there and you can click on a link to access all the entries of that contest. I stored the contests in order to iterate throuhg all as follow:
list_links = driver.find_elements_by_tag_name('a')
hlink=[]
for ii in list_links:
hlink.append(ii.get_attribute("href"))
sub="https://rotogrinders.com/resultsdb"
con= "contest"
contest_list=[]
for text in hlink:
if sub in text:
if con in text:
contest_list.append(text)
# Iterate through all the entries(user) of a contest and extract the information of the team entered by the user
for c in contest_list:
driver.get(c)
Then, I want to extract all participants team entered in the contest and store it in a dataframe. I am able to do it successfully for the first page of the contest.
# Waits until tables are loaded and has text. Timeouts after 60 seconds
while WebDriverWait(driver, 60).until(ec.presence_of_element_located((By.XPATH, './/tbody//tr//td//span//a[text() != ""]'))):
# while ????:
# Get tables to get the user names
tables = pd.read_html(driver.page_source)
users_df = tables[0][['Rank','User']]
users_df['User'] = users_df['User'].str.replace(' Member', '')
# Initialize results dataframe and iterate through users
for i, row in users_df.iterrows():
rank = row['Rank']
user = row['User']
# Find the user name and click on the name
user_link = driver.find_elements(By.XPATH, "//a[text()='%s']" %(user))[0]
user_link.click()
# Get the lineup table after clicking on the user name
tables = pd.read_html(driver.page_source)
lineup = tables[1]
#print (user)
#print (lineup)
# Restructure to put into resutls dataframe
lineup.loc[9, 'Name'] = lineup.iloc[9]['Salary']
lineup.loc[10, 'Name'] = lineup.iloc[9]['Pts']
temp_df = pd.DataFrame(lineup['Name'].values.reshape(-1, 11),
columns=lineup['Pos'].iloc[:9].tolist() + ['Total_$', 'Total_Pts'] )
temp_df.insert(loc=0, column = 'User', value = user)
temp_df.insert(loc=0, column = 'Rank', value = rank)
temp_df["Date"]=d
results = results.append(temp_df)
#next_button = driver.find_elements_by_xpath("//button[#type='button']")
#next_button[2].click()
results = results.reset_index(drop=True)
driver.close()
However, there are other pages and to access it, you need to click on the small arrow next buttonat the bottom. Moreover, you can click indefinitely on that button; even if there are not more entries. Therefore, I would like to be able to loop through all pages with entries and stop when there are no more entries and change contest. I try to implement a while loop to do so, but my code did not work...

You must really make sure that page loads completely before you do anything on that page.
Moreover, it seems to work only for the first date in my calendar loop
and always fails in the next iteration
Usually when selenium loads a browser page it tries to look for the element even if it is not loaded all the way. I suggest you to recheck the xpath of the element you are trying to click.
Also try to see when the page loads completely and use time.sleep(number of seconds)
to make sure you hit the element or you can check for a particular element or a property of element that would let you know that the page has been loaded.
One more suggestion is that you can use driver.current_url to see which page are you targetting. I have had this issue while i was working on multiple tabs and I had to tell python/selenium to manually switch to that tab

Related

Methods to webscrape Prizepicks for all NBA props

I have been looking EVERYWHERE for some form of help on any method on python to web scrape all nba props from app.prizepicks.com. I came down to 2 potential methods: API with pandas and selenium. I believe prizepicks recently shut down their api system to restrain from users from scraping the nba props, so to my knowledge using selenium-stealth is the only way possible to web scrape the prizepicks nba board. Can anyone please help me with, or provide a code that scrapes prizepicks for all nba props? The information needed would be the player name, prop type (such as points, rebounds, 3-Pt Made, Free throws made, fantasy, pts+rebs, etc.), prop line (such as 34.5, 8.5, which could belong to a prop type such as points and rebounds, respectively). I would need this to work decently quickly and refresh every set amount of minutes. I found something similar to what i would want provided in another thread by 'C. Peck'. Which I will provide (hopefully, i dont really know how to use stackoverflow). But the code that C. Peck provided does not work on my device and i was wondering if anyone here write a functional code/fix this code to work for me. I have a macbook pro so i dont know if that affects anything.
EDIT
After a lot of trial and error, and help from the thread, I have managed to complete the first step. I am able to webscrape from the "Points" tab on the nba league of prizepicks, but I want to scrape all the info from every tab, not just points. I honestly dont know why my code isnt fully working, but i basically want it to scrape points, rebounds, assists, fantasy, etc... Let me know any fixes i should do to be able to scrape for every stat_element in the stat_container, or other methods too! Ill update the code below:
EDIT AGAIN
it seems like the problem lies in the "stat-container" and "stat-elements". I checked to see what elements the "stat-elements" has, and it is only points. I checked to see what elements the "stat-container" has, and it gave me an error. I believe if someone helps me with that then the problem will be fixed. This is the error it gives when i try to see the elements inside of "stat-container": line 27, in
for element in stat_container:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'WebElement' object is not iterable
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://app.prizepicks.com/")
driver.find_element(By.CLASS_NAME, "close").click()
time.sleep(2)
driver.find_element(By.XPATH, "//div[#class='name'][normalize-space()='NBA']").click()
time.sleep(2)
# Wait for the stat-container element to be present and visible
stat_container = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "stat-container")))
# Find all stat elements within the stat-container
stat_elements = driver.find_elements(By.CSS_SELECTOR, "div.stat")
# Initialize empty list to store data
nbaPlayers = []
# Iterate over each stat element
for stat in stat_elements:
# Click the stat element
stat.click()
projections = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".projection")))
for projection in projections:
names = projection.find_element(By.XPATH, './/div[#class="name"]').text
points= projection.find_element(By.XPATH, './/div[#class="presale-score"]').get_attribute('innerHTML')
text = projection.find_element(By.XPATH, './/div[#class="text"]').text
print(names, points, text)
players = {
'Name': names,
'Prop':points, 'Line':text
}
nbaPlayers.append(players)
df = pd.DataFrame(nbaPlayers)
print(df)
driver.quit()

Answer updated with the working code. I made some small changes
Replaced stat_elements with categories, which is a list containing the stat names.
Loop over categories and click the div button with label equal to the current category name
Add .replace('\n','') at the end of the text variable
.
driver.get("https://app.prizepicks.com/")
driver.find_element(By.CLASS_NAME, "close").click()
time.sleep(2)
driver.find_element(By.XPATH, "//div[#class='name'][normalize-space()='NBA']").click()
time.sleep(2)
# Wait for the stat-container element to be present and visible
stat_container = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "stat-container")))
# Find all stat elements within the stat-container
# i.e. categories is the list ['Points','Rebounds',...,'Turnovers']
categories = driver.find_element(By.CSS_SELECTOR, ".stat-container").text.split('\n')
# Initialize empty list to store data
nbaPlayers = []
# Iterate over each stat element
for category in categories:
# Click the stat element
line = '-'*len(category)
print(line + '\n' + category + '\n' + line)
driver.find_element(By.XPATH, f"//div[text()='{category}']").click()
projections = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".projection")))
for projection in projections:
names = projection.find_element(By.XPATH, './/div[#class="name"]').text
points= projection.find_element(By.XPATH, './/div[#class="presale-score"]').get_attribute('innerHTML')
text = projection.find_element(By.XPATH, './/div[#class="text"]').text.replace('\n','')
print(names, points, text)
players = {'Name': names, 'Prop':points, 'Line':text}
nbaPlayers.append(players)
pd.DataFrame(nbaPlayers)

Python Selenium - Scraping Trip Advisor Reviews

I am only a hobbyist with Python, so please bare with me. I am trying to run this script to collect Trip Advisor reviews and write to excel.
But once it opens up the website it throws this error: NoSuchElementException no such element: Unable to locate element: {"method":"xpath","selector":".//q[#class='IRsGHoPm']"}
Any got any ideas on what is going wrong?
import csv #This package lets us save data to a csv file
from selenium import webdriver #The Selenium package we'll need
from selenium.webdriver.support import expected_conditions
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
import time #This package lets us pause execution for a bit
path_to_file = "E:Desktop/Data/Reviews.csv"
pages_to_scrape = 3
url = "https://www.tripadvisor.com/Hotel_Review-g60982-d209422-Reviews-Hilton_Waikiki_Beach-Honolulu_Oahu_Hawaii.html"
# import the webdriver
driver = webdriver.Chrome()
driver.get(url)
# open the file to save the review
csvFile = open(path_to_file, 'a', encoding="utf-8")
csvWriter = csv.writer(csvFile)
# change the value inside the range to save the number of reviews we're going to grab
for i in range(0, pages_to_scrape):
# give the DOM time to load
time.sleep(5)
# Click the "expand review" link to reveal the entire review.
driver.find_element_by_xpath(".//div[contains(#data-test-target, 'expand-review')]").click()
# Now we'll ask Selenium to look for elements in the page and save them to a variable. First lets define a container that will hold all the reviews on the page. In a moment we'll parse these and save them:
container = driver.find_elements_by_xpath("//div[#data-reviewid]")
# Next we'll grab the date of the review:
dates = driver.find_elements_by_xpath(".//div[#class='_2fxQ4TOx']")
# Now we'll look at the reviews in the container and parse them out
for j in range(len(container)): # A loop defined by the number of reviews
# Grab the rating
rating = container[j].find_element_by_xpath(".//span[contains(#class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
# Grab the title
title = container[j].find_element_by_xpath(".//div[contains(#data-test-target, 'review-title')]").text
#Grab the review
review = container[j].find_element_by_xpath(".//q[#class='IRsGHoPm']").text.replace("\n", " ")
#Grab the data
date = " ".join(dates[j].text.split(" ")[-2:])
#Save that data in the csv and then continue to process the next review
csvWriter.writerow([date, rating, title, review])
# When all the reviews in the container have been processed, change the page and repeat
driver.find_element_by_xpath('.//a[#class="ui_button nav next primary "]').click()
# When all pages have been processed, quit the driver
driver.quit()
enter image description here

I am trying to finding "_2fxQ4TOx" class name in the page but it came up with 0 results. May be these are the auto-generated classes by browser to keep up with compression and its own CSS reference.
Instead, try to find elements by these attributes (which actually exist in the frontend), try finding it like:
container = driver.find_element_by_css_selector('div[data-test-target="reviews-tab"]')
reviews = container.find_elements_by_css_selector('div[data-test-target="HR_CC_CARD"]')
for review in reviews:
# get date
# get descriptions
# ...
you can also refer to this link:
Is there a way to find an element by attributes in Python Selenium?

Python - Downloading PDFs from website behind dropdown

For the site https://www.wsop.com/tournaments/results/, the objective is to download all available PDFs on the REPORTS section, behind all different drop down options where they are available.
Currently I am trying to do this using selenium, because I couldn't find an api, but I am open to other suggestions. For now the code is a bunch of copy-paste from relevant questions and YT videos.
My plan of attack is to select an option in the drop-down menu, press 'GO' (to load them), navigate to 'REPORTS' (if available) and download all the PDFs available. And then iterate over all options. Challenge 2 is then to get the PDFs to something like a dataframe to do some analysis.
Below is my current code, that only manages to download the top PDF of the by default selected option in the drop-down:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
import os
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
#settings and loading webpage
options=Options()
options.headless=True
CD=ChromeDriverManager().install()
driver=webdriver.Chrome(CD,options=options)
params={'behavior':'allow','downloadPath':os.getcwd()+'\\PDFs'}
driver.execute_cdp_cmd('Page.setDownloadBehavior',params)
driver.get('https://www.wsop.com/tournaments/results/')
#Go through the dropdown
drp=Select(driver.find_element_by_id("CPHbody_aid"))
drp.select_by_index(0)
drp=Select(driver.find_element_by_id("CPHbody_grid"))
drp.select_by_index(1)
drp=Select(driver.find_element_by_id("CPHbody_tid"))
drp.select_by_index(5)
#Click the necessary buttons (section with issues)
driver.find_element_by_xpath('//*[#id="nav-tabs"]/a[6]').click()
#driver.find_element_by_name('GO').click()
#WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.LINK_TEXT, "GO"))).click()
#WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.LINK_TEXT, "REPORTS"))).click()
a=driver.find_element_by_id("reports").click()
I can navigate through the drop-down just fine (and it should be easy to iterate over them). However, I do not get the 'GO' button pressed. I tried it a bunch of different ways, a few I showed as a comment in the code.
I am able to press the REPORTS tab, but I think that breaks down when there are different amounts of tabs, the line in the comments might work better, but for now I am not able to download all PDFs anyway, it just takes the first PDF of the page.
Many thanks to whoever can help:)

The website is structured in such a way that you can loop through the years that a WSOP was played, then within each year you can loop through every event and get the data from page into a pandas dataframe. This is far more efficient than taking screenshots into PDFs
You can edit how far you want to go back with the from_year variable in line 5, going way back will obviously take more time. See the below script which will output all the data into csv. Note that not every event has POY points available. Also you'll need to pip install requests, pandas and bs4 if you haven't already.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from_year = 2020
wsop_tounrament_url = 'https://www.wsop.com/tournaments/GetTournaments.aspx?aid=2'
wsop_resp = requests.get(wsop_tounrament_url)
soup = BeautifulSoup(wsop_resp.text,'html.parser')
years = [x['value'] for x in soup.find_all('option') if str(from_year) in x.text]
event_dfs = []
for year in years:
event_resp = requests.get(f'https://www.wsop.com/tournaments/GetEvents.aspx?grid={str(year)}')
soup = BeautifulSoup(event_resp.text,'html.parser')
event_ids = [x['value'] for x in soup.find_all('option')]
for event in event_ids:
page = 1
while True:
url = f'https://www.wsop.com/tournaments/results/?aid=2&grid={year}&tid={event}&rr=5&curpage={page}'
results = requests.get(url)
soup = BeautifulSoup(results.text,'html.parser')
info = soup.find('div',{'id':'eventinfo'})
dates = info.find('p').text
name = info.find('h1').text
year_name = soup.find('div',{'class':'content'}).find('h3').text.strip()
table = soup.find('div',{'id':'results'})
size = int(table.find('ul')['class'][0][-1])
rows = table.find_all('li')
if len(rows) <= size+1:
break
print(f'processing {year_name} - {name} - page {page}')
output = []
headers = []
for x in range(size):
series = []
for i, row in enumerate(rows):
if i == x:
headers.append(row.text)
continue
if i%size == x:
series.append(row.text)
output.append(series)
df = pd.DataFrame(output)
df = df.transpose()
df.columns = headers
df['year_name'] = year_name
df['event_id'] = event
df['year_id'] = year
df['event_name'] = name
df['dates'] = dates
df['url'] = url
event_dfs.append(df)
page += 1
print(f'Scraped {year_name} successfully')
final_df = pd.concat(event_dfs)
final_df.to_csv('wsop_output.csv',index=False)

I am not going to write you the whole script but here's how to click on the "go" button :
We can see from the Developper tools that the button is the only element to have the class "submit-red-button", so we can access it with : driver.find_elements_by_class_name('submit-red-button')[0].click()
You say that you can access the Reports tab but it did not work when I tested your program so just in case, you can use driver.find_elements_by_class_name('taboff')[4] to get it.
Then, all you need to do is to click on each pdf link in order to download the files

Refine Selenium Google search results by time frame and date

I am trying to refine my results after using Selenium and Chrome with python to automate Google searches and get the sorted links. I can successfully get initial search results with the script and automatically click the 'Tools' button.
Bottom line is I cant figure out the required HTML tags to access and select/click the time frame drop down, defaulted to 'Any Time' and then select/click the 'Relevance' drop down to sort by date. I have tried Select but am using the wrong tags for that method. I have used inspect element and Katalon Recorder to figure it out, but I get syntax errors such as "element not found". Any help is appreciated.
driver.get('https://www.google.com/search')
search_field = driver.find_element_by_name("q")
search_field.send_keys("cheese")
search_field.submit()
# Clicks the Tools button, activates sort dropdowns
driver.find_element_by_id("hdtb-tls").click()
# Need to sort results by last 24, week, month, etc.
driver.find_element_by_class_name('hdtb-mn-hd')
driver.find_element_by_link_text('Past month').click()
# Need to sort results date
driver.find_element_by_xpath('.//*[normalize-space(text()) and normalize-
space(.)="To"])[1]/following::div[5]')
driver.find_element_by_link_text('Sorted by date').click()

are you missing the .click() for driver.find_element_by_class_name('hdtb-mn-hd')
driver = webdriver.Chrome()
driver.get('https://www.google.com/search')
search_field = driver.find_element_by_name("q")
search_field.send_keys("cheese")
search_field.submit()
# Clicks the Tools button, activates sort dropdowns
driver.find_element_by_id("hdtb-tls").click()
# Need to sort results by last 24, week, month, etc.
driver.find_element_by_class_name('hdtb-mn-hd').click()
driver.find_element_by_link_text('Past month').click()
here's a full script that worked it all the way through:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.google.com/search')
search_field = driver.find_element_by_name("q")
search_field.send_keys("cheese")
search_field.submit()
# Clicks the Tools button, activates sort dropdowns
time.sleep(1)
driver.find_element_by_id("hdtb-tls").click()
# Need to sort results by last 24, week, month, etc.
time.sleep(1)
driver.find_element_by_class_name('hdtb-mn-hd').click()
time.sleep(1)
driver.find_element_by_link_text('Past month').click()
# Need to sort results date
time.sleep(1)
driver.find_elements_by_xpath('//*[#id="hdtbMenus"]/div/div[3]/div')[0].click()
time.sleep(1)
driver.find_elements_by_xpath('//*[#id="sbd_1"]')[0].click()

How to loop though HTML and return id values

I am using selenium to navigate to a webpage and store the page source in a variable.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://google.com")
html1 = driver.page_source
html1 now contains the page source of http://google.com.
My question is How can I return html selectors such as id="id" or name="name".
EDIT:
For example:
The webpage I navigated to with selenium has a menu bar with 4 tabs. Each tab has an id element; id="tab1", id="tab2", and so on. I would like to return each id value. So I want tab1, tab2, so on.
Edit#2:
Another example:
The homepage on my webpage (http://chrisarroyo.me) have several clickable links with ids. I would like to be able to return/print those ids to my console.
So I would like to return the ids for the Learn More button and the ids for the links in the footer (facebookLnk, githubLnk, etc..)

If you are looking for a list of WebElements that have an ID use:
elements = driver.find_elements_by_xpath("//*[#id]")
You can then iterate over that list and use get_attribute_("id") to pull out each elements specific ID.
For name, its pretty much the same code. Except change id to name and you're set.

Thank you #stewartm you comment helped.
This ended up giving me the results I was looking for:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://chrisarroyo.me")
id_elements = driver.find_elements_by_xpath("//*[#id]")
for eachElement in id_elements:
individual_ids = eachElement.get_attribute("id")
print(individual_ids)
After running the above ^^ the output listed each of the ids on the webpage specified.
output:
navbarNavAltMarkup
learnBtn
githubLnk
facebookLnk
linkedinLnk

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping - Selenium - Python - python

Related

Methods to webscrape Prizepicks for all NBA props

Python Selenium - Scraping Trip Advisor Reviews

Python - Downloading PDFs from website behind dropdown

Refine Selenium Google search results by time frame and date

How to loop though HTML and return id values

Categories

Resources