python - scraping tables with dependent dropdown lists - python

I am trying to obtain the electoral results of the 2016 elections in Peru disaggregated by district. As you can see in this link (https://www.web.onpe.gob.pe/modElecciones/elecciones/elecciones2016/PRPCP2016/Resultados-Ubigeo-Presidencial.html#posicion), when selecting the "scope", three boxes appear more with the title of "department", "province" and "district". I want to extract all the results from the image table at the "district" level and save them in a table, but to achieve this I must first select "department", "province" and "district" 1
So far, I have written this code that inserts options determined by me to dropdown lists, but I was wondering if I can automate the selection of options. There are 1874 districts so doing it manually is not an option.
I hope you can help me. Thanks!
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome()
driver.get('https://www.web.onpe.gob.pe/modElecciones/elecciones/elecciones2016/PRPCP2016/Resultados-Ubigeo-Presidencial.html#posicion')
time.sleep(4)
selectAmbit = Select(driver.find_element_by_id("cdgoAmbito"))
selectAmbit.select_by_value('P')
time.sleep(2)
selectRegion = Select(driver.find_element_by_id("cdgoDep"))
selectRegion.select_by_value('010000')
time.sleep(2)
selectProvin = Select(driver.find_element_by_id("cdgoProv"))
selectProvin.select_by_value('010200')
time.sleep(2)
selectDistr = Select(driver.find_element_by_id("cdgoDist"))
selectDistr.select_by_value('010202')

A simple loop to get all selectDistr options. You'd just do it nested to select all options of every type .
selectDistr = Select(driver.find_element_by_id("cdgoDist"))
for option in selectDistr.options:
selectDistr.select_by_value(option.get_attribute('value'))

Related

Methods to webscrape Prizepicks for all NBA props

I have been looking EVERYWHERE for some form of help on any method on python to web scrape all nba props from app.prizepicks.com. I came down to 2 potential methods: API with pandas and selenium. I believe prizepicks recently shut down their api system to restrain from users from scraping the nba props, so to my knowledge using selenium-stealth is the only way possible to web scrape the prizepicks nba board. Can anyone please help me with, or provide a code that scrapes prizepicks for all nba props? The information needed would be the player name, prop type (such as points, rebounds, 3-Pt Made, Free throws made, fantasy, pts+rebs, etc.), prop line (such as 34.5, 8.5, which could belong to a prop type such as points and rebounds, respectively). I would need this to work decently quickly and refresh every set amount of minutes. I found something similar to what i would want provided in another thread by 'C. Peck'. Which I will provide (hopefully, i dont really know how to use stackoverflow). But the code that C. Peck provided does not work on my device and i was wondering if anyone here write a functional code/fix this code to work for me. I have a macbook pro so i dont know if that affects anything.
EDIT
After a lot of trial and error, and help from the thread, I have managed to complete the first step. I am able to webscrape from the "Points" tab on the nba league of prizepicks, but I want to scrape all the info from every tab, not just points. I honestly dont know why my code isnt fully working, but i basically want it to scrape points, rebounds, assists, fantasy, etc... Let me know any fixes i should do to be able to scrape for every stat_element in the stat_container, or other methods too! Ill update the code below:
EDIT AGAIN
it seems like the problem lies in the "stat-container" and "stat-elements". I checked to see what elements the "stat-elements" has, and it is only points. I checked to see what elements the "stat-container" has, and it gave me an error. I believe if someone helps me with that then the problem will be fixed. This is the error it gives when i try to see the elements inside of "stat-container": line 27, in
for element in stat_container:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'WebElement' object is not iterable
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://app.prizepicks.com/")
driver.find_element(By.CLASS_NAME, "close").click()
time.sleep(2)
driver.find_element(By.XPATH, "//div[#class='name'][normalize-space()='NBA']").click()
time.sleep(2)
# Wait for the stat-container element to be present and visible
stat_container = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "stat-container")))
# Find all stat elements within the stat-container
stat_elements = driver.find_elements(By.CSS_SELECTOR, "div.stat")
# Initialize empty list to store data
nbaPlayers = []
# Iterate over each stat element
for stat in stat_elements:
# Click the stat element
stat.click()
projections = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".projection")))
for projection in projections:
names = projection.find_element(By.XPATH, './/div[#class="name"]').text
points= projection.find_element(By.XPATH, './/div[#class="presale-score"]').get_attribute('innerHTML')
text = projection.find_element(By.XPATH, './/div[#class="text"]').text
print(names, points, text)
players = {
'Name': names,
'Prop':points, 'Line':text
}
nbaPlayers.append(players)
df = pd.DataFrame(nbaPlayers)
print(df)
driver.quit()
Answer updated with the working code. I made some small changes
Replaced stat_elements with categories, which is a list containing the stat names.
Loop over categories and click the div button with label equal to the current category name
Add .replace('\n','') at the end of the text variable
.
driver.get("https://app.prizepicks.com/")
driver.find_element(By.CLASS_NAME, "close").click()
time.sleep(2)
driver.find_element(By.XPATH, "//div[#class='name'][normalize-space()='NBA']").click()
time.sleep(2)
# Wait for the stat-container element to be present and visible
stat_container = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "stat-container")))
# Find all stat elements within the stat-container
# i.e. categories is the list ['Points','Rebounds',...,'Turnovers']
categories = driver.find_element(By.CSS_SELECTOR, ".stat-container").text.split('\n')
# Initialize empty list to store data
nbaPlayers = []
# Iterate over each stat element
for category in categories:
# Click the stat element
line = '-'*len(category)
print(line + '\n' + category + '\n' + line)
driver.find_element(By.XPATH, f"//div[text()='{category}']").click()
projections = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".projection")))
for projection in projections:
names = projection.find_element(By.XPATH, './/div[#class="name"]').text
points= projection.find_element(By.XPATH, './/div[#class="presale-score"]').get_attribute('innerHTML')
text = projection.find_element(By.XPATH, './/div[#class="text"]').text.replace('\n','')
print(names, points, text)
players = {'Name': names, 'Prop':points, 'Line':text}
nbaPlayers.append(players)
pd.DataFrame(nbaPlayers)

Python - Downloading PDFs from website behind dropdown

For the site https://www.wsop.com/tournaments/results/, the objective is to download all available PDFs on the REPORTS section, behind all different drop down options where they are available.
Currently I am trying to do this using selenium, because I couldn't find an api, but I am open to other suggestions. For now the code is a bunch of copy-paste from relevant questions and YT videos.
My plan of attack is to select an option in the drop-down menu, press 'GO' (to load them), navigate to 'REPORTS' (if available) and download all the PDFs available. And then iterate over all options. Challenge 2 is then to get the PDFs to something like a dataframe to do some analysis.
Below is my current code, that only manages to download the top PDF of the by default selected option in the drop-down:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
import os
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
#settings and loading webpage
options=Options()
options.headless=True
CD=ChromeDriverManager().install()
driver=webdriver.Chrome(CD,options=options)
params={'behavior':'allow','downloadPath':os.getcwd()+'\\PDFs'}
driver.execute_cdp_cmd('Page.setDownloadBehavior',params)
driver.get('https://www.wsop.com/tournaments/results/')
#Go through the dropdown
drp=Select(driver.find_element_by_id("CPHbody_aid"))
drp.select_by_index(0)
drp=Select(driver.find_element_by_id("CPHbody_grid"))
drp.select_by_index(1)
drp=Select(driver.find_element_by_id("CPHbody_tid"))
drp.select_by_index(5)
#Click the necessary buttons (section with issues)
driver.find_element_by_xpath('//*[#id="nav-tabs"]/a[6]').click()
#driver.find_element_by_name('GO').click()
#WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.LINK_TEXT, "GO"))).click()
#WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.LINK_TEXT, "REPORTS"))).click()
a=driver.find_element_by_id("reports").click()
I can navigate through the drop-down just fine (and it should be easy to iterate over them). However, I do not get the 'GO' button pressed. I tried it a bunch of different ways, a few I showed as a comment in the code.
I am able to press the REPORTS tab, but I think that breaks down when there are different amounts of tabs, the line in the comments might work better, but for now I am not able to download all PDFs anyway, it just takes the first PDF of the page.
Many thanks to whoever can help:)
The website is structured in such a way that you can loop through the years that a WSOP was played, then within each year you can loop through every event and get the data from page into a pandas dataframe. This is far more efficient than taking screenshots into PDFs
You can edit how far you want to go back with the from_year variable in line 5, going way back will obviously take more time. See the below script which will output all the data into csv. Note that not every event has POY points available. Also you'll need to pip install requests, pandas and bs4 if you haven't already.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from_year = 2020
wsop_tounrament_url = 'https://www.wsop.com/tournaments/GetTournaments.aspx?aid=2'
wsop_resp = requests.get(wsop_tounrament_url)
soup = BeautifulSoup(wsop_resp.text,'html.parser')
years = [x['value'] for x in soup.find_all('option') if str(from_year) in x.text]
event_dfs = []
for year in years:
event_resp = requests.get(f'https://www.wsop.com/tournaments/GetEvents.aspx?grid={str(year)}')
soup = BeautifulSoup(event_resp.text,'html.parser')
event_ids = [x['value'] for x in soup.find_all('option')]
for event in event_ids:
page = 1
while True:
url = f'https://www.wsop.com/tournaments/results/?aid=2&grid={year}&tid={event}&rr=5&curpage={page}'
results = requests.get(url)
soup = BeautifulSoup(results.text,'html.parser')
info = soup.find('div',{'id':'eventinfo'})
dates = info.find('p').text
name = info.find('h1').text
year_name = soup.find('div',{'class':'content'}).find('h3').text.strip()
table = soup.find('div',{'id':'results'})
size = int(table.find('ul')['class'][0][-1])
rows = table.find_all('li')
if len(rows) <= size+1:
break
print(f'processing {year_name} - {name} - page {page}')
output = []
headers = []
for x in range(size):
series = []
for i, row in enumerate(rows):
if i == x:
headers.append(row.text)
continue
if i%size == x:
series.append(row.text)
output.append(series)
df = pd.DataFrame(output)
df = df.transpose()
df.columns = headers
df['year_name'] = year_name
df['event_id'] = event
df['year_id'] = year
df['event_name'] = name
df['dates'] = dates
df['url'] = url
event_dfs.append(df)
page += 1
print(f'Scraped {year_name} successfully')
final_df = pd.concat(event_dfs)
final_df.to_csv('wsop_output.csv',index=False)
I am not going to write you the whole script but here's how to click on the "go" button :
We can see from the Developper tools that the button is the only element to have the class "submit-red-button", so we can access it with : driver.find_elements_by_class_name('submit-red-button')[0].click()
You say that you can access the Reports tab but it did not work when I tested your program so just in case, you can use driver.find_elements_by_class_name('taboff')[4] to get it.
Then, all you need to do is to click on each pdf link in order to download the files

Webscraping - Selenium - Python

I want to extract all the fantasy teams that have been entered for past contests. To loop through the dates, I just change a small part of the URL as shown in my code below:
#Packages:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
# Driver
chromedriver =("C:/Users/Michel/Desktop/python/package/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(chromedriver)
# Dataframe that will be use later
results = pd.DataFrame()
best_lineups=pd.DataFrame()
opti_lineups=pd.DataFrame()
#For loop over all DATES:
calendar=[]
calendar.append("2019-01-10")
calendar.append("2019-01-11")
for d in calendar:
driver.get("https://rotogrinders.com/resultsdb/date/"+d+"/sport/4/")
Then, to access the different contests of that day, you need to click on the contest tab. I use the following code to locate and click on it.
# Find "Contest" tab
contest= driver.find_element_by_xpath("//*[#id='root']/div/main/main/div[2]/div[3]/div/div/div[1]/div/div/div/div/div[3]")
contest.click()
I simply inspect and copy the xpath of the tab. However, most of the times it is working, but sometimes I get an error message " Unable to locate element...". Moreover, it seems to work only for the first date in my calendar loop and always fails in the next iteration... I do not know why. I try to locate it differently, but I feel I am missing something such as:
contests=driver.find_element_by_xpath("//*[#role='tab']
Once, the contest tab is successfully clicked, all contests of that day are there and you can click on a link to access all the entries of that contest. I stored the contests in order to iterate throuhg all as follow:
list_links = driver.find_elements_by_tag_name('a')
hlink=[]
for ii in list_links:
hlink.append(ii.get_attribute("href"))
sub="https://rotogrinders.com/resultsdb"
con= "contest"
contest_list=[]
for text in hlink:
if sub in text:
if con in text:
contest_list.append(text)
# Iterate through all the entries(user) of a contest and extract the information of the team entered by the user
for c in contest_list:
driver.get(c)
Then, I want to extract all participants team entered in the contest and store it in a dataframe. I am able to do it successfully for the first page of the contest.
# Waits until tables are loaded and has text. Timeouts after 60 seconds
while WebDriverWait(driver, 60).until(ec.presence_of_element_located((By.XPATH, './/tbody//tr//td//span//a[text() != ""]'))):
# while ????:
# Get tables to get the user names
tables = pd.read_html(driver.page_source)
users_df = tables[0][['Rank','User']]
users_df['User'] = users_df['User'].str.replace(' Member', '')
# Initialize results dataframe and iterate through users
for i, row in users_df.iterrows():
rank = row['Rank']
user = row['User']
# Find the user name and click on the name
user_link = driver.find_elements(By.XPATH, "//a[text()='%s']" %(user))[0]
user_link.click()
# Get the lineup table after clicking on the user name
tables = pd.read_html(driver.page_source)
lineup = tables[1]
#print (user)
#print (lineup)
# Restructure to put into resutls dataframe
lineup.loc[9, 'Name'] = lineup.iloc[9]['Salary']
lineup.loc[10, 'Name'] = lineup.iloc[9]['Pts']
temp_df = pd.DataFrame(lineup['Name'].values.reshape(-1, 11),
columns=lineup['Pos'].iloc[:9].tolist() + ['Total_$', 'Total_Pts'] )
temp_df.insert(loc=0, column = 'User', value = user)
temp_df.insert(loc=0, column = 'Rank', value = rank)
temp_df["Date"]=d
results = results.append(temp_df)
#next_button = driver.find_elements_by_xpath("//button[#type='button']")
#next_button[2].click()
results = results.reset_index(drop=True)
driver.close()
However, there are other pages and to access it, you need to click on the small arrow next buttonat the bottom. Moreover, you can click indefinitely on that button; even if there are not more entries. Therefore, I would like to be able to loop through all pages with entries and stop when there are no more entries and change contest. I try to implement a while loop to do so, but my code did not work...
You must really make sure that page loads completely before you do anything on that page.
Moreover, it seems to work only for the first date in my calendar loop
and always fails in the next iteration
Usually when selenium loads a browser page it tries to look for the element even if it is not loaded all the way. I suggest you to recheck the xpath of the element you are trying to click.
Also try to see when the page loads completely and use time.sleep(number of seconds)
to make sure you hit the element or you can check for a particular element or a property of element that would let you know that the page has been loaded.
One more suggestion is that you can use driver.current_url to see which page are you targetting. I have had this issue while i was working on multiple tabs and I had to tell python/selenium to manually switch to that tab

Using selenium in python to get data from dynamic website: how to discover the way databases querys are done?

I had some experience with coding before, but not specifically for web applications. I have been tasked with getting data from this website: http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/
They are avaliable on a day-to-day basis. I have used selenium in Python, and so far the results are good: I can get the entire table, store it in a pandas dataframe, and then to a mysql database and stuff. The problem is: the result from the website is always the same!
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
def GetDataFromWeb(day, month, year):
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')
driver = webdriver.Chrome(chrome_options=options)
driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")
#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)
#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.send_keys("/".join((str(day),str(month),str(year))))
date = driver.find_element_by_tag_name("button").click()
#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(5)
page = bs(driver.page_source,"html.parser")
table = page.find(id='tb_principal1')
headers = ['Dias Corridos', '252','360']
matrix = []
for rows in table.select('tr')[2:]:
values = []
for columns in rows.select('td'):
values.append(columns.text.replace(',','.'))
matrix.append(values)
df = pd.DataFrame(data=matrix, columns=headers)
driver.close()
#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]
The table resulting from this function is always the same, no matter what inputs I send to it. And they seem to be from the corresponding date of 06/09/2018 (month=09,day=06). I think the main problem is that I don't know how the queries to their database is done, so this always runs like a "default date". I have read some people talking about Ajax and JavaScript requests, but I don't know if that's the case. How can I tell?
This code will work(updated few lines in your code)
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
import pandas as pd
def GetDataFromWeb(day, month, year):
***#to avoid data error in date handler***
if month < 10:
month="0"+str(month)
if day < 10:
day="0"+str(day)
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')
driver = webdriver.Chrome(chrome_options=options)
driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")
#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)
#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.clear() ***#to clear auto populated data***
date.send_keys(((str(day),str(month),str(year)))) ***# removed the join part***
driver.find_element_by_tag_name("button").click()
#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(50)
page = bs(driver.page_source,"html.parser")
table = page.find(id='tb_principal1')
headers = ['Dias Corridos', '252','360']
matrix = []
for rows in table.select('tr')[2:]:
values = []
for columns in rows.select('td'):
values.append(columns.text.replace(',','.'))
matrix.append(values)
df = pd.DataFrame(data=matrix, columns=headers)
driver.close()
#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]
print GetDataFromWeb(3,9,2018)
It will print the matching data for the required date.
I have added #to avoid data error in date handler
if month < 10:
month="0"+str(month)
if day < 10:
day="0"+str(day)
date.clear() #to clear auto populated data
date.send_keys(((str(day),str(month),str(year)))) # removed the join part
Note the problem in your code was the date& month fields take two digit number and date.send_keys("/".join((str(day), str(month), str(year)))) line was generating an error because of which the system date was picked and you always see same data coming for any input data. Also when you click on the date it was picking default date so first, we have to clear that and send custom date. Hope this helps
Update for additional query: Add these imports
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Add this line in place of wait
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR,'#divContainerIframeBmf > form > div > div > div:nth-child(1) > div:nth-child(3) > div > div > p')))

How to loop though HTML and return id values

I am using selenium to navigate to a webpage and store the page source in a variable.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://google.com")
html1 = driver.page_source
html1 now contains the page source of http://google.com.
My question is How can I return html selectors such as id="id" or name="name".
EDIT:
For example:
The webpage I navigated to with selenium has a menu bar with 4 tabs. Each tab has an id element; id="tab1", id="tab2", and so on. I would like to return each id value. So I want tab1, tab2, so on.
Edit#2:
Another example:
The homepage on my webpage (http://chrisarroyo.me) have several clickable links with ids. I would like to be able to return/print those ids to my console.
So I would like to return the ids for the Learn More button and the ids for the links in the footer (facebookLnk, githubLnk, etc..)
If you are looking for a list of WebElements that have an ID use:
elements = driver.find_elements_by_xpath("//*[#id]")
You can then iterate over that list and use get_attribute_("id") to pull out each elements specific ID.
For name, its pretty much the same code. Except change id to name and you're set.
Thank you #stewartm you comment helped.
This ended up giving me the results I was looking for:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://chrisarroyo.me")
id_elements = driver.find_elements_by_xpath("//*[#id]")
for eachElement in id_elements:
individual_ids = eachElement.get_attribute("id")
print(individual_ids)
After running the above ^^ the output listed each of the ids on the webpage specified.
output:
navbarNavAltMarkup
learnBtn
githubLnk
facebookLnk
linkedinLnk

Categories