Sup, I'm trying to extract some data tables from an website (https://www.anbima.com.br/pt_br/informar/curvas-de-juros-fechamento.htm), but as we can see the data is inside an Iframe. It took me a while, since I'm not an expert to webscraping data, to click in the button "Consultar" to get in the page that I want. Basically, i't loads the data (4 tables) that inside an Iframe too.
The problem it's I still don't have any successful attempt to get the tables, maybe it's because of the Iframe.
For an example, I tried to use xpath i the first table without sucess.
drive.find_elemnt_by_xpath(//*[#id="Parametros"]/table).text
Here's the code to reach the page that i mentioned:
from selenium import webdriver
import time
import re
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as expectedCondition
from selenium.webdriver.chrome.options import Options
import pandas as pd
import numpy as np
#----------------------- INICIALIZAÇÃO DO SCRAPING -----------------------------#
want_to_scrape = True
if want_to_scrape:
options = Options()
#options.add_argument('--headless')
driver = webdriver.Chrome("C:\\Users\\......\\chromedriver.exe",options=options)
now = time.time()
dataset_list = []
url = 'https://www.anbima.com.br/pt_br/informar/curvas-de-juros-fechamento.htm'
driver.get(url)
#element = driver.find_element_by_class_name('full')
#driver.switch_to.frame(element)
driver.switch_to.frame(0)
element = driver.find_elements_by_name('Consultar')
element[0].click()
time.sleep(1)
try:
alert = driver.switch_to_alert()
alert.accept()
print("alert accepted")
except:
print("no alert")
time.sleep(1)
driver.switch_to.frame(0)
driver.find_element_by_xpath
Try replacing your driver.switch_to.frame(0) line with this:
# Get the iframe element - note, may need to use more specialized selector here
iframe = driver.find_elements_by_tag_name('iframe')
driver.switch_to.frame(iframe)
That will get your driver into the frame context so you can fetch the tables. You may need to use a different selector to get the iframe element. If you post the iframe HTML here, I can help you write a selector.
Related
I'm trying to scrape some information from clickable popups in a table on a website into a pandas dataframe using Selenium in python and it seems to be able to do this if the popups have information.
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
import pandas as pd
import time
driver = webdriver.Chrome()
driver.get('https://mspotrace.org.my/Sccs_list')
time.sleep(20)
# Select maximum number of entries
elem = driver.find_element_by_css_selector('select[name=dTable_length]')
select = Select(elem)
select.select_by_value('500')
time.sleep(15)
# Get list of elements
elements = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//a[#title='View on Map']")))
# Loop through element popups and pull details of facilities into DF
pos = 0
df = pd.DataFrame(columns=['facility_name','other_details'])
try:
for element in elements:
data = []
element.click()
time.sleep(3)
facility_name = driver.find_element_by_xpath('//h4[#class="modal-title"]').text
other_details = driver.find_element_by_xpath('//div[#class="modal-body"]').text
data.append(facility_name)
data.append(other_details)
df.loc[pos] = data
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[aria-label='Close'] > span"))).click() # close popup window
time.sleep(10)
pos+=1
except:
print("No geo location information")
pass
print(df)
However, there are cases when a window like below appears and I need to click 'OK' on this to resume scraping the other rows on the web page but I can't seem to be able to find the element to click on to do this.
can you try for Pythonenter code here:
driver.switch_to.alert.accept()
But, your test scenario should be clear and should know where this pop up appears. If you don't know and "really" random, you can check some hooks that running for each test step
Selenium driver provides methods to switch to alerts context and working with it:
driver.switch_to().alert()
After that, you can do whatever you want, depending on alert type. To simulate clicking on “OK”:
driver.switch_to().alert().accept()
More info here
Can not find the element
The code was writen using python with visual studio code
from time import time
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
paginaHit = 'https://hit.com.do/solicitud-de-verificacion/'
driver.get(paginaHit)
driver.maximize_window()
time.sleep(5)
bl = 'SMLU7318830A'
elementoBL = driver.find_element(By.XPATH, '//*[#id="billoflanding"]').send_keys(bl)
# WebDriverWait(driver,2).until(EC.element_to_be_clickable((By.NAME, "bl"))).Click()
The code is OK, but can not find the element in the webpage.
The portion of the page you are trying to access is inside an EMBED tag. It looks similar to an IFRAME so I would start by switching the context to the EMBED tag and then try searching for the element.
driver = webdriver.Chrome()
paginaHit = 'https://hit.com.do/solicitud-de-verificacion/'
driver.get(paginaHit)
driver.maximize_window()
embed = driver.find_element(By.CSS_SELECTOR, "embed")
driver.switch_to.frame(embed)
bl = 'SMLU7318830A'
wait =WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.ID, "billoflanding"))).send_keys(bl)
Couple of additional points:
Don't use sleeps... sleeps are a bad practice. Instead use WebDriverWait when you need to wait for something to happen.
If you are using an ID to find an element, use By.ID and not XPath. ID should be preferred, when available. Next should be a CSS selector and then finally, XPATH only when needed, e.g. to locate elements by contained text or to do complicated DOM traversal.
I'm practicing trying to scrape my university's course catalog. I have a few lines in Python that open the url in Chrome and clicks the search button to bring up the course catalog. When I go to extract the texting using find_elements_by_xpath(), it returns blank. When I use the dev tools on Chrome, there definitely is text there.
from selenium import webdriver
import time
driver = webdriver.Chrome()
url = 'https://courses.osu.edu/psp/csosuct/EMPLOYEE/PUB/c/COMMUNITY_ACCESS.OSR_CAT_SRCH.GBL?'
driver.get(url)
time.sleep(3)
iframe = driver.find_element_by_id('ptifrmtgtframe')
driver.switch_to.frame(iframe)
element = driver.find_element_by_xpath('//*[#id="OSR_CAT_SRCH_WK_BUTTON1"]')
element.click()
course = driver.find_elements_by_xpath('//*[#id="OSR_CAT_SRCH_OSR_CRSE_HEADER$0"]')
print(course)
I'm trying to extract the text from the element 'OSU_CAT_SRCH_OSR_CRSE_HEADER'. I don't understand why it's not returning the text values especially when I can see that it contains text with dev tools.
You are not using text that is the reason you are not getting the text.
course = driver.find_elements_by_xpath('//*[#id="OSR_CAT_SRCH_OSR_CRSE_HEADER$0"]').text
Try above changes in last second line
Below is the full code after the changes
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
url = 'https://courses.osu.edu/psp/csosuct/EMPLOYEE/PUB/c/COMMUNITY_ACCESS.OSR_CAT_SRCH.GBL?'
driver.get(url)
time.sleep(3)
iframe = driver.find_element_by_id('ptifrmtgtframe')
driver.switch_to.frame(iframe)
element = driver.find_element_by_xpath('//*[#id="OSR_CAT_SRCH_WK_BUTTON1"]')
element.click()
# wait 10 seconds
course = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//*[#id="OSR_CAT_SRCH_OSR_CRSE_HEADER$0"]'))
).text
print(course)
I've been trying to scrape data from a table using selenium, but when I run the code, it only gets the header of the table.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
driver.implicitly_wait(100)
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
print(t.text)
I also tried finding element by tag name using table, without luck.
you should try this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
driver.implicitly_wait(100)
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
number=2
while(number<12):
content = driver.find_element_by_xpath('//*[#id="body"]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr['+str(number)+']')
print(content.text)
number+=1
The XPATH in 'table' is just the header, the actual content is this : '//*[#id="body"]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr['+str(number)+']' , that's why you are not getting any content different than the header. Since the XPATH in the rows are like ...../tr[2],...../tr[3],...../tr[4], etc, Im using the str(number) < 12 , to get all the raws, you can also try with 50 rows a the time, is up to you.
I would use requests and mimic the POST request by the page as much faster
import requests
data = {'METHOD': '0','VALUE': '{"BusquedaRubros":"true","IdRubro":"41","Inicio":0}'}
r = s.post('http://www.panamacompra.gob.pa/Security/AmbientePublico.asmx/cargarActosOportunidadesDeNegocio', data=data).json()
print(r['listActos'])
You need wait until loader disappear, you can use invisibility_of_element_located, utilize WebDriverWait and expected_conditions. For the table you can use css_selector instead your xpath.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
time.sleep(2)
WebDriverWait(driver, 50).until(EC.invisibility_of_element_located((By.XPATH, '//img[#src="images/loading.gif"]')))
table = driver.find_element_by_css_selector('.table_asearch.table.table-bordered.table-striped.table-hover.table-condensed')
print(table.text)
driver.quit()
Selenium is loading the table (happens fairly quickly) and then assuming it is done, since it's never given a chance to load the table rows (happens more slowly). One way around this is to repeatedly try to find an element that won't appear until the table is finished loading.
This is FAR from the most elegant solution (and there's probably Selenium libraries that do it better), but you can wait for the table by checking to see if a new table row can be found, and if not, sleep for 1 second before trying again.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
wvar = 0
while(wvar == 0):
try:
#try loading one of the elements we want to read
el = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr[3]')
wvar = 1
except NoSuchElementException:
#not loaded yet
print('table body empty, waiting...')
time.sleep(1)
print('table loaded!')
#element got loaded; reload the table
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
print(table.text)
I want to grab the page source of the page after I make a click. And then go back using browser.back() function. But Selenium doesn't let the page fully load after the click and the content which is generated by JavaScript isn't being included in the page source of that page.
element[i].click()
#Need to wait here until the content is fully generated by JS.
#And then grab the page source.
scoreCardHTML = browser.page_source
browser.back()
As Alan mentioned - you can wait for some element to be loaded. Below is an example code
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox()
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "element_id")))
you can also use seleniums staleness_of
from selenium.webdriver.support.expected_conditions import staleness_of
def wait_for_page_load(browser, timeout=30):
old_page = browser.find_element_by_tag_name('html')
yield
WebDriverWait(browser, timeout).until(
staleness_of(old_page)
)
You can do it using this method of a loop of try and wait, an easy to implement method
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("url")
Button=''
while not Button:
try:
Button=browser.find_element_by_name('NAME OF ELEMENT')
Button.click()
except:continue
Assuming "pass" is an element in the current page and won't be present at the target page.
I mostly use Id of the link I am going to click on. because it is rarely present at the target page.
while True:
try:
browser.find_element_by_id("pass")
except:
break