How do I scrap a JS popup table with Selenium? - python

The goal: need to select a option on a dropdown menu then when a list gets pasted below I need to click on each one iteratively and scrap all the given data. Thankfully classes have proper ID names so should be doable but am facing some issues as described below
Can better understand it if you visit the website here www.psx.com.pk/psx/resources-and-tools/listings/listed-companies
Messy code:
chromedriver = "chromedriver.exe"
driver = webdriver.Chrome(chromedriver)
driver.get("https://www.psx.com.pk/psx/resources-and-tools/listings/listed-companies")
select = Select(driver.find_element_by_id("sector"))
for opt in select.options: #this will loop through all the dropdown options from the site
opt.click() #in source code table class gets populated here
table = driver.find_elements_by_class_name("addressbook")
for index in range(len(table)):
# if index % 2 == 0:
elem = table[index].text
print(elem)
elem.click()
data = driver.find_elements_by_class_name("addressbookdata")
print(data)
If you run this code on your end the output is very erratic, if everything work correctly I will get Index/Company names in my table.text variable so thought a quick and dirty solution to just get IDs would be to % 2 the index instead of populating a df first and then dropping the duplicates. After I've gotten all the IDs I need to click on all of them and then extract and append the data from ID addressbookdata into a dataframe whole, I don't think theres any logical problem in my code right now? But I can't make this work, its my first time using selenium as well am much more comfortable with beautifulsoup

I select dropdown table by value and pull table data selenium with pandas
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'https://www.psx.com.pk/psx/resources-and-tools/listings/listed-companies'
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver,30)
#select from dropdown pop up option
Select(Wait.until(EC.visibility_of_element_located((By.XPATH, "//select[#id='sector']")))).select_by_value("0801")
dptable = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#class="table-responsive"]'))).get_attribute("outerHTML")
df = pd.read_html(dptable)
print(df)

Related

How to close clickable popup to continue scraping through Selenium in python

I'm trying to scrape some information from clickable popups in a table on a website into a pandas dataframe using Selenium in python and it seems to be able to do this if the popups have information.
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
import pandas as pd
import time
driver = webdriver.Chrome()
driver.get('https://mspotrace.org.my/Sccs_list')
time.sleep(20)
# Select maximum number of entries
elem = driver.find_element_by_css_selector('select[name=dTable_length]')
select = Select(elem)
select.select_by_value('500')
time.sleep(15)
# Get list of elements
elements = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//a[#title='View on Map']")))
# Loop through element popups and pull details of facilities into DF
pos = 0
df = pd.DataFrame(columns=['facility_name','other_details'])
try:
for element in elements:
data = []
element.click()
time.sleep(3)
facility_name = driver.find_element_by_xpath('//h4[#class="modal-title"]').text
other_details = driver.find_element_by_xpath('//div[#class="modal-body"]').text
data.append(facility_name)
data.append(other_details)
df.loc[pos] = data
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[aria-label='Close'] > span"))).click() # close popup window
time.sleep(10)
pos+=1
except:
print("No geo location information")
pass
print(df)
However, there are cases when a window like below appears and I need to click 'OK' on this to resume scraping the other rows on the web page but I can't seem to be able to find the element to click on to do this.
can you try for Pythonenter code here:
driver.switch_to.alert.accept()
But, your test scenario should be clear and should know where this pop up appears. If you don't know and "really" random, you can check some hooks that running for each test step
Selenium driver provides methods to switch to alerts context and working with it:
driver.switch_to().alert()
After that, you can do whatever you want, depending on alert type. To simulate clicking on “OK”:
driver.switch_to().alert().accept()
More info here

Turn table in selenium into a pandas data frame?

I am trying to a scrape a table consisting og 45 columns and 7 rows. The table is loaded using ajax and I can't access the API. Thus I needed to use selenium in Python. I am close to get what I want but I don't know how I can turn my 'selenium find elements' into a Pandas DataFrame. So far, my code looks like this:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
driver = webdriver.Chrome()
url = "http://www.hctiming.com/myphp/resources/login/browse_results.php?live_action=yes&smartphone_action=no" #a redirect to a login page occurs
driver.get(url)
driver.find_element_by_id("open").click()
user = driver.find_element_by_name("username")
password = driver.find_element_by_name("password")
user.clear()
user.send_keys("MyUserNameWhichIWillNotShare")
password.clear()
password.send_keys("myPasswordWhicI willNotShare")
driver.find_element_by_name("submit").click()
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "Results Services")) # I must first click in this line
)
element.click()
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "View Live")) # Then I must click in this link. Now I have access to the result database
)
element.click()
except:
driver.quit()
time.sleep(5) #I have set a timesleep to 5 secunds. There must be a better way to accomplish this. I just want to make sure that the table is loaded when I try to scrape it
columns = len(driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th"))
rows = len(driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr"))
print(columns, rows)
The last code line prints 45 and 7. Thus, this seems to work. However, I don't understand how I can make a dataframe of it? Thank you.
It's hard to tell not seeing data structure, but if table is simple, you can try to parse it directly by pandas read_html.
df = pd.read_html(driver.page_source)[0]
You can also create datafame by iterating through all table elements properly manipulating xpath:
df = pd.DataFrame()
for i in range(rows):
s = pd.Series()
for c in range(columns):
s[c] = driver.find_elements_by_xpath(f"/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr[{i+1}]/td[{c+1}]")
df = df.append(s, ignore_index=True)

How to get data from table using selenium in Python

I have this URL which has table in it. I need to get all the rows and column data from table from all the multiple pages. I am not able to understand how can I get data from the table. Below is the code I have:
from selenium import webdriver
import os
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.ui import Select
from pynput.keyboard import Key, Controller
curr_path = os.path.dirname(os.path.abspath(__file__))
keyboard = Controller()
driver = webdriver.Firefox()
driver.get("http://silk.dephut.go.id/index.php/info/iuiphhk")
driver.maximize_window()
Above code opens a firefox and loads up the url. Below code I am using to click on next page:
next_btn = (By.XPATH, "//div[#id='silk_content_wrapper']//ul[1]//li[4]//a[1]")
WebDriverWait(driver, 30).until(ec.element_to_be_clickable(next_btn)).click()
But I am unable to understand how to get data from table. I am not from web development field so not able to understand the website code. I referred to this question accepted answer and I extracted the ID of the table:
table_id = driver.find_element(By.ID, 'diviuiphhk')
But I didnt find the ID of the rows to get the value. To find the ID,XPATH of any object on url, I use chropath. Can anyone please help me understand how to get data from the table. Please help. Thanks
I was able to solve it. Below is the code:
table_id = driver.find_element(By.XPATH, "//table[#class='table']")
for row in range(1, 11):
rows = table_id.find_elements(By.XPATH, "//body//tbody//tr[" + str(row) + "]")
for row_data in rows:
col = row_data.find_elements(By.TAG_NAME, "td")
for i in range(len(col)):
print(col[i].text)
First I used chropath to get the XPATH value of the table. Then I also got the XPATH of row. This XPATH of row was same for all the rows of table, just have to increase the number from 1 to 10. The column inside the rows was referred to bye td TAG NAME. So used this tag name to get the values of the column.
Thanks
This will give you all the cells of the table and you can extract the data
driver.find_elements(By.XPATH, "//table[#class='table']/tbody/tr/td")

Scrape data from table whose elements don't load immediately

I've been trying to scrape data from a table using selenium, but when I run the code, it only gets the header of the table.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
driver.implicitly_wait(100)
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
print(t.text)
I also tried finding element by tag name using table, without luck.
you should try this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
driver.implicitly_wait(100)
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
number=2
while(number<12):
content = driver.find_element_by_xpath('//*[#id="body"]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr['+str(number)+']')
print(content.text)
number+=1
The XPATH in 'table' is just the header, the actual content is this : '//*[#id="body"]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr['+str(number)+']' , that's why you are not getting any content different than the header. Since the XPATH in the rows are like ...../tr[2],...../tr[3],...../tr[4], etc, Im using the str(number) < 12 , to get all the raws, you can also try with 50 rows a the time, is up to you.
I would use requests and mimic the POST request by the page as much faster
import requests
data = {'METHOD': '0','VALUE': '{"BusquedaRubros":"true","IdRubro":"41","Inicio":0}'}
r = s.post('http://www.panamacompra.gob.pa/Security/AmbientePublico.asmx/cargarActosOportunidadesDeNegocio', data=data).json()
print(r['listActos'])
You need wait until loader disappear, you can use invisibility_of_element_located, utilize WebDriverWait and expected_conditions. For the table you can use css_selector instead your xpath.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
time.sleep(2)
WebDriverWait(driver, 50).until(EC.invisibility_of_element_located((By.XPATH, '//img[#src="images/loading.gif"]')))
table = driver.find_element_by_css_selector('.table_asearch.table.table-bordered.table-striped.table-hover.table-condensed')
print(table.text)
driver.quit()
Selenium is loading the table (happens fairly quickly) and then assuming it is done, since it's never given a chance to load the table rows (happens more slowly). One way around this is to repeatedly try to find an element that won't appear until the table is finished loading.
This is FAR from the most elegant solution (and there's probably Selenium libraries that do it better), but you can wait for the table by checking to see if a new table row can be found, and if not, sleep for 1 second before trying again.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
wvar = 0
while(wvar == 0):
try:
#try loading one of the elements we want to read
el = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr[3]')
wvar = 1
except NoSuchElementException:
#not loaded yet
print('table body empty, waiting...')
time.sleep(1)
print('table loaded!')
#element got loaded; reload the table
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
print(table.text)

need to click on particular object on website to load more content multiple time using python and chrome driver

need to click on particular object multiple times but it is executing only once. please help me
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
chrome_path=r"C:\Users\Bhanwar\Desktop\New folder (2)\chromedriver.exe"
driver =webdriver.Chrome(chrome_path)
driver.get("https://priceraja.com/mobile/pricelist/samsung-mobile-price-list-in-india")
driver.implicitly_wait(10)
i=0
while i<4:
driver.find_element_by_css_selector('.loadMore').click()
driver.implicitly_wait(10)
i+=1
I think that the line driver.implicitly_wait(10) doesn't actually make the driver wait at that time, and because there is an element with a css selector of .loadmore in the page the whole time, that element may be getting clicked faster than the page can handle. Another way to check would be to get a css selector of a new element that only appears after the page changes. Each phone item (in this page) has a specific id, so you can check for the id of the last item to see how many items are in the page. The id seems to be "product-itmes-", itmes is correct, it is not items.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
import time
chrome_path=r"C:\Users\Bhanwar\Desktop\New folder (2)\chromedriver.exe"
driver =webdriver.Chrome(chrome_path)
driver.get("https://priceraja.com/mobile/pricelist/samsung-mobile-price-list-in-india")
driver.implicitly_wait(10)
i=0
time = 5
by = By.ID
hook = "product-itmes-" # The id of one item, they seems to be this plus
# the number item that they are
button = '.loadmore'
while i<4:
element_number = 25*i # It looks like there are 25 items added each time, and starts at 25
WebDriverWait(driver, time).
until(ec.presence_of_element_located(by, hook+str(element_number)))
driver.find_element_by_css_selector(button).click()
time.sleep(1) # Makes the page wait for the element to change
i+=1

Categories