I have this URL which has table in it. I need to get all the rows and column data from table from all the multiple pages. I am not able to understand how can I get data from the table. Below is the code I have:
from selenium import webdriver
import os
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.ui import Select
from pynput.keyboard import Key, Controller
curr_path = os.path.dirname(os.path.abspath(__file__))
keyboard = Controller()
driver = webdriver.Firefox()
driver.get("http://silk.dephut.go.id/index.php/info/iuiphhk")
driver.maximize_window()
Above code opens a firefox and loads up the url. Below code I am using to click on next page:
next_btn = (By.XPATH, "//div[#id='silk_content_wrapper']//ul[1]//li[4]//a[1]")
WebDriverWait(driver, 30).until(ec.element_to_be_clickable(next_btn)).click()
But I am unable to understand how to get data from table. I am not from web development field so not able to understand the website code. I referred to this question accepted answer and I extracted the ID of the table:
table_id = driver.find_element(By.ID, 'diviuiphhk')
But I didnt find the ID of the rows to get the value. To find the ID,XPATH of any object on url, I use chropath. Can anyone please help me understand how to get data from the table. Please help. Thanks
I was able to solve it. Below is the code:
table_id = driver.find_element(By.XPATH, "//table[#class='table']")
for row in range(1, 11):
rows = table_id.find_elements(By.XPATH, "//body//tbody//tr[" + str(row) + "]")
for row_data in rows:
col = row_data.find_elements(By.TAG_NAME, "td")
for i in range(len(col)):
print(col[i].text)
First I used chropath to get the XPATH value of the table. Then I also got the XPATH of row. This XPATH of row was same for all the rows of table, just have to increase the number from 1 to 10. The column inside the rows was referred to bye td TAG NAME. So used this tag name to get the values of the column.
Thanks
This will give you all the cells of the table and you can extract the data
driver.find_elements(By.XPATH, "//table[#class='table']/tbody/tr/td")
Related
The goal: need to select a option on a dropdown menu then when a list gets pasted below I need to click on each one iteratively and scrap all the given data. Thankfully classes have proper ID names so should be doable but am facing some issues as described below
Can better understand it if you visit the website here www.psx.com.pk/psx/resources-and-tools/listings/listed-companies
Messy code:
chromedriver = "chromedriver.exe"
driver = webdriver.Chrome(chromedriver)
driver.get("https://www.psx.com.pk/psx/resources-and-tools/listings/listed-companies")
select = Select(driver.find_element_by_id("sector"))
for opt in select.options: #this will loop through all the dropdown options from the site
opt.click() #in source code table class gets populated here
table = driver.find_elements_by_class_name("addressbook")
for index in range(len(table)):
# if index % 2 == 0:
elem = table[index].text
print(elem)
elem.click()
data = driver.find_elements_by_class_name("addressbookdata")
print(data)
If you run this code on your end the output is very erratic, if everything work correctly I will get Index/Company names in my table.text variable so thought a quick and dirty solution to just get IDs would be to % 2 the index instead of populating a df first and then dropping the duplicates. After I've gotten all the IDs I need to click on all of them and then extract and append the data from ID addressbookdata into a dataframe whole, I don't think theres any logical problem in my code right now? But I can't make this work, its my first time using selenium as well am much more comfortable with beautifulsoup
I select dropdown table by value and pull table data selenium with pandas
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'https://www.psx.com.pk/psx/resources-and-tools/listings/listed-companies'
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver,30)
#select from dropdown pop up option
Select(Wait.until(EC.visibility_of_element_located((By.XPATH, "//select[#id='sector']")))).select_by_value("0801")
dptable = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#class="table-responsive"]'))).get_attribute("outerHTML")
df = pd.read_html(dptable)
print(df)
I am very new to Python and I am trying to get all store locations from website of Muthoot. the following is a code i wrote but i am not getting any output. Please let me know what is wrong and what i need to correct.
As i understand, the code is not getting the search button clicked and hence nothing is moving. But how to do that??
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
import pandas as pd
driver= webdriver.Chrome(executable_path="D:\Chromedriverpath\chromedriver_win32\chromedriver.exe")
driver.get("https://www.muthootfinance.com/branch-locator")
#Saving this element in a variable
drp=Select(driver.find_element_by_id("statelist"))
slist=drp.options
for ele in slist:
table=driver.select("table.table")
columns=table.find("thead").find_all("th")
column_names=[c.string for c in columns]
table_rows=table.find("tbody").find_all("tr")
l=[]
for tr in table_rows:
td=tr.find_all('td')
row=[str(tr.get_text()).strip() for tr in td]
l.append(row)
df=pd.DataFrame(l,columns=column_names)
df.head()
I think this will work for you now, I copied your code and it seems to work!
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
driver.get("https://www.muthootfinance.com/branch-locator")
# Saving this element in a variable
html_list = driver.find_element_by_id("state_branch_list")
items = html_list.find_elements_by_tag_name("li")
for item in items:
places = item.text
print(places)
df = pd.DataFrame([places])
Hello I want to scrape data from one website. I scrape my data with
BeautifulSoup
and this the code that I use (without the imports):
df = pd.read_html(requests.get('myurl').text, flavor="bs4")
df = pd.concat(df)
df.to_csv("mycsv.csv", index=False)
So far i don't have problems with this code, but when i want to scrape data from this site. The above program has an error that says no table found. So i use
selenium
to solve my problem. Below i have the code:
driver = webdriver.Firefox(executable_path=r'C:\Users\myfolders\geckodriver.exe')
driver.get("https://www.nba.com/stats/teams/traditional/?sort=W_PCT&dir=-1")
html = driver.page_source
tables = pd.read_html(html)
data = tables[1]
driver.close()
But again when i execute the adove code i have the same problem
ValueError: No tables found
When i check the html of the page i find the table attributes. Can anyone help me with this problem?
You may need to wait for the tables to load before reading driver.page_source.
Tested the following on my machine and was able to pick up two tables. You may want to add additional waits depending on your needs.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://www.nba.com/stats/teams/traditional/?sort=W_PCT&dir=-1")
table = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, 'table'))
)
html = driver.page_source
tables = pd.read_html(html)
driver.close()
print(tables)
I am trying to a scrape a table consisting og 45 columns and 7 rows. The table is loaded using ajax and I can't access the API. Thus I needed to use selenium in Python. I am close to get what I want but I don't know how I can turn my 'selenium find elements' into a Pandas DataFrame. So far, my code looks like this:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
driver = webdriver.Chrome()
url = "http://www.hctiming.com/myphp/resources/login/browse_results.php?live_action=yes&smartphone_action=no" #a redirect to a login page occurs
driver.get(url)
driver.find_element_by_id("open").click()
user = driver.find_element_by_name("username")
password = driver.find_element_by_name("password")
user.clear()
user.send_keys("MyUserNameWhichIWillNotShare")
password.clear()
password.send_keys("myPasswordWhicI willNotShare")
driver.find_element_by_name("submit").click()
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "Results Services")) # I must first click in this line
)
element.click()
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "View Live")) # Then I must click in this link. Now I have access to the result database
)
element.click()
except:
driver.quit()
time.sleep(5) #I have set a timesleep to 5 secunds. There must be a better way to accomplish this. I just want to make sure that the table is loaded when I try to scrape it
columns = len(driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th"))
rows = len(driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr"))
print(columns, rows)
The last code line prints 45 and 7. Thus, this seems to work. However, I don't understand how I can make a dataframe of it? Thank you.
It's hard to tell not seeing data structure, but if table is simple, you can try to parse it directly by pandas read_html.
df = pd.read_html(driver.page_source)[0]
You can also create datafame by iterating through all table elements properly manipulating xpath:
df = pd.DataFrame()
for i in range(rows):
s = pd.Series()
for c in range(columns):
s[c] = driver.find_elements_by_xpath(f"/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr[{i+1}]/td[{c+1}]")
df = df.append(s, ignore_index=True)
I've been trying to scrape data from a table using selenium, but when I run the code, it only gets the header of the table.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
driver.implicitly_wait(100)
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
print(t.text)
I also tried finding element by tag name using table, without luck.
you should try this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
driver.implicitly_wait(100)
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
number=2
while(number<12):
content = driver.find_element_by_xpath('//*[#id="body"]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr['+str(number)+']')
print(content.text)
number+=1
The XPATH in 'table' is just the header, the actual content is this : '//*[#id="body"]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr['+str(number)+']' , that's why you are not getting any content different than the header. Since the XPATH in the rows are like ...../tr[2],...../tr[3],...../tr[4], etc, Im using the str(number) < 12 , to get all the raws, you can also try with 50 rows a the time, is up to you.
I would use requests and mimic the POST request by the page as much faster
import requests
data = {'METHOD': '0','VALUE': '{"BusquedaRubros":"true","IdRubro":"41","Inicio":0}'}
r = s.post('http://www.panamacompra.gob.pa/Security/AmbientePublico.asmx/cargarActosOportunidadesDeNegocio', data=data).json()
print(r['listActos'])
You need wait until loader disappear, you can use invisibility_of_element_located, utilize WebDriverWait and expected_conditions. For the table you can use css_selector instead your xpath.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
time.sleep(2)
WebDriverWait(driver, 50).until(EC.invisibility_of_element_located((By.XPATH, '//img[#src="images/loading.gif"]')))
table = driver.find_element_by_css_selector('.table_asearch.table.table-bordered.table-striped.table-hover.table-condensed')
print(table.text)
driver.quit()
Selenium is loading the table (happens fairly quickly) and then assuming it is done, since it's never given a chance to load the table rows (happens more slowly). One way around this is to repeatedly try to find an element that won't appear until the table is finished loading.
This is FAR from the most elegant solution (and there's probably Selenium libraries that do it better), but you can wait for the table by checking to see if a new table row can be found, and if not, sleep for 1 second before trying again.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
wvar = 0
while(wvar == 0):
try:
#try loading one of the elements we want to read
el = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr[3]')
wvar = 1
except NoSuchElementException:
#not loaded yet
print('table body empty, waiting...')
time.sleep(1)
print('table loaded!')
#element got loaded; reload the table
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
print(table.text)