How to collect data python using selenium geckodriver

How to collect data python using selenium geckodriver - python

I tried to scrap data from website using selenium firefox and geckodriver.
I want to navigate from one page to another and extract informations. I am unable to move from current page to another, and force driver to back to the original page and move to next element in the list. I created the code bellow which click on first element and go to the specific page to collect data. Thank you
binary = FirefoxBinary('/usr/bin/firefox')
driver = webdriver.Firefox(firefox_binary=binary, executable_path=r'/home/twitter/geckodriver')
try:
driver.get('https://www..........')
list_element = driver.find_elements_by_xpath("//span[#class='icon icon-email']")
for element in list_element :
x = driver.current_url
element.click()
time.sleep(5)
for ex in driver.find_elements_by_xpath('//span[#class = "valeur"]'):
print (ex.text)
driver.back()
except Exception as e:
print (e)
driver.quit()

This might happens because your driver/browser didn't get the new page (current page).
add one line after element.click() or time.sleep(5) :
driver.switch_to.window(driver.current_window_handle)
then try to run your code again.
Hope this helps you! :)

Related

how to extract the URL

I tried to extract the data from this website - https://hk.centanet.com/findproperty/list/transaction?q=xXHFRIuxWUSNboTJYGkUIg. I have the following problem:-
when I click the line item, the website pointed to other URL (https://hk.centanet.com/findproperty/transaction-detail/-%E5%BE%A1%E7%9A%87%E5%BA%AD_AJP202209S0604). After press "back forword" arrow, the website always go to the firt page.
I can't find the href of the line item.
I use python and selenium
do anyone have idea to solve my problems? Thanks in advance.
========
The below program is upto now, I can think about to fix this issue.....hahahaha
row_num=1
web_click = "//*[#class='cv-structured-list-item cv-structured-list-item--standard bx--structured-list-row'][{}]/div[1]".format(row_num)
click_date = driver.find_element_by_xpath(web_click)
click_date.click()
time.sleep(3)
driver.back()
time.sleep(3)
link = driver.find_element_by_xpath("//div[#class='el-pagination el-pagination--small']/ul/li[2]")
link.click() # click on "Next Page" link
time.sleep(3)
row_num=2
web_click = "//*[#class='cv-structured-list-item cv-structured-list-item--standard bx--structured-list-row'][{}]/div[1]".format(row_num)
click_date = driver.find_element_by_xpath(web_click)
click_date.click()
time.sleep(3)
driver.back()
time.sleep(3)
link = driver.find_element_by_xpath("//div[#class='el-pagination el-pagination--small']/ul/li[2]")
link.click() # click on "Next Page" link
time.sleep(3)

You need to get each page with driver.get(url).I think this will solve your problem.

I use selenium(python) to retrieve some data from WOS top papers, but when I use click() to open the sub link, I can only open the first url

My task is to open each url from the following website and retrieve some evaluation data for each essay. I have located the element successfully, which means I get 10 element. However, when selenium began to imitate human to click the url, it can only open the first link of ten links.
https://esi.clarivate.com/DocumentsAction.action
HTML:
The code is as followed.
import time
from selenium import webdriver
driver=webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get('https://esi.clarivate.com/IndicatorsAction.action?Init=Yes&SrcApp=IC2LS&SID=H3-M1jrs4mSS2O3WTFbtdrUJugtDvogGRIM-18x2dx2B1ubex2Bo9Y5F6ZPQtUZbfUAx3Dx3Dp1StTsneXx2B7vu85UqXoaoQx3Dx3D-03Ff2gF3hTJGBPDScD1wSwx3Dx3D-cLUx2FoETAVeN3rTSMreq46gx3Dx3D')
#add filter-> research fields-> "clinical medicine"
target = driver.find_element_by_id("ext-gen1065")
time.sleep(1)
target.click()
time.sleep(1)
n = driver.window_handles
driver.switch_to.window(n[-1])
links=driver.find_elements_by_class_name("docTitle")
length=len(links)
for i in range(0,length):
item=links[i]
item.click()
time.sleep(1)
handles=driver.window_handles
index_handle=driver.current_window_handle
for handle in handles:
if handle != index_handle:
driver.switch_to.window(handle)
else:
continue
time.sleep(1)
u1=driver.find_elements_by_class_name("large-number")[2].text
u2=driver.find_elements_by_class_name("large-number")[3].text
print(u1,u2)
print("\n")
driver.close()
time.sleep(1)
driver.switch_to_window(index_handle)
driver.quit()
print("————finished————")
The error page:
And I try to find out the problem by testing these code:
links=driver.find_elements_by_class_name("docTitle")
length=len(links)
print(length)
print(links[1].text)
#links[0].click()
links[1].click()
The result is:
which means it had already find the element, but failed to open it.(when using links[0].text, it works fine.)
Any idea about this?

Python Selenium: stale element reference: element is not attached to the page document

My program is throwing an error message "Stale element reference: element is not attached to the page document". When I looked at the previous posts (such as Python Selenium stale element fix),, I found that I am not updating the url after calling click function. I updated the url. However, it didn't fix the issue. Could anyone point out where am I making mistake please? Here is my code:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-infobars")
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path="path of driver here")
driver.get("https://stackoverflow.com/users/37181/alex-gaynor?tab=topactivity")
if driver.find_elements_by_xpath("//a[#class='grid--cell fc-white js-notice-close']"):
driver.find_element_by_xpath("//a[#class='grid--cell fc-white js-notice-close']").click()
inner_tabs = driver.find_elements_by_xpath("//div[#class='tabs']//a")
for inner_tab in inner_tabs:
if inner_tab.text == "answers":
inner_tab.click()
time.sleep(3)
driver.get(driver.current_url)
continue
if inner_tab.text == "questions":
inner_tab.click()
time.sleep(3)
driver.get(driver.current_url)
continue
driver.quit()

when you open new URL by clicking link or driver.get() it will create new document element so old element (inner_tab) will invalidate. to solve, first collect all URL then open in loop.
urls_to_visit = []
for inner_tab in inner_tabs:
if inner_tab.text in ["questions", "answers"]:
urls_to_visit.append(inner_tab.get_attribute("href"))
for url in urls_to_visit:
driver.get(url)
time.sleep(3)

This is one of the most frustrating errors you can get with Selenium.
I recommend to try it like this:
for tab in ['answers', 'questions']:
js = "window.tab = [...document.querySelectorAll('div.tabs > a')].filter(a => a.innerText === '" + tab + "')[0]"
driver.execute_script(js)
driver.execute_script("if(window.tab) window.tab.click()")
time.sleep(3)
print(driver.current_url)
By selecting inside of the browser context you can avoid the stale references.

Unable to get text using selenium web driver while iterating over multiple links stored in an array

This is a continuation of my previous question "Web scraping using selenium and beautifulsoup.. trouble in parsing and selecting button". I could solve the previous problem, but I am now stuck on the below.
I got the links from previously stored in an array.
Then, I am trying to visit all the links stored in a list named StartupLink.
The information I need to scrape and store in an array is in div class=content tag. For some link, the above div tag contains div hidden_more with javascript enabled click events. So I am handling the exception. However, the loop runs fine and visits links but after first two links it gives NA output even though there is the presence of div content tag, it also shows no error (that's unacceptable).
The array contains 400 links to visit with similar div content element.
Where am I going wrong here?
Description=[]
driver = webdriver.Chrome()
for link in StartupLink:
try:
driver.get(link)
sleep(5)
more = driver.find_element_by_xpath('//a[#class="hidden_more"]')
element = WebDriverWait(driver, 10).until(EC.visibility_of(more))
sleep(5)
element.click()
sleep(5)
page = driver.find_element_by_xpath('//div[#class="content"]').text
sleep(5)
except Exception as e:# NoSuchElementException:
driver.start_session()
sleep(5)
page = driver.find_element_by_xpath('//div[#class="content"]').text
sleep(5)
print(str(e))
if page == '':
page = "NA"
Description.append(page)
else:
Description.append(page)
print(page)

Waiting for page to load selenium, Python. All the pages have the same structure

I am trying to scrape some data using selenium and python. I have a list with some links and I have to go through every link. What I do now is the following:
for link in links:
self.page_driver.get(link)
time.sleep(5)
#scrape data
It works just fine, the problem is that I have a lot of links and waiting 5 seconds for each one is a waste of time. That's why I decided to try with something like:
self.driver.get(link)
try:
element_present = EC.presence_of_element_located((By.CLASS_NAME, 'cell-box'))
WebDriverWait(self.driver, 10).until(element_present)
except TimeoutException:
logging.info("Timed out waiting for page to load")
The problem is that every link has the exact same structure inside, only data change, so the element is found even if the page hasn't changed. What I would like to do is to save the name of the product in the link in a variable, change page wait until the name of the product is different than the one saved, which means the new page loaded. Any help would be really appreciated.

You can add the staleness_of Expected Condition
wait = WebDriverWait(self.driver, 10)
element = None
for link in links:
self.page_driver.get(link)
if (element is not None):
wait.until(EC.staleness_of(element)
try:
element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'cell-box')))
except TimeoutException:
logging.info("Timed out waiting for page to load")
#scrape data

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to collect data python using selenium geckodriver - python

This might happens because your driver/browser didn't get the new page (current page). add one line after element.click() or time.sleep(5) : driver.switch_to.window(driver.current_window_handle) then try to run your code again. Hope this helps you! :)

Related

how to extract the URL

I use selenium(python) to retrieve some data from WOS top papers, but when I use click() to open the sub link, I can only open the first url

Python Selenium: stale element reference: element is not attached to the page document

Unable to get text using selenium web driver while iterating over multiple links stored in an array

Waiting for page to load selenium, Python. All the pages have the same structure

Categories

Resources