how to extract the URL - python

I tried to extract the data from this website - https://hk.centanet.com/findproperty/list/transaction?q=xXHFRIuxWUSNboTJYGkUIg. I have the following problem:-
when I click the line item, the website pointed to other URL (https://hk.centanet.com/findproperty/transaction-detail/-%E5%BE%A1%E7%9A%87%E5%BA%AD_AJP202209S0604). After press "back forword" arrow, the website always go to the firt page.
I can't find the href of the line item.
I use python and selenium
do anyone have idea to solve my problems? Thanks in advance.
========
The below program is upto now, I can think about to fix this issue.....hahahaha
row_num=1
web_click = "//*[#class='cv-structured-list-item cv-structured-list-item--standard bx--structured-list-row'][{}]/div[1]".format(row_num)
click_date = driver.find_element_by_xpath(web_click)
click_date.click()
time.sleep(3)
driver.back()
time.sleep(3)
link = driver.find_element_by_xpath("//div[#class='el-pagination el-pagination--small']/ul/li[2]")
link.click() # click on "Next Page" link
time.sleep(3)
row_num=2
web_click = "//*[#class='cv-structured-list-item cv-structured-list-item--standard bx--structured-list-row'][{}]/div[1]".format(row_num)
click_date = driver.find_element_by_xpath(web_click)
click_date.click()
time.sleep(3)
driver.back()
time.sleep(3)
link = driver.find_element_by_xpath("//div[#class='el-pagination el-pagination--small']/ul/li[2]")
link.click() # click on "Next Page" link
time.sleep(3)

You need to get each page with driver.get(url).I think this will solve your problem.

Related

how to find URL of some elements of a webpage?

the webpage is : https://www.vpgame.com/market/gold?order_type=pro_price&order=desc&offset=0
As you can see there are 25 items in the selling part of this page that when you click them it opens a new tab and show you that specific item details.
Now I want to make a program to get those 25 item URLs and save them in a list, and my problem is as you can see in page inspect, their tags are which should be and also I can't find any 'href' attributes that related to them.
# using selenium and driver = webdriver.Chrome()
link = driver.find_elements_by_tag_name('a')
link2 = [l.get_attribute('href') for l in link]
I thought I can do it with above code but the problem is what I said. any suggestion?
Looks like you are trying to scrape a page that is powered by react. There are no href tags because javascript is powering all the linking. Your best bet is to use selenium to execute a click on each of the div objects, switch to the newly tabe, and use something like this code to get the URL of the page it's taken you to:
import time
links = driver.find_elements_by_class_name('card-header')
urls = []
for link in links:
new_page = link.click()
driver.switch_to.window(driver.window_handles[1])
url = driver.current_url
urls.append(url)
driver.close()
driver.switch_to.window(driver.window_handles[0])
time.sleep(1)
Note that the code closes the new tab each time and goes back to the main tab. I added time.sleep() so it doesn't go too fast.

How to stop click the same button while the button always exist by selenium

I faced one issue, before when I scrape multiple pages by Selenium, just use click next page button and use NoSuchElementException to stop it.
But the url I facing now is the element always exists, in the last page, if I click next page button, it just reload the current page.
Anyone Can help to solve how to stop click the same button?
items=driver.find_elements_by_class_name('item')
while True:
try:
#click next page
driver.find_element_by_link_text('下一页').click()
sleep(5)
#scrpae data here
items=driver.find_elements_by_class_name('item')
for i in range(0, len(items)):
results.append(items[i])
print(items[i])
except NoSuchElementException:
break
For the pages details you can check the picture below
Fullsize image
[Edited]
You can solve it by matching current page url and next page url in next page link.
if current page url matches the url in next page link then it is the last page. otherwise continue scraping.
You should have a variable where you store current page URL and when you click on next page link by selenium, you get the page url and match with previous.
This is what i am saying:
url = "https://humkinar.com.pk/"
driver.get(url)
items=driver.find_elements_by_class_name('item')
current_page_url = ""
prev_page_url = url
while True:
try:
driver.find_element_by_link_text('下一页').click()
current_page_url = driver.current_url
if current_page_url != prev_page_url:
time.sleep(5)
items=driver.find_elements_by_class_name('item')
for i in range(0, len(items)):
results.append(items[i])
print(items[i])
prev_page_url = current_page_url
else:
break
except NoSuchElementException:
break
As i see in picture (i suppose picture you shared is of last page),check for className == 'disable' in <a class='disable'> <some text in chinese></a> and break;
UPDATE:
items=driver.find_elements_by_class_name('item')
while True:
try:
#click next page
next = driver.find_element_by_link_text('下一页')
next.click()
sleep(5)
#scrpae data here
items=driver.find_elements_by_class_name('item')
for i in range(0, len(items)):
results.append(items[i])
print(items[i])
break;
if(next.getAttribute('class') == 'disable'){
throw new Exception()
}

Python Selenium: stale element reference: element is not attached to the page document

My program is throwing an error message "Stale element reference: element is not attached to the page document". When I looked at the previous posts (such as Python Selenium stale element fix),, I found that I am not updating the url after calling click function. I updated the url. However, it didn't fix the issue. Could anyone point out where am I making mistake please? Here is my code:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-infobars")
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path="path of driver here")
driver.get("https://stackoverflow.com/users/37181/alex-gaynor?tab=topactivity")
if driver.find_elements_by_xpath("//a[#class='grid--cell fc-white js-notice-close']"):
driver.find_element_by_xpath("//a[#class='grid--cell fc-white js-notice-close']").click()
inner_tabs = driver.find_elements_by_xpath("//div[#class='tabs']//a")
for inner_tab in inner_tabs:
if inner_tab.text == "answers":
inner_tab.click()
time.sleep(3)
driver.get(driver.current_url)
continue
if inner_tab.text == "questions":
inner_tab.click()
time.sleep(3)
driver.get(driver.current_url)
continue
driver.quit()
when you open new URL by clicking link or driver.get() it will create new document element so old element (inner_tab) will invalidate. to solve, first collect all URL then open in loop.
urls_to_visit = []
for inner_tab in inner_tabs:
if inner_tab.text in ["questions", "answers"]:
urls_to_visit.append(inner_tab.get_attribute("href"))
for url in urls_to_visit:
driver.get(url)
time.sleep(3)
This is one of the most frustrating errors you can get with Selenium.
I recommend to try it like this:
for tab in ['answers', 'questions']:
js = "window.tab = [...document.querySelectorAll('div.tabs > a')].filter(a => a.innerText === '" + tab + "')[0]"
driver.execute_script(js)
driver.execute_script("if(window.tab) window.tab.click()")
time.sleep(3)
print(driver.current_url)
By selecting inside of the browser context you can avoid the stale references.

How to collect data python using selenium geckodriver

I tried to scrap data from website using selenium firefox and geckodriver.
I want to navigate from one page to another and extract informations. I am unable to move from current page to another, and force driver to back to the original page and move to next element in the list. I created the code bellow which click on first element and go to the specific page to collect data. Thank you
binary = FirefoxBinary('/usr/bin/firefox')
driver = webdriver.Firefox(firefox_binary=binary, executable_path=r'/home/twitter/geckodriver')
try:
driver.get('https://www..........')
list_element = driver.find_elements_by_xpath("//span[#class='icon icon-email']")
for element in list_element :
x = driver.current_url
element.click()
time.sleep(5)
for ex in driver.find_elements_by_xpath('//span[#class = "valeur"]'):
print (ex.text)
driver.back()
except Exception as e:
print (e)
driver.quit()
This might happens because your driver/browser didn't get the new page (current page).
add one line after element.click() or time.sleep(5) :
driver.switch_to.window(driver.current_window_handle)
then try to run your code again.
Hope this helps you! :)

How to save state of selenium web driver in python?

I am trying to scrape this website: http://www.infoempleo.com/ofertas-internacionales/.
I wanted to scrape by selecting the "Last 15 days" radio button. So I wrote this code.
browser = webdriver.Chrome('C:\Users\Junaid\Downloads\chromedriver\chromedriver_win32\chromedriver.exe')
new_urls = deque(['http://www.infoempleo.com/ofertas-internacionales/'])
processed_urls = set()
while len(new_urls):
print "------ URL LIST -------"
print new_urls
print "-----------------------"
print
time.sleep(5)
url = new_urls.popleft()
processed_urls.add(url)
try:
print "----------- Scraping ==>",url
browser.get(url)
elem = browser.find_elements_by_id("fechapublicacion")[-1]
if ( elem.is_selected() ):
print "already selected"
else:
elem.click()
html = browser.page_source
except:
print "-------- Failed to Scrape, Moving to Next"
continue
soup = BeautifulSoup(html)
I have been able to select the radio button and scrape the first page.
There is a list of pages at the end like 1, 2, 3..
When moving to the next page, 'browser.get(url)' is called which resets the radio button to 'Any Date' instead of 'Last 15 Days'. Which makes the code execute the else statement else: elem.click() to select the radio button again, which open the first page that has been already scraped.
Is there a way around this? Help will be appreciated.
I have found a work around this problem. Instead of saving links to next pages in a list. I am selecting the nextPage button/element and using .click(). This way the browser.get(url) is not needed to call again and the page is not reloaded.

Categories