I am trying to scrape this website: http://www.infoempleo.com/ofertas-internacionales/.
I wanted to scrape by selecting the "Last 15 days" radio button. So I wrote this code.
browser = webdriver.Chrome('C:\Users\Junaid\Downloads\chromedriver\chromedriver_win32\chromedriver.exe')
new_urls = deque(['http://www.infoempleo.com/ofertas-internacionales/'])
processed_urls = set()
while len(new_urls):
print "------ URL LIST -------"
print new_urls
print "-----------------------"
print
time.sleep(5)
url = new_urls.popleft()
processed_urls.add(url)
try:
print "----------- Scraping ==>",url
browser.get(url)
elem = browser.find_elements_by_id("fechapublicacion")[-1]
if ( elem.is_selected() ):
print "already selected"
else:
elem.click()
html = browser.page_source
except:
print "-------- Failed to Scrape, Moving to Next"
continue
soup = BeautifulSoup(html)
I have been able to select the radio button and scrape the first page.
There is a list of pages at the end like 1, 2, 3..
When moving to the next page, 'browser.get(url)' is called which resets the radio button to 'Any Date' instead of 'Last 15 Days'. Which makes the code execute the else statement else: elem.click() to select the radio button again, which open the first page that has been already scraped.
Is there a way around this? Help will be appreciated.
I have found a work around this problem. Instead of saving links to next pages in a list. I am selecting the nextPage button/element and using .click(). This way the browser.get(url) is not needed to call again and the page is not reloaded.
Related
This is my first time with selenium and the website I'm scraping (page) doesn't have a next page button and the pages for pagination don't change till you click the "..." and then it shows the next set of 10 pagination links. How do I loop through the clicking.
I've seen a few answers online but I don't couldn't adapt them to my code because of the links only come in sets. This is the code
from selenium.webdriver import Chrome
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
driver_path = 'Projects\Selenium Driver\chromedriver_win32'
driver = Chrome(executable_path=driver_path)
driver.get('https://business.nh.gov/nsor/search.aspx')
drop_down = driver.find_element(By.ID, 'ctl00_cphMain_lstStates')
select = Select(drop_down)
select.select_by_visible_text('NEW HAMPSHIRE')
driver.find_element(By.ID, 'ctl00_cphMain_btnSubmit').click()
content = driver.find_elements(By.CSS_SELECTOR, 'table#ctl00_cphMain_gvwOffender a')
hrefs = []
for link_el in content:
href = link_el.get_attribute('href')
hrefs.append(href)
offenders_href = hrefs[:10]
pagination_links = driver.find_elements(By.CSS_SELECTOR, 'table#ctl00_cphMain_gvwOffender tbody tr td table tbody a')
With your current code, the next page elements are already captured within list content[10:]. And the last page hyperlink with ellipsis is actually the next logical sequence. Using this fact, we can use a current page variable to keep track of the page being visited and use that to identify the right anchor tag element within list content for the next page.
With a do-while loop logic and using your code to scrape the required elements, here the primary code:
offenders_href = list()
curr_page = 1
while True:
# find all anchor tags with this table
content = driver.find_elements(By.CSS_SELECTOR, 'table#ctl00_cphMain_gvwOffender a')
hrefs = []
for link_el in content:
href = link_el.get_attribute('href')
hrefs.append(href)
offenders_href += hrefs[:10]
curr_page += 1
# find next page element
for page_elem in content[10:]:
if page_elem.get_attribute("href").endswith('$'+str(curr_page)+"')"):
next_page = page_elem
break
else:
# last page reached, break out of while
break
print(f'clicking {next_page.text}...')
next_page.click()
sleep(1)
I placed this code in function launch_click_pages. Launching it with your URL, it is a able to scroll through pages (it kept going, but I stopped it at some page):
>>> launch_click_pages('https://business.nh.gov/nsor/search.aspx')
clicking 2...
clicking 3...
clicking 4...
clicking 5...
clicking 6...
clicking 7...
clicking 8...
clicking 9...
clicking 10...
clicking ......
clicking 12...
clicking 13...
clicking 14...
clicking 15...
^C
You can try to execute script e.g. driver.execute_script("javascript:__doPostBack('ctl00$cphMain$gvwOffender','Page$5')") and you will redirected to fifth page
I tried to extract the data from this website - https://hk.centanet.com/findproperty/list/transaction?q=xXHFRIuxWUSNboTJYGkUIg. I have the following problem:-
when I click the line item, the website pointed to other URL (https://hk.centanet.com/findproperty/transaction-detail/-%E5%BE%A1%E7%9A%87%E5%BA%AD_AJP202209S0604). After press "back forword" arrow, the website always go to the firt page.
I can't find the href of the line item.
I use python and selenium
do anyone have idea to solve my problems? Thanks in advance.
========
The below program is upto now, I can think about to fix this issue.....hahahaha
row_num=1
web_click = "//*[#class='cv-structured-list-item cv-structured-list-item--standard bx--structured-list-row'][{}]/div[1]".format(row_num)
click_date = driver.find_element_by_xpath(web_click)
click_date.click()
time.sleep(3)
driver.back()
time.sleep(3)
link = driver.find_element_by_xpath("//div[#class='el-pagination el-pagination--small']/ul/li[2]")
link.click() # click on "Next Page" link
time.sleep(3)
row_num=2
web_click = "//*[#class='cv-structured-list-item cv-structured-list-item--standard bx--structured-list-row'][{}]/div[1]".format(row_num)
click_date = driver.find_element_by_xpath(web_click)
click_date.click()
time.sleep(3)
driver.back()
time.sleep(3)
link = driver.find_element_by_xpath("//div[#class='el-pagination el-pagination--small']/ul/li[2]")
link.click() # click on "Next Page" link
time.sleep(3)
You need to get each page with driver.get(url).I think this will solve your problem.
the webpage is : https://www.vpgame.com/market/gold?order_type=pro_price&order=desc&offset=0
As you can see there are 25 items in the selling part of this page that when you click them it opens a new tab and show you that specific item details.
Now I want to make a program to get those 25 item URLs and save them in a list, and my problem is as you can see in page inspect, their tags are which should be and also I can't find any 'href' attributes that related to them.
# using selenium and driver = webdriver.Chrome()
link = driver.find_elements_by_tag_name('a')
link2 = [l.get_attribute('href') for l in link]
I thought I can do it with above code but the problem is what I said. any suggestion?
Looks like you are trying to scrape a page that is powered by react. There are no href tags because javascript is powering all the linking. Your best bet is to use selenium to execute a click on each of the div objects, switch to the newly tabe, and use something like this code to get the URL of the page it's taken you to:
import time
links = driver.find_elements_by_class_name('card-header')
urls = []
for link in links:
new_page = link.click()
driver.switch_to.window(driver.window_handles[1])
url = driver.current_url
urls.append(url)
driver.close()
driver.switch_to.window(driver.window_handles[0])
time.sleep(1)
Note that the code closes the new tab each time and goes back to the main tab. I added time.sleep() so it doesn't go too fast.
I am trying to scrape a long list of books in 10 web pages. When the loop clicks on next > button for the first time the website displays a login overlay so selenium can not find the target elements.
I have tried all the possible solutions:
Use some chrome options.
Use try-except to click X button on the overlay. But it appears only one time (when clicking next > for the first time). The problem is that when I put this try-except block at the end of while True: loop, it became infinite as I use continue in except as I do not want to break the loop.
Add some popup blocker extensions to Chrome but they do not work when I run the code although I add the extension using options.add_argument('load-extension=' + ExtensionPath).
This is my code:
options = Options()
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('disable-avfoundation-overlays')
options.add_argument('disable-internal-flash')
options.add_argument('no-proxy-server')
options.add_argument("disable-notifications")
options.add_argument("disable-popup")
Extension = (r'C:\Users\DELL\AppData\Local\Google\Chrome\User Data\Profile 1\Extensions\ifnkdbpmgkdbfklnbfidaackdenlmhgh\1.1.9_0')
options.add_argument('load-extension=' + Extension)
options.add_argument('--disable-overlay-scrollbar')
driver = webdriver.Chrome(options=options)
driver.get('https://www.goodreads.com/list/show/32339._50_?page=')
wait = WebDriverWait(driver, 2)
review_dict = {'title':[], 'author':[],'rating':[]}
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('table', class_ = 'tableList js-dataTooltip')
while True:
table = driver.find_element_by_xpath('//*[#id="all_votes"]/table')
for product in table.find_elements_by_xpath(".//tr"):
for td in product.find_elements_by_xpath('.//td[3]/a'):
title = td.text
review_dict['title'].append(title)
for td in product.find_elements_by_xpath('.//td[3]/span[2]'):
author = td.text
review_dict['author'].append(author)
for td in product.find_elements_by_xpath('.//td[3]/div[1]'):
rating = td.text[0:4]
review_dict['rating'].append(rating)
try:
close = wait.until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[3]/div/div/div[1]/button')))
close.click()
except NoSuchElementException:
continue
try:
element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'next_page')))
element.click()
except TimeoutException:
break
df = pd.DataFrame.from_dict(review_dict)
df
Any help like if I can change the loop to for loop clicks next > button until the end rather than while loop or where should I put try-except block to close the overlay or if there is Chromeoption can disable overlay.
Thanks in advance
Thank you for sharing your code and the website that you are having trouble with. I was able to close the Login Modal by using xpath. I took this challenge and broke up the code using class objects. 1 object is for the selenium.webdriver.chrome.webdriver and the other object is for the page that you wanted to scrape the data against ( https://www.goodreads.com/list/show/32339 ). In the following methods, I used the Javascript return arguments[0].scrollIntoView(); method and was able to scroll to the last book that displayed on the page. After I did that, I was able to click the next button
def scroll_to_element(self, xpath : str):
element = self.chrome_driver.find_element(By.XPATH, xpath)
self.chrome_driver.execute_script("return arguments[0].scrollIntoView();", element)
def get_book_count(self):
return self.chrome_driver.find_elements(By.XPATH, "//div[#id='all_votes']//table[contains(#class, 'tableList')]//tbody//tr").__len__()
def click_next_page(self):
# Scroll to last record and click "next page"
xpath = "//div[#id='all_votes']//table[contains(#class, 'tableList')]//tbody//tr[{0}]".format(self.get_book_count())
self.scroll_to_element(xpath)
self.chrome_driver.find_element(By.XPATH, "//div[#id='all_votes']//div[#class='pagination']//a[#class='next_page']").click()
Once I clicked on the "Next" button, I saw the modal display. I was able to find the xpath for the modal and was able to close the modal.
def is_displayed(self, xpath: str, int = 5):
try:
webElement = DriverWait(self.chrome_driver, int).until(
DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
)
return True if webElement != None else False
except:
return False
def is_modal_displayed(self):
return self.is_displayed("//body[#class='modalOpened']")
def close_modal(self):
self.chrome_driver.find_element(By.XPATH, "//div[#class='modal__content']//div[#class='modal__close']").click()
if(self.is_modal_displayed()):
raise Exception("Modal Failed To Close")
I hope this helps you to solve your problem.
I am using a firefox browser with selenium. I am scraping a website that has multiple pages like google search, where you can pick the page at the bottom. On each page, I click an element, like google again, and scrap data from that element's information. If I am at an element's information on the third page, and click the back button using my regular firefox browser, it goes back to the third page. But, when I press the back button in selenium with driver.back(), it takes me back to the first page. Anyone know how to fix this?
count = 1
while 1:
try:
pages = driver.find_elements_by_css_selector("a.page-number.gradient")
except:
break
for page in pages:
if page.text==str(count):
page.click()
print count
break
states = driver.find_elements_by_xpath("//*[#id='table_div']/div/div/table/tbody/tr/td[19]")
fails = []
i = 1
for state in states:
if state.text == "FAILED":
fails.append(i)
i+=1
for fail in fails:
print driver.find_element_by_xpath("//*[#id='table_div']/div/div/table/tbody/tr[" + str(fail) + "]/td[19]").text
driver.find_element_by_xpath("//*[#id='table_div']/div/div/table/tbody/tr[" + str(fail) + "]/td[1]/input").click()
time.sleep(2)
errors = driver.find_element_by_name("errors")
if "\n" in errors.text:
fixedText = errors.text.split("\n")[0]
errors.clear()
errors.send_keys(fixedText)
time.sleep(1)
driver.find_element_by_name('post_type').click()
time.sleep(5)
driver.switch_to_alert().accept()
driver.switch_to_alert().accept()
driver.back()
driver.back()
else:
driver.back()
driver.switch_to_alert().accept()
driver.switch_to_alert().accept()
count+=1
The code is really complicated, but basically it's the driver.back() lines that aren't working