Can't print in Python - python

I've been studying Python last few weeks to automate a work for my business.
Basically I have to do webscraping but I'm having trouble with a print function in the next-to-last code line...
def search_time(self):
for item in self.items:
print(f"Procurando {item}.")
self.driver.get(self.bot_url)
cpfBox = self.driver.find_element_by_xpath('//*[#id="search"]/div/div[1]/input')
cpfBox.send_keys(item)
time.sleep(2)
cpfButton = self.driver.find_element_by_xpath('//*[#id="search"]/div/div[2]/button')
cpfButton.click()
time.sleep(2)
self.delay = 3 # seconds
try:
WebDriverWait(self.driver, self.delay).until(EC.presence_of_element_located((By.XPATH, '//*[#id="main"]/div[1]/h2')))
print('CPF Valido')
except TimeoutException:
print('CPF Invalido')
time.sleep(2)
name = self.driver.find_element_by_xpath("/html/body/main[1]/div[1]/div[1]/div[1]/div[1]/h2").text
print(name)
time.sleep(2)
items = ["32911769953"]
bot_url = BOT(items)
bot_url.search_time()

Check whether the xpath is indeed the correct path to the text you want to get through selenium.
If this is a website you can go to the element that you are trying to find and right click and select Copy XPATH.
try:
WebDriverWait(self.driver, self.delay).until(EC.presence_of_element_located((By.XPATH, '//*[#id="main"]/div[1]/h2')))
name = self.driver.find_element_by_xpath("/html/body/main[1]/div[1]/div[1]/div[1]/div[1]/h2").text
print(name)
time.sleep(2)
print('CPF Valido')
except TimeoutException:
print('CPF Invalido')
time.sleep(2)

First, I would recommend generalizing your xpath query. Currently, it is heavily dependent on the entire page's structure. A small layout change in an unrelated part of the page could give you undesirable results. Review xpath syntax use of // and predicates will save you many headaches in the future.
If that doesn't help, please post the html that you are attempting to parse.

Related

scraping a page that updates using python selenium

It has now been weeks that I was trying to scrape all the information in this website. This website is the profile of a company and I'm tying to get all the information in this id = "profile-basic" section and id="profile-addresses" section.
I am looping through 1000 0f these profiles and this is only one of them. The reason why Im not showing the code is because its very basic and it will not effect my question, though for the ones that want to know its just a simple for loop that goes through a list one by one.
The problem is a lot of the elements in the page don't appear in some profile but in other profiles they will. I tried solving that by writing down the xpath of all possible elements and than using try: to check all of them and it worked just find, the only problem was the xpath would not always be for one part of information for example the xpath for addres //*[#id="profile-addresses"]/div/div/div/div[1]/p but sometimes it could be //*[#id="profile-addresses"]/div/div/div/div[2]/p many other xpaths. Since im trying to put the address inside the address variable it will be impossible to tell which xpath will be for the address in that page.
I tried using this code:
names = {"آدرس تولیدی :" : "Address", "آدرس دفتر :" : "Address", "تلفن :" : "Phone2",
"تعداد پرسنل :" : "StaffNumber", "کدپستی :" : "PostalCode", "توضیحات :" :
"Description2"}
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[1]/span').text
_1 = names.get(e)
__1 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[1]/p').text
exec(f"global {_1}\n{_1} = Smalify('{__1}')")
except:
pass
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[2]/span').text
_2 = names.get(e)
__2 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[2]/p').text
exec(f"global {_2}\n{_2} = Smalify('{__2}')")
except:
pass
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[3]/span').text
_3 = names.get(e)
__3 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[3]/p').text
exec(f"global {_3}\n{_3} = Smalify('{__3}')")
except:
pass
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[4]/span').text
_4 = names.get(e)
__4 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[4]/p').text
exec(f"global {_4}\n{_4} = Smalify('{__4}')")
except:
pass
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[5]/span').text
_5 = names.get(e)
__5 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[5]/p').text
exec(f"global {_5}\n{_5} = Smalify('{__5}')")
except:
pass
The code above will read the span in front of the main element, and than find the matching variable name from the names dictionary and when it did it will se the value of the main element to the variable name using the exec() function.
This code did not work at all for two reasons, A: It always returned Noun even if it could find the elements. B: It took way too long.
I was wondering if there is anyways other than my code to do it efficiently.
You can always try to search by ID, rather than xpath. Since the xpath is variable between the pages, try to find something that is static, such as ID name.
There is some more information about the different ways you can locate specific html elements using selenium at this link. I definitely recommend checking it out.
Here is an example of searching for your elements by their IDs:
browser.find_element(By.ID, "profile-addresses").text
Good luck!

Blocking login overlay window when scraping web page using Selenium

I am trying to scrape a long list of books in 10 web pages. When the loop clicks on next > button for the first time the website displays a login overlay so selenium can not find the target elements.
I have tried all the possible solutions:
Use some chrome options.
Use try-except to click X button on the overlay. But it appears only one time (when clicking next > for the first time). The problem is that when I put this try-except block at the end of while True: loop, it became infinite as I use continue in except as I do not want to break the loop.
Add some popup blocker extensions to Chrome but they do not work when I run the code although I add the extension using options.add_argument('load-extension=' + ExtensionPath).
This is my code:
options = Options()
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('disable-avfoundation-overlays')
options.add_argument('disable-internal-flash')
options.add_argument('no-proxy-server')
options.add_argument("disable-notifications")
options.add_argument("disable-popup")
Extension = (r'C:\Users\DELL\AppData\Local\Google\Chrome\User Data\Profile 1\Extensions\ifnkdbpmgkdbfklnbfidaackdenlmhgh\1.1.9_0')
options.add_argument('load-extension=' + Extension)
options.add_argument('--disable-overlay-scrollbar')
driver = webdriver.Chrome(options=options)
driver.get('https://www.goodreads.com/list/show/32339._50_?page=')
wait = WebDriverWait(driver, 2)
review_dict = {'title':[], 'author':[],'rating':[]}
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('table', class_ = 'tableList js-dataTooltip')
while True:
table = driver.find_element_by_xpath('//*[#id="all_votes"]/table')
for product in table.find_elements_by_xpath(".//tr"):
for td in product.find_elements_by_xpath('.//td[3]/a'):
title = td.text
review_dict['title'].append(title)
for td in product.find_elements_by_xpath('.//td[3]/span[2]'):
author = td.text
review_dict['author'].append(author)
for td in product.find_elements_by_xpath('.//td[3]/div[1]'):
rating = td.text[0:4]
review_dict['rating'].append(rating)
try:
close = wait.until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[3]/div/div/div[1]/button')))
close.click()
except NoSuchElementException:
continue
try:
element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'next_page')))
element.click()
except TimeoutException:
break
df = pd.DataFrame.from_dict(review_dict)
df
Any help like if I can change the loop to for loop clicks next > button until the end rather than while loop or where should I put try-except block to close the overlay or if there is Chromeoption can disable overlay.
Thanks in advance
Thank you for sharing your code and the website that you are having trouble with. I was able to close the Login Modal by using xpath. I took this challenge and broke up the code using class objects. 1 object is for the selenium.webdriver.chrome.webdriver and the other object is for the page that you wanted to scrape the data against ( https://www.goodreads.com/list/show/32339 ). In the following methods, I used the Javascript return arguments[0].scrollIntoView(); method and was able to scroll to the last book that displayed on the page. After I did that, I was able to click the next button
def scroll_to_element(self, xpath : str):
element = self.chrome_driver.find_element(By.XPATH, xpath)
self.chrome_driver.execute_script("return arguments[0].scrollIntoView();", element)
def get_book_count(self):
return self.chrome_driver.find_elements(By.XPATH, "//div[#id='all_votes']//table[contains(#class, 'tableList')]//tbody//tr").__len__()
def click_next_page(self):
# Scroll to last record and click "next page"
xpath = "//div[#id='all_votes']//table[contains(#class, 'tableList')]//tbody//tr[{0}]".format(self.get_book_count())
self.scroll_to_element(xpath)
self.chrome_driver.find_element(By.XPATH, "//div[#id='all_votes']//div[#class='pagination']//a[#class='next_page']").click()
Once I clicked on the "Next" button, I saw the modal display. I was able to find the xpath for the modal and was able to close the modal.
def is_displayed(self, xpath: str, int = 5):
try:
webElement = DriverWait(self.chrome_driver, int).until(
DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
)
return True if webElement != None else False
except:
return False
def is_modal_displayed(self):
return self.is_displayed("//body[#class='modalOpened']")
def close_modal(self):
self.chrome_driver.find_element(By.XPATH, "//div[#class='modal__content']//div[#class='modal__close']").click()
if(self.is_modal_displayed()):
raise Exception("Modal Failed To Close")
I hope this helps you to solve your problem.

Can't find the element using xpath and I'm sure it exists before the driver looks for it

I am trying to download excel files from a website using selenium in headless mode. While it's working perfectly fine in most cases, there are a few cases(some months of an year) where the driver.find_element_by_xpath() fails to work like expected. I have been through many posts and though that the element might not have appeared when the driver was looking for it, but that isn't case as I thoroughly checked it and also tried to slow down the process using time.sleep(), on a side note I also use driver.implicitly_wait() to make things easier as the website actually takes a while to load content on the page. I couldn't use requests because it doesn't show any data in the response of get request. My script is as follows:
from selenium import webdriver
import datetime
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
import os
import shutil
import time
import calendar
currentdir = os.path.dirname(__file__)
Initial_path = 'whateveritis'
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_experimental_option("prefs", {
"download.default_directory": f"{Initial_path}",
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing.enabled": True
})
def save_hist_data(year, months):
def waitUntilDownloadCompleted(maxTime=1200):
driver.execute_script("window.open()")
# switch to new tab
driver.switch_to.window(driver.window_handles[-1])
# navigate to chrome downloads
driver.get('chrome://downloads')
# define the endTime
endTime = time.time() + maxTime
while True:
try:
# get the download percentage
downloadPercentage = driver.execute_script(
"return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('#progress').value")
# check if downloadPercentage is 100 (otherwise the script will keep waiting)
if downloadPercentage == 100:
# exit the method once it's completed
return downloadPercentage
except:
pass
# wait for 1 second before checking the percentage next time
time.sleep(1)
# exit method if the download not completed with in MaxTime.
if time.time() > endTime:
break
starts_on = 1
for month in months:
no_month = datetime.datetime.strptime(month, "%b").month
no_of_days = calendar.monthrange(year, no_month)[1]
print(f"{no_of_days} days in {month}-{year}")
driver = webdriver.Chrome(executable_path="whereeveritexists", options=chrome_options)
driver.maximize_window() #For maximizing window
driver.implicitly_wait(20)
driver.get("https://www.iexindia.com/marketdata/areaprice.aspx")
select = Select(driver.find_element_by_name('ctl00$InnerContent$ddlPeriod'))
select.select_by_visible_text('-Select Range-')
driver.find_element_by_xpath("//input[#name='ctl00$InnerContent$calFromDate$txt_Date']").click()
select = Select(driver.find_element_by_xpath("//td[#class='scwHead']/select[#id='scwYears']"))
select.select_by_visible_text(str(year))
select = Select(driver.find_element_by_xpath("//td[#class='scwHead']/select[#id='scwMonths']"))
select.select_by_visible_text(month)
#PROBLEM IS WITH THIS BLOCK
test=None
while not test:
try:
driver.find_element_by_xpath(f"//td[#class='scwCells' and contains(text(),'{starts_on}')]").click()
test=True
except IndentationError:
print('Entered except block -IE')
driver.find_element_by_xpath(f"//td[#class='scwCellsWeekend' and contains(text(), '{starts_on}')]").click()
test=True
except:
print('Entered except block -IE-2')
driver.find_element_by_xpath(f"//td[#class='scwInputDate' and contains(text(), '{starts_on}')]").click()
test=True
driver.find_element_by_xpath("//input[#name='ctl00$InnerContent$calToDate$txt_Date']").click()
select = Select(driver.find_element_by_xpath("//td[#class='scwHead']/select[#id='scwYears']"))
select.select_by_visible_text(str(year))
select = Select(driver.find_element_by_xpath("//td[#class='scwHead']/select[#id='scwMonths']"))
select.select_by_visible_text(month)
#PROBLEM IS WITH THIS BLOCK
test=None
while not test:
try:
driver.find_element_by_xpath(f"//td[#class='scwCells' and contains(text(), '{no_of_days}')]").click()
# time.sleep(4)
test=True
except IndentationError:
print('Entered except block -IE')
driver.find_element_by_xpath(f"//td[#class='scwCellsWeekend' and contains(text(), '{no_of_days}')]").click()
# time.sleep(4)
test=True
except:
# time.sleep(2)
driver.find_element_by_xpath(f"//td[#class='scwInputDate' and contains(text(), '{no_of_days}')]").click()
test=True
driver.find_element_by_xpath("//input[#name='ctl00$InnerContent$btnUpdateReport']").click()
driver.find_element_by_xpath("//a[#title='Export drop down menu']").click()
print("Right before excel button click")
driver.find_element_by_xpath("//a[#title='Excel']").click()
waitUntilDownloadCompleted(180)
print("After the download potentially!")
filename = max([Initial_path + f for f in os.listdir(Initial_path)],key=os.path.getctime)
shutil.move(filename,os.path.join(Initial_path,f"{month}{year}.xlsx"))
driver.quit()
def main():
# years = list(range(2013,2015))
# months = ['Jan', 'Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
# for year in years:
# try:
save_hist_data(2018, ['Mar'])
# except:
# pass
if __name__== '__main__':
main()
The while loops are basically being used to select the date element on the calendar(month and year are already being selected from the drop downs). Because the website has different tags if the date falls on weekday or weekend, I used try and except blocks to try all possible xpaths but the weird thing is, some months of an year simply don't work like expected. This is the link btw "https://www.iexindia.com/marketdata/areaprice.aspx". Especially, in the case of Mar-2018, searching for xpaths on the chrome browser manually works and it locates 31st of Mar-2018, but when the python script is being executed it throws and error saying
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//td[#class='scwInputDate' and contains(text(), '31')]"}
(Session info: headless chrome=84.0.4147.105)
Issue is with except : Exception handling. As per your code block if element was not found by "//td[#class='scwCells' and contains(text(), '{no_of_days}')]". Since for 31st March class is scwCellsWeekend element is not found.
As per first except it will handle an IdentationException. Since element not found is not an IdentationException, it is going for next except Exception handling.
Since for second except no condition is mentioned , NoSuchElementException is handled inside it. As per code given here it is trying to search and element with xpath //td[#class='scwInputDate' and contains(text(), '31')]. Which is again not able to find as a result you are getting NoSuchElementException.
Instead of using so many exception handling scenarios you can use logical operator or as bleow:
driver.find_element_by_xpath(f"//td[#class='scwCellsWeekend' and contains(text(), '{no_of_days}')] | //td[#class='scwCells' and contains(text(), '{no_of_days}')] | //td[#class='scwInputDate' and contains(text(), '{no_of_days}')]").click()

Waiting for page to load selenium, Python. All the pages have the same structure

I am trying to scrape some data using selenium and python. I have a list with some links and I have to go through every link. What I do now is the following:
for link in links:
self.page_driver.get(link)
time.sleep(5)
#scrape data
It works just fine, the problem is that I have a lot of links and waiting 5 seconds for each one is a waste of time. That's why I decided to try with something like:
self.driver.get(link)
try:
element_present = EC.presence_of_element_located((By.CLASS_NAME, 'cell-box'))
WebDriverWait(self.driver, 10).until(element_present)
except TimeoutException:
logging.info("Timed out waiting for page to load")
The problem is that every link has the exact same structure inside, only data change, so the element is found even if the page hasn't changed. What I would like to do is to save the name of the product in the link in a variable, change page wait until the name of the product is different than the one saved, which means the new page loaded. Any help would be really appreciated.
You can add the staleness_of Expected Condition
wait = WebDriverWait(self.driver, 10)
element = None
for link in links:
self.page_driver.get(link)
if (element is not None):
wait.until(EC.staleness_of(element)
try:
element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'cell-box')))
except TimeoutException:
logging.info("Timed out waiting for page to load")
#scrape data

python selenium firefox behavior

I am using a firefox browser with selenium. I am scraping a website that has multiple pages like google search, where you can pick the page at the bottom. On each page, I click an element, like google again, and scrap data from that element's information. If I am at an element's information on the third page, and click the back button using my regular firefox browser, it goes back to the third page. But, when I press the back button in selenium with driver.back(), it takes me back to the first page. Anyone know how to fix this?
count = 1
while 1:
try:
pages = driver.find_elements_by_css_selector("a.page-number.gradient")
except:
break
for page in pages:
if page.text==str(count):
page.click()
print count
break
states = driver.find_elements_by_xpath("//*[#id='table_div']/div/div/table/tbody/tr/td[19]")
fails = []
i = 1
for state in states:
if state.text == "FAILED":
fails.append(i)
i+=1
for fail in fails:
print driver.find_element_by_xpath("//*[#id='table_div']/div/div/table/tbody/tr[" + str(fail) + "]/td[19]").text
driver.find_element_by_xpath("//*[#id='table_div']/div/div/table/tbody/tr[" + str(fail) + "]/td[1]/input").click()
time.sleep(2)
errors = driver.find_element_by_name("errors")
if "\n" in errors.text:
fixedText = errors.text.split("\n")[0]
errors.clear()
errors.send_keys(fixedText)
time.sleep(1)
driver.find_element_by_name('post_type').click()
time.sleep(5)
driver.switch_to_alert().accept()
driver.switch_to_alert().accept()
driver.back()
driver.back()
else:
driver.back()
driver.switch_to_alert().accept()
driver.switch_to_alert().accept()
count+=1
The code is really complicated, but basically it's the driver.back() lines that aren't working

Categories