Same strings in python aren't matching in web scraping

Same strings in python aren't matching in web scraping - python

Hi I am trying to scrape gujarat rera using selenium python.
main.py:
import csv
import os
import time
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support.select import Select
WORK_DONE = False
chromedriver_autoinstaller.install()
driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://gujrerar1.gujarat.gov.in/home")
time.sleep(3)
close = driver.find_elements(By.CSS_SELECTOR,".close") # click close on popup.
for i in close:
try:
i.click()
time.sleep(0.5)
except:
pass
close = driver.find_elements(By.CSS_SELECTOR,".close") # click close on popup.
for i in close:
try:
i.click()
time.sleep(0.5)
except:
pass
search_bar = driver.find_element(By.ID,"password")
data=["031218"]
for reg_no in data:
last_thing_of_reg_no = reg_no.split("/")[5]
search_bar.send_keys(Keys.CONTROL + "A")
search_bar.send_keys(last_thing_of_reg_no)
driver.find_element(By.NAME,"btn1").click() # click on search button.
time.sleep(1.5)
driver.find_element(By.TAG_NAME,"body").send_keys(Keys.END) # scroll the page to end
time.sleep(1.5)
total_projects = driver.find_elements(By.CSS_SELECTOR,".search_result_list")
for projects in total_projects:
all_paragraphs = projects.find_elements(By.TAG_NAME,"p")
for i in all_paragraphs:
text = i.text.replace("Reg No. : ","")
print(text)
print(reg_no)
if reg_no == text:
print("yes")
print("done")
time.sleep(100)
print("done")
time.sleep(50)
First, it opens the website, closes the pop-ups, finds a search bar, and sends reg_no. Then it gets all the paragraph tags in the website, iterates through each of them, and the error is there. Both statements are exactly same but it is not going inside if statement, I don't know why it is so.
I want to print yes in this code and iterate inside it, but I don't know why it is not printing yes, even though both the strings are exactly the same.
I don't know what else to write but feel free to ask more questions,
Many thanks for considering my request.

Related

How to close clickable popup to continue scraping through Selenium in python

I'm trying to scrape some information from clickable popups in a table on a website into a pandas dataframe using Selenium in python and it seems to be able to do this if the popups have information.
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
import pandas as pd
import time
driver = webdriver.Chrome()
driver.get('https://mspotrace.org.my/Sccs_list')
time.sleep(20)
# Select maximum number of entries
elem = driver.find_element_by_css_selector('select[name=dTable_length]')
select = Select(elem)
select.select_by_value('500')
time.sleep(15)
# Get list of elements
elements = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//a[#title='View on Map']")))
# Loop through element popups and pull details of facilities into DF
pos = 0
df = pd.DataFrame(columns=['facility_name','other_details'])
try:
for element in elements:
data = []
element.click()
time.sleep(3)
facility_name = driver.find_element_by_xpath('//h4[#class="modal-title"]').text
other_details = driver.find_element_by_xpath('//div[#class="modal-body"]').text
data.append(facility_name)
data.append(other_details)
df.loc[pos] = data
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[aria-label='Close'] > span"))).click() # close popup window
time.sleep(10)
pos+=1
except:
print("No geo location information")
pass
print(df)
However, there are cases when a window like below appears and I need to click 'OK' on this to resume scraping the other rows on the web page but I can't seem to be able to find the element to click on to do this.

can you try for Pythonenter code here:
driver.switch_to.alert.accept()
But, your test scenario should be clear and should know where this pop up appears. If you don't know and "really" random, you can check some hooks that running for each test step

Selenium driver provides methods to switch to alerts context and working with it:
driver.switch_to().alert()
After that, you can do whatever you want, depending on alert type. To simulate clicking on “OK”:
driver.switch_to().alert().accept()
More info here

Why exactly do I get an IndexError?

I am trying to use this python code from here. I am using firefox geckodriver instead. I get an index error from line 43 which is log_in[0].click(). Here is the code for convenience:
# importing necessary classes
# from different modules
from lib2to3.pgen2 import driver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
import time
browser = webdriver.Firefox()
prefs = {"profile.default_content_setting_values.notifications": 2}
# open facebook.com using get() method
browser.get('https://www.facebook.com/')
# user_name or e-mail id
username = "argleblargle#gmail.com"
# getting password from text file
with open('test.txt', 'r') as myfile:
password = myfile.read().replace('\n', '')
print("Let's Begin")
element = browser.find_elements_by_xpath('//*[#id ="email"]')
element[0].send_keys(username)
print("Username Entered")
element = browser.find_element_by_xpath('//*[#id ="pass"]')
element.send_keys(password)
print("Password Entered")
# logging in
log_in = browser.find_elements_by_id('loginbutton')
log_in[0].click()
print("Login Successful")
browser.get('https://www.facebook.com/events/birthdays/')
feed = 'Hap Borth! Hope you have an amazing day!'
element = browser.find_elements_by_xpath("//*[#class ='enter_submit\
uiTextareaNoResize uiTextareaAutogrow uiStreamInlineTextarea\
inlineReplyTextArea mentionsTextarea textInput']")
cnt = 0
for el in element:
cnt += 1
element_id = str(el.get_attribute('id'))
XPATH = '//*[#id ="' + element_id + '"]'
post_field = browser.find_element_by_xpath(XPATH)
post_field.send_keys(feed)
post_field.send_keys(Keys.RETURN)
print("Birthday Wish posted for friend" + str(cnt))
# Close the browser
browser.close()
As you can see from the code, it prints out when a step is completed. It passed username entered, passed password entered, but did not pass login successful. I get an IndexError: line 43, in <module> log_in[0].click()
Is that because the login button is somewhere different from when the code was first written? Is it 2FA shenanigans? I am doing this for fun, thanks for reading.
EDIT: the original error was because of the s in ind_elements_by_id. There is one element. Oops.
The error is now selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: [id="loginbutton"]

log_in variable is empty, because many websites including facebook first load the website as such, and only than load the layout and all the elements with javascript.
Your code tries to interact with facebook before it is fully loaded and therefor cannot find login button.
You can do something like this to resolve your problem or just write a while loop that checks if the button is found.

How to make Selenium only click a button and nothing else? Inconsistent clicking

My goal: to scrape the amount of projects done by a user on khan academy.
To do so I need to parse the profile user page. But I need to click on show more to see all the project a user had done and then scrape them.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
# here is one example of a user
driver = webdriver.Chrome()
driver.get('https://www.khanacademy.org/profile/trekcelt/projects')
# to infinite click on show more button until there is none
while True:
try:
showmore_project=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME,'showMore_17tx5ln')))
showmore_project.click()
except TimeoutException:
break
except StaleElementReferenceException:
break
# parsing the profile
soup=BeautifulSoup(driver.page_source,'html.parser')
# get a list of all the projects
project=soup.find_all(class_='title_1usue9n')
# get the number of projects
print(len(project))
This code return 0 for print(len(project)). And that's not normal because when you manually check https://www.khanacademy.org/profile/trekcelt/projects you can see there that the amount of projects is definetly not 0.
The weird thing: at first, you can see (with the webdriver) that this code is working fine and then selenium clicks on something else than the show more button, it click on one of the project's link for example and thus change the page and that's why we get 0.
I don't understand how to correct my code so selenium is only clicking on the right button and nothing else.

Check out the following implementation to get the desired behavior. When the script is running, take a closer look at the scroll bar to see the progress.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
with webdriver.Chrome() as driver:
wait = WebDriverWait(driver,10)
driver.get('https://www.khanacademy.org/profile/trekcelt/projects')
while True:
try:
showmore = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'[class^="showMore"] > a')))
driver.execute_script("arguments[0].click();",showmore)
except Exception:
break
soup = BeautifulSoup(driver.page_source,'html.parser')
project = soup.find_all(class_='title_1usue9n')
print(len(project))
Another way would be:
while True:
try:
showmore = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'[class^="showMore"] > a')))
showmore.location_once_scrolled_into_view
showmore.click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,'[class^="spinnerContainer"] > img[class^="loadingSpinner"]')))
except Exception:
break
Output at this moment:
381

I have modified the accepted answer to improve the performance of your script. Comment on how you can achieve it is in the code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import time
start_time = time.time()
# here is one example of a user
with webdriver.Chrome() as driver:
driver.get('https://www.khanacademy.org/profile/trekcelt/projects')
# This code will wait until the first Show More is displayed (After page loaded)
showmore_project = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME,
'showMore_17tx5ln')))
showmore_project.click()
# to infinite click on show more button until there is none
while True:
try:
# We will retrieve and click until we do not find the element
# NoSuchElementException will be raised when we reach the button. This will save the wait time of 10 sec
showmore_project= driver.find_element_by_css_selector('.showMore_17tx5ln [role="button"]')
# Using a JS to send the click will avoid Selenium to through an exception where the click would not be
# performed on the right element.
driver.execute_script("arguments[0].click();", showmore_project)
except StaleElementReferenceException:
continue
except NoSuchElementException:
break
# parsing the profile
soup=BeautifulSoup(driver.page_source,'html.parser')
# get a list of all the projects
project=soup.find_all(class_='title_1usue9n')
# get the number of projects
print(len(project))
print(time.time() - start_time)
Execution Time1: 14.343502759933472
Execution Time2: 13.955228090286255
Hope this help you!

How to scroll down a twitter page to load next pages and extract the data

I am trying to scroll down comments on a twitter status,trying to extract the page with all the comments(or at least first 5 pages). Using selenium driver for it , but not successful with the scrolling part, so i have to do manually and extract. I am using python 3.6.5 Pls help...
for eg for this tweet - https://twitter.com/TeamYouTube/status/1012415985184206848
Can anyone help me with code..
My code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome(executable_path="...../chromedriver")
driver.get('https://twitter.com/TeamYouTube/status/1012415985184206848')
for i in range(1,10):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
ip = input("Enter y to proceed: ")
if(ip == 'y'):
page = driver.page_source
filename = input('Enter file name : ')
path = 'D:/page_'+filename+'.html'
f = open(path,'w',encoding='utf-8')
f.write(page)
f.close()
driver.close()

Try this:
driver.execute_script("arguments[0].scrollTo(0, document.body.scrollHeight);", driver.findElement(By.id("#permalink-overlay-dialog")));
Explanation: you have to scroll a particular div. To be able to do it, you have to find this element on the page and the scroll to the end of the page only this element.
Second suggestion is to use:
from selenium.webdriver.common.keys import Keys
# locate element and simulate 'END' button press
driver.find_element_by_id("permalink-overlay-dialog").send_keys(Keys.END)
if ot won't work try also to extend with ActionChains:
from selenium.webdriver.common.action_chains import ActionChains
element = driver.find_element_by_id("permalink-overlay-dialog")
action = ActionChains(driver)
action.move_to_element(element).perform()
element.send_keys(Keys.END)

StaleElementReferenceException in Python

I am trying to scrape data from the Sunshine List website (http://www.sunshinelist.ca/) using the Selenium package but I get the following error mentioned below. From several other related posts I understand that I need to use the WebDriverWait to explicitly ask the driver to wait/refresh but I am unable to identify where and how I should call the function.
Screenshot of Error
StaleElementReferenceException: Message: The element reference
of (tr class="even") stale: either the element is no longer attached to the DOM or the
page has been refreshed
import numpy as np
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
ffx_bin = FirefoxBinary(r'C:\Users\BhagatM\AppData\Local\Mozilla Firefox\firefox.exe')
ffx_caps = DesiredCapabilities.FIREFOX
ffx_caps['marionette'] = True
driver = webdriver.Firefox(capabilities=ffx_caps,firefox_binary=ffx_bin)
driver.get("http://www.sunshinelist.ca/")
driver.maximize_window()
tablewotags1=[]
while True:
divs = driver.find_element_by_id('datatable-disclosures')
divs1=divs.find_elements_by_tag_name('tbody')
for d1 in divs1:
div2=d1.find_elements_by_tag_name('tr')
for d2 in div2:
tablewotags1.append(d2.text)
try:
driver.find_element_by_link_text('Next →').click()
except NoSuchElementException:
break
year1=tablewotags1[0::10]
name1=tablewotags1[3::10]
position1=tablewotags1[4::10]
employer1=tablewotags1[1::10]
df1=pd.DataFrame({'Year':year1,'Name':name1,'Position':position1,'Employer':employer1})
df1.to_csv('Sunshine List-1.csv', index=False)

If your problem is to click the "Next" button, you can do that with the xpath:
driver = webdriver.Firefox(executable_path=r'/pathTo/geckodriver')
driver.get("http://www.sunshinelist.ca/")
wait = WebDriverWait(driver, 20)
el=wait.until(EC.presence_of_element_located((By.XPATH,"//ul[#class='pagination']/li[#class='next']/a[#href='#' and text()='Next → ']")))
el.click()

For each click on the "Next" button -- you should find that button and click on it.
Or do something like this:
max_attemps = 10
while True:
next = self.driver.find_element_by_css_selector(".next>a")
if next is not None:
break
else:
time.sleep(0.5)
max_attemps -= 1
if max_attemps == 0:
self.fail("Cannot find element.")
And after this code does click action.
PS: Also try to add just time.sleep(x) after fiding element and then do click action.

Try this code below.
When the element is no longer attached to the DOM and the StaleElementReferenceException is invoked, search for the element again to reference the element.
Please do note I checked with Chrome:
try:
driver.find_element_by_css_selector('div[id="datatable-disclosures_wrapper"] li[class="next"]>a').click()
except StaleElementReferenceException:
driver.find_element_by_css_selector('div[id="datatable-disclosures_wrapper"] li[class="next"]>a').click()
except NoSuchElementException:
break

>>>Stale Exceptions can be handled using **StaleElementReferenceException** to continue to execute the for loop. When you try to get the element by any find_element method in a for loop.
from selenium.common import exceptions
and customize your code of for loop as:
for loop starts:
try:
driver.find_elements_by_id("data") //method to find element
//your code
except exceptions.StaleElementReferenceException:
pass

When you raise the StaleElementException that means that somthing changed in the site, but not in the list you have. So the trick is to refresh that list every time, inside the loop like this:
while True:
driver.implicitly_wait(4)
for d1 in driver.find_element_by_id('datatable-disclosures').find_element_by_tag_name('tbody').find_elements_by_tag_name('tr'):
tablewotags1.append(d1.text)
try:
driver.switch_to.default_content()
driver.find_element_by_xpath('//*[#id="datatable-disclosures_wrapper"]/div[2]/div[2]/div/ul/li[7]/a').click()
except NoSuchElementException:
print('Don\'t be so cryptic about error messages, they are good\n
...Script broke clicking next') #jk aside put some info there
break
Hope this help you, cheers.
Edit:
So I went to the said website, the layout is pretty straight forward, but the structure repeats itself like four times. So when you go about crawling the site like that something is bound to change.
So I’ve edited the code to only scrap one tbody tree. This tree comes from the first datatable-disclousure. And added some waits.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Same strings in python aren't matching in web scraping - python

Related

How to close clickable popup to continue scraping through Selenium in python

Why exactly do I get an IndexError?

How to make Selenium only click a button and nothing else? Inconsistent clicking

How to scroll down a twitter page to load next pages and extract the data

StaleElementReferenceException in Python

Categories

Resources