scraping a page that updates using python selenium - python

It has now been weeks that I was trying to scrape all the information in this website. This website is the profile of a company and I'm tying to get all the information in this id = "profile-basic" section and id="profile-addresses" section.
I am looping through 1000 0f these profiles and this is only one of them. The reason why Im not showing the code is because its very basic and it will not effect my question, though for the ones that want to know its just a simple for loop that goes through a list one by one.
The problem is a lot of the elements in the page don't appear in some profile but in other profiles they will. I tried solving that by writing down the xpath of all possible elements and than using try: to check all of them and it worked just find, the only problem was the xpath would not always be for one part of information for example the xpath for addres //*[#id="profile-addresses"]/div/div/div/div[1]/p but sometimes it could be //*[#id="profile-addresses"]/div/div/div/div[2]/p many other xpaths. Since im trying to put the address inside the address variable it will be impossible to tell which xpath will be for the address in that page.
I tried using this code:
names = {"آدرس تولیدی :" : "Address", "آدرس دفتر :" : "Address", "تلفن :" : "Phone2",
"تعداد پرسنل :" : "StaffNumber", "کدپستی :" : "PostalCode", "توضیحات :" :
"Description2"}
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[1]/span').text
_1 = names.get(e)
__1 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[1]/p').text
exec(f"global {_1}\n{_1} = Smalify('{__1}')")
except:
pass
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[2]/span').text
_2 = names.get(e)
__2 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[2]/p').text
exec(f"global {_2}\n{_2} = Smalify('{__2}')")
except:
pass
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[3]/span').text
_3 = names.get(e)
__3 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[3]/p').text
exec(f"global {_3}\n{_3} = Smalify('{__3}')")
except:
pass
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[4]/span').text
_4 = names.get(e)
__4 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[4]/p').text
exec(f"global {_4}\n{_4} = Smalify('{__4}')")
except:
pass
try:
e=browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[5]/span').text
_5 = names.get(e)
__5 = browser.find_element(By.XPATH, '//*[#id="profile-addresses"]/div/div/div/div[5]/p').text
exec(f"global {_5}\n{_5} = Smalify('{__5}')")
except:
pass
The code above will read the span in front of the main element, and than find the matching variable name from the names dictionary and when it did it will se the value of the main element to the variable name using the exec() function.
This code did not work at all for two reasons, A: It always returned Noun even if it could find the elements. B: It took way too long.
I was wondering if there is anyways other than my code to do it efficiently.

You can always try to search by ID, rather than xpath. Since the xpath is variable between the pages, try to find something that is static, such as ID name.
There is some more information about the different ways you can locate specific html elements using selenium at this link. I definitely recommend checking it out.
Here is an example of searching for your elements by their IDs:
browser.find_element(By.ID, "profile-addresses").text
Good luck!

Related

Selenium cannot locate element

I am using selenium to create a kahoot bot flooder. (kahoot.it) I am trying to use selenium to locate the input box, as well as the confirm button. Whenever I try to define them as a variable, I get this. "Command raised an exception: TimeoutException: Message:", which I think means that the 5 seconds that I set has expired, meaning that the element was never located.
for idr in tabs:
num+=1
drv.switch_to.window(idr)
time.sleep(0.3)
gameid = WebDriverWait(drv,5).until(EC.presence_of_element_located((By.CLASS_NAME , "sc-bZSQDF bXdUBZ")))
gamebutton = WebDriverWait(drv,5).until(EC.presence_of_element_located((By.CLASS_NAME , "sc-iqHYGH eMQRbB sc-geEHAE kTTBHH")))
gameid.send_keys(gamepin)
gamebutton.click()
time.sleep(0.8)
try:
nick = WebDriverWait(drv,5).until(EC.presence_of_element_located((By.CLASS_NAME , "sc-bZSQDF bXdUBZ")))
nickbutton = WebDriverWait(drv,5).until(EC.presence_of_element_located((By.CLASS_NAME , "sc-iqHYGH eMQRbB sc-ja-dpGc gYusMa")))
nick.send_keys(f'{name}{num - 1}')
nickbutton.click()
except:
I tried locating an "Iframe" which wasn't really successful (might have done it wrong), but I have been searching for hours and haven't found any answers. Any help would be appreciated.
The Class name for the input and button tags have spaces in it.
For input tag you can use the name attribute. and for button tag you can use the tag name since its the only button tag in the DOM.
gameinput = wait.until(EC.presence_of_element_located((By.NAME,"gameId")))
gameinput.send_keys("Sample Text")
submit = wait.until(EC.presence_of_element_located((By.TAG_NAME,"button")))
submit.click()
#It also worked with below line:
gameinput = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".sc-bZSQDF.bXdUBZ")))

Clear search bar using selenium

Im using selenium to check if FB pages exist. When i enter the page title in the search bar it works fine but after the second loop the name of the page gets attached to the preview search and i cant find a way to clear the previous search.
For example it looks for
xyz for the first time
then it looks for
xyzabc when i just want to look for abc this time.
How can i clear the search bar so i can just enter the input without the previous input?
Here is my code
for page_target in df.page_name.values:
time.sleep(3)
inputElement = driver.find_element_by_name("q")
inputElement.send_keys(page_target)
inputElement.submit()
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser').get_text()
title = soup.find(page_target)
#if page exists add 1 to the dic otherwise -1
if title > 0:
dic_holder[page_target] = 1
else:
dic_holder[page_target] = -1
driver.find_element_by_name("q").clear()
time.sleep(3)
You can use
WebElement.clear();//to clear the previous search item
WebElement.sendkeys(abc);//to insert the new search
Also I guess you have a sticky search in your application hence I recommend you to use this method everytime you insert something in the searchbox
Few ways to do it:
Use element.clear(). I see that you already tried in your code, not sure how it didn't work but I guess it is not text box or input element?
Use javascript: driver.execute_script('document.getElementsByName("q")[0].value=""');
Emulate Ctrl+A?
from selenium.webdriver.common.keys import Keys
elem.send_keys(Keys.CONTROL, 'a')
elem.send_keys("page 1")

Stale Element Reference error when iterating over CSS Selector to extract links with Python

I am trying to retrieve all the links to the posts of on instagram account. The structure is a bit nested: first I find the class by X_Path where all of those links are located and then I iterate over web_elements( posts) to extract the links. However, this approach throws the Stale Element Reference.
My question is: How should I design a loop with WebDriverWait implementation with By.CSS_Selector to extract links and store them in one list?
I've read and tried to implement the WebDriverWait, yet I am stuck doing that properly since all the attempts do not seem to work.
I've search for the questions and have found two links that were very helpful, however none of those deal with By.CSS_SELECTOR to extract a href.
These are the links:
StaleElementException when iterating with Python
My current code that goes in infinite loop:
def getting_comment(instagram_page, xpath_to_links, xpath_to_comments ):
global allComments
links = []
scheight = .1
posts = []
browser= webdriver.Chrome('/Users/marialavrovskaa/desktop/chromedriver')
browser.get(f"{instagram_page}")
while scheight < 9.9:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/%s);" % scheight)
scheight += .01
posts = browser.find_elements_by_xpath(f"//div[#class='{xpath_to_links}']")
for elem in posts:
while True:
try:
WebDriverWait(elem, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".a")))
links.append(elem.find_element_by_css_selector('a').get_attribute('href'))
except TimeoutException:
break
instagram_page = https://www.instagram.com/titovby/?hl=ru
xpath_to_links = v1Nh3 kIKUG _bz0w

Can't print in Python

I've been studying Python last few weeks to automate a work for my business.
Basically I have to do webscraping but I'm having trouble with a print function in the next-to-last code line...
def search_time(self):
for item in self.items:
print(f"Procurando {item}.")
self.driver.get(self.bot_url)
cpfBox = self.driver.find_element_by_xpath('//*[#id="search"]/div/div[1]/input')
cpfBox.send_keys(item)
time.sleep(2)
cpfButton = self.driver.find_element_by_xpath('//*[#id="search"]/div/div[2]/button')
cpfButton.click()
time.sleep(2)
self.delay = 3 # seconds
try:
WebDriverWait(self.driver, self.delay).until(EC.presence_of_element_located((By.XPATH, '//*[#id="main"]/div[1]/h2')))
print('CPF Valido')
except TimeoutException:
print('CPF Invalido')
time.sleep(2)
name = self.driver.find_element_by_xpath("/html/body/main[1]/div[1]/div[1]/div[1]/div[1]/h2").text
print(name)
time.sleep(2)
items = ["32911769953"]
bot_url = BOT(items)
bot_url.search_time()
Check whether the xpath is indeed the correct path to the text you want to get through selenium.
If this is a website you can go to the element that you are trying to find and right click and select Copy XPATH.
try:
WebDriverWait(self.driver, self.delay).until(EC.presence_of_element_located((By.XPATH, '//*[#id="main"]/div[1]/h2')))
name = self.driver.find_element_by_xpath("/html/body/main[1]/div[1]/div[1]/div[1]/div[1]/h2").text
print(name)
time.sleep(2)
print('CPF Valido')
except TimeoutException:
print('CPF Invalido')
time.sleep(2)
First, I would recommend generalizing your xpath query. Currently, it is heavily dependent on the entire page's structure. A small layout change in an unrelated part of the page could give you undesirable results. Review xpath syntax use of // and predicates will save you many headaches in the future.
If that doesn't help, please post the html that you are attempting to parse.

selenium stop after input

I am new to Python and Selenium coding, but I think I figured it out, tryed to build some exmaples for myself to learn from them, I got 2 questions,
First of all for some reason my code is stopping after my Input it does not going for the yalla() Function for some reason,
yallaurl = str(input('Your URL + ' + ""))
browser = webdriver.Chrome()
browser.get(yallaurl)
browser.maximize_window()
yalla()
Other then this the other Question is about browser.find_element_by_xpath so After I go to an html file and click Copy xpath I am getting something like this:
/html/body/table[2]/tbody/tr/td/form/table[4]/tbody/tr[2]/td/table/tbody/tr[2]/td[2]
So how is the line of code is working? is this legit?
def yalla():
sleep(2)
count = len(browser.find_elements_by_class_name('flyingCart'))
email = browser.find_element_by_xpath('/html/body/table[2]/tbody/tr/td/form/table[4]/tbody/tr[2]/td/table/tbody/tr[2]/td[2]')
for x in range(2, count):
itemdesc[x] = browser.find_element_by_xpath(
"/html/body/table[2]/tbody/tr/td/form/table[1]/tbody/tr[2]/td[2]/table/tbody/tr[x]/td[2]/a[1]/text()")
priceper[x] = browser.find_element_by_xpath(
"/html/body/table[2]/tbody/tr/td/form/table[1]/tbody/tr[2]/td[2]/table/tbody/tr[x]/td[5]/text()")
amount[x] = browser.find_element_by_xpath(
"/html/body/table[2]/tbody/tr/td/form/table[1]/tbody/tr[2]/td[2]/table/tbody/tr[x]/td[6]")
browser.navigate().to('https://www.greeninvoice.co.il/app/documents/new#type=100')
checklogininvoice()
Yes, your code will run just fine and is legit but not recommended.
As described, the absolute path works fine, but would break if the HTML was changed only slightly
Reference: https://selenium-python.readthedocs.io/locating-elements.html
Firstly, this code is confusing:
yallaurl = str(input('Your URL + ' + ""))
This is essentially equavilent to:
yallaurl = input('Your URL: ')
Yes, this code is correct:
browser.find_element_by_xpath('/html/body/table[2]/tbody/tr/td/form/table[4]/tbody/tr[2]/td/table/tbody/tr[2]/td[2]')
Please refer to the docs for proper usage.
Here is the suggested use of this method:
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '/html/body/table[2]/tbody/tr/td/form/table[4]/tbody/tr[2]/td/table/tbody/tr[2]/td[2]')
This code will return an object of the element you have selected. To print the HTML of the element itself, this should work:
print(element.get_attribute('outerHTML'))
For further information on page objects, please refer to this page of the docs.
Since you have not provided the code for your 'yalla' function, it is hard to diagnose the problem there.

Categories