How to get data from Airbnb with selenium - python

I am trying to web-scrape Airbnb with selenium. However it's been a HUGE impossible mission.
First, I create a drive, where the argument "executable_path" is where my chromedriver is installed.
driver = webdriver.Chrome(executable_path=r'C:\directory\directory\directory\chromedriver.exe')
Secondly, I do the other stuffs:
driver.get('https://www.airbnb.com.br/')
a = driver.find_element(By.CLASS_NAME, "cqtsvk7 dir dir-ltr")
a.click()
a.send_keys('Poland')
Here I received the error: NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found
Moreover, when I create the variables to store the html elements, it doesn't work as well:
title = driver.find_elements(By.CLASS_NAME, 'a-size-base-plus a-color-base a-text-normal')
place = driver.find_elements(By.ID, 'title_49247685')
city = driver.find_elements(By.CLASS_NAME, 'f15liw5s s1cjsi4j dir dir-ltr')
price = driver.find_elements(By.CLASS_NAME, 'p11pu8yw dir dir-ltr')
Please, someone could help me? How can I get the place, city and price of all of my query of place to travel in airbnb? (I know how to store all in a pandas df, my problem is the use of selenium. Those "get_elements" seem not to work properly in airbnb.

I received the error: NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found
Which line is raising this error? I don't see anything in your snippets that could be causing it, but is there anything in your code [before the included snippet], or some external factor that could be causing the automated window to get closed? You could see if any of the answers to this helps you with the issue, especially if you're using .switch_to.window anywhere in your code.
Searching
(You should include screenshots or better descriptions of the fields you are targeting, especially when the issue is that you're having difficulty targeting them.)
Secondly, I do the other stuffs:
driver.get('https://www.airbnb.com.br/')
a = driver.find_element(By.CLASS_NAME, "cqtsvk7 dir dir-ltr")
want that selenium search for me the country where I want to extract the data (Poland, in this case)
If you mean that you're trying to enter "Poland" into this input field, then the class cqtsvk7 in cqtsvk7 dir dir-ltr appears to change. The id attribute might be more reliable; but also, it seems like you need to click on the search area to make the input interactable; and after entering "Poland" you also have to click on the search icon and wait to load the results.
# from selenium.webdriver.support.ui import WebDriverWait
def search_airbnb(search_for, browsr, wait_timeout=5):
wait_til = WebDriverWait(browsr, wait_timeout).until
browsr.get('https://www.airbnb.com.br/')
wait_til(EC.element_to_be_clickable(
(By.CSS_SELECTOR, 'div[data-testid="little-search"]')))
search_area = browsr.find_element(
By.CSS_SELECTOR, 'div[data-testid="little-search"]')
search_area.click()
print('CLICKED search_area')
wait_til(EC.visibility_of_all_elements_located(
(By.ID, "bigsearch-query-location-input")))
a = browsr.find_element(By.ID, "bigsearch-query-location-input")
a.send_keys(search_for)
print(f'ENTERED "{search_for}"')
wait_til(EC.element_to_be_clickable((By.CSS_SELECTOR,
'button[data-testid="structured-search-input-search-button"]')))
search_btn = browsr.find_element(By.CSS_SELECTOR,
'button[data-testid="structured-search-input-search-button"]')
search_btn.click()
print('CLICKED search_btn')
searchFor = 'Poland'
search_airbnb(searchFor, driver) # , 15) # adjust wait_timeout if necessary
Notice that for the clicked elements, I used By.CSS_SELECTOR; if unfamiliar with CSS selectors, you can consult this reference. You can also use By.XPATH in these cases; this XPath cheatsheet might help then.
Scraping Results
How can I get the place, city and price of all of my query of place to travel in airbnb?
Again, you can use CSS selectors [or XPaths] as they're quite versatile. If you use a function like
def select_get(elem, sel='', tAttr='innerText', defaultVal=None, isv=False):
try:
el = elem.find_element(By.CSS_SELECTOR, sel) if sel else elem
rVal = el.get_attribute(tAttr)
if isinstance(rVal, str): rVal = rVal.strip()
return defaultVal if rVal is None else rVal
except Exception as e:
if isv: print(f'failed to get "{tAttr}" from "{sel}"\n', type(e), e)
return defaultVal
then even if a certain element or attribute is missing in any of the cards, it'll just fill in with defaultVal and all the other cards will still be scraped instead of raising an error and crashing the whole program.
You can get a list of dictionaries in listings by looping through the result cards with list comprehension like
listings = [{
'name': select_get(el, 'meta[itemprop="name"]', 'content'), # SAME TEXT AS
# 'title_sub': select_get(el, 'div[id^="title_"]+div+div>span'),
'city_title': select_get(el, 'div[id^="title_"]'),
'beds': select_get(el, 'div[id^="title_"]+div+div+div>span'),
'dates': select_get(el, 'div[id^="title_"]+div+div+div+div>span'),
'price': select_get(el, 'div[id^="title_"]+div+div+div+div+div div+span'),
'rating': select_get(el, 'div[id^="title_"]~span[aria-label]', 'aria-label')
# 'url': select_get(el, 'meta[itemprop="url"]', 'content', defaultVal='').split('?')[0],
} for el in driver.find_elements(
By.CSS_SELECTOR, 'div[itemprop="itemListElement"]' ## RESULT CARD SELECTOR
)]
Dealing with Pagination
If you wanted to scrape from multiple pages, you can loop through them. [You can also use while True (instead of a for loop as below) for unlimited pages, but I feel like it's safer like this, even if you set an absurdly high limit like maxPages=5000 or something; either way, it should break out of the loop once it rreaches the last page.]
maxPages = 50 # adjust as preferred
wait = WebDriverWait(browsr, 3) # adjust timeout as necessary
listings, addedIds = [], []
isFirstPage = True
for pgi in range(maxPages):
prevLen = len(listings) # just for printing progress
## wait to load all the cards ##
try:
wait.until(EC.visibility_of_all_elements_located(
(By.CSS_SELECTOR, 'div[itemprop="itemListElement"]')))
except Exception as e:
print(f'[{pgi}] Failed to load listings', type(e), e)
continue # losing one loop for additional wait time
## check current page number according to driver ##
try:
pgNum = driver.find_element(
By.CSS_SELECTOR, 'button[aria-current="page"]'
).text.strip() if not isFirstPage else '1'
except Exception as e:
print('Failed to find pgNum', type(e), e)
pgNum = f'?{pgi+1}?'
## collect listings ##
pgListings = [{
'listing_id': select_get(
el, 'div[role="group"]>a[target^="listing_"]', 'target',
defaultVal='').replace('listing_', '', 1).strip(),
# 'position': 'pg_' + str(pgNum) + '-pos_' + select_get(
# el, 'meta[itemprop="position"]', 'content', defaultVal=''),
'name': select_get(el, 'meta[itemprop="name"]', 'content'),
#####################################################
### INCLUDE ALL THE key-value pairs THAT YOU WANT ###
#####################################################
} for el in driver.find_elements(
By.CSS_SELECTOR, 'div[itemprop="itemListElement"]'
)]
## [ only checks for duplicates against listings frm previous pages ] ##
listings += [pgl for pgl in pgListings if pgl['listing_id'] not in addedIds]
addedIds += [l['listing_id'] for l in pgListings]
### [OR] check for duplicates within the same page as well ###
## for pgl in pgListings:
## if pgl['listing_id'] not in addedIds:
## listings.append(pgl)
## addedIds.append(addedIds)
print(f'[{pgi}] extracted', len(listings)-prevLen,
f'listings [of {len(pgListings)} total] from page', pgNum)
## got to next page ##
nxtPg = driver.find_elements(By.CSS_SELECTOR, 'a[aria-label="Próximo"]')
if not nxtPg:
print(f'No more next page [{len(listings)} listings so far]\n')
break ### [OR] START AGAIN FROM page1 WITH:
## try: _, isFirstPage = search_airbnb(searchFor, driver), True
## except Exception as e: print('Failed to search again', type(e), e)
## continue
### bc airbnb doesn't show all results even across all pages
### so you can get a few more every re-scrape [but not many - less than 5 per page]
try: _, isFirstPage = nxtPg[0].click(), False
except Exception as e: print('Failed to click next', type(e), e)
dMsg = f'[reduced from {len(addedIds)} after removing duplicates]'
print('extracted', len(listings), 'listings with', dMsg)
[listing_id seems to be the easiest way to ensure that only unique listings are collected. You can also form a link to that listing like f'https://www.airbnb.com.br/rooms/{listing_id}'.]
Combining with Old Data [Load & Save]
If you want to save to CSV and also load previous from the same file with old and new data combined without duplicates, you can do some thing like
# import pandas as pd
# import os
fileName = 'pol_airbnb.csv'
maxPages = 50
try:
listings = pd.read_csv(fileName).to_dict('records')
addedIds = [str(l['listing_id']).strip() for l in listings]
print(f'loaded {len(listings)} previously extracted listings')
except Exception as e:
print('failed to load previous data', type(e), e)
listings, addedIds = [], []
#################################################
# for pgi... ## LOOP THROUGH PAGES AS ABOVE #####
#################################################
dMsg = f'[reduced from {len(addedIds)} after removing duplicates]'
print('extracted', len(listings), 'listings with', dMsg)
pd.DataFrame(listings).set_index('listing_id').to_csv(fileName)
print('saved to', os.path.abspath(fileName))
Note that keeping the old data might mean that some the listings are no longer available.
View pol_airbnb.csv for my results with maxPages=999 and searching again instead of break-ing in if not nxtPg.....

Related

How to loop through indeed job pages using selenium

I am trying to make a selenium python script to collect data from each job in an indeed job search. I can easily get the data from the first and second page. The problem I am running into is while looping through the pages, the script only clicks the next page and the previous page, in that order. Going from page 1 -> 2 -> 1 -> 2 -> ect. I know it is doing this because both the next and previous button have the same class name. So when I redeclare the webelement variable when the page uploads, it hits the previous button because that is the first location of the class in the stack. I tried making it always click the next button by using the xpath, but I still run into the same errors. I would inspect the next button element, and copy the full xpath. my code is below, I am using python 3.7.9 and pip version 21.2.4
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
HTTPS = "https://"
# hard coded data to test
siteDomain = "indeed.com"
jobSearch = "Software Developer"
locationSearch = "Richmond, VA"
listOfJobs = []
def if_exists_by_id(id):
try:
driver.find_element_by_id(id)
except NoSuchElementException:
return False
return True
def if_exists_by_class_name(class_name):
try:
driver.find_element_by_class_name(class_name)
except NoSuchElementException:
return False
return True
def if_exists_by_xpath(xpath):
try:
driver.find_element_by_xpath(xpath)
except NoSuchElementException:
return False
return True
def removeSpaces(strArray):
newjobCounter = 0
jobCounter = 0
for i, word in enumerate(strArray):
jobCounter += 1
if strArray[i].__contains__("\n"):
strArray[i] = strArray[i].replace("\n", " ")
if strArray[i].__contains__("new"):
newjobCounter += 1
print(strArray[i] + "\n")
if newjobCounter == 0:
print("Unfortunately, there are no new jobs for this search")
else:
print("With " + str(newjobCounter) + " out of " + str(jobCounter) + " new jobs!")
return strArray
try:
# Goes to Site
driver.get(HTTPS + siteDomain)
# obtains access to elements from website
searchJob = driver.find_element_by_name("q")
searchLocation = driver.find_element_by_name("l")
# clear text field
searchJob.send_keys(Keys.CONTROL, "a", Keys.BACK_SPACE)
searchLocation.send_keys(Keys.CONTROL, "a", Keys.BACK_SPACE)
# inputs values into website elements
searchJob.send_keys(jobSearch)
searchLocation.send_keys(locationSearch)
# presses button to search
searchLocation.send_keys(Keys.RETURN)
# Begin looping through pages
pageList = driver.find_element_by_class_name("pagination")
page = pageList.find_elements_by_tag_name("li")
numPages = 0
for i,x in enumerate(page):
time.sleep(1)
# checks for popup, if there is popup, exit out and sleep
if if_exists_by_id("popover-x"):
driver.find_element_by_id("popover-x").click()
time.sleep(1)
# increment page counter variabke
numPages += 1
# obtains data in class name value
jobCards = driver.find_elements_by_class_name("jobCard_mainContent")
# prints number of jobs returned
print(str(len(jobCards)) + " jobs in: " + locationSearch)
# inserts each job into list of jobs array
# commented out to make debugging easier
# for jobCard in jobCards:
# listOfJobs.append(jobCard.text)
# supposed to click the next page, but keeps alternating
# between next page and previous page
driver.find_element_by_class_name("np").click()
print("On page number: " + str(numPages))
# print(removeSpaces(listOfJobs))
except ValueError:
print(ValueError)
finally:
driver.quit()
Any help will be greatly appreciated, also if I am implementing bad coding practices in the structure of the script please let me know as I am trying to learn as much as possible! :)
I have tested your code.. the thing is there are 2 'np' class elements when we go to the 2nd page.. what you can do is for first time use find_element_by_class_name('np') and for all the other time use find_elements_by_class_name('np')[1] that will select the next button.. and you can use find_elements_by_class_name('np')[0] for the previous button if needed. Here is the code!
if i == 0:
driver.find_element_by_class_name("np").click()
else:
driver.find_elements_by_class_name("np")[1].click()
Just replace the line driver.find_element_by_class_name("np").click() with the code snippet above.. I have tested it and it worked like a charm.
Also i am not as experienced as the other devs here.. But i am glad if i could help you. (This is my first answer ever on stackoverflow)

Analysing YouTube comments using Python -- parameter has disabled comments

I'm trying to get into text analysis using YouTube comments. I've been using the code from the following website to scrape YouTube:
https://www.pingshiuanchua.com/blog/post/using-youtube-api-to-analyse-youtube-comments-on-python
The script starts working, but there is a section of the code that generates an error if comments have been disabled, and I can't find a way to check to see if comments are disabled or if comments exist, and to just skip that video if there are no comments to scrape, and continue on to the next video.
The code chunk in question creating the error is:
# =============================================================================
# Get Comments of Top Videos
# =============================================================================
video_id_pop = []
channel_pop = []
video_title_pop = []
video_desc_pop = []
comments_pop = []
comment_id_pop = []
reply_count_pop = []
like_count_pop = []
from tqdm import tqdm
for i, video in enumerate(tqdm(video_id, ncols = 100)):
response = service.commentThreads().list(
part = 'snippet',
videoId = video,
maxResults = 100, # Only take top 100 comments...
order = 'relevance', #... ranked on relevance
textFormat = 'plainText',
).execute()
comments_temp = []
comment_id_temp = []
reply_count_temp = []
like_count_temp = []
for item in response['items']:
comments_temp.append(item['snippet']['topLevelComment']['snippet']['textDisplay'])
comment_id_temp.append(item['snippet']['topLevelComment']['id'])
reply_count_temp.append(item['snippet']['totalReplyCount'])
like_count_temp.append(item['snippet']['topLevelComment']['snippet']['likeCount'])
comments_pop.extend(comments_temp)
comment_id_pop.extend(comment_id_temp)
reply_count_pop.extend(reply_count_temp)
like_count_pop.extend(like_count_temp)
video_id_pop.extend([video_id[i]]*len(comments_temp))
channel_pop.extend([channel[i]]*len(comments_temp))
video_title_pop.extend([video_title[i]]*len(comments_temp))
video_desc_pop.extend([video_desc[i]]*len(comments_temp))
query_pop = [query] * len(video_id_pop)
Edited to add:
The person who created the code left a message to fix the error saying:
"You can wrap the query part of the code in a try...except statement, where if the try statement (the query part) failed, you can push an except of blank response or "error" string into the list."
I have NFI how to carry this out if it makes sense to anyone else...
Note: this is not necessarily "good" coding style, but it's the sort of thing I would do if I ran into this problem when I was writing a script for my own short-term, personal use.
Python (and many other languages) have a way to catch exceptions and handle them without crashing. Used properly, this can be a very nice way to handle bad data.
https://docs.python.org/3.8/tutorial/errors.html is a good overview of exceptions. In general, the format they take is something like
try:
code_that_can_error()
except ExceptionThatWIllBeThrown as ex:
handle_exception()
print(ex) # ex is an object that has information about what went wrong
finally:
clean_up()
(Finally is particularly useful if you have something you need to call close on, like a file. If the exception is thrown, you might not close it, but a finally is guaranteed to get called, even if an exception is thrown.)
In your case, all we need is to ignore the error and move on to the next video.
for i, video in enumerate(tqdm(video_id, ncols = 100)):
try:
response = service.commentThreads().list(
part = 'snippet',
videoId = video,
maxResults = 100, # Only take top 100 comments...
order = 'relevance', #... ranked on relevance
textFormat = 'plainText',
).execute()
comments_temp = []
[...]
video_desc_pop.extend([video_desc[i]]*len(comments_temp))
except:
# Something threw an error. Skip that video and move on
print(f"{video} has comments disabled, or something else went wrong")
query_pop = [query] * len(video_id_pop)

Scroll Height Return "None" in Selenium: [ arguments[0].scrollHeight ]

Working on the python Bot with selenium, and infinite scrolling in dialog box isn't working due to "None" return from the "arguments[0].scrollHeight"
dialogBx=driver.find_element_by_xpath("//div[#role='dialog']/div[2]")
print(dialogBx) #<selenium.webdriver.remote.webelement.WebElement (session="fcec89cc11fa5fa5eaf29a8efa9989f9", element="31bfd470-de78-XXXX-XXXX-ac1ffa6224c4")>
print(type(dialogBx)) #<class 'selenium.webdriver.remote.webelement.WebElement'>
sleep(5)
last_height=driver.execute_script("arguments[0].scrollHeight",dialogBx);
print("Height : ",last_height) #None
I needed last height to compare, please suggest solution.
Ok, to answer your question, since you are inside a dialog we should focus on it. When you execute : last_height=driver.execute_script("arguments[0].scrollHeight",dialogBx); I believe you are executing that in the main page or in a wrong div (not 100% sure). Either way I took a diferente approach, we are going to select the last <li> item currently available in the dialog and scroll down to its position, this will force the dialog to update. I will extract a code from the full code you will see below:
last_li_item = driver.find_element_by_xpath('/html/body/div[4]/div/div[2]/ul/div/li[{p}]'.format(p=start_pos))
last_li_item.location_once_scrolled_into_view
We first select the last list item and then the property location_once_scrolled_into_view. This property will scroll our dialog down to our last item and then it will load more items. start_pos is just the position in the list of <li> element we have available. ie.: <div><li></li><li></li><li></li></div> start_pos=2 which is the last li item starting from 0. I put this variable name because it is inside a for loop which is watching the changes of li items inside the div, you will get it once you see the full code.
In other hand to execute this,simply change the parameters at the top and execute the test function test(). If you are already log in to instagram you can just run get_list_of_followers().
Note: Using this function use a Follower class that is also in this code. You can remove if you wish but you will need to modify the function.
IMPORTANT:
When you execute this program, the dialog box items will be increasing until there is no more items to load, so a TODO would be remove the element you have already processed otherwise I believe performace will get slower when you start hitting big numbers!
Let me know if you need any other explanation. Now the code:
import time
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
# instagram url as our base
base_url = "https://www.instagram.com"
# =====================MODIFY THESE TO YOUR NEED=========
# the user we wish to get the followers from
base_user = "/nasa/"
# how much do you wish to sleep to wait for loading (seconds)
sleep_time = 3
# True will attempt login with facebook, False with instagram
login_with_facebook = True
# Credentials here
username = "YOUR_USERNAME"
password = "YOUR_PASSWORD"
# How many users do you wish to retrieve? -1 = all or n>0
get_users = 10
#==========================================================
# This is the div that contains all the followers info not the dialog box itself
dialog_box_xpath = '/html/body/div[4]/div/div[2]/ul/div'
total_followers_xpath = '/html/body/div[1]/section/main/div/header/section/ul/li[2]/a/span'
followers_button_xpath = '/html/body/div[1]/section/main/div/header/section/ul/li[2]/a'
insta_username_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[2]/div/label/input'
insta_pwd_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[3]/div/label/input'
insta_login_button_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[4]/button'
insta_fb_login_button_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[6]/button'
fb_username_xpath = '/html/body/div[1]/div[3]/div[1]/div/div/div[2]/div[1]/form/div/div[1]/input'
fb_pwd_xpath = '/html/body/div[1]/div[3]/div[1]/div/div/div[2]/div[1]/form/div/div[2]/input'
fb_login_button_xpath = '/html/body/div[1]/div[3]/div[1]/div/div/div[2]/div[1]/form/div/div[3]/button'
u_path = fb_username_xpath if login_with_facebook else insta_username_xpath
p_path = fb_pwd_xpath if login_with_facebook else insta_pwd_xpath
lb_path = fb_login_button_xpath if login_with_facebook else insta_login_button_xpath
# Simple class of a follower, you dont actually need this but for explanation is ok.
class Follower:
def __init__(self, user_name, href):
self.username = user_name
self.href = href
#property
def get_username(self):
return self.username
#property
def get_href(self):
return self.href
def __repr__(self):
return self.username
def test():
base_user_path = base_url + base_user
driver = webdriver.Chrome()
driver.get(base_user_path)
# click the followers button and will ask for login
driver.find_element_by_xpath(followers_button_xpath).click()
time.sleep(sleep_time)
# now we decide if we will login with facebook or instagram
if login_with_facebook:
driver.find_element_by_xpath(insta_fb_login_button_xpath).click()
time.sleep(sleep_time)
username_input = driver.find_element_by_xpath(u_path)
username_input.send_keys(username)
password_input = driver.find_element_by_xpath(p_path)
password_input.send_keys(password)
driver.find_element_by_xpath(lb_path).click()
# We need to wait a little longer for the page to load so. Feel free to change this to your needs.
time.sleep(10)
# click the followers button again
driver.find_element_by_xpath(followers_button_xpath).click()
time.sleep(sleep_time)
# now we get the list of followers from the dialog box. This function will return a list of follower objects.
followers: list[Follower] = get_list_of_followers(driver, dialog_box_xpath, get_users)
# close the driver we do not need it anymore.
driver.close()
for follower in followers:
print(follower, follower.get_href)
def get_list_of_followers(driver, d_xpath=dialog_box_xpath, get_items=10):
"""
Get a list of followers from instagram
:param driver: driver instance
:param d_xpath: dialog box xpath. By default it gets the global parameter but you can change it
:param get_items: how many items do you wish to obtain? -1 = Try to get all of them. Any positive number will be
= the number of followers to obtain
:return: list of follower objects
"""
# getting the dialog content element
dialog_box: WebElement = driver.find_element_by_xpath(d_xpath)
# getting all the list items (<li></li>) inside the dialog box.
dialog_content: list[WebElement] = dialog_box.find_elements_by_tag_name("li")
# Get the total number of followers. since we get a string we need to convert to int by int(<str>)
total_followers = int(driver.find_element_by_xpath('/html/body/div[1]/section/main/div/header/section/ul/li['
'2]/a/span').get_attribute("title").replace(".",""))
# how many items we have without scrolling down?
li_items = len(dialog_content)
# We are trying to get n elements (n=get_items variable). Now we need to check if there are enough followers to
# retrieve from if not we will get the max quantity of following. This applies only if n is >=0. If -1 then the
# total amount of followers is n
if get_items == -1:
get_items = total_followers
elif -1 < get_items <= total_followers:
# no need to change anything, git is ok to work with get_items
pass
else:
# if it not -1 and not between 0 and total followers then we raise an error
raise IndexError
# You can start from greater than 0 but that will give you a shorter list of followers than what you wish if
# there is not enough followers available. i.e: total_followers = 10, get_items=10, start_from=1. This will only
# return 9 followers not 10 even if get_items is 10.
return generate_followers(0, get_items, total_followers, dialog_box, driver)
def generate_followers(start_pos, get_items, total_followers, dialog_box_element: WebElement, driver):
"""
Generate followers based on the parameters
:param start_pos: index of where to start getting the followers from
:param get_items: total items to get
:param total_followers = total number of followers
:param dialog_box_element: dialog box to get the list items count
:param driver: driver object
:return: followers list
"""
if -1 < start_pos < total_followers:
# we want to count items from our current position until the last element available without scrolling. We do
# it this way so when we scroll down, the list items will be greater but we will start generating followers
# from our last current position not from the beginning!
first = dialog_box_element.find_element_by_xpath("./li[{pos}]".format(pos=start_pos+1))
li_items = dialog_box_element.find_elements_by_xpath("./li[position()={pos}][last("
")]/following-sibling::li"
.format(pos=(start_pos + 1)))
li_items.insert(0, first)
print("Generating followers from position position: {pos} with {li_count} list items"
.format(pos=(start_pos+1), li_count=len(li_items)))
followers = []
for i in range(len(li_items)):
anchors = li_items[i].find_elements_by_tag_name("a")
anchor = anchors[0] if len(anchors) ==1 else anchors[1]
follower = Follower(anchor.text, anchor.get_attribute(
"href"))
followers.append(follower)
get_items -= 1
start_pos += 1
print("Follower {f} added to the list".format(f=follower))
# we break the loop if our starting position is greater than 0 or if get_items has reached 0 (means if we
# request 10 items we got them all no need to continue
if start_pos >= total_followers or get_items == 0:
print("finished")
return followers
print("finished loop, executing scroll down...")
last_li_item = driver.find_element_by_xpath('/html/body/div[4]/div/div[2]/ul/div/li[{p}]'.format(p=start_pos))
last_li_item.location_once_scrolled_into_view
time.sleep(sleep_time)
followers.extend(generate_followers(start_pos, get_items, total_followers, dialog_box_element, driver))
return followers
else:
raise IndexError

Looking element on few pages with loop “while” Python, Selenium

In my application, I have a table with employees but the table can have more than 1 page with employees. I want to check if the new employee was added (which i created) i want to check that employee in a table and click on it, with Selenium Webdriver Python. The whole idea is check first page if there is no employee with id that I'm looking for than click second page, check there and click on employee, and if there is no click 3rd page etc. I have a function that goes on pages 1 by 1 but it doesn't check that employee which i need on that pages:
id = ()
def add_new_employee(driver, first_name, last_name):
driver.find_element_by_css_selector("#menu_pim_viewPimModule").click()
driver.find_element_by_css_selector("[name='btnAdd']").click()
driver.find_element_by_css_selector("#firstName").send_keys(first_name)
driver.find_element_by_css_selector("#lastName").send_keys(last_name)
driver.find_element_by_css_selector("#photofile").\
send_keys(os.path.abspath("cloud-computing-IT.jpg"))
global id
id = driver.find_element_by_css_selector("#employeeId").get_attribute("value")
def new_employee_added(driver):
global id
driver.find_element_by_css_selector("#menu_pim_viewPimModule").click()
el = len(driver.find_elements_by_link_text("%s" % id))
while el < 1:
try:
driver.find_element_by_link_text("%s" % id).click()
except NoSuchElementException:
try:
for i in range(1, 50):
driver.find_element_by_link_text("%s" % i).click()
except NoSuchElementException:
return False
def test_new_employee(driver, first_name="Patrick", last_name="Patterson"):
login(driver, username="Admin", password="Password")
add_new_employee(driver,first_name, last_name)
new_employee_added(driver)
logout(driver)
The problem is in this function:
def new_employee_added(driver):
global id
driver.find_element_by_css_selector("#menu_pim_viewPimModule").click()
el = len(driver.find_elements_by_link_text("%s" % id))
while el < 1:
try:
driver.find_element_by_link_text("%s" % id).click()
except NoSuchElementException:
try:
for i in range(1, 50):
driver.find_element_by_link_text("%s" % i).click()
except NoSuchElementException:
return False
loop should try find element on 1st page, if no go to 2nd and check there, but it seems like it tries find on 1st page then run that piece of loop :
`for i in range(1, 50):
driver.find_element_by_link_text("%s" % i).click()`
and clicking pages not trying to find employee
Your main loop should iterate through all pages and break once you find required id. In your code you're trying to find id on the first page and if it absent on the first page you just iterate through other pages without executing search for required id.
As it's hard to provide you a good solution without actual html source below lines are kind of pseudo code:
for i in range(1, 50):
try:
# search for required element
driver.find_element_by_link_text("%s" % id).click()
break
except NoSuchElementException:
# go to next page as there is no required element on current page
driver.find_element_by_link_text("%s" % i).click()
P.S. It seem that there are no reasons to use global in your function, you can simply add a parameter to your new_employee_added as new_employee_added(driver, id) and call it with appropriate value.
P.P.S. Do not use "id" as a variable name as id() is a Python built-in function

How to get around "MissingSchema" error in Python?

After running my script I notice that my "parse_doc" function throws error when it find's any url None. Turn out that, my "process_doc" function were supposed to produce 25 links but it produces only 19 because few pages doesn't have any link to lead to another page. However, when my second function receives that link with None value, it produces that error indicating "MissingSchema". How to get around this so that when it finds any link with None value it will go for another. Here is the partial portion of my script which will give you an idea what I meant:
def process_doc(medium_link):
page = requests.get(medium_link).text
tree = html.fromstring(page)
try:
name = tree.xpath('//span[#id="titletextonly"]/text()')[0]
except IndexError:
name = ""
try:
link = base + tree.xpath('//section[#id="postingbody"]//a[#class="showcontact"]/#href')[0]
except IndexError:
link = ""
parse_doc(name, link) "All links get to this function whereas some links are with None value
def parse_doc(title, target_link):
page = requests.get(target_link).text # Error thrown here when it finds any link with None value
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(title, tel)
The error what I'm getting:
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://?
Btw, in my first function there is a variable named "base" which is for concatenating with the produced result to make a full-fledged link.
If you want to avoid cases when your target_link == None then try
def parse_doc(title, target_link):
if target_link:
page = requests.get(target_link).text
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(tel)
print(title)
This should allow you to handle only non-empty links or do nothing otherwise
First of all make sure that your schema, meaning url, is correct. Sometimes you are just missing a character or have one too much in https://.
If you have to raise an exception though you can do it like this:
import requests
from requests.exceptions import MissingSchema
...
try:
res = requests.get(linkUrl)
print(res)
except MissingSchema:
print('URL is not complete')

Categories