Scraping only the portion that loads - Without Scrolling

Scraping only the portion that loads - Without Scrolling - python

I have written a simple web scraping code using Selenium but I want to scrape only the portion that is present 'before scroll'
Say, if it is this page I want to scrape - https://en.wikipedia.org/wiki/Pandas_(software) - Selenium reads information till the absolute last element/text which for me is the 'Powered by Media Wiki' button on the far bottom-right of the page.
What I want Selenium to do is stop after DataFrames (see screenshot) and not scroll down to the bottom.
And I also want to know where on the page it stops. I have checked multiple sources and most of them ask for infinite scroll websites. No one asks for just the 'visible' half of a page.
This is my code now:
from selenium import webdriver
EXECUTABLE = r"chromedriver.exe"
# get the URL
url = "https://en.wikipedia.org/wiki/Pandas_(software)"
# open the chromedriver
driver = webdriver.Chrome(executable_path = EXECUTABLE)
# google window is maximized so that all webpages are rendered in the same size
driver.maximize_window()
# make the driver wait for 30 seconds before throwing a time-out exception
driver.implicitly_wait(30)
# get URL
driver.get(url)
for element in driver.find_elements_by_xpath("//*"):
try:
#stuff
except:
continue
driver.close()
Absolutely any direction is appreciated. I have tried to be as clear as possible here but let me know if any more details are required.

I don't think that is possible. Observe the DOM, all the informational elements are under one section I mean one tag div[#id='content'], which is already visible to Selenium. Even if you try with //*, div[#id='content'] is visible.
And trying to check whether the element is visible though not scrolled, will also return True. (If someone knows to do what you are asking for, even I would like to know.)
from selenium import webdriver
from selenium.webdriver.support.expected_conditions import _element_if_visible
driver = webdriver.Chrome(executable_path = 'path to chromedriver.exe')
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://en.wikipedia.org/wiki/Pandas_(software)")
elements = driver.find_elements_by_xpath("//div[#id='content']//*")
for element in elements:
try:
if _element_if_visible(element):
print(element.get_attribute("innerText"))
except:
break
driver.quit()

Related

Selenium can not find any element using css, xpath, name, id (python)

I want to log into a forum, already bypassed the cookie message by switching to its iframe but I can't get access to anything on the page. This is my code:
#set path, open firefox
path = r'C:\Users\Anwender\Downloads\geckodriver-v0.30.0-win64\geckodriver.exe'
driver = webdriver.Firefox(executable_path=path)
#open bym forum
driver.get('https://www.bym.de/forum/')
driver.maximize_window()
#deal with cookie
wait = WebDriverWait(driver,20)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='SP Consent Message']")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button[title='Zustimmen']"))).click()
driver.implicitly_wait(30)
So far, so good. Now a ton of ads pop up. I try to get to the login:
username = driver.find_element_by_id('navbar_username')
username.send_keys("name")
password = driver.find_element_by_id('navbar_password')
password.send_keys("pw")
driver.find_element_by_xpath("/html/body/div[1]/div[2]/main/div[3]/div[1]/ul/li[2]/form/fieldset/div/div/input[4]").click()
I have tried different variants of this using the css selector or the xpath. Didn't work. Then I tried waiting 20 seconds till everything has loaded. Didn't work. I tried accessing a different element (to a subforum). Didn't work.
I tried:
try:
wait.until(
EC.presence_of_element_located((By.ID, "navbar_username"))
)
finally:
driver.quit()
Selenium just can not find this element, the browser just closed. I have looked for iframes but I couldn't find any in the html. Is it possible I am not even on the main "window"? "Frame"? "Site"? I don't know what to call it.
Any help would be much appreciated!

You switched to an iframe and closed it, but you never switched back! Selenium is still on the closed and absent iframe. Switch back to the mainpage using:
driver.switch_to.default_content()

Why is the website not loading after a click with Selenium?

I'm trying to load all products of a website where around 20 products are shown and then a "show more" button has to be klicked to load more products. I can see that the button is klicked when I execute the code, but the content of the website is not loaded. In the end a loop shall be built to load all the products, but to do so the content first has to be visible.
import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from time import sleep
url = 'https://www.galaxus.ch/de/s2/sector/haushalt-2'
driver = webdriver.Chrome('C:\DRIVERS\chromedriver_win32\chromedriver.exe')
driver.get(url)
sleep(3)
element = driver.find_element_by_id('pageFoot')
driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
sel = Selector(text=driver.page_source)
sleep(10)
driver.find_element_by_xpath('//*[#id="pageContent"]/div[2]/div/div[5]/div[5]/div/button/span').click()
sleep(20)
driver.close()
I've also tried to click on other objects within the website but the access is then denied.

Welcome to site. Try this locator for your button.
driver.find_element_by_css_selector(".styled__DeprecatedStyledButton-sc-1wug1tr-1.isotoG")
If you will still have a problem, than scroll into a different element. For example to this button. As I see it is not visible if you scroll to the very bottom.
Also, for debugging time.sleep() is ok, but in real projects you should use only explicit/implicit waits, with some exclusions where time.sleep() is the only option.

How to handle Google Privacy Popup in selenium (python)

I'm pretty new to selenium and i searched a lot, but can't find an answer to my problem.
I want to open firefox, go on google and search for something. Store everthing in a list and print it to my console.
But everytime i open firefox with selenium, a pop-up windows opens with a confirmation to the privacy regulations of google. I dont know how to close it.
Here is what i tried:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
# create a new Firefox session
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.maximize_window()
# Navigate to the application home page
driver.get("http://www.google.com")
main_page = driver.current_window_handle
time.sleep(3)
#Tried to find out in which browers window I am right now -> terminal says '19'
print('The actual browers window is: {}'.format(main_page))
driver.find_element_by_xpath("//*[#id='introAgreeButton']/span/span").click()
# get the search textbox
search_field = driver.find_element_by_id("lst-ib")
search_field.clear()
# enter search keyword and submit
search_field.send_keys("Flowers")
search_field.submit()
# get the list of elements which are displayed after the search
# currently on result page using find_elements_by_class_name method
lists= driver.find_elements_by_class_name("_Rm")
# get the number of elements found
print ("Found " + str(len(lists)) + " searches:")
# iterate through each element and print the text that is
# name of the search
i=0
for listitem in lists:
print (listitem.get_attribute("innerHTML"))
i=i+1
if(i>10):
break
# close the browser window
driver.quit()

Google privacy popup is contained in iframe. You must switch to iframe and then look for accept (or deny) button. Like so
wait = WebDriverWait(driver, 5)
iframe = wait.until(EC.element_to_be_clickable([By.CSS_SELECTOR, '#cnsw > iframe']))
driver.switch_to.frame(iframe)
And then when in iframe, look for Accept button a click it.

Selenium Python, parsing through website, opening a new tab, and scraping

I'm new at Python and Selenium. I'm trying to do something--which im sure im going in a very round-about way--any help is greatly appreciated.
The page im trying to parse through has different cards that need to be clicked on, i need to go to each card, and from there grab the name (h1) and the url. I havent gotten very far, and this is what i have so far.
I go through the first page, grab all the urls, add them to a list. Then i want to go through the list, and go to each url (opening a new tab) and from there grabbing the h1 and url. It doesn't seem like I'm even able to grab the h1, and it opens a new tab, then hangs, then opens the same tab.
Thank you in advance!
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise//') #main URL
title_links = driver.find_elements_by_css_selector('ul.n4 a')
urls = [] #list of URLs
# main = driver.find_elements_by_id('enterprise-list')
for item in title_links:
urls.append(item.get_attribute('href'))
# print(urls)
for url in urls:
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(url)
print(driver.find_element_by_css_selector('div.info h1'))

Well, there are a few issues here:
You should be much more specific with your tag for grabbing urls. This is leading to multiple copies of the same url--that's why it is opening the same pages again.
You should give the site enough time to load before trying to grab objects, that may be why it's timing out but always good to be on the safe side before grabbing objects.
You have to shift focus back to the original page to continue iterating the list
You don't need to inject JS to open a new tab and use a py call to open , and JS formatting could be cleaner
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise/') # main URL
# Be much more specific or you'll get multiple returns of the same link
urls = driver.find_elements(By.TAG_NAME, 'ul.n4 li div.img a')
for url in urls:
# get href to print
print(url.get_attribute('href'))
# Inject JS to open new tab
driver.execute_script("window.open(arguments[0])", url)
# Switch focus to new tab
driver.switch_to.window(driver.window_handles[1])
# Make sure what we want has time to load and exists before trying to grab it
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.info h1')))
# Grab it and print it's contents
print(driver.find_element(By.CSS_SELECTOR, 'div.info h1').text)
# Uncomment the next line to do one tab at a time. Will reduce speed but not use so much ram.
#driver.close()
# Focus back on first window
driver.switch_to.window(driver.window_handles[0])
# Close window
driver.quit()

Unable to select an element via XPath in selenium Webdriver python

I am trying to run a script in selenium webdriver python. Where I am trying to click on search field, but its always showing exception of "An element could not be located on the page using the given search parameters."
Here is script:
from selenium import webdriver
from selenium.webdriver.common.by import By
class Exercise:
def safari(self):
class Exercise:
def safari(self):
driver = webdriver.Safari()
driver.maximize_window()
url= "https://www.airbnb.com"
driver.implicitly_wait(15)
Title = driver.title
driver.get(url)
CurrentURL = driver.current_url
print("Current URL is "+CurrentURL)
SearchButton =driver.find_element(By.XPATH, "//*[#id='GeocompleteController-via-SearchBarV2-SearchBarV2']")
SearchButton.click()
note= Exercise()
note.safari()
Please Tell me, where I am wrong?

There appears to be two matching cases:
The one that matches the search bar is actually the second one. So you'd edit your XPath as follows:
SearchButton = driver.find_element(By.XPATH, "(//*[#id='GeocompleteController-via-SearchBarV2-SearchBarV2'])[2]")
Or simply:
SearchButton = driver.find_element_by_xpath("(//*[#id='GeocompleteController-via-SearchBarV2-SearchBarV2'])[2]")
You can paste your XPath in Chrome's Inspector tool (as seen above) by loading the same website in Google Chrome and hitting F12 (or just right click anywhere and click "Inspect"). This gives you the matching elements. If you scroll to 2 of 2 it highlights the search bar. Therefore, we want the second result. XPath indices start at 1 unlike most languages (which usually have indices start at 0), so to get the second index, encapsulate the entire original XPath in parentheses and then add [2] next to it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping only the portion that loads - Without Scrolling - python

Related

Selenium can not find any element using css, xpath, name, id (python)

Why is the website not loading after a click with Selenium?

How to handle Google Privacy Popup in selenium (python)

Selenium Python, parsing through website, opening a new tab, and scraping

Unable to select an element via XPath in selenium Webdriver python

Categories

Resources