Selenium Python, parsing through website, opening a new tab, and scraping

Selenium Python, parsing through website, opening a new tab, and scraping - python

I'm new at Python and Selenium. I'm trying to do something--which im sure im going in a very round-about way--any help is greatly appreciated.
The page im trying to parse through has different cards that need to be clicked on, i need to go to each card, and from there grab the name (h1) and the url. I havent gotten very far, and this is what i have so far.
I go through the first page, grab all the urls, add them to a list. Then i want to go through the list, and go to each url (opening a new tab) and from there grabbing the h1 and url. It doesn't seem like I'm even able to grab the h1, and it opens a new tab, then hangs, then opens the same tab.
Thank you in advance!
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise//') #main URL
title_links = driver.find_elements_by_css_selector('ul.n4 a')
urls = [] #list of URLs
# main = driver.find_elements_by_id('enterprise-list')
for item in title_links:
urls.append(item.get_attribute('href'))
# print(urls)
for url in urls:
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(url)
print(driver.find_element_by_css_selector('div.info h1'))

Well, there are a few issues here:
You should be much more specific with your tag for grabbing urls. This is leading to multiple copies of the same url--that's why it is opening the same pages again.
You should give the site enough time to load before trying to grab objects, that may be why it's timing out but always good to be on the safe side before grabbing objects.
You have to shift focus back to the original page to continue iterating the list
You don't need to inject JS to open a new tab and use a py call to open , and JS formatting could be cleaner
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise/') # main URL
# Be much more specific or you'll get multiple returns of the same link
urls = driver.find_elements(By.TAG_NAME, 'ul.n4 li div.img a')
for url in urls:
# get href to print
print(url.get_attribute('href'))
# Inject JS to open new tab
driver.execute_script("window.open(arguments[0])", url)
# Switch focus to new tab
driver.switch_to.window(driver.window_handles[1])
# Make sure what we want has time to load and exists before trying to grab it
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.info h1')))
# Grab it and print it's contents
print(driver.find_element(By.CSS_SELECTOR, 'div.info h1').text)
# Uncomment the next line to do one tab at a time. Will reduce speed but not use so much ram.
#driver.close()
# Focus back on first window
driver.switch_to.window(driver.window_handles[0])
# Close window
driver.quit()

Related

Not able to download the file from websiteusing selenium python

I am trying to download the daily report from the website NSE-India using selenium & python.
Approach to download the daily report
Website loads with no data
After X time,page is loaded with report information
Once the page is loaded with report data,"table[#id='etfTable']" appears
Explicit wait is added in the code,to wait till the "table[#id='etfTable']" loads
Code for explicit wait
element=WebDriverWait(driver,50).until(EC.visibility_of_element_located(By.xpath,"//table[#id='etfTable']"))
Extract the onclick event using xpath
downloadcsv= driver.find_element_by_xpath("//div[#id='esw-etf']/div[2]/div/div[3]/div/ul/li/a")
Trigger the click to download the file
Full code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options =webdriver.ChromeOptions();
prefs={"download.default_directory":"/Volumes/Project/WebScraper/downloadData"};
options.binary_location=r'/Applications/Google Chrome 2.app/Contents/MacOS/Google Chrome'
chrome_driver_binary =r'/usr/local/Caskroom/chromedriver/94.0.4606.61/chromedriver'
options.add_experimental_option("prefs",prefs)
driver =webdriver.Chrome(chrome_driver_binary,options=options)
try:
#driver.implicity_wait(10)
driver.get('https://www.nseindia.com/market-data/exchange-traded-funds-etf')
element =WebDriverWait(driver,50).until(EC.visibility_of_element_located(By.xpath,"//table[#id='etfTable']"))
downloadcsv= driver.find_element_by_xpath("//div[#id='esw-etf']/div[2]/div/div[3]/div/ul/li/a")
print(downloadcsv)
downloadcsv.click()
time.sleep(5)
driver.close()
except:
print("Invalid URL")
Issue i am facing.
The page is keeps on loading but when launched without selenium the daily report is getting loaded
Normal
Loading via Selenium
Not able to download the daily report

There are some syntax error in the program. Like semi-colon in few lines and while finding element using WebDriverWait, brackets are missing.
Try like below and confirm.
Can use Javascript to click on that element.
driver.get("https://www.nseindia.com/market-data/exchange-traded-funds-etf")
element =WebDriverWait(driver,50).until(EC.visibility_of_element_located((By.XPATH,"//table[#id='etfTable']/tbody/tr[2]")))
downloadcsv= driver.find_element_by_xpath("//img[#title='csv']/parent::a")
print(downloadcsv)
driver.execute_script("arguments[0].click();",downloadcsv)

It's not an issue with your code it's an issue with the website. I checked it most of the time it did not allow me to click on the CSV file. instead of downloading the CSV file, you can scrape the table.
# for direct to the page delete cookies is very important otherwise it will deny the access
browser.delete_all_cookies()
browser.get('https://www.nseindia.com/market-data/exchange-traded-funds-etf')
sleep(5)
soup = BeautifulSoup(browser.page_source, 'html.parser')
# scrape the table from the soup

Scraping only the portion that loads - Without Scrolling

I have written a simple web scraping code using Selenium but I want to scrape only the portion that is present 'before scroll'
Say, if it is this page I want to scrape - https://en.wikipedia.org/wiki/Pandas_(software) - Selenium reads information till the absolute last element/text which for me is the 'Powered by Media Wiki' button on the far bottom-right of the page.
What I want Selenium to do is stop after DataFrames (see screenshot) and not scroll down to the bottom.
And I also want to know where on the page it stops. I have checked multiple sources and most of them ask for infinite scroll websites. No one asks for just the 'visible' half of a page.
This is my code now:
from selenium import webdriver
EXECUTABLE = r"chromedriver.exe"
# get the URL
url = "https://en.wikipedia.org/wiki/Pandas_(software)"
# open the chromedriver
driver = webdriver.Chrome(executable_path = EXECUTABLE)
# google window is maximized so that all webpages are rendered in the same size
driver.maximize_window()
# make the driver wait for 30 seconds before throwing a time-out exception
driver.implicitly_wait(30)
# get URL
driver.get(url)
for element in driver.find_elements_by_xpath("//*"):
try:
#stuff
except:
continue
driver.close()
Absolutely any direction is appreciated. I have tried to be as clear as possible here but let me know if any more details are required.

I don't think that is possible. Observe the DOM, all the informational elements are under one section I mean one tag div[#id='content'], which is already visible to Selenium. Even if you try with //*, div[#id='content'] is visible.
And trying to check whether the element is visible though not scrolled, will also return True. (If someone knows to do what you are asking for, even I would like to know.)
from selenium import webdriver
from selenium.webdriver.support.expected_conditions import _element_if_visible
driver = webdriver.Chrome(executable_path = 'path to chromedriver.exe')
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://en.wikipedia.org/wiki/Pandas_(software)")
elements = driver.find_elements_by_xpath("//div[#id='content']//*")
for element in elements:
try:
if _element_if_visible(element):
print(element.get_attribute("innerText"))
except:
break
driver.quit()

How to Skip a Webpage After a Period of Time Selenium

I am parsing a file with a ton of colleges. Selenium googles "Admissions " + college_name then clicks the first link and gets some data from each page. The issue is that the list of college names I am pulling from is very rough (technically a list of all accredited institutions in America), so some of the links are broken or get stuck in a load loop. How do I set some sort of timer that basically says
if page load time > x seconds:
go to next element in list

You could invoke WebDriverWait on the page, and if the page catches a TimeoutException then you will know it took too long to load, so you can proceed to the next one.
Given you do not know what each page HTML will look like, this is a very challenging problem.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# list of college names
names = []
for name in names:
# search for the college here
# get list of search results
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#class='rc']")))
search_results = driver.find_elements_by_xpath("//div[#class='rc']")
# get first result
search_result = search_results[0]
# attempt to load the page
try:
search_result.click()
except TimeoutException:
# click operation should time out if next page does not load
# pass to move on to next URL
pass
This is a very rough, general outline. As I mentioned, without knowing what the expected page title will be, or what the expected page content will look like, it's incredibly difficult to write a generic method that will successfully accomplish this. This code is meant to be just a starting point for you.

How to save web scraped data from selenium to a .txt file

I'm pretty new to Python and just completed the 'automate the boring stuff with python' course. I have a script that works well for going to a site, grabbing all the necessary data I need, and printing it to my console. I'm having a problem though on how to actually save/export that data to a file. For now I'd like to be able to export it to a .txt or a .csv file. Any help is appreciated, as I can't find a straightforward answer on the web. I just need that last step to complete my project, thanks!
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Chrome()
def getTen():
# Opens the browser and navigates to the page
browser.get('http://taxsales.lgbs.com/map?lat=29.437693458470175&lon=-98.4618145&zoom=9&offset=0&ordering=sale_date,street_name,address_full,uid&sale_type=SALE,RESALE,STRUCK%20OFF,FUTURE%20SALE&county=BEXAR%20COUNTY&state=TX&in_bbox=-99.516502,28.71637426279382,-97.407127,30.153924134433552')
# Waits until the page is loaded then clicks the accept button on the popup window
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[2]/div/div/div[2]/button[1]"))).click()
# Loops through 10 times, changing the listing number to correspond with 1
for i in range(1,11):
clickable = "/html/body/ui-view/div[2]/main/aside/div[2]/property-listing[" + str(i) + "]/article/div/a"
# Waits until the page is loaded then clicks the view more details button on the first result
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, clickable))).click()
# Waits until the page is loaded, then pulls all the info from the top section of the page
info = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "/html/body/ui-view/div/main/div/div/div[2]/property-detail-info/dl[1]")))
# prints info to the console
print(info.text);
# goes back a page and repeats the process for the next listing
browser.back()
getTen()

If you are trying to save info.text, you can just open a local file and write to it. For example:
with open('output.txt', 'w') as f:
f.write(info.text)
More info on reading and writing to files can be found here: Reading and Writing Files in Python

You can simply use you output/get text in console and use below code to get text in.txt
output.txt will be your file name in which u want to save and element.text will be element or text u want to save in .txt
with open('output.txt', 'w') as f:
f.write(elements.text)

Unable to collect titles from a webpage in the right way

I've written a script in python in combination with selenium to get some titles out of some images from a webpage. The thing is the content I would like to parse are located near the bottom of that page. So, If i try like the conventional way to grab that, the browse fails.
So, I used a javascript code within my scraper to let the browser scroll to the bottom and it worked.
However, I don't think it's a good solution to keep up so tried with .scrollIntoView() but that didn't work either. What can be the ideal way to serve the purpose?
This is my script:
from selenium import webdriver
import time
URL = "https://www.99acres.com/supertech-cape-town-sector-74-noida-npxid-r922?sid=UiB8IFFTIHwgUyB8IzMxIyAgfCAxIHwgNyM0MyMgfCA4MjEyIHwjNSMgIHwg"
driver = webdriver.Chrome()
driver.get(URL)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") #I don't wish to keep this line
time.sleep(3)
for item in driver.find_elements_by_css_selector("#carousel img"):
print(item.get_attribute("title"))
driver.quit()

Try to use below code that should allow you to scroll to required node and scrape images:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
banks = driver.find_element_by_id("xidBankSection")
driver.execute_script("arguments[0].scrollIntoView();", banks)
images = WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#carousel img")))
for image in images:
print(image.get_attribute("title"))
Some explanation: initially those images are absent in source code and generated inside BankSection once you scrolled to it, so you need to scroll down to BankSection and wait until images generated

You can try below line of code
recentList = driver.find_elements_by_css_selector("#carousel img"):
for list in recentList :
driver.execute_script("arguments[0].scrollIntoView();", list )
print(list.get_attribute("title"))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium Python, parsing through website, opening a new tab, and scraping - python

Related

Not able to download the file from websiteusing selenium python

Scraping only the portion that loads - Without Scrolling

How to Skip a Webpage After a Period of Time Selenium

How to save web scraped data from selenium to a .txt file

Unable to collect titles from a webpage in the right way

Categories

Resources