Unable to scrape kosis.kr even with selenium - python

I trying to scrape data from given link below. But I can not get html elements. I am using selenium with python. When I do print(driver.page_source), it prints just bunch of JS like when we try to scrape a javascript driven website with BeautifulSoup. I waited longer to render the whole page but still selenium driver can not get html rendered elements. So how do I scrape it?
https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1JH20151&vw_cd=MT_ETITLE&list_id=J1_10&scrId=&language=en&seqNo=&lang_mode=en&obj_var_id=&itm_id=&conn_path=MT_ETITLE&path=%252Feng%252FstatisticsList%252FstatisticsListIndex.do
I am trying scrape kosis.kr but selenium driver.page_source is giving nothig.

The data of your interest is located in nested iframes on that page. Try this to get the tabular content from there:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1JH20151&vw_cd=MT_ETITLE&list_id=J1_10&scrId=&language=en&seqNo=&lang_mode=en&obj_var_id=&itm_id=&conn_path=MT_ETITLE&path=%252Feng%252FstatisticsList%252FstatisticsListIndex.do"
with webdriver.Chrome() as driver:
driver.get(link)
WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#iframe_rightMenu")))
WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#iframe_centerMenu1")))
for item in WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"table[id='mainTable'] tr"))):
data = [i.text for i in item.find_elements(By.CSS_SELECTOR,'th,td')]
print(data)

Related

can't get page source from selenium

purpose: using selenium get entire page source.
problem: loaded page does not contain content, only JavaScript files and css files.
target site : https://www.warcraftlogs.com
test code(need 'pip install selenium'):
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.warcraftlogs.com/zone/rankings/29#boss=2512&metric=hps&difficulty=3&class=Priest&spec=Discipline")
pageSource = driver.page_source
fileToWrite = open("page_source.html", "w",encoding='utf-8')
fileToWrite.write(pageSource)
fileToWrite.close()
trythings--
try python request code, same result. that did't contain content only js,css things
It's a personal opinion, this site deliberated hide contant data.
i wanna do scriping this site data,
how can i do?
Here is a way of getting the page source, after all elements loaded:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
[...]
wait = WebDriverWait(driver, 5)
url='https://www.warcraftlogs.com/zone/rankings/29#boss=2512&metric=hps&difficulty=3&class=Priest&spec=Discipline'
driver.get(url)
stuffs = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[#class="top-100-details-number kill"]')))
t.sleep(5)
print(driver.page_source)
You can then write page source to file, etc. Selenium documentation: https://www.selenium.dev/documentation/

Selenium not loading full dynamic html webpage despite waiting

I'm trying to load the videos page of a youtube channel and parse it to extract recent video information. I want to avoid using the API since it has a daily usage quota.
The problem I'm having is Selenium does not seem to load the full html of the webpage when printing "driver.pagesource":
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
driver = Chrome(executable_path='chromedriver')
driver.get('https://www.youtube.com/c/Oxylabs/videos')
# Agree to youtube cookie popup
try:
consent = driver.find_element_by_xpath(
"//*[contains(text(), 'I agree')]")
consent.click()
except:
pass
# Parse html
WebDriverWait(driver,100).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="show-more-button"]')))
print(driver.page_source)
I have tried to implement WebDriverWait as seen above. This results in a timeout exception error. However, the following xpath (/html - the end of the webpage) does not result in a timeout exception:
WebDriverWait(driver,100).until(EC.visibility_of_element_located((By.XPATH, '/html')))
-but this does not load the full html either.
I have also tried to implement time.sleep(100) instead of WebDriverWait, but this too results in the incomplete html. Any help would be greatly appreciated.
The element you are looking for is not on the page, this is the reason for the timeout:
//*[#id="show-more-button"]
Have you tried scrolling to the page bottom or looking for some other element??
driver.execute_script("arguments[0].scrollIntoView();", element)

Get fully generated DOM elements inside iframe using selenium and phantomJS with python

Ok, Im stuck. Im making a little web scraping python script using selenium and PhantomJS. The page that I working on has the data I want inside an iframe document that my web driver does not run.
<main Page Heads etc>
<blah>
<iframe 1 src="src1" ... etc etc>
#document
<tag>
<tag>
<iframe2 src="src2"><iframe2>
<iframe1>
<blah>
<end of webpage DOM>
I want to get the src of iframe2. I tried to run the src1 URL through my webdriver but all I get out is the raw page html, not the loaded webpage elements, iframe2 must be created by some script inside iframe1, but I can't get my webdriver to run the script.
Any ideas?
This what im doing to run the javascript on webpages to get the complied page DOM:
from selenium import webdriver
self.driver = webdriver.PhantomJS()
self.driver.get(url)
page = self.driver.page_source
soup = BeautifulSoup(page,'html.parser')
You can't get a full page_source. In the case of iframe, you should use the following command: switch_to.frame(iframe_element), so you can get an element inside
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
self.driver = webdriver.PhantomJS()
self.driver.get(url)
WebDriverWait(self._driver, 50).until(
EC.presence_of_all_elements_located
((By.XPATH,
'//iframe[#id="iframegame"]'))
)
iframe_element = self.driver.find_element_by_xpath('//iframe[#id="iframegame"]')
self.driver.switch_to.frame(iframe_element)
tag = self.driver.find_element_by_xpath('//tag')
And back again, you can get an outer element of iframe using the following command;
self.driver.switch_to.default_content()

How to get_attribute('innerHTML') from a list of URLs - Selenium?

I am web scraping using Selenium in Python. And I'm using the xpath to extract part of the contents for the website.
I want to know how to use a loop extract a list of URLs and save them into a dictionary.
mylist_URLs = ['https://www.sec.gov/cgi-bin/own-disp? action=getowner&CIK=0001560258',
'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000034088',
'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0001048911']
My coding below only works for 1 url...
driver = webdriver.Chrome(r'xxx\chromedriver.exe')
driver.get('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0000104169')
driver.find_elements_by_xpath('/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[1]/td')[0].get_attribute('innerHTML')
Thank you for the help.
You can use simple for each loop with WebDriverWait to make sure the table is loaded before getting the innerHTML.
Add below imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Script:
mylist_URLs = ['https://www.sec.gov/cgi-bin/own-disp? action=getowner&CIK=0001560258',
'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000034088',
'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0001048911']
# open the browser
driver = webdriver.Chrome(r'xxx\chromedriver.exe')
# iterate through all the urls
for url in mylist_URLs:
print(url)
driver.get(url)
# wait for the table to present
element = WebDriverWait(driver,30).until(EC.presence_of_element_located((By.XPATH, "(//table[1]/tbody/tr[2]/td/table/tbody/tr[1]/td)[1]"))
# now get the element innerHTML
print(element.get_attribute('innerHTML')))

BeautifulSoup scraping from a web page already opened by Selenium

I would like to make scrape a web page which was opened by Selenium from a different webpage.
I entered a search term into a website using Selenium and this landed me in a new page. My aim is to create soup out of this new page. But, the soup is getting created out of the previous page where I entered my search term. Help please!
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
inputElement.send_keys(Keys.ENTER)
driver.wait.until(staleness_of('txtStock')
source = driver.page_source
soup = BeautifulSoup(source)
You need to know the exect company names for your search. After you are using send_keys, you tried to check for staleness of an element. I did not understand how that statement should work. I added WebDriverWait for an element of the new page.
The following works for me reagrding the selenium part up to getting the page source:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries Ltd.')
inputElement.send_keys(Keys.ENTER)
company = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'lblCompany')))
source = driver.page_source
You should add exception handling.
#Jens Dibbern has given a working solution. But it is not necessary that the exact name of the company should be given in the search. What happens is that when you type a non-exact name, a drop-down will pop up.
I have observed that until and unless this drop-down is present enter key is not working. You can check this by going to the site, pasting the name and without waiting press the enter key as fast as possible. Nothing happens.
You could also wait for this drop-down to be visible instead and the send the enter key.This also works perfectly. Note that this will end up selecting the first item in the drop-down if more than one is present.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
drop_down=driver.find_element_by_css_selector("#listPlacementStock")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#listPlacementStock:not([style*="display: none"])')))
inputElement.send_keys(Keys.ENTER)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="CompanyLink"]')))
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
print(soup)

Categories