Bypass Referral Denied error in selenium using python - python

I was making a script to download images from comic naver and I'm kind of done with it, however I can't seem to save the images.
I successfully grabbed the images via urlib and BeasutifulSoup, now, seems like they've introduced hotlink blocking and I can't seem to save the images on my system via urlib or selenium.
Update: I tried changing the useragent to see if that was causing problems... still the same.
Any fix or solution?
My code right now :
import requests
from bs4 import BeautifulSoup
import re
import urllib
import urllib2
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Chrome/15.0.87"
)
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver = webdriver.PhantomJS(desired_capabilities=dcap)
soup = BeautifulSoup(urllib.urlopen(url).read())
scripts = soup.findAll('img', alt='comic content')
for links in scripts:
Imagelinks = links['src']
filename = Imagelinks.split('_')[-1]
print 'Downloading Image : '+filename
driver.get(Imagelinks)
driver.save_screenshot(filename)
driver.close()
Following 'MAI's' reply, I tried what I could with selenium, and got what I wanted. It's solved now. My code :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver.get(url)
elem = driver.find_elements_by_xpath("//div[#class='wt_viewer']//img[#alt='comic content']")
for links in elem:
print links.get_attribute('src')
driver.quit()
but, when I try to taek screenshots of this, it shows that the "element is not attached to the page". Now, how am I supposed to solve that :/

(Note: Apologies, I'm not able to comment, so I have to make this an answer.)
To answer your original question, I've just been able to download an image in cURL from Naver Webtoons (the English site) by adding a Referer: http://www.webtoons.com header like so:
curl -H "Referer: http://www.webtoons.com" [link to image] > img.jpg
I haven't tried, but you'll probably want to use http://comic.naver.com instead. To do this with urllib, create a Request object with the header required:
req = urllib.request.Request(url, headers={"Referer": "http://comic.naver.com"})
with urllib.request.urlopen(req) as response, open("image.jpg", "wb") as outfile:
Then you can save the file using shutil.copyfileobj(src, dest). So instead of taking screenshots, you can simply get a list of all the images to download, then make a request for each one using the referer header.
Edit: I have a working script on GitHub which only requires urllib and BeautifulSoup.

I took a short look at the website with Chrome dev tools.
I would suggest you to download the image directly instead of screen-shooting. Selenium webdriver should actually run the javascripts on PhantomJS headless browser, so you should get images loaded by javascript at the following path.
The path that I am getting by eye-balling the html is
html body #wrap #container #content div #comic_view_area div img
The image tags in the last level have IDs like content_image_N, N counting from 0. So you can also get specific picture by using img#content_image_0 for example.

Related

can't get page source from selenium

purpose: using selenium get entire page source.
problem: loaded page does not contain content, only JavaScript files and css files.
target site : https://www.warcraftlogs.com
test code(need 'pip install selenium'):
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.warcraftlogs.com/zone/rankings/29#boss=2512&metric=hps&difficulty=3&class=Priest&spec=Discipline")
pageSource = driver.page_source
fileToWrite = open("page_source.html", "w",encoding='utf-8')
fileToWrite.write(pageSource)
fileToWrite.close()
trythings--
try python request code, same result. that did't contain content only js,css things
It's a personal opinion, this site deliberated hide contant data.
i wanna do scriping this site data,
how can i do?
Here is a way of getting the page source, after all elements loaded:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
[...]
wait = WebDriverWait(driver, 5)
url='https://www.warcraftlogs.com/zone/rankings/29#boss=2512&metric=hps&difficulty=3&class=Priest&spec=Discipline'
driver.get(url)
stuffs = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[#class="top-100-details-number kill"]')))
t.sleep(5)
print(driver.page_source)
You can then write page source to file, etc. Selenium documentation: https://www.selenium.dev/documentation/

Download PDF file from .cfm URL

I'm trying to download a PDF file from this address: https://aisweb.decea.mil.br/inc/notam/gerar-boletim/reports/report-notam.cfm
I wrote some code that first fills out some information in this page (correctly) https://aisweb.decea.mil.br/?i=notam and then clicks a button that opens a new tab to the generated PDF file. The problem is that when it tries to save the PDF file at the end, it downloads directly from the .cfm address, resulting in an empty PDF template (you can see this by clicking the fist link).
How can I download the PDF that is currently being shown to me on the page, instead of accessing the first URL directly?
This is my code
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
from selenium.webdriver.common.print_page_options import PrintOptions
from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib
import time
import requests
from urllib.parse import urljoin
aerodromos = "SBNT,SBJP,SBFZ,SBRF" #TEST
driver = webdriver.Chrome('C:\Windows\chromedriver.exe')
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
driver.get("https://aisweb.decea.mil.br/?i=notam")
driver.maximize_window()
caixaTexto = driver.find_element(By.XPATH,'//*[#id="icaocode"]')
caixaTexto.send_keys(aerodromos)
botao = driver.find_element(By.XPATH, '//*[#id="a"]/form/div/div[3]/div/input[2]')
botao.click()
botao = driver.find_element(By.XPATH, '//*[#id="select-all"]')
botao.click()
botao = driver.find_element(By.XPATH, '/html/body/div/div/div/div/div[2]/div/div/form/input[3]')
botao.click()
response = urllib.request.urlretrieve('https://aisweb.decea.mil.br/inc/notam/gerar-boletim/reports/report-notam.cfm', filename='relatorio1.pdf')
I did it! When I tried to change the settings in Chrome to download PDFs instead of opening them, it made no difference, but I ended up finding a solution while searching for another way to do it.
Unable to access the modal elements to download pdf with selenium
I changed Chrome experimental options profile in my code and it worked! Now it opens the tab, immediately downloads the file and closes the tab!

Selenium not loading full dynamic html webpage despite waiting

I'm trying to load the videos page of a youtube channel and parse it to extract recent video information. I want to avoid using the API since it has a daily usage quota.
The problem I'm having is Selenium does not seem to load the full html of the webpage when printing "driver.pagesource":
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
driver = Chrome(executable_path='chromedriver')
driver.get('https://www.youtube.com/c/Oxylabs/videos')
# Agree to youtube cookie popup
try:
consent = driver.find_element_by_xpath(
"//*[contains(text(), 'I agree')]")
consent.click()
except:
pass
# Parse html
WebDriverWait(driver,100).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="show-more-button"]')))
print(driver.page_source)
I have tried to implement WebDriverWait as seen above. This results in a timeout exception error. However, the following xpath (/html - the end of the webpage) does not result in a timeout exception:
WebDriverWait(driver,100).until(EC.visibility_of_element_located((By.XPATH, '/html')))
-but this does not load the full html either.
I have also tried to implement time.sleep(100) instead of WebDriverWait, but this too results in the incomplete html. Any help would be greatly appreciated.
The element you are looking for is not on the page, this is the reason for the timeout:
//*[#id="show-more-button"]
Have you tried scrolling to the page bottom or looking for some other element??
driver.execute_script("arguments[0].scrollIntoView();", element)

Why is HTML returned by requests different from the real page HTML?

I'm trying to scrape a webpage for getting some data to work with, one of the web pages I want to scrape is this one https://www.etoro.com/people/sparkliang/portfolio, the problem comes when I scrape the web page using:
import requests
h=requests.get('https://www.etoro.com/people/sparkliang/portfolio')
h.content
And gives me a completely different result HTML from the original, for example adding a lot of meta kind and deleting the text or type HTML variables I am searching for.
For example imagine I want to scrape:
<p ng-if=":: item.IsStock" class="i-portfolio-table-hat-fullname ng-binding ng-scope">Shopify Inc.</p>
I use a command like this:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.etoro.com/people/sparkliang/portfolio').text
print(html_text)
soup = BeautifulSoup(html_text,'lxml')
job = soup.find('p', class_='i-portfolio-table-hat-fullname ng-binding ng-scope').text
This will return me Shopify Inc.
But it doesn't because the html code y load or get from the web page with the requests' library, gets me another complete different html.
I want to know how to get the original html code from the web page.
If you use cntl-f for searching to a keyword like Shopify Inc it wont be even in the code i get from the requests python library
It happens because the page uses dynamic javascript to create the DOM elements. So you won't be able to accomplish it using requests. Instead you should use selenium with a webdriver and wait for the elements to be created before scraping.
You can try downloading ChromeDriver executable here. And if you paste it in the same folder as your script you can run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe" # CHANGE THIS IF NOT SAME FOLDER
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = 'https://www.etoro.com/people/sparkliang/portfolio'
driver.get(url)
html_text = driver.page_source
jobs = WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p.i-portfolio-table-hat-fullname'))
)
for job in jobs:
print(job.text)
Here we use selenium with WebDriverWait and EC to ensure that all the elements wil exist when we try to scrape the info we're looking for.
Outputs
Facebook
Apple
Walt Disney
Alibaba
JD.com
Mastercard
...

web scraping w/age verification

Hello I want to web scrape data from a site with an age verification pop-up using python 3.x and beautifulsoup. I can't get to the underlying text and images without clicking "yes" for "are you over 21". Thanks for any support.
EDIT: Thanks, with some help from a comment I see that I can use the cookies but am not sure how to manage/store/call cookies with the requests package.
So with some help from another user I am using selenium package so that it will work also in case it's a graphical overlay (I think?). Having trouble getting it to work with the gecko driver but will keep trying! Thanks for all the advice again, everyone.
EDIT 3: OK I have made progress and I can get the browser window to open, using the gecko driver!~ Unfortunately it doesn't like that link specification so I'm posting again. The link to click "yes" on the age verification is buried on that page as something called mlink...
EDIT 4: Made some progress, updated code is below. I managed to find the element in the XML code, now I just need to manage to click the link.
#
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'/Users/jeff/Documents/geckodriver') # Optional argument, if not specified will search path.
driver.get('https://www.shopharborside.com/oakland/#/shop/412');
url = 'https://www.shopharborside.com/oakland/#/shop/412'
driver.get(url)
#
driver.find_element_by_class_name('hhc_modal-body').click(Yes)
#wait.1.second
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print(soup.prettify())
Edit new: Stuck again, here is the current code. I seem to have isolated the element "mBtnYes" but I get an error when running the code :
ElementClickInterceptedException: Message: Element is not clickable at point (625,278.5500030517578) because another element obscures it
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'/Users/jeff/Documents/geckodriver') # Optional argument, if not specified will search path.
driver.get('https://www.shopharborside.com/oakland/#/shop/412');
url = 'https://www.shopharborside.com/oakland/#/shop/412'
driver.get(url)
#
driver.find_element_by_id('myBtnYes').click()
#wait.1.second
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print(soup.prettify())
if your aim is to click the verification get to selenium:
ps install selenium && get geckodriver(firefox) or chromedriver(chrome)
#Mossein~King(hi i'm here to help)
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.firefox.options import Options
from BeautifulSoup import BeautifulSoup
#this.is.for.headless.This.will.save.you.a.bunch.of.research.time(Trust.me)
options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(firefox_options=options)
#for.graphical(you.need.gecko.driver.for.firefox)
# driver = webdriver.Firefox()
url = 'your-url'
driver.get(url)
#get.the.link.to.clicking
#exaple if<a class='MosseinKing'>
driver.find_element_by_xpath("//a[#class='MosseinKing']").click()
#wait.1.secong.in.case.of.transitions
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print soup.prettify()

Categories