Python Selenium Get PageSource of XHTML

Python Selenium Get PageSource of XHTML - python

I was wondering if there was a way to print the entire html path. I am trying to verify some text in a pdf xhtml file pop-up and can not get to to. My hope is to get the entire page source and verify the text is in there. However .page_source seems to only give me the url and description and I am looking to get each line of code.

A possible approach is to make selenium find the starting page tag (html) and get all the source related code.
driver = webdriver.Firefox()
driver.get("http://stackoverflow.com/")
driver.find_element_by_tag_name("html").get_attribute('outerHTML')
Documentation
Output example:
<html webdriver="true"><head>
<title>Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com">
<meta property="og:type" content="website">
<meta name="description" content="Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers">
<meta property="og:image" itemprop="image primaryImageOfPage" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon#2.png?v=73d79a89bded">
<meta name="twitter:title" property="og:title" itemprop="title name" content="Stack Overflow">
<meta name="twitter:description" property="og:description" itemprop="description" content="Q&A for professional and enthusiast programmers">
<meta property="og:url" content="http://stackoverflow.com/">
......

Related

Emulating CORS issues with pytest

I need to test requests that can be sent through iframe. For example: i have some page on domain_01:
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport"
content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<style>
body {
margin: 0 auto;
}
</style>
<body>
<iframe id="inlineFrameExample"
title="Inline Frame Example"
width="1600"
height="900"
src="http://domain_02:8000/app/dashboard">
</iframe>
</body>
</html>
And as you can see here this page contains iframe with link to page on domain_02. I try to understand: is it possible to emulate request that goes to domain_02 through this iframe on doamin_01 with pytest.
Main task what i need to solve it's create tests with different requests and check that there is no CORS issues with it.
How i check it now: manually only. I run second web-server through inner python server (python -m http.server 8090) and set dns-record on local dns-server to emulate domain_01. It will be so cool to run this tests with pytest.

How to Get Title and Brief Description from a URL in Python?

In Discord, if you post a link, Discord will find the title, and a brief summary of the linked webpage:
How can I replicate this behavior in Python?

Similar to this SO: https://stackoverflow.com/a/43154489/9964778
You need to get the value of some meta tags which starts with "og". In your example, you have in the source code below with the related metadata for title, description, image, among other fields
<meta property="og:site_name" content="livescience.com">
<meta property="og:image" content="https://cdn.mos.cms.futurecdn.net/W3wpBCQ4hEL4dthLYnsosK-1200-80.gif">
<meta property="og:image:width" content="1200">
<meta property="og:type" content="article">
<meta property="article:publisher" content="https://www.facebook.com/livescience?cmpid=556687">
<meta property="og:title" content="Alien shopping-bag ocean weirdo has glowing Cheetos for guts">
<meta property="og:url" content="https://www.livescience.com/alien-glowing-cheeto-sea-cucumber">
<meta property="og:description" content="The deep-sea creature surprised scientists.">

ERROR: 'NoneType' object has no attribute 'find_all'

I'm doing web scraping of a web page called: CVE Trends
import bs4, requests,webbrowser
LINK = "https://cvetrends.com/"
PRE_LINK = "https://nvd.nist.gov/"
response = requests.get(LINK)
response.raise_for_status()
soup=bs4.BeautifulSoup(response.text,'html.parser')
div_tweets=soup.find('div',class_='tweet_text')
a_tweets=div_tweets.find_all('a')
link_tweets =[]
for a_tweet in a_tweets:
link_tweet= str(a_tweet.get('href'))
if PRE_LINK in link_tweet:
link_tweets.append(link_tweet)
from pprint import pprint
pprint(link_tweets)
This is the code that I've written so far. I've tried in many ways but it gives always the same error:
'NoneType' object has no attribute 'find_all'
Can someone help me please? I really need this.
Thanks in advance for any answer.

This is due to not getting response you exactly want.
https://cvetrends.com/
This website have java-script loaded content,so you will not get data in request.
instead of scraping website you will get data from https://cvetrends.com/api/cves/24hrs
here is some solution:
import requests
import json
from urlextract import URLExtract
LINK = "https://cvetrends.com/api/cves/24hrs"
PRE_LINK = "https://nvd.nist.gov/"
link_tweets = []
# library for url extraction
extractor = URLExtract()
# ectract response from LINK (json Response)
html = requests.get(LINK).text
# convert string to json object
twitt_json = json.loads(html)
twitt_datas = twitt_json.get('data')
for twitt_data in twitt_datas:
# extract tweets
twitts = twitt_data.get('tweets')
for twitt in twitts:
# extract tweet texts and validate condition
twitt_text = twitt.get('tweet_text')
if PRE_LINK in twitt_text:
# find urls from text
urls_list = extractor.find_urls(twitt_text)
for url in urls_list:
if PRE_LINK in url:
link_tweets.append(twitt_text)
print(link_tweets)

This is happening because soup.find("div", class_="tweet_text") is not finding anything, so it returns None. This is happening because the site you're trying to scrape is populated using javascript, so when you send a get request to the site, this is what you're getting back:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<title>
CVE Trends - crowdsourced CVE intel
</title>
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="description"/>
<meta content="trending CVEs, CVE intel, CVE trends" name="keywords"/>
<meta content="CVE Trends - crowdsourced CVE intel" name="title" property="og:title">
<meta content="Simon Bell" name="author"/>
<meta content="website" property="og:type">
<meta content="https://cvetrends.com/images/cve-trends.png" name="image" property="og:image">
<meta content="https://cvetrends.com" property="og:url">
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." property="og:description"/>
<meta content="en_GB" property="og:locale"/>
<meta content="en_US" property="og:locale:alternative"/>
<meta content="CVE Trends" property="og:site_name"/>
<meta content="summary_large_image" name="twitter:card"/>
<meta content="#SimonByte" name="twitter:creator"/>
<meta content="CVE Trends - crowdsourced CVE intel" name="twitter:title"/>
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="twitter:description"/>
<meta content="https://cvetrends.com/images/cve-trends.png" name="twitter:image"/>
<link href="https://cvetrends.com/favicon.ico" id="favicon" rel="icon" sizes="32x32"/>
<link href="https://cvetrends.com/apple-touch-icon.png" id="apple-touch-icon" rel="apple-touch-icon"/>
<link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/5.1.0/css/bootstrap.min.css" rel="stylesheet"/>
</meta>
</meta>
</meta>
</meta>
</head>
<body>
<div id="root">
</div>
<noscript>
Please enable JavaScript to run this app.
</noscript>
<script src="https://cvetrends.com/js/main.d0aa7136854f54748577.bundle.js">
</script>
</body>
</html>
You can verify this using print(soup.prettify()).
To be able to scrape this site you'll probable have to use something like Selenium.

How to match the first part of string with the two same substring?

I have text as below,
<meta name="description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。">
<meta name="Keywords" content="天気,天気予報,気象,情報,台風,地震,津波,週間,ウェザー,ウェザーニュース,ウェザーニューズ,今日の天気,明日の天気"><meta property="og:type" content="article">
<meta property="og:title" content="【天地始粛】音や景色から感じる秋の気配"><meta property="og:description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta property="og:url" content="https://weathernews.jp/s/topics/201807/300285/">
<meta property="og:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
<meta name="twitter:title" content="【天地始粛】音や景色から感じる秋の気配">
<meta name="twitter:description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
<link rel="canonical" href="https://weathernews.jp/s/topics/201807/300285/">
<link rel="amphtml" href="https://weathernews.jp/s/topics/201807/300285/amp.html">
<script async="async" src="https://www.googletagservices.com/tag/js/gpt.js">
I used pattern = re.compile(r'(https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+)\?[0-9]+') to match it, and I want to get https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg, but I got
https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><meta name="twitter:title" content="【天地始粛】音や景色から感じる秋の気配"><meta name="twitter:description content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta name="twitter:image" content="https://smtgs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg
how can I modify my Regex pattern?

You may try this:
this captures the url until it reaches file extensions[inclusive],
(https:\/\/smtgvs\.weathernews\.jp\/s\/topics\/img\/\d+\/\w+\.[jpng]{3})
demo

Download PDF with chrome plugin in python selenium

I'm trying to extract a PDF from this site that uses the native Google Chrome pdf viewer tool to open the pdf in the first place, it's content type is /application/pdf. The issue is that the site URLs that I get aren't actually links to the PDF but rather to a .zul site where the js will load the pdf, or fetch it.
Here's my download code below:
def download_pdf(url, idx, save_dir):
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled":False,"name":"Chrome PDF Viewer"}],
"download.default_directory" : save_dir}
options.add_experimental_option("prefs",profile)
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", chrome_options=options)
driver.get(url)
The problem that Im encountering with the above code is that I get the following readout from driver.source_page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="Expires" content="-1" />
<title>Document Viewer</title>
<link rel="stylesheet" type="text/css" href="/eSMARTContracts/zkau/web/9776a7f0/zul/css/zk.wcs;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1"/>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zk.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zul.lang.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<!-- ZK 6.0.2 EE 2012072410 -->
</head>
<body>
<div id="j4AP_" class="z-temp"></div>
<script class="z-runonce" type="text/javascript">zk.pi=1;zkmx(
[0,'j4AP_',{dt:'z_2m1',cu:'/eSMARTContracts;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',uu:'/eSMARTContracts/zkau;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',ru:'/service/dpsweb/ViewDPSWeb.zul'},[
['zul.wnd.Window','j4AP0',{$$onSize:false,$$onMaximize:false,$$onOpen:false,$$onMinimize:false,$$onZIndex:false,$onClose:true,$$onMove:false,width:'100%',height:'100%',prolog:'\
'},[]]]]);
</script>
<noscript>
<div class="noscript"><p>Sorry, JavaScript must be enabled.<br/>Change your browser options, then try again.</p></div>
</noscript>
</body>
</html>
EDIT: Included the link

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Selenium Get PageSource of XHTML - python

Related

Emulating CORS issues with pytest

How to Get Title and Brief Description from a URL in Python?

ERROR: 'NoneType' object has no attribute 'find_all'

How to match the first part of string with the two same substring?

Download PDF with chrome plugin in python selenium

Categories

Resources