How to Get Title and Brief Description from a URL in Python? - python

In Discord, if you post a link, Discord will find the title, and a brief summary of the linked webpage:
How can I replicate this behavior in Python?

Similar to this SO: https://stackoverflow.com/a/43154489/9964778
You need to get the value of some meta tags which starts with "og". In your example, you have in the source code below with the related metadata for title, description, image, among other fields
<meta property="og:site_name" content="livescience.com">
<meta property="og:image" content="https://cdn.mos.cms.futurecdn.net/W3wpBCQ4hEL4dthLYnsosK-1200-80.gif">
<meta property="og:image:width" content="1200">
<meta property="og:type" content="article">
<meta property="article:publisher" content="https://www.facebook.com/livescience?cmpid=556687">
<meta property="og:title" content="Alien shopping-bag ocean weirdo has glowing Cheetos for guts">
<meta property="og:url" content="https://www.livescience.com/alien-glowing-cheeto-sea-cucumber">
<meta property="og:description" content="The deep-sea creature surprised scientists.">

Related

Emulating CORS issues with pytest

I need to test requests that can be sent through iframe. For example: i have some page on domain_01:
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport"
content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<style>
body {
margin: 0 auto;
}
</style>
<body>
<iframe id="inlineFrameExample"
title="Inline Frame Example"
width="1600"
height="900"
src="http://domain_02:8000/app/dashboard">
</iframe>
</body>
</html>
And as you can see here this page contains iframe with link to page on domain_02. I try to understand: is it possible to emulate request that goes to domain_02 through this iframe on doamin_01 with pytest.
Main task what i need to solve it's create tests with different requests and check that there is no CORS issues with it.
How i check it now: manually only. I run second web-server through inner python server (python -m http.server 8090) and set dns-record on local dns-server to emulate domain_01. It will be so cool to run this tests with pytest.

ERROR: 'NoneType' object has no attribute 'find_all'

I'm doing web scraping of a web page called: CVE Trends
import bs4, requests,webbrowser
LINK = "https://cvetrends.com/"
PRE_LINK = "https://nvd.nist.gov/"
response = requests.get(LINK)
response.raise_for_status()
soup=bs4.BeautifulSoup(response.text,'html.parser')
div_tweets=soup.find('div',class_='tweet_text')
a_tweets=div_tweets.find_all('a')
link_tweets =[]
for a_tweet in a_tweets:
link_tweet= str(a_tweet.get('href'))
if PRE_LINK in link_tweet:
link_tweets.append(link_tweet)
from pprint import pprint
pprint(link_tweets)
This is the code that I've written so far. I've tried in many ways but it gives always the same error:
'NoneType' object has no attribute 'find_all'
Can someone help me please? I really need this.
Thanks in advance for any answer.
This is due to not getting response you exactly want.
https://cvetrends.com/
This website have java-script loaded content,so you will not get data in request.
instead of scraping website you will get data from https://cvetrends.com/api/cves/24hrs
here is some solution:
import requests
import json
from urlextract import URLExtract
LINK = "https://cvetrends.com/api/cves/24hrs"
PRE_LINK = "https://nvd.nist.gov/"
link_tweets = []
# library for url extraction
extractor = URLExtract()
# ectract response from LINK (json Response)
html = requests.get(LINK).text
# convert string to json object
twitt_json = json.loads(html)
twitt_datas = twitt_json.get('data')
for twitt_data in twitt_datas:
# extract tweets
twitts = twitt_data.get('tweets')
for twitt in twitts:
# extract tweet texts and validate condition
twitt_text = twitt.get('tweet_text')
if PRE_LINK in twitt_text:
# find urls from text
urls_list = extractor.find_urls(twitt_text)
for url in urls_list:
if PRE_LINK in url:
link_tweets.append(twitt_text)
print(link_tweets)
This is happening because soup.find("div", class_="tweet_text") is not finding anything, so it returns None. This is happening because the site you're trying to scrape is populated using javascript, so when you send a get request to the site, this is what you're getting back:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<title>
CVE Trends - crowdsourced CVE intel
</title>
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="description"/>
<meta content="trending CVEs, CVE intel, CVE trends" name="keywords"/>
<meta content="CVE Trends - crowdsourced CVE intel" name="title" property="og:title">
<meta content="Simon Bell" name="author"/>
<meta content="website" property="og:type">
<meta content="https://cvetrends.com/images/cve-trends.png" name="image" property="og:image">
<meta content="https://cvetrends.com" property="og:url">
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." property="og:description"/>
<meta content="en_GB" property="og:locale"/>
<meta content="en_US" property="og:locale:alternative"/>
<meta content="CVE Trends" property="og:site_name"/>
<meta content="summary_large_image" name="twitter:card"/>
<meta content="#SimonByte" name="twitter:creator"/>
<meta content="CVE Trends - crowdsourced CVE intel" name="twitter:title"/>
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="twitter:description"/>
<meta content="https://cvetrends.com/images/cve-trends.png" name="twitter:image"/>
<link href="https://cvetrends.com/favicon.ico" id="favicon" rel="icon" sizes="32x32"/>
<link href="https://cvetrends.com/apple-touch-icon.png" id="apple-touch-icon" rel="apple-touch-icon"/>
<link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/5.1.0/css/bootstrap.min.css" rel="stylesheet"/>
</meta>
</meta>
</meta>
</meta>
</head>
<body>
<div id="root">
</div>
<noscript>
Please enable JavaScript to run this app.
</noscript>
<script src="https://cvetrends.com/js/main.d0aa7136854f54748577.bundle.js">
</script>
</body>
</html>
You can verify this using print(soup.prettify()).
To be able to scrape this site you'll probable have to use something like Selenium.

How do I eliminate extra line in Python multi-line string?

I made a HTML Basic Markup string in Python and I made the string split over multiple lines, however, I ran into a problem. This is an HTML Basic Markup string and I want to to appear like this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Untitled</title>
</head>
<body>
</body>
</html>
So I created a string in python and this is what it looks like:
HTML_Basic_Markup = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Untitled</title>
</head>
<body>
</body>
</html>
"""
When I print HTML_Basic_Markup I get an extra space at the top, so to fix this I did this:
HTML_Basic_Markup = """<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Untitled</title>
</head>
<body>
</body>
</html>
"""
However, I want to make the code look neat and want the Doctype to be aligned with the rest of the code, so how would I remove the line which is created at the top?
String objects support a strip method you can use to remove leading and trailing characters (including newlines). See here.

How to match the first part of string with the two same substring?

I have text as below,
<meta name="description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。">
<meta name="Keywords" content="天気,天気予報,気象,情報,台風,地震,津波,週間,ウェザー,ウェザーニュース,ウェザーニューズ,今日の天気,明日の天気"><meta property="og:type" content="article">
<meta property="og:title" content="【天地始粛】音や景色から感じる秋の気配"><meta property="og:description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta property="og:url" content="https://weathernews.jp/s/topics/201807/300285/">
<meta property="og:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
<meta name="twitter:title" content="【天地始粛】音や景色から感じる秋の気配">
<meta name="twitter:description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
<link rel="canonical" href="https://weathernews.jp/s/topics/201807/300285/">
<link rel="amphtml" href="https://weathernews.jp/s/topics/201807/300285/amp.html">
<script async="async" src="https://www.googletagservices.com/tag/js/gpt.js">
I used pattern = re.compile(r'(https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+)\?[0-9]+') to match it, and I want to get https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg, but I got
https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><meta name="twitter:title" content="【天地始粛】音や景色から感じる秋の気配"><meta name="twitter:description content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta name="twitter:image" content="https://smtgs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg
how can I modify my Regex pattern?
You may try this:
this captures the url until it reaches file extensions[inclusive],
(https:\/\/smtgvs\.weathernews\.jp\/s\/topics\/img\/\d+\/\w+\.[jpng]{3})
demo

Python Selenium Get PageSource of XHTML

I was wondering if there was a way to print the entire html path. I am trying to verify some text in a pdf xhtml file pop-up and can not get to to. My hope is to get the entire page source and verify the text is in there. However .page_source seems to only give me the url and description and I am looking to get each line of code.
A possible approach is to make selenium find the starting page tag (html) and get all the source related code.
driver = webdriver.Firefox()
driver.get("http://stackoverflow.com/")
driver.find_element_by_tag_name("html").get_attribute('outerHTML')
Documentation
Output example:
<html webdriver="true"><head>
<title>Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com">
<meta property="og:type" content="website">
<meta name="description" content="Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers">
<meta property="og:image" itemprop="image primaryImageOfPage" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon#2.png?v=73d79a89bded">
<meta name="twitter:title" property="og:title" itemprop="title name" content="Stack Overflow">
<meta name="twitter:description" property="og:description" itemprop="description" content="Q&A for professional and enthusiast programmers">
<meta property="og:url" content="http://stackoverflow.com/">
......

Categories