How to convert HTML to text in Python?

How to convert HTML to text in Python? - python

I know there are a lot of answers on this question, but many of them are outdated, and when I found one that "worked", it did not work well enough.
This is my current code:
import requests
from bs4 import BeautifulSoup
url = "http://example.com"
req = requests.get(url)
html = req.text
PlainText = BeautifulSoup(html, 'lxml')
print (PlainText.get_text())
This is the output I get:
Example Domain
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
#media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...
This is the output I want:
Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
More information...
How can I get only the text I can read printed out from a website?

Something like this should work, as long as the "Plain text" part doesn't contain the character '}'.
import requests
from bs4 import BeautifulSoup
url = "http://example.com"
req = requests.get(url)
html = req.text
PlainText = BeautifulSoup(html, 'lxml')
text = Plaintext.get_text()
split = text.split('}')
withoutCss = split[len(split) - 1]
print (withoutCss)

Here is a python program that uses a function to remove everything between the < tags and the > tags, and returns just the text that is not between these tags.
def striphtmltags(s):
b=True
r=''
for i in range(0, len(s)):
if(s[i]=='<'): b=False
if(b): r+=s[i]
if(s[i]=='>'): b=True
return(r.strip())
html="<html><body><h1>this is the header</h1>this is the main body<font color=blue>this is blue</font><h6>this is the footer</h6></body></html>"
text=striphtmltags(html)
print("text:", text)
This produces:
text: this is the headerthis is the main bodythis is bluethis is the footer

Related

how to I download csv file from github [duplicate]

Lets say there's a file that lives at the github repo:
https://github.com/someguy/brilliant/blob/master/somefile.txt
I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:
import requests
from os import getcwd
url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)
f = open(filename,'w')
f.write(r.content)
Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:
<!DOCTYPE html>
<!--
Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.com/styleguide/templates/2.0
-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<title>Page not found · GitHub</title>
<style type="text/css" media="screen">
body {
background: #f1f1f1;
font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
text-rendering: optimizeLegibility;
margin: 0; }
.container { margin: 50px auto 40px auto; width: 600px; text-align: center; }
a { color: #4183c4; text-decoration: none; }
a:visited { color: #4183c4 }
a:hover { text-decoration: none; }
h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }
ul { list-style: none; margin: 25px 0; padding: 0; }
li { display: table-cell; font-weight: bold; width: 1%; }
#error-suggestions { font-size: 14px; }
#next-steps { margin: 25px 0 50px 0;}
#next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
#next-steps a { font-weight: bold; }
.divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}
#parallax_wrapper {
position: relative;
z-index: 0;
}
#parallax_field {
overflow: hidden;
position: absolute;
left: 0;
top: 0;
height: 370px;
width: 100%;
}
etc etc.
Content from Github, but not the content of the file. What am I doing wrong?

The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.
If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):
https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt
Note the change in domain name, and the blob/ part of the path is gone.
To demonstrate this with the requests GitHub repository itself:
>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================
.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]

You need to request the raw version of the file, from https://raw.githubusercontent.com.
See the difference:
https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py
Also, you should probably add a / between your directory and the filename:
>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'

Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:
url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"
E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.

Adding a working example ready for copy+paste:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"
# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"
resp = requests.get(url, headers=headers)
print(resp.status_code)
(*) If repo is not private - remove the headers part.
Bonus:
Check out this Curl < --> Python-requests online converter.

How to parse specific values from a CSS URL

I'm trying to parse some specific hex color values from a css URL (not all the contents), but don't know how to figure that out using Python.
The URL looks like the below:
https://abcdomain.com/styles/theme.css
And its contents are :
#charset "UTF-8";
/* CSS Document */
.bg-primary {
background-color: #2ccfff;
color: white;
}
.bg-success {
background-color: #8b88ff;
color: white;
}
.bg-info {
background-color: #66ccff;
color: white;
}
.bg-warning {
background-color: #ff9900;
color: white;
}
.bg-danger {
background-color: #7bb31a;
color: white;
}
.bg-orange {
background-color: #f98e33;
color: white;
}
I just need to parse the "background-color" hex values for specific entries, starting from "warning" until "orange" ONLY.
I tries urllib.request but didn't work accurately with me.
I will be so grateful if anyone could help me get this values using a Python script.
Thanks,
Ahmed

I added an extra 'f' to your CSS code, because it didn't validate.
You can download a file using requests and parse CSS using cssutils. The following code finds all background-color instances and puts them in a dict with the CSS selector.
import requests
import cssutils
# Use this instead of requests if you want to read from a local file
# css = open('test.css').read()
url = 'https://abcdomain.com/styles/theme.css'
r = requests.get(url)
css = r.content
sheet = cssutils.parseString(css)
results = {}
for rule in sheet:
if rule.type == rule.STYLE_RULE:
for prop in rule.style:
if prop.name == 'background-color':
results[rule.selectorText] = prop.value
print(results)
This prints the following result:
{
'.bg-primary': '#2ccfff',
'.bg-success': '#8b88ff',
'.bg-info': '#6cf',
'.bg-warning': '#f90',
'.bg-danger': '#7bb31a',
'.bg-orange': '#f98e33'
}

Beautiful soup returns nothing

This is the HTML code:
<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>
I am trying to print 42263 - Unencrypted Telnet Server using Beautiful Soup but the output is an empty element i.e, []
This is my Python code:
from bs4 import BeautifulSoup
import csv
import urllib.request as urllib2
with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
divs = soup.find_all('div', attrs={'background':'#fdc431'})
print(divs)

background is not an attribute of the div tag. The attributes of the div tag are:
{'xmlns': '', 'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}
So, either you'll have to use
soup.find_all('div', attrs={'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}
or, you can use the lambda function to check if background: #fdc431 is in the style attribute value, like this:
soup = BeautifulSoup('<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>', 'html.parser')
print(soup.find(lambda t: t.name == 'div' and 'background: #fdc431' in t['style']).text)
# 42263 - Unencrypted Telnet Server
or, you can use RegEx, as shown by Jatimir in his answer.

Solution with regexes:
from bs4 import BeautifulSoup
import re
with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
Let's find the div that matches the following regular expression: background:\s*#fdc431;. \s matches a single Unicode whitespace character. I assumed that there can be 0 or more whitespaces so I added the * modifier to match 0 or more repetitions of the preceding RE. You can read more about regexes here as they sometimes come in handy. I also recommend you this online regex tester.
div = soup.find('div', attrs={'style': re.compile(r'background:\s*#fdc431;')})
This however is equivalent to:
div = soup.find('div', style=re.compile(r'background:\s*#fdc431;'))
You can read about that in the official documentation of BeautifulSoup
Worth reading are also the sections about the kinds of filters you can provide to the find and other similar methods.
You can supply either a string, regular expression, list, True or a function, as shown by Keyur Potdar in his anwser.
Assuming the div exists we can get its text by:
>>> div.text
'42263 - Unencrypted Telnet Server'

html script csv to html table

I don't have any idea about html. Some how i got code to convert csv to html.
Below is code:
import sys
import csv
# Open the CSV file for reading
def populate_table(csv_fl):
reader = csv.reader(open(csv_fl))
# Create the HTML file for output
html_table = ''
# initialize rownum variable
rownum = 0
# write <table> tag
html_table= '<table>\n'
# generate table contents
for row in reader: # Read a single row from the CSV file
# write header row. assumes first row in csv contains header
if rownum == 0:
html_table += '<tr>\n' # write <tr> tag
for column in row:
html_table += '<th>' + column + '</th>\n'
html_table += '</tr>\n'
#write all other rows
else:
html_table += '<tr>\n'
for column in row:
if 'fail' in column or 'Fail' in column:
html_table += "<td style='color:red'>" + column + '</td>\n'
continue
html_table += '<td>' + column + '</td>\n'
html_table += '</tr>\n'
#increment row count
rownum += 1
# write </table> tag
html_table += '</table>\n'
return html_table
Above code if string contains Fail or fail it will make red color cell.
I need help here to make full line in red color (Not single cell).
Below is code to fill html (Indent is wrong. If need correct indent code i will share in link ).
I will excute below code like below:
python2.7 fil.py test.csv test.html
import csv2html
import sys
class Sketch:
def __init__(self):
"""
Returns html sketch for a defined scenario
Scenarios asccessible as functions.
supported ones are:
-fail
-pass
-status_update
-final
"""
def _style (self):
body = """
<style>
p {
font-family : Calibri;
font-size: 14px;
font-weight: bolder;
text-align : left;
}
p.fade {
color : #CCCCCC;
font-size: 14px;
}
em {
font-style : italic ;
font-size : 16px;
font-weight: lighter ;
}
em.pass {
font-style : italic ;
font-size : 16px;
color: green ;
}
em.fail {
font-style : italic ;
font-size : 16px;
color: red ;
}
a {
text-decoration: none;
}
a:hover {
text-decoration: underline;
}
hr {
align: left ;
margin-left: 0px ;
width: 500px;
height:1px;
}
table {
border-collapse: collapse;
}
tr {
padding: 4px;
text-align: center;
border-right:2px solid #FFFFFF;
}
tr:nth-child(even){background-color: #f2f2f2}
th {
background-color: #cceeff;
color: black;
padding: 4px;
border-right:2px solid #FFFFFF;
}
</style>
"""
return body
def _start(self):
return """
<!DOCTYPE html>
<html>
"""
def _end(self):
body ="""
<hr/>
<p class="fade">Note: Link might be disabled,
please put me in safe sender list, by right
click on message.
This is a system generated mail, please don't
respond to it.</p>
</html>
"""
return body
def _fail (self):
body = """
<p>STATUS :
<em class="fail">failed</em>
</p>
"""
return body
def _critical_fail(self):
str_ = 'Failure is critical, terminating the run.'
body = """
<p>
<em class="fail">%s</em>
</p>
"""%str_
return body
def _pass (self):
body = """
<p>STATUS :
<em class="pass">passed</em>
</p>
"""
return body
def _type (self, title, val):
body = """
<p>%s :
<em>%s</em>
</p>
"""%(title.upper(), val)
return body
def _loglink(self, logs):
body = """ <p> LOGS :</p>
<a href=%s>%s</a>
"""%(logs,logs)
return body
def render (self, test_id, descr, platform=None, pass_=True, \
logs=None, critical=False):
body = self._start() +\
self._style() + \
self._type("test id", test_id) + \
self._type("description", descr) +\
self._type("platform", platform)
if pass_==True:
body += self._pass ()
else:
body += self._fail ()
if critical:
body += self._critical_fail()
body += self._loglink(logs)
body += self._end()
return body
def status_update (self, ):
pass
def final (self, logs):
body += self._end()
return body
def add_html_header (csv_fl, fname):
""" html data returned by sqlite needs to be enclosed in
some of the mandatory tags for the web to parse it
properly. ! """
sketch =Sketch()
content ="""
%s %s
<body>
%s
</body>
</html>
"""%(sketch._start(), sketch._style(), csv2html.populate_table(csv_fl))
open (fname, 'w').write (content)
if len(sys.argv) < 3:
print "Usage: csvToTable.py csv_file html_file"
exit(1)
csv_fl = sys.argv[1]
html_fl = sys.argv[2]
add_html_header(csv_fl, html_fl)

To color the whole row in red, simply
<tr style="color:red"> where <tr> is the row you want to color.
p {
font-family: Calibri;
font-size: 14px;
font-weight: bolder;
text-align: left;
}
p.fade {
color: #CCCCCC;
font-size: 14px;
}
em {
font-style: italic;
font-size: 16px;
font-weight: lighter;
}
em.pass {
font-style: italic;
font-size: 16px;
color: green;
}
em.fail {
font-style: italic;
font-size: 16px;
color: red;
}
a {
text-decoration: none;
}
a:hover {
text-decoration: underline;
}
hr {
align: left;
margin-left: 0px;
width: 500px;
height: 1px;
}
table {
border-collapse: collapse;
}
tr {
padding: 4px;
text-align: center;
border-right: 2px solid #FFFFFF;
}
tr:nth-child(even) {
background-color: #f2f2f2
}
th {
background-color: #cceeff;
color: black;
padding: 4px;
border-right: 2px solid #FFFFFF;
}
<table>
<tr>
<th>AAA</th>
<th>BBB</th>
</tr>
<tr>
<td>CCC</td>
<td>DDD</td>
</tr>
<tr style="color:red"> <!-- Here -->
<td>EEE</td>
<td>FFF</td>
</tr>
<tr>
<td>GGG</td>
<td>HHH</td>
</tr>
</table>

How to download and write a file from Github using Requests

Lets say there's a file that lives at the github repo:
https://github.com/someguy/brilliant/blob/master/somefile.txt
I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:
import requests
from os import getcwd
url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)
f = open(filename,'w')
f.write(r.content)
Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:
<!DOCTYPE html>
<!--
Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.com/styleguide/templates/2.0
-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<title>Page not found · GitHub</title>
<style type="text/css" media="screen">
body {
background: #f1f1f1;
font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
text-rendering: optimizeLegibility;
margin: 0; }
.container { margin: 50px auto 40px auto; width: 600px; text-align: center; }
a { color: #4183c4; text-decoration: none; }
a:visited { color: #4183c4 }
a:hover { text-decoration: none; }
h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }
ul { list-style: none; margin: 25px 0; padding: 0; }
li { display: table-cell; font-weight: bold; width: 1%; }
#error-suggestions { font-size: 14px; }
#next-steps { margin: 25px 0 50px 0;}
#next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
#next-steps a { font-weight: bold; }
.divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}
#parallax_wrapper {
position: relative;
z-index: 0;
}
#parallax_field {
overflow: hidden;
position: absolute;
left: 0;
top: 0;
height: 370px;
width: 100%;
}
etc etc.
Content from Github, but not the content of the file. What am I doing wrong?

The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.
If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):
https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt
Note the change in domain name, and the blob/ part of the path is gone.
To demonstrate this with the requests GitHub repository itself:
>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================
.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]

You need to request the raw version of the file, from https://raw.githubusercontent.com.
See the difference:
https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py
Also, you should probably add a / between your directory and the filename:
>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'

Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:
url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"
E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.

Adding a working example ready for copy+paste:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"
# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"
resp = requests.get(url, headers=headers)
print(resp.status_code)
(*) If repo is not private - remove the headers part.
Bonus:
Check out this Curl < --> Python-requests online converter.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert HTML to text in Python? - python

Related

how to I download csv file from github [duplicate]

How to parse specific values from a CSS URL

Beautiful soup returns nothing

html script csv to html table

How to download and write a file from Github using Requests

Categories

Resources