How to parse specific values from a CSS URL - python

I'm trying to parse some specific hex color values from a css URL (not all the contents), but don't know how to figure that out using Python.
The URL looks like the below:
https://abcdomain.com/styles/theme.css
And its contents are :
#charset "UTF-8";
/* CSS Document */
.bg-primary {
background-color: #2ccfff;
color: white;
}
.bg-success {
background-color: #8b88ff;
color: white;
}
.bg-info {
background-color: #66ccff;
color: white;
}
.bg-warning {
background-color: #ff9900;
color: white;
}
.bg-danger {
background-color: #7bb31a;
color: white;
}
.bg-orange {
background-color: #f98e33;
color: white;
}
I just need to parse the "background-color" hex values for specific entries, starting from "warning" until "orange" ONLY.
I tries urllib.request but didn't work accurately with me.
I will be so grateful if anyone could help me get this values using a Python script.
Thanks,
Ahmed

I added an extra 'f' to your CSS code, because it didn't validate.
You can download a file using requests and parse CSS using cssutils. The following code finds all background-color instances and puts them in a dict with the CSS selector.
import requests
import cssutils
# Use this instead of requests if you want to read from a local file
# css = open('test.css').read()
url = 'https://abcdomain.com/styles/theme.css'
r = requests.get(url)
css = r.content
sheet = cssutils.parseString(css)
results = {}
for rule in sheet:
if rule.type == rule.STYLE_RULE:
for prop in rule.style:
if prop.name == 'background-color':
results[rule.selectorText] = prop.value
print(results)
This prints the following result:
{
'.bg-primary': '#2ccfff',
'.bg-success': '#8b88ff',
'.bg-info': '#6cf',
'.bg-warning': '#f90',
'.bg-danger': '#7bb31a',
'.bg-orange': '#f98e33'
}

Related

how to I download csv file from github [duplicate]

Lets say there's a file that lives at the github repo:
https://github.com/someguy/brilliant/blob/master/somefile.txt
I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:
import requests
from os import getcwd
url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)
f = open(filename,'w')
f.write(r.content)
Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:
<!DOCTYPE html>
<!--
Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.com/styleguide/templates/2.0
-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<title>Page not found · GitHub</title>
<style type="text/css" media="screen">
body {
background: #f1f1f1;
font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
text-rendering: optimizeLegibility;
margin: 0; }
.container { margin: 50px auto 40px auto; width: 600px; text-align: center; }
a { color: #4183c4; text-decoration: none; }
a:visited { color: #4183c4 }
a:hover { text-decoration: none; }
h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }
ul { list-style: none; margin: 25px 0; padding: 0; }
li { display: table-cell; font-weight: bold; width: 1%; }
#error-suggestions { font-size: 14px; }
#next-steps { margin: 25px 0 50px 0;}
#next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
#next-steps a { font-weight: bold; }
.divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}
#parallax_wrapper {
position: relative;
z-index: 0;
}
#parallax_field {
overflow: hidden;
position: absolute;
left: 0;
top: 0;
height: 370px;
width: 100%;
}
etc etc.
Content from Github, but not the content of the file. What am I doing wrong?
The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.
If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):
https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt
Note the change in domain name, and the blob/ part of the path is gone.
To demonstrate this with the requests GitHub repository itself:
>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================
.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]
You need to request the raw version of the file, from https://raw.githubusercontent.com.
See the difference:
https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py
Also, you should probably add a / between your directory and the filename:
>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'
Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:
url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"
E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.
Adding a working example ready for copy+paste:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"
# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"
resp = requests.get(url, headers=headers)
print(resp.status_code)
(*) If repo is not private - remove the headers part.
Bonus:
Check out this Curl < --> Python-requests online converter.

remove index & headers from dataframe while styling

I am reading an xlsx & creating a html while applying some style using jinja2
import pandas
import jinja2
df = pandas.read_excel('C:\\Users...\\2020.xlsx', 'TEST',
usecols = 'A:J')
pandas.set_option('precision', 2)
df_dropna = df.dropna(how = 'all')
df_fillna = df_dropna.fillna('')
#html = df_fillna.to_html(index=0,header=False,border=0)
def highlight(val):
if (val in ('USERID','Name')) :
return 'background-color: yellow'
else:
return 'background-color: white'
styler = (df_fillna.style.applymap(highlight))
# Template handling
env = jinja2.Environment(loader=jinja2.FileSystemLoader(searchpath=''))
template = env.get_template('template2.html')
outputText = template.render(my_table=styler.render())
html_file = open('trial.html', 'w')
html_file.write(outputText)
html_file.close()
Code works perfectly fine, except that I am not able to get rid of header & index. Anything that can help remove index & header? Please help!
See below image
Solution was not in Pyhton code, but I had to put it in my template for css
below is the code to hide index & header
table td.first { display: none;}
When calling render method overrid head parameter with empty list.
https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.render.html
Something like:
render(head=[]))
Adding this to the style.css in the Table settings worked for me.
th:nth-of-type(1){ display: none; }
table { margin-bottom: 0em; width:100%;}
th { font-weight: bold; text-align: center;}
thead th { background: #c3d9ff; }
th:nth-of-type(1){ display: none; }
th,td,caption { padding: 4px 5px 4px 5px; }

How to convert a Python string into its escaped version in C++?

I'm trying to write a Python program that reads in a file and prints the contents as a single string as it would be escaped in a C++ format. This is because the string will be copied from Python output and pasted into a C++ program (C++ string variable definition).
Basically, I want to convert
<!DOCTYPE html>
<html>
<style>
.card{
max-width: 400px;
min-height: 250px;
background: #02b875;
padding: 30px;
box-sizing: border-box;
color: #FFF;
margin:20px;
box-shadow: 0px 2px 18px -4px rgba(0,0,0,0.75);
}
</style>
<body>
<div class="card">
<h4>The ESP32 Update web page without refresh</h4><br>
<h1>Sensor Value:<span id="ADCValue">0</span></h1><br>
</div>
</body>
<script>
setInterval(function() {
// Call a function repetatively with 0.1 Second interval
getData();
}, 100); //100mSeconds update rate
function getData() {
var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
if (this.readyState == 4 && this.status == 200) {
document.getElementById("ADCValue").innerHTML =
this.responseText;
}
};
xhttp.open("GET", "readADC", true);
xhttp.send();
}
</script>
</html>
to this
<!DOCTYPE html>\n<html>\n<style>\n.card{\n max-width: 400px;\n min-height: 250px;\n background: #02b875;\n padding: 30px;\n box-sizing: border-box;\n color: #FFF;\n margin:20px;\n box-shadow: 0px 2px 18px -4px rgba(0,0,0,0.75);\n}\n</style>\n\n<body>\n<div class=\"card\">\n <h4>The ESP32 Update web page without refresh</h4><br>\n <h1>Sensor Value:<span id=\"ADCValue\">0</span></h1><br>\n</div>\n</body>\n\n<script>\nsetInterval(function() {\n // Call a function repetatively with 0.1 Second interval\n getData();\n}, 100); //100mSeconds update rate\n\nfunction getData() {\n var xhttp = new XMLHttpRequest();\n xhttp.onreadystatechange = function() {\n if (this.readyState == 4 && this.status == 200) {\n document.getElementById(\"ADCValue\").innerHTML =\n this.responseText;\n }\n };\n xhttp.open(\"GET\", \"readADC\", true);\n xhttp.send();\n}\n</script>\n</html>
Using this Python program:
if __name__ == '__main__':
with open(<filepath>) as html:
contents = html.read().replace('"', r'\"')
print(contents)
print('')
print(repr(contents))
I get exactly what I want minus double backslashes when "escaping" the double quotes. I've tried a few random things, but all the attempts either get rid of both backslashes or don't change the string at all.
I simply want to add a single backslash before all the double quotes in my string. Is this even possible in Python?
You can use str.translate to map the troublesome characters to their escaped equivalents. Since python's rules on escape and quote characters can be a bit baroque, I've just brute forced them for consistency.
# escapes for C literal strings
_c_str_trans = str.maketrans({"\n": "\\n", "\"":"\\\"", "\\":"\\\\"})
if __name__ == '__main__':
with open(<filepath>) as html:
contents = html.read().translate(_c_str_trans)
print(contents)
print('')
print(repr(contents))

How to convert HTML to text in Python?

I know there are a lot of answers on this question, but many of them are outdated, and when I found one that "worked", it did not work well enough.
This is my current code:
import requests
from bs4 import BeautifulSoup
url = "http://example.com"
req = requests.get(url)
html = req.text
PlainText = BeautifulSoup(html, 'lxml')
print (PlainText.get_text())
This is the output I get:
Example Domain
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
#media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...
This is the output I want:
Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
More information...
How can I get only the text I can read printed out from a website?
Something like this should work, as long as the "Plain text" part doesn't contain the character '}'.
import requests
from bs4 import BeautifulSoup
url = "http://example.com"
req = requests.get(url)
html = req.text
PlainText = BeautifulSoup(html, 'lxml')
text = Plaintext.get_text()
split = text.split('}')
withoutCss = split[len(split) - 1]
print (withoutCss)
Here is a python program that uses a function to remove everything between the < tags and the > tags, and returns just the text that is not between these tags.
def striphtmltags(s):
b=True
r=''
for i in range(0, len(s)):
if(s[i]=='<'): b=False
if(b): r+=s[i]
if(s[i]=='>'): b=True
return(r.strip())
html="<html><body><h1>this is the header</h1>this is the main body<font color=blue>this is blue</font><h6>this is the footer</h6></body></html>"
text=striphtmltags(html)
print("text:", text)
This produces:
text: this is the headerthis is the main bodythis is bluethis is the footer

How to download and write a file from Github using Requests

Lets say there's a file that lives at the github repo:
https://github.com/someguy/brilliant/blob/master/somefile.txt
I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:
import requests
from os import getcwd
url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)
f = open(filename,'w')
f.write(r.content)
Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:
<!DOCTYPE html>
<!--
Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.com/styleguide/templates/2.0
-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<title>Page not found · GitHub</title>
<style type="text/css" media="screen">
body {
background: #f1f1f1;
font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
text-rendering: optimizeLegibility;
margin: 0; }
.container { margin: 50px auto 40px auto; width: 600px; text-align: center; }
a { color: #4183c4; text-decoration: none; }
a:visited { color: #4183c4 }
a:hover { text-decoration: none; }
h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }
ul { list-style: none; margin: 25px 0; padding: 0; }
li { display: table-cell; font-weight: bold; width: 1%; }
#error-suggestions { font-size: 14px; }
#next-steps { margin: 25px 0 50px 0;}
#next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
#next-steps a { font-weight: bold; }
.divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}
#parallax_wrapper {
position: relative;
z-index: 0;
}
#parallax_field {
overflow: hidden;
position: absolute;
left: 0;
top: 0;
height: 370px;
width: 100%;
}
etc etc.
Content from Github, but not the content of the file. What am I doing wrong?
The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.
If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):
https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt
Note the change in domain name, and the blob/ part of the path is gone.
To demonstrate this with the requests GitHub repository itself:
>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================
.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]
You need to request the raw version of the file, from https://raw.githubusercontent.com.
See the difference:
https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py
Also, you should probably add a / between your directory and the filename:
>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'
Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:
url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"
E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.
Adding a working example ready for copy+paste:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"
# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"
resp = requests.get(url, headers=headers)
print(resp.status_code)
(*) If repo is not private - remove the headers part.
Bonus:
Check out this Curl < --> Python-requests online converter.

Categories