How to download and write a file from Github using Requests - python

Lets say there's a file that lives at the github repo:
https://github.com/someguy/brilliant/blob/master/somefile.txt
I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:
import requests
from os import getcwd
url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)
f = open(filename,'w')
f.write(r.content)
Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:
<!DOCTYPE html>
<!--
Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.com/styleguide/templates/2.0
-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<title>Page not found · GitHub</title>
<style type="text/css" media="screen">
body {
background: #f1f1f1;
font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
text-rendering: optimizeLegibility;
margin: 0; }
.container { margin: 50px auto 40px auto; width: 600px; text-align: center; }
a { color: #4183c4; text-decoration: none; }
a:visited { color: #4183c4 }
a:hover { text-decoration: none; }
h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }
ul { list-style: none; margin: 25px 0; padding: 0; }
li { display: table-cell; font-weight: bold; width: 1%; }
#error-suggestions { font-size: 14px; }
#next-steps { margin: 25px 0 50px 0;}
#next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
#next-steps a { font-weight: bold; }
.divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}
#parallax_wrapper {
position: relative;
z-index: 0;
}
#parallax_field {
overflow: hidden;
position: absolute;
left: 0;
top: 0;
height: 370px;
width: 100%;
}
etc etc.
Content from Github, but not the content of the file. What am I doing wrong?

The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.
If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):
https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt
Note the change in domain name, and the blob/ part of the path is gone.
To demonstrate this with the requests GitHub repository itself:
>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================
.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]

You need to request the raw version of the file, from https://raw.githubusercontent.com.
See the difference:
https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py
Also, you should probably add a / between your directory and the filename:
>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'

Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:
url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"
E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.

Adding a working example ready for copy+paste:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"
# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"
resp = requests.get(url, headers=headers)
print(resp.status_code)
(*) If repo is not private - remove the headers part.
Bonus:
Check out this Curl < --> Python-requests online converter.

Related

how to I download csv file from github [duplicate]

Lets say there's a file that lives at the github repo:
https://github.com/someguy/brilliant/blob/master/somefile.txt
I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:
import requests
from os import getcwd
url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)
f = open(filename,'w')
f.write(r.content)
Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:
<!DOCTYPE html>
<!--
Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.com/styleguide/templates/2.0
-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<title>Page not found · GitHub</title>
<style type="text/css" media="screen">
body {
background: #f1f1f1;
font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
text-rendering: optimizeLegibility;
margin: 0; }
.container { margin: 50px auto 40px auto; width: 600px; text-align: center; }
a { color: #4183c4; text-decoration: none; }
a:visited { color: #4183c4 }
a:hover { text-decoration: none; }
h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }
ul { list-style: none; margin: 25px 0; padding: 0; }
li { display: table-cell; font-weight: bold; width: 1%; }
#error-suggestions { font-size: 14px; }
#next-steps { margin: 25px 0 50px 0;}
#next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
#next-steps a { font-weight: bold; }
.divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}
#parallax_wrapper {
position: relative;
z-index: 0;
}
#parallax_field {
overflow: hidden;
position: absolute;
left: 0;
top: 0;
height: 370px;
width: 100%;
}
etc etc.
Content from Github, but not the content of the file. What am I doing wrong?
The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.
If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):
https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt
Note the change in domain name, and the blob/ part of the path is gone.
To demonstrate this with the requests GitHub repository itself:
>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================
.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]
You need to request the raw version of the file, from https://raw.githubusercontent.com.
See the difference:
https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py
Also, you should probably add a / between your directory and the filename:
>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'
Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:
url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"
E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.
Adding a working example ready for copy+paste:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"
# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"
resp = requests.get(url, headers=headers)
print(resp.status_code)
(*) If repo is not private - remove the headers part.
Bonus:
Check out this Curl < --> Python-requests online converter.

How to render a 3D object to HTML file by using pythonOCC in Django?

I have a Django application and I'm using pythonOCC package in it. I have to display the 3D .stl, .stp, .igs files in my template. Normally, when I call the render() function, the following outputs appear on my vscode console and since flask app created by pythonocc instead of django starts running in localhost, my index.html is never rendered. However I need to display the files in a Django template. That's why I have extended the X3DomRenderer Class such as below.
My custom X3DomRenderer class:
class CustomX3DomRenderer(x3dom_renderer.X3DomRenderer):
def render_to_string(self):
self.generate_html_file(self._axes_plane, self._axes_plane_zoom_factor)
return open(self._html_filename, 'r').read()
the HTML codes that returned from render_to_string() function:
<html lang="en">
<head>
<title>pythonOCC 7.4.0 x3dom renderer</title>
<meta name='Author' content='Thomas Paviot - tpaviot#gmail.com'>
<meta name='Keywords' content='WebGl,pythonOCC'>
<meta charset="utf-8">
<link rel="stylesheet" type="text/css" href="https://x3dom.org/release/x3dom.css">
<script src="https://x3dom.org/release/x3dom.js"></script>
<style>
body {
background: linear-gradient(#ced7de, #808080);
margin: 0px;
overflow: hidden;
}
#pythonocc_rocks {
padding: 5px;
position: absolute;
left: 1%;
bottom: 2%;
height: 38px;
width: 280px;
border-radius: 5px;
border: 2px solid #f7941e;
opacity: 0.7;
font-family: Arial;
background-color: #414042;
color: #ffffff;
font-size: 14px;
opacity: 0.5;
}
#commands {
padding: 5px;
position: absolute;
right: 1%;
top: 2%;
height: 65px;
width: 180px;
border-radius: 5px;
border: 2px solid #f7941e;
opacity: 0.7;
font-family: Arial;
background-color: #414042;
color: #ffffff;
font-size: 14px;
opacity: 0.5;
}
a {
color: #f7941e;
text-decoration: none;
}
a:hover {
color: #ffffff;
}
</style>
</head>
<body>
<x3d id="pythonocc-x3d-scene" style="width:100%;border: none" >
<Scene>
<transform scale="1,1,1">
<transform id="plane_smallaxe_Id" rotation="1 0 0 -1.57079632679">
<inline url="https://rawcdn.githack.com/x3dom/component-editor/master/static/x3d/plane.x3d" mapDEFToID="true" namespaceName="plane"></inline>
<inline url="https://rawcdn.githack.com/x3dom/component-editor/master/static/x3d/axesSmall.x3d" mapDEFToID="true" namespaceName="axesSmall"></inline>
</transform>
<inline url="https://rawcdn.githack.com/x3dom/component-editor/master/static/x3d/axes.x3d" mapDEFToID="true" namespaceName="axes"></inline>
</transform>
<transform id="glbal_scene_rotation_Id" rotation="1 0 0 -1.57079632679"> <Inline onload="fitCamera()" mapDEFToID="true" url="shp6b8ef6d6e61744489de16a6798cfe998.x3d"></Inline>
</transform> </Scene>
</x3d>
<div id="pythonocc_rocks">
pythonocc-7.4.0 x3dom renderer
<br>Check our blog at
<a href=http://www.pythonocc.org>http://www.pythonocc.org</a>
</div>
<div id="commands">
<b>t</b> view/hide shape<br>
<b>r</b> reset view<br>
<b>a</b> show all<br>
<b>u</b> upright<br>
</div>
<script>
var selected_target_color = null;
var current_selected_shape = null;
var current_mat = null;
function fitCamera()
{
var x3dElem = document.getElementById('pythonocc-x3d-scene');
x3dElem.runtime.fitAll();
}
function select(the_shape) // called whenever a shape is clicked
{
// restore color for previous selected shape
if (current_mat) {
current_mat.diffuseColor = selected_target_color;
}
// store the shape for future process
current_selected_shape = the_shape;
console.log(the_shape);
// store color, to be restored later
appear = current_selected_shape.getElementsByTagName("Appearance")[0];
mat = appear.getElementsByTagName("Material")[0];
current_mat = mat;
console.log(mat);
selected_target_color = mat.diffuseColor;
mat.diffuseColor = "1, 0.65, 0";
//console.log(the_shape.getElementsByTagName("Appearance"));//.getAttribute('diffuseColor'));
}
function onDocumentKeyPress(event) {
event.preventDefault();
if (event.key=="t") { // t key
if (current_selected_shape) {
if (current_selected_shape.render == "true") {
current_selected_shape.render = "false";
}
else {
current_selected_shape.render = "true";
}
}
}
}
// add events
document.addEventListener('keypress', onDocumentKeyPress, false);
</script>
</body>
</html>
and here is my view:
def occ_viewer(request):
shape = read_step_file(os.path.join('C:/Users/imgea/desktop/bgtask/bgtask/ThreeDFile', 'splinecage.stp'))
my_renderer = extend_x3dom.CustomX3DomRenderer(path='C:/Users/imgea/desktop/bgtask/bgtask/ThreeDFile')
my_renderer.DisplayShape(shape)
context = {'viewer': my_renderer.render_to_string()}
return render(request, 'success.html', context)
and I have added these HTML codes that I got from render_to_string() function to my template file. The viewer's grid has shown but the 3D object hasn't because of the error below.
Page not found: http://127.0.0.1:8000/file/occview/shp6b8ef6d6e61744489de16a6798cfe998.x3d
The library creates that .x3d file in the same directory with the file that I wanna render to template but I guess the viewer is looking for this .x3d file in the error which I mentioned before even I sent the directory. I couldn't find the cause of this error. Am I missing something?
Thank you!!

How to convert HTML to text in Python?

I know there are a lot of answers on this question, but many of them are outdated, and when I found one that "worked", it did not work well enough.
This is my current code:
import requests
from bs4 import BeautifulSoup
url = "http://example.com"
req = requests.get(url)
html = req.text
PlainText = BeautifulSoup(html, 'lxml')
print (PlainText.get_text())
This is the output I get:
Example Domain
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
#media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...
This is the output I want:
Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
More information...
How can I get only the text I can read printed out from a website?
Something like this should work, as long as the "Plain text" part doesn't contain the character '}'.
import requests
from bs4 import BeautifulSoup
url = "http://example.com"
req = requests.get(url)
html = req.text
PlainText = BeautifulSoup(html, 'lxml')
text = Plaintext.get_text()
split = text.split('}')
withoutCss = split[len(split) - 1]
print (withoutCss)
Here is a python program that uses a function to remove everything between the < tags and the > tags, and returns just the text that is not between these tags.
def striphtmltags(s):
b=True
r=''
for i in range(0, len(s)):
if(s[i]=='<'): b=False
if(b): r+=s[i]
if(s[i]=='>'): b=True
return(r.strip())
html="<html><body><h1>this is the header</h1>this is the main body<font color=blue>this is blue</font><h6>this is the footer</h6></body></html>"
text=striphtmltags(html)
print("text:", text)
This produces:
text: this is the headerthis is the main bodythis is bluethis is the footer

API question - How to REST request with Python

I'm making an API REST request using Python. And I encountered the following html result that says "Service - Endpoint not found. Please see the service help page for constructing valid requests to the service"
How can I fix this issue?
Note: This API can help determine whether an individual address is up to date by inputting individual address, first name, last name, etc.
Python query
import requests
import json
url = 'https://smartmover.melissadata.net/v3/WEB/SmartMover/doSmartMover/'
payload = {'t': '1353', 'id': 'sw38hs47u', 'jobid': '1', 'act': 'NCOA, CCOA', 'cols': 'TransmissionResults,TransmissionReference, Version, TotalRecords,CASSReportLink,NCOAReportLink,Records,AddressExtras,AddressKey,AddressLine1,AddressLine2,AddressTypeCode,BaseMelissaAddressKey,CarrierRoute,City,CityAbbreviation,CompanyName,CountryCode,CountryName,DeliveryIndicator,DeliveryPointCheckDigit,DeliveryPointCode,MelissaAddressKey,MoveEffectiveDate,MoveTypeCode,PostalCode,RecordID,Results,State,StateName,Urbanization', 'opt': 'ProcessingType: Standard', 'List': 'test', 'full': 'PATEL MANISH', 'first':'MANISH','last':'PATEL', 'a1':'1600 S 5TH ST', 'a2':'1600 S 5TH ST', 'city':'Austin', 'state': 'TX', 'postal': '78704', 'ctry': 'USA'}
response = requests.get(url, params=payload)
print (response.text)
Result
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Service</title>
<style>BODY { color: #000000; background-color: white; font-family: Verdana; margin-left: 0px; margin-top: 0px; } #content { margin-left: 30px; font-size: .70em; padding-bottom: 2em; } A:link { color: #336699; font-weight: bold; text-decoration: underline; } A:visited { color: #6699cc; font-weight: bold; text-decoration: underline; } A:active { color: #336699; font-weight: bold; text-decoration: underline; } .heading1 { background-color: #003366; border-bottom: #336699 6px solid; color: #ffffff; font-family: Tahoma; font-size: 26px; font-weight: normal;margin: 0em 0em 10px -20px; padding-bottom: 8px; padding-left: 30px;padding-top: 16px;} pre { font-size:small; background-color: #e5e5cc; padding: 5px; font-family: Courier New; margin-top: 0px; border: 1px #f0f0e0 solid; white-space: pre-wrap; white-space: -pre-wrap; word-wrap: break-word; } table { border-collapse: collapse; border-spacing: 0px; font-family: Verdana;} table th { border-right: 2px white solid; border-bottom: 2px white solid; font-weight: bold; background-color: #cecf9c;} table td { border-right: 2px white solid; border-bottom: 2px white solid; background-color: #e5e5cc;}</style>
</head>
<body>
<div id="content">
<p class="heading1">Service</p>
<p xmlns="">Endpoint not found. Please see the <a rel="help-page" href="https://smartmover.melissadata.net/v3/WEB/SmartMover/help">service help page</a> for constructing valid requests to the service.</p>
</div>
</body>
</html>
[Finished in 0.9s]
Remove the / at the end of the url:
import requests
import json
url = 'https://smartmover.melissadata.net/v3/WEB/SmartMover/doSmartMover' # <<<
payload = {'t': '1353', 'id': 'sw38hs47u', 'jobid': '1', ...}
response = requests.get(
url, params=payload,
headers={'Content-Type': 'application/json'} # Using JSON here for readability in the response
)
print (response.text)
Output:
{
"CASSReportLink": "",
"NCOAReportLink": "",
"Records": [],
"TotalRecords": "",
"TransmissionReference": "1353",
"TransmissionResults": "SE20",
"Version": "4.0.4.48"
}

xhtml2pdf doesn't embed Helvetica

I'm creating a PDF with xhtml2pdf using Django. I'm sending that PDF to print, but then they say that some Fonts are not embed. I have a Helvetica font, but I didn't use Helvetica in the PDFs.
Here you have a Screen Shot of the properties of the PDF
As you see, Guilles'ComicFont and TF2Secondary are correclty embbeded, but not with Helvetica.
Here you have my view that generates the PDF:
def generate_pdf(request, book, order):
try:
book = Book.objects.get(pk=int(book))
order = Order.objects.get(identificador=order, cuento=book)
except ObjectDoesNotExist:
raise Http404
data = {}
data = ast.literal_eval(order.datos_variables)
data['order'] = order
template = get_template(order.book.plantilla_html)
html = template.render(Context(data))
url = '/Users/vergere/dev/media/pdfs/%s.pdf' % order.identificador
fichero = open(url, "w+b")
pisaStatus = pisa.CreatePDF(html.encode('utf-8'), dest=fichero, encoding='utf-8')
fichero.seek(0)
fichero.close()
order.url_pdf = '%s.pdf' % order.identificador
order.contador_libro = numero
order.codigo_libro = country_code
order.save()
return HttpResponseRedirect(reverse('order_single', kwargs={'pedido': order.pk}))
And here my HTML:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Background</title>
<style>
#page {
background-image: url("cuento1/img/portada.jpg");
size: 213mm 216mm;
}
#font-face {
font-family: Helvetica;
src: url(pdf_generator/helvetica.ttf);
}
#font-face {
font-family: 'Gilles';
src: url(pdf_generator/cuento1/gilles/gilles.ttf);
font-weight: normal;
font-style: normal;
}
#font-face {
font-family: 'Secondary';
src: url(pdf_generator/cuento1/tf2_secondary/tf2_secondary.ttf);
font-weight: normal;
font-style: normal;
}
* {
box-shadow:none !important;
margin: 0;
padding: 0;
text-shadow: none !important;
font-family: 'Gilles';
}
body{
font-family: 'Gilles';
}
p {
color: #464439;
display: block;
font-weight: normal;
position: absolute;
}
.page-1 p, .page-2 p{
font-family: 'Secondary';
font-size: 40px;
line-height: 1.3em;
position: absolute;
}
</style>
</head>
<body>
<pdf:nextpage name="p1" />
<pdf:nextpage name="p2" />
<div class="page-6 page-dedicatoria">
{{order.dedicatoria}} <br /> {{order.de}}
</div>
<p> </p>
</body>
</html>
Anyone knows why is using Helvetica? Or there is any way to embed Helvetica? I'm trying with "#font-face" but It's not working.
Helvetica is one of the standard fonts that every PDF renderer has to have available. Therefore it doesn't have to be embedded.
A possible solution would be to use another sans-serif font instead of Helvetica. On windows e.g. Arial. On OSX e.g. Helvetica Neue or Avenir. They look a lot like Helvetica, but are not standard PDF fonts.
In your stylesheet, specify a new font for all elements;
* {
font-family: Avenir;
}

Categories