Python HTML - Get element by attribute - python

There is music website I regularly read, and it has a section where users post their own fictional music-related stories. There is a 91 part series (Written over a length of time, uploaded part by part) that always follows the convention of:
http://www.ultimate-guitar.com/columns/fiction/riot_band_blues_part_#.html.
I would like to be able to get just the formatted text from every part and put it into one html file.
Conveniently, there is a link to a print version, correctly formatted for my purposes. All I would have to do is write a script to download all of the parts and then dump them into file. Not hard.
Unfortunately, the url for a print version is as follows:
www.ultimate-guitar.com/print.php?what=article&id=95932
The only way to know what article corresponds to what ID field is to look at the value attribute of a certain input tag in the original article.
What I want to do is this:
Go to each page, incrementng through the varying numbers.
Find the <input> tag with attribute 'name="rowid"' and get the number in it's 'value=' attribute.
Go to www.ultimate-guitar.com/print.php?what=article&id=<value>.
Append everything (minus <html><head> and <body> to a html file.
Rinse and repeat.
Is this possible? And is python the right language? Also, what dom/html/xml library should I use?
Thanks for any help.

With lxml and urllib2:
import lxml.html
import urllib2
#implement the logic to download each page, with HTML strings in a sequence named pages
url = "http://www.ultimate-guitar.com/print.php?what=article&id=%s"
for page in pages:
html = lxml.html.fromstring(page)
ID = html.find(".//input[#name='rowid']").value
article = urllib2.urlopen(url % ID).read()
article_html = lxml.html.fromstring(article)
with open(ID + ".html", "w") as html_file:
html_file.write(article_html.find(".//body").text_content())
edit: Upon running this, it seems there may be some Unicode characters in the page. One way to get around this is to do article = article.encode("ascii", "ignore") or to put the encode method after .read(), to force ASCII and ignore Unicode, though this is a lazy fix.
This is assuming you just want the text content of everything inside the body tag. This will save files with the format of storyID.html (so "95932.html") in the local directory of the Python file. Change the save semantics if you like.

You could actually do this in javascript/jquery without too much trouble. javascripty-pseudocode, appending to an empty document:
for(var pageNum = 1; i<= 91; i++) {
$.ajax({
url: url + pageNum,
async: false,
success: function() {
var printId = $('input[name="rowid"]').val();
$.ajax({
url: printUrl + printId,
async: false,
success: function(data) {
$('body').append($(data).find('body').contents());
}
});
}
});
}
After the loading completes you could save the resultant HTML to a file.

Related

How to display HTML using LXML in Python

So what I am trying to achieve is really simple.
I want to call python test.py and would like to go to my local host and see the html result. However I keep getting an error ValueError: Invalid tag name u'<html><body><h1>Test!</h1></body></html>'
Below is my code. What's the problem here?
import lxml.etree as ETO
html = ETO.Element("<html><body><h1>Test!</h1></body></html>")
self.wfile.write(ETO.tostring(html, xml_declaration=False, pretty_print=True))
You have to create each element in turn, and put them in the structure that you want them to have:
html = ETO.Element('html')
body = ETO.SubElement(html, 'body')
h1 = ETO.SubElement(body, 'h1')
h1.text = 'Test!'
Then ETO.tostring(html) will return a bytestring that looks like this:
>>> ETO.tostring(html)
b'<html><body><h1>Test!</h1></body></html>'
Since you are reading an existing file, Element isn't useful here; try changing this
html = ETO.Element("<html><body><h1>Test!</h1></body></html>")
to this
html = ETO.fromstring("<html><body><h1>Test!</h1></body></html>")
and see if it works for you.

Flask and Jinja2 url_for error - concatenating json object into url_for

I am using Flask with Jinja2 and MapBox on a project which involves plotting data on a map using GeoJSON derived from model data. Example of how this is loaded:
$.getJSON("{{ url_for(".geojson") }}", function(data) {
var geojson = L.geoJson(data, {
onEachFeature: function (feature, layer) {
//do stuff
}
});
markers.addLayer(geojson);
var map = L.map('map', {maxZoom: 9, minZoom: 3}).fitBounds(markers.getBounds());
baseLayer.addTo(map);
markers.addTo(map);
An example of using this JSON data within my JS:
var feature = e.layer.feature;
//print item name
console.log(feature.properties.name)
//print item latitude
console.log(feature.properties.latitude)
//print item category info
console.log(feature.properties.category.name)
This works great. My dataset has now extended to include image urls (example 09379_580_360.jpg), and the images themselves are hosted in a static/images/eol folder. I'd like to include these as an image within a DIV, of which I am setting dynamically via JS like so...
var commoncontent = '<div class="panel-heading"><h3>'+feature.properties.name+'</h3></div>'
$('#common').html(commoncontent)
However, when I attempt to concatenate my image data into jinja's url_for...
var commoncontent = '<div><img src="{{ url_for("static", filename="images/eol/thumbs/big/'+feature.properties.category.localimageurl.jpg+'") }}"></div>'
... I get this error in my console
GET http://127.0.0.1:5000/static/images/eol/thumbs/big/feature.properties.category.localimageurl.jpg 404 (NOT FOUND)
I know that feature.properties.category.localimageurl is correct as it prints to my console when I console.log() it. However, I have no idea why the interpreter is taking it directly as a string and not concatenating it?
feature is a JavaScript object. Jinja doesn't have access to those; it runs on the server, whereas your JavaScript runs in the client. feature doesn't exist when your template is rendered. You will need to handle the concatenation with JavaScript.
var commoncontent = '<div><img src="{{ url_for("static", filename="images/eol/thumbs/big/") }}' + feature.properties.category.localimageurl.jpg + '"></div>'

json unicode to string so i can use that in django html page

Edited :
Anybody tell me how to decode the unicode.I just want to print json unicode into my html page i developed. I got the api from heroku api.
pretty much i followed every step correctly. But the output is unicode and that i don`t know how to extract the content and display into my page.
I need to print the content. How to do that ?
my views.py
template_vars['kural'] = json.dumps(thirukural[x])
t = loader.get_template('index.html')
c = Context(template_vars)
#pprint.pprint(c)
return HttpResponse(t.render(c))
Html Page
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head><body>
<p id="p"></p>
<script type="text/javascript">
var t = {{kural|safe}}
var text = eval(t);
var p = document.getElementById("p");
p.innerHTML=t.kural;
</script>
</body></html>
</body>
</html>
It`s currently printed like this
யாதனின் யாதனின் நீங்கியான் நோதல் அதனின் அதனின் இலன்.
but in the heroku api page the sample output printed like this
{
"id": "213",
"kural": "புத்தே ளுலகத்தும் ஈண்டும் பெறலரிதே\n\nஒப்புரவின் நல்ல பிற."
}
You can see that my output doesnt have line breaks that \n . How can i do that ?
The kural variable is a dict, if you want to display kural in your view, I think you need json.
import json
template_vars['kural'] = json.dumps(thirukural[x])
I believe what you'll want to do is change that first line you show of your views code to:
template_vars['kural'] = thirukural[x].encode('ascii', 'xmlcharrefreplace')
That should change everything into HTML entities and it will end up looking something like this:
'உலகம் தழீஇய தொட்பம் மலர்தலும்\n\nகூம்பலும் இல்ல தறிவு.'

How to return JSON data from Flask without ajax

Suppose I want to render a page (not just JSON) using Flask with some specific data that I fetch from the database. For example
display_data.html includes:
<script src='display_data.js'></script>
...
<h1>Data display page!</h1>
<div id="chartContainer"></div>
display_data.js:
$(function() {
draw_chart($("#chartContainer"), json_data);
//draw_chart is defined elsewhere and json_data is what I want to pass in
});
Python:
#app.route('/<data_id>')
def get_display_data_page(data_id):
data = get_data_by_id(data_id)
return render_template('display_data.html', data = data)
I think that if I want to just "render template", I'd have to include elsewhere in display_data.html the following:
<script>window.json_data = {{ data | tojson | safe}}</script>
This pattern smells bad: I'm leaving an object on the global namespace (so that my JS file can access it), displaying the data as plain text, and rendering a string in that is parsed into JSON so the JS can use it. Looks bad but this does work.
Two other options:
Return the data with AJAX. Given the title of this post I'm specifically trying to avoid ajax. The reason for this is mainly that I'm building a mobile site and want to reduce the number of pings back to the server. I'm also thinking (perhaps more metaphysically) about encapsulating the page: once you have it, you have all of it.
Render my JS file via Flask and Jinja. This seems like a bummer because I'd have to then write a route down and render the JS based on the same logic that I have in the get_display_data_page: looking up the data by its id, etc. Code duplication and dynamic JS sound like big no-no's to me.
Is there a known pattern to doing this well?
There's no need to leave data in the global scope if you don't want to. In your template you can do something like this:
<script>
function registerTask(f, args) {
$(function() {
f.call(this, args);
});
}
{% for name, args in js_tasks %}
registerTask({{name}}, {{ args | tojson | safe }});
{% endfor %}
</script>
Then, in your JS file, redefine draw_chart to just take the data (or have a wrapper around it that you use as your task registry name):
function draw_chart_task(data) {
draw_chart($('#chartContainer'), data);
}
Finally, in your controller, simply provide the data and the task name as a tuple:
return render_template('display_data.html', js_tasks=[('draw_chart_task', data)])
This ensures that your JavaScript is not just plucking its dependencies out of the global scope, and you are not making extra network calls.
The data is visible in the raw text output of the page, but it is visible if you make an AJAX call too, you just have to look in a different panel of your browser's developer's tools to see it.

Generating pdf using web2py-appreport (xhtmltopdf) in Python Web2py webapp

I am from a non coding background so python, web2py is very new to me.
My app needs to export textarea content (using RTE redactor) to pdf. I get html content from textarea (redactor), can you please advice me on how to use pyfpdf to generate a pdf file on button click.
I don't know how to get the html content (images and text) on button click in view to generate pdf using appreport.
I was able to use app-report to generate a pdf (using PISA, PYPDF does not work) from an existing html file (without css) if html file has css it throws an error,
***<class 'sx.w3c.cssParser.CSSParseError'> Terminal function expression expected closing ')':: (u'Alpha(Opacity', u'=0); }\n\n\n\n.ui-state-')***
This might be due to a mistake in the controller code:
def myreport():
html = response.render('myreport.html', dict())
return plugin_appreport.REPORTPISA(html = html)
Another thing I tried was passing the html from my view to the controller using ajax post (in Javascript). Redactor is the textarea RTE I am using and alert gives me the desired html result.
View:
function getContent() {
var t= jQuery('#redactor_content').getCode();
alert(t);
jQuery.ajax({
type: "POST",
url: "http://127.0.0.1:8000/Test50/default/myreport2",
data: "{g : 'jQuery('#redactor_content').getCode()'}"
});
}
Controller:
def myreport2():
g = request.get_vars
html = response.render(g)
return plugin_appreport.REPORTPISA(html = html)
Due to my less knowledge in coding , I am not able to figure out and correct my mistake. I will be thankful if anybody can help me with this problem.
Regards,
Akash
Could it be this post request:
jQuery.ajax({
type: "POST",
url: "http://127.0.0.1:8000/Test50/default/myreport2",
data: "{g : 'jQuery('#redactor_content').getCode()'}"
});
}
I think you should have the 'data' parameter be a literal dictionary, not a string. Change this line like this (remove all but one set of quotes):
data: {g : jQuery('#redactor_content').getCode() }
This should properly send the request. The jQuery documentation says that the data parameter should be key-value pairs, not a string.

Categories