I have a python server-side application that generates a simple HTML page with a big blurb of client-side javascript that generates client-side the DOM tree displayed to the user based on a big blob of JSON data assigned to a js variable. Some of that JSON data contains strings, some of which contain HTML tags. It all boils down to something like this:
<html>
...
var tmp = "<p>some text</p>";
...
</html>
Unsurprisingly, the above does not work since it should look like the following to make the browser HTML parser happy:
<html>
...
var tmp = "<p>some text<\/p>";
...
</html>
(notice the escaped forward slash)
The JSON inserted in the HTML is generated with the python default json library. Namely, with json.dumps which is designed explicitely to not escape the forward slash in strings.
I tried to subclass json.JSONDecoder to override its behavior for python strings but this does not work since it does not allow specialization of the serialization of basic python types.
I tried to use a variety of other python json libraries without much luck: it seems that since most people hate the escaped forward slashes, most libraries do not generate them.
I could escape the strings by hand before stuffing them in my python data structures before calling json.dumps. I could also write a function to recursively iterate over the data structure, spot strings, and escape them automatically (nicer over the long run). I could maybe escape the string generated by json.dumps before stuffing it in the HTML (I am not sure that this could not lead to invalid JSON being inserted in the HTML).
Which leads me to my question: is there a json serialization library that can be coerced to escape forward slashes in strings in python ?
The best way I've found is to just do a replacement on the resulting string.
out = json.dumps(obj)
out = out.replace("/", "\\/")
Escaping forward slashes is optional within the JSON spec, and doing so ensures that you won't get bit by "</script>" attacks in the string.
Related
I have a program that can output results either as JSON or Python data structure literals. I am wondering how to succinctly name the latter option.
JavaScript literals are not called JSON. JSON derived its name and syntax from JavaScript, but they’re not the same thing. Use “Python literals”.
I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text, special characters (such as accents) are encoded (é = é for example).
Yet when I try r.encoding, I get utf-8.
In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.
Try as follows:
r = requests.get("https://gks.gs/login")
print r.text
There encoded characters which are displayed, we can see Mot de passe oublié ?.
I do not understand why. Do you think it may be because of https? How to fix this please?
These are HTML character entity references, the easiest way to decode them is:
In Python 2.x:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('oublié')
'oublié'
In Python 3.x:
>>> import html.parser
>>> html.parser.HTMLParser().unescape('oublié')
'oublié'
These are HTML escape codes, defined in the HTML Coded Character Set. Even though a certain document may be encoded in UTF-8, HTML (and its grandparent, SGML) were defined back in the good old days of ASCII. A system accessing an HTML page on the WWW may or may not natively support extended characters, and the developers needed a way to define "advanced" characters for some users, while failing gracefully for other users whose systems could not support them. Since UTF-8 standardization was only a gleam in its founders' eyes at that point, an encoding system was developed to describe characters that weren't part of ASCII. It was up to the browser developers to implement a way of displaying those extended characters, either through glyphs or through extended fonts.
Encoding special characters using &sometihg; is "legal" in any HTML and despite of looking a bit strange, they are to be considered valid.
The text is supposed to be rendered by some HTML browser and it will result in correct result, regardless if you find these character encoded using given construct or directly.
For instructions how to convert these encoded characters see HTML Entity Codes to Text
Those are HTML escape codes, often referred to as HTML entities. As you see, HTML uses its own code to replace reserved symbols.
You can use the library HTMLParser
parser = HTMLParser.HTMLParser
parsed = parser.unescape(r.text)
I am using the twitter API, and when I make a request to the api website, something like
https://api.instagram.com/v1/tags/cats/media/recent?user_id=myUserId&count=1
I get the correct response back, JSON data, except all of the // characters are escaped and are shown as \/\/
This is true for the command line, using curl and when i type that url directly into the browser.
If it makes any difference, I am ultimately going to be calling a function and navigating to that URL so I need it to be the unescaped.
Furthermore, I will be accessing that URL with Python, so if there is a Python method that is good, but ideally I would just get the response back unchanged.
The JSON standard allows (though not requires) / to be escaped. If you use any standard-compliant JSON parser (i.e. pretty much any JSON parser), it will do the unescaping for you.
I am sending my GET request to python server my query string is having
"http://192.168.4.106:3333/xx/xx/xx/xx?excelReport**&detail=&#tt**=475&dee=475&empi=&qwer=&start_date=03/01/2014&end_date=03/13/2014&SearchVar=0&report_format=D"
my query string is containing one character # so when i am doing request.keys() in my server its not showing me any params passed.Its working with other special character??
I am stuck in this problem from quite a long time??
I am using zope framework??
Please suggest??
The # character cannot be used like that in a query string.
You should encode it with %23 and decode it when you parse the string.
The reason behind that can be found at W3 site
# marks the end of the 'query' part of an URL and the start of the 'fragment'. If you need to have a '#' inside your query (that is, the GET params that you get with request.keys()), you need to encode it (with the standard urllib.urlencode or with whatever your framework provides).
I'm not sure what's the purpose of # in that URL, though. Is it supposed to be a key #tt** in your request.keys()? Is it in fact the start of the fragment?
Nowadays fragments are often used to have some routing in the client side of a webapp, since if you go from #a to #b inside a webpage, you don't need to reload the page. So if that may be the case then you can't encode the #, since it would lose its meaning. You would need then to extract the parameters you want from the fragment part manually.
You can use urllib.quote to solve your problem generally.
>>> import urllib
>>> urllib.quote('#')
'%23'
I want to parse yaml documents like the following
meta-info-1: val1
meta-info-2: val2
---
Plain text/markdown content!
jhaha
If I load_all this with PyYAML, I get the following
>>> list(yaml.load_all(open('index.yml')))
[{'meta-info-1': 'val1', 'meta-info-2': 'val2'}, 'Plain text/markdown content! jhaha']
What I am trying to achieve here is that the yaml file should contain two documents, and the second one is supposed to be interpreted as a single string document, more specifically any large body of text with markdown formatting. I don't want it to be parsed as YAML syntax.
In the above example, PyYAML returns the second document as a single string. But if the second document has a : character in place of the ! for instance, I get a syntax error. This is because PyYAML is parsing the stuff in that document.
Is there a way I can tell PyYAML that the second document is a just a raw string and not to parse it?
Edit: A few excellent answers there. While using quotes or the literal syntax solves the said problem, I'd like the users to be able to write the plain text without any extra cruft. Just the three -'s (or .'s) and write away a large body of plain text. Which might also include quotes too. So, I'd like to know if I can tell PyYAML to parse only one document, and give the second to me raw.
Eidt 2: So, adapting agf's idea, instead of using a try/except as the second document could be valid yaml syntax,
config_content, body_content = open(filename).read().split('\n---')
config = yaml.loads(config_content)
body = yaml.loads(body_content)
Thanks agf.
You can do
raw = open(filename).read()
docs = []
for raw_doc in raw.split('\n---'):
try:
docs.append(yaml.load(raw_doc))
except SyntaxError:
docs.append(raw_doc)
If you won't have control over the format of the original document.
From the PyYAML docs,
Double-quoted is the most powerful style and the only style that can express any scalar value. Double-quoted scalars allow escaping. Using escaping sequences \x** and \u****, you may express any ASCII or Unicode character.
So it sounds like there is no way to represent an arbitrary scalar in the parsing if it's not double quoted.
If all you want is to escape the colon character in YAML, then enclose it within single or double quotes. Also, you can try literal style for your second document which should be treated as single scalar.