How to handle \n in json files in python

How to handle \n in json files in python - python

so I am trying to read a json file like this:
{
"Name": "hello",
"Source": " import json \n x= 10, .... "
}
the way I am trying to read it by using the json library in python
thus my code looks something like this:
import json
input =''' {
"Name": "python code",
"Source": " import json \n x= 10, .... "
}'''
output = json.load(input)
print(output)
the problem that source has the invalid character "\n". I don't want to replace \n with \n as this is code will be later run in another program.
I know that json.JSONDecoder is able to handle \n but I am not sure how to use.

You need to escape the backslash in the input string, so that it will be taken literally.
import json
input =''' {
"Name": "python code",
"Source": " import json \\n x= 10, .... "
}'''
output = json.loads(input)
print output
Also, you should be using json.loads to parse JSON in a string, json.load is for getting it from a file.
Note that if you're actually getting the JSON from a file or URL, you don't need to worry about this. Backslash only has special meaning to Python when it's in a string literal in the program, not when it's read from somewhere else.

Alternatively, you can define the string in raw format using r:
import json
# note the r
input = r''' {
"Name": "python code",
"Source": " import json \n x= 10, .... "
}'''
# loads (not load) is used to parse a string
output = json.loads(input)
print(output)

Related

PDF long text extraction to JSON in Python

I'm trying to create a python script that extracts text from a PDF then converts it to a correctly formatted JSON file (see below).
The text extraction is not a problem. I'm using PyPDF2 to extract the text from user inputted pdf - which will often result in a LONG text string. I would like to add this text as a 'value' to a json 'key' (see 2nd example below).
My code:
# Writing all data to JSON file
# Data to be written
dictionary ={
"company": str(company),
"document": str(document),
"text": str(text) # This is what would be a LONG string of text
}
# Serializing json
json_object = json.dumps(dictionary, indent = 4)
print(json_object)
with open('company_document.json', 'w') as fp:
json.dump(json_object, fp)
The ideal output would be a JSON file that is structured like this:
[
{
"company": 1,
"document-name": "Orlando",
"text": " **LONG_TEXT_HERE** "
}
]
I'm not getting the right json structure as an output. Also, the long text string most likely contains some punctuation or special characters that can affect the json - such as closing the string too early. I could take this out before, but is there a way to keep it in for the json file so I can address it in the next step (in Neo4j) ?
This is my output at the moment:
"{\n \"company\": \"Stack\",\n \"document\": \"Overflow Report\",\n \"text\": \"Long text 2020\\nSharing relevant and accountable information about our development "quotes and things...
Current:
Current situation
Goal:
Ideal situation
Does anyone have an idea on how this can be achieved?

Like many people, you are confusing the CONTENT of your data with the REPRESENTATION of your data. The code you have works just fine. Notice:
import json
# Data to be written
dictionary ={
"company": 1,
"document": "Orlando",
"text": """Long text 2020
Sharing relevant and accountable information about our development.
This is a complicated text string with "quotes and things".
"""
}
# Serializing json
json_object = json.dumps([dictionary], indent = 4)
print(json_object)
with open('company_document.json', 'w') as fp:
json.dump([dictionary], fp)
When executed, this produces the following on stdout:
[
{
"company": 1,
"document": "Orlando",
"text": "Long text 2020\nSharing relevant and accountable information about our development.\nThis is a complicated text string with \"quotes and things\".\n"
}
]
Notice that the embedded quotes are escaped. That's what the standard requires. The file does not have the indentation, because you didn't ask for it, but it's still quite valid JSON.
[{"company": 1, "document": "Orlando", "text": "Long text 2020\nSharing relevant and accountable information about our development.\nThis is a complicated text string with \"quotes and things\".\n"}]
FOLLOWUP
This version reads in whatever was in the file before, adds a new record to the list, and saves the whole thing out.
import os
import json
# Read existing data.
MASTER = "company_document.json"
if os.path.exists( MASTER ):
database = json.load( open(MASTER,'r') )
else:
database = []
# Data to be written
dictionary ={
"company": 1,
"document": "Orlando",
"text": """Long text 2020
Sharing relevant and accountable information about our development.
This is a complicated text string with "quotes and things".
"""
}
# Serializing json
json_object = json.dumps([dictionary], indent = 4)
print(json_object)
database.append(dictionary)
with open(MASTER, 'w') as fp:
json.dump(database, fp)

How to load a JSON file in Python keeping some escaped Unicode characters

I'm trying to parse a json file that have some escaped unicode characters on it's values for some processing.
It comes that when I open the file to work on it, python converts the escaped characters to the actual ascii character, but I need to keep the characters escaped, due to a requirement of the tool that will later process this file.
I tried opening the file with different encodings and also tried both options (True and False) for the ensure_ascii option on the json.dump function.
But the problem really seems to be when I load the file, as the example bellow:
This is an simple json file to be used as input:
{
"id": "my_current_id",
"name": "dockeradmin \u003e aclpolicy",
"content": "description: Read \u0026 Execute permissions."
}
This is a model that I'm using to simplify the test of different solutions
import json
with open('./test.json', encoding='utf-8') as tfstateFile:
tfstateData = json.load(tfstateFile)
tfstateData['id'] = 'my_new_id'
print(tfstateData['id'])
print(tfstateData['name'])
print(tfstateData['content'])
resultFilePath = '/result/test_result.json'
with open(resultFilePath, 'w', encoding='utf-8') as resultStateFile:
json.dump(tfstateData, resultStateFile, indent=4, sort_keys=False, ensure_ascii=False)
When I use the provided json as input, this is the output I'm receiving:
>>> my_new_id
>>> dockeradmin > aclpolicy
>>> description: Read & Execute permissions.
The output file is also decoding the escaped characters:
{
"id": "my_new_id",
"name": "dockeradmin > aclpolicy",
"content": "description: Read & Execute permissions."
}
Is it possible to load the json file and keep the escaped characters on the strings values?
The resultant output should be:
{
"id": "my_new_id",
"name": "dockeradmin \u003e aclpolicy",
"content": "description: Read \u0026 Execute permissions."
}

Unable to read regex in a JSON file as a string in Python

I need to read the below JSON file into Python with regex intact. I will be using the regular expressions in the program.
{
"Title": "Sample Compliance Check",
"Checks": {
"6": {
"+": ["^interfa.*", "^ip address 192\.168\.0"],
"description": "All interfaces with IP Address 192.168.0",
"action": "aaa new-model"
}
}
}
When I try to read this using the json module I get the error of invalid json.
json.decoder.JSONDecodeError: Invalid \escape:
I tried converting the backslash to double backslash
{
"Title": "Sample Compliance Check",
"Checks": {
"6": {
"+": ["^interfa.*", "^ip address 192\\.168\\.0"],
"description": "All interfaces with IP Address 192.168.0",
"action": "aaa new-model"
}
}
}
Now, it gets read in Python but I get the same output with double-backslashes.
Is there any way to encode regex in JSON and read it like it's encoded (in raw regex form)?

Raw string formatting and "normal string formatting" in python do not change the way the string is stored, but only how you can enter the string. So the following strings are both equal:
"\\a" == r"\a"
Python shows you backslashes escaped, but if you try the regex you will see that it matches what you want to match.
>>> bool(re.match("^ip address 192\\.168\\.0", "ip address 192.168.0"))
True

Non-latin text outputting as nonsense in Python

I've got a script which makes a JSON request that may return text in any script, then outputs the text (I dont have any control over the text being returned).
It works fine with latin characters, but other scripts output as a mojibake, and I'm not sure what's going wrong.
In the response, the problematic characters are encoded using \u syntax. In particular, I have a string containing \u00d0\u00b8\u00d1\u0081\u00d0\u00bf\u00d1\u008b\u00d1\u0082\u00d0\u00b0\u00d0\u00bd\u00d0\u00b8\u00d0\u00b5 which should output as испытание but instead outputs as Ð¸ÑÐ¿ÑÑÐ°Ð½Ð¸Ðµ.
Obviously this is something to do with how python deals with unicode and UTF, but I despite all I've read I don't understand what's going on well enough to know how to solve it.
I've tried to extract the salient points from the code below:
response = requests.get(url, params=params, cookies=self.cookies, auth=self.auth)
text = response.text
print text
status = json.loads(text)
print status
for folder in status['folders']
print folder['name']
Output:
{ "folders": [ { "name": "\u00d0\u00b8\u00d1\u0081\u00d0\u00bf\u00d1\u008b\u00d1\u0082\u00d0\u00b0\u00d0\u00bd\u00d0\u00b8\u00d0\u00b5" } ] }
{u'folders': [{ u'name': u'\xd0\xb8\xd1\x81\xd0\xbf\xd1\x8b\xd1\x82\xd0\xb0\xd0\xbd\xd0\xb8\xd0\xb5' }]}
Ð¸ÑÐ¿ÑÑÐ°Ð½Ð¸Ðµ
I've also tried
status = response.json();
for folder in status['folders']:
print folder['name']
With the same result.
Note, I'm really passing the string to a GTKMenuItem to be displayed, but the output from printing the string is the same as from showing it in the menu.

As #Ricardo Cárdenes said in the comment the server sends incorrect response. The response that you've got is double encoded:
>>>> u = u'\xd0\xb8\xd1\x81\xd0\xbf\xd1\x8b\xd1\x82\xd0\xb0\xd0\xbd\xd0\xb8\xd0\xb5'
>>>> print u.encode('latin-1').decode('utf-8')
испытание
The correct string would look like:
>>>> s = {u"name": u"испытание"}
>>>> import json
>>>> print json.dumps(s)
{"name": "\u0438\u0441\u043f\u044b\u0442\u0430\u043d\u0438\u0435"}
>>>> print s['name']
испытание
>>>> print s['name'].encode('unicode-escape')
\u0438\u0441\u043f\u044b\u0442\u0430\u043d\u0438\u0435
>>>> print s['name'].encode('utf-8')
испытание
>>>> s['name'].encode('utf-8')
'\xd0\xb8\xd1\x81\xd0\xbf\xd1\x8b\xd1\x82\xd0\xb0\xd0\xbd\xd0\xb8\xd0\xb5'

Confused by Python returning JSON as string instead of literal

I've done some coding in RoR, and in Rails, when I return a JSON object via an API call, it returns as
{ "id" : "1", "name" : "Dan" }.
However in Python (with Flask and Flask-SQLAlchemy), when I return a JSON object via json.dumps or jsonpickle.encode it is returned as
"{ \"id\" : \"1\", \"name\": \"Dan\" }" which seems very unwieldily as it can't easily be parsed on the other end (by an iOS app in this case - Obj-C).
What am I missing here, and what should I do to return it as a JSON literal, rather than a JSON string?
This is what my code looks like:
people = models.UserRelationships.query.filter_by(user_id=user_id, active=ACTIVE_RECORD)
friends = people.filter_by(friends=YES)
json_object = jsonpickle.encode(friends.first().as_dict(), unpicklable=False, keys=True)
print(json_object) # this prints here, i.e. { "id" : "1", "name" : "Dan" }
return json_object # this returns "{ \"id\" : \"1\", \"name\": \"Dan\" }" to the browser

What is missing in your understanding here is that when you use the JSON modules in Python, you're not working with a JSON object. JSON is by definition just a string that matches a certain standard.
Lets say you have the string:
friends = '{"name": "Fred", "id": 1}'
If you want to work with this data in python, you will want to load it into a python object:
import json
friends_obj = json.loads(friends)
At this point friends_obj is a python dictionary.
If you want to convert it (or any other python dictionary or list) then this is where json.dumps comes in handy:
friends_str = json.dumps(friends_obj)
print friends_str
'{"name": "Fred", "id": 1}'
However if we attempt to "dump" the original friends string you'll see you get a different result:
dumped_str = json.dumps(friends)
print dumped_str
'"{\\"name\\": \\"Fred\\", \\"id\\": 1}"'
This is because you're basically attempting to encode an ordinary string as JSON and it is escaping the characters. I hope this helps make sense of things!
Cheers

Looks like you are using Django here, in which case do something like
from django.utils import simplejson as json
...
return HttpResponse(json.dumps(friends.first().as_dict()))

This is almost always a sign that you're double-encoding your data somewhere. For example:
>>> obj = { "id" : "1", "name" : "Dan" }
>>> j = json.dumps(obj)
>>> jj = json.dumps(j)
>>> print(obj)
{'id': '1', 'name': 'Dan'}
>>> print(j)
{"id": "1", "name": "Dan"}
>>> print(jj)
"{\"id\": \"1\", \"name\": \"Dan\"}"
Here, jj is a perfectly valid JSON string representation—but it's not a representation of obj, it's a representation of the string j, which is useless.
Normally you don't do this directly; instead, either you started with a JSON string rather than an object in the first place (e.g., you got it from a client request or from a text file), or you called some function in a library like requests or jsonpickle that implicitly calls json.dumps with an already-encoded string. But either way, it's the same problem, with the same solution: Just don't double-encode.

You should be using flask.jsonify, which will not only encode correctly, but also set the content-type headers accordingly.
people = models.UserRelationships.query.filter_by(user_id=user_id, active=ACTIVE_RECORD)
friends = people.filter_by(friends=YES)
return jsonify(friends.first().as_dict())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to handle \n in json files in python - python

Alternatively, you can define the string in raw format using r: import json # note the r input = r''' { "Name": "python code", "Source": " import json \n x= 10, .... " }''' # loads (not load) is used to parse a string output = json.loads(input) print(output)

Related

PDF long text extraction to JSON in Python

How to load a JSON file in Python keeping some escaped Unicode characters

Unable to read regex in a JSON file as a string in Python

Non-latin text outputting as nonsense in Python

Confused by Python returning JSON as string instead of literal

Categories

Resources