python parsing json data with double quotes - python

How do you parse json data with double quotes within:
json.loads('
{
"time":"1410661614",
"text":"This is great",
"from":
{
"username":"mrb",
"id":"5071",
"full_name":"Free "Mrb"" #here is the problem
},
"id":"8090107"
}
')
python returns:
ValueError: Expecting ',' delimiter: line 1 column 107 (char 106)

You can easily fix this issue by escaping the double quote (\")
import json
json.loads("""
{
"time":"1410661614",
"text":"This is great",
"from":
{
"username":"mrb",
"id":"5071",
"full_name":"Free \\"Mrb\\""
},
"id":"8090107"
}
""")
As said in the comments, better fix the json generator to properly escape the double quote, it will be hard to parse and correct the json file.

Whoever wrote the program that emits those unescaped quotes inside strings needs a serious talking to...
As Martijn said, parsing arbitrary crazy quotes is not easy.
OTOH, if the JSON is otherwise well-formed, and the offending strings don't cross line boundaries, then it's not so bad. Eg,
#! /usr/bin/env python
''' Escape quotes in malformed JSON value strings
Written by PM 2Ring 2014.09.19
'''
import re
data = [
''' "evil_name":"Free "Mrb"",''',
''' "good_name":"Alan Turing",'''
]
for line in data:
pre, val = line.split(':')
parts = re.split('(")', val)
n = parts.count('"')
if n > 2:
i = 1
a = []
for c in parts:
if c == '"':
if 1 < i < n:
c = '\\"'
i += 1
a.append(c)
line = pre + ':' + ''.join(a)
print line
Output
"evil_name":"Free \"Mrb\"",
"good_name":"Alan Turing",

Related

How to convert binary data to json

I want to convert the below data to json in python.
I have the data in the following format.
b'{"id": "1", "name": " value1"}\n{"id":"2", name": "value2"}\n{"id":"3", "name": "value3"}\n'
This has multiple json objects separated by \n. I was trying to load this as json .
converted the data into string first and loads as json but getting the exception.
my_json = content.decode('utf8')
json_data = json.loads(my_json)
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 2306)
You need to decode it then split by '\n' and load each json object separately. If you store your byte string in a variable called byte_string you could do something like:
json_str = byte_string.decode('utf-8')
json_objs = json_str.split('\n')
for obj in json_objs:
json.loads(obj)
For the particular string that you have posted here though, you will get an error on the second object because the second key in it is missing a double quote. It is name" in the string you linked.
First, this isn't valid json since it's not a single object. Second, there is a typo: the "id":"2" entry is missing a double-quote on the name property element.
An alternative to processing one dict at a time, you can replace the newlines with "," and turn it into an array. This is a fragile solution since it requires exactly one newline between each dict, but is compact:
s = b'{"id": "1", "name": " value1"}\n{"id":"2", "name": "value2"}\n{"id":"3", "name": "value3"}\n'
my_json = s.decode('utf8')
json_data = json.loads("[" + my_json.rstrip().replace("\n", ",") + "]")
What have to first decode your json to a string. So you can just say:
your_json_string = the_json.decode()
now you have a string.
Now what you want to do is:
your_json_string = your_json_string.replace("\\n", "")
so you are replacing the \n with nothing basically. Note that the two backslashes are required, this is not a typo.
Now you can just say:
your_json = json.loads(your_json_string)

What is wrong with my Python syntax: I am trying to use multiple quotation marks plus variables in a string

I am trying to use Python to write to a file. However, the code has multiple " in it plus calls a variable. I simply cannot manage the syntax.
The code should result in:
{
"Name of site": "https://.google.com",
Where the website is a variable not a string.
The code attempt is below. It never resolves the variable and just displays it as a string called host_name. I have attempted to add backslashes and quotations (various types of single and double) but whatever I try does not work.
with open ("new_file.txt", "a") as f:
f.write ("{ \n")
f.write("\"Name of site\": \"https://" + host_name + ", \n")
The new_file.txt shows:
"Name of site": "https:// + host_name + "\," + "
I have no idea where the "\," comes from.
You can use f strings, and take advantage of the fact that both '' and "" create string literals.
>>> host_name = example.com
>>> output = "{\n"+ f'"Name of site": "https://{host_name}",' + "\n"
>>> print(output)
{
"Name of site": "https://example.com",
Note that in that example you have to also concatenate strings in order to avoid the fact that f-strings don't allow either braces or backslashes; however, there is even a way around that.
newline = '\n'
l_curly = "{"
output = f'{l_curly}{newline}"Name of site": "https://{host_name}", {newline}'
So that's how you'd build the string directly. But it does also seem more likely that what you really want to is to construct a dictionary, then write that dictionary out using JSON.
>>> import json
>>> host_name = 'example.com'
>>> data = {"Name of site": f"https://{host_name}"}
>>> output = json.dumps(data, indent=4)
>>> print(output)
{
"Name of site": "https://example.com"
}

How to read csv with multiple quoted delimiters in single field?

I'd like to be able to split a string which contains the delimiter quoted multiple times. Is there an argument to handle this type of string with the csv module? Or is there another way to process it?
text = '"a,b"-"c,d","a,b"-"c,d"'
next(csv.reader(StringIO(text), delimiter=",", quotechar='"', quoting=csv.QUOTE_NONE))
Expected output: ['"a,b"-"c,d"', '"a,b"-"c,d"']
Actual output: ['"a', 'b"-"c', 'd"', '"a', 'b"-"c', 'd"']
EDIT:
The example above is simplified, but apparently too simplified as some comments provided solutions for the simplified version but not for the full version. Below is the actual data I want to process.
import csv
text = '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
next(csv.reader(StringIO(text), delimiter=",", quotechar='"', quoting=csv.QUOTE_NONE))
Expected output
[
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0',
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
]
Actual output
[
'"3-Amino-1',
'2',
'4-triazole"-text-0-"3-Amino-1',
'2',
'4-triazole"-CD-0','"3-Amino-1',
'2', '4-triazole"-text-0-"3-Amino-1',
'2',
'4-triazole"-LS-0'
]
I'll only answer the first part of your question: there is no way to do this with the built-in csv module.
Looking at the CPython source code, quotechar option is only processed at the start of a field:
case START_FIELD:
/* expecting field */
...
else if (c == dialect->quotechar &&
dialect->quoting != QUOTE_NONE) {
/* start quoted field */
self->state = IN_QUOTED_FIELD;
}
...
break;
Inside a field, there is no such check:
case IN_FIELD:
/* in unquoted field */
if (c == '\n' || c == '\r' || c == '\0') {
/* end of line - return [fields] */
if (parse_save_field(self) < 0)
return -1;
self->state = (c == '\0' ? START_RECORD : EAT_CRNL);
}
else if (c == dialect->escapechar) {
/* possible escaped character */
self->state = ESCAPED_CHAR;
}
else if (c == dialect->delimiter) {
/* save field - wait for new field */
if (parse_save_field(self) < 0)
return -1;
self->state = START_FIELD;
}
else {
/* normal character - save in field */
if (parse_add_char(self, module_state, c) < 0)
return -1;
}
break;
There is a check for quotechar while the parser is in the IN_QUOTED_FIELD state; however, upon encountering a quote, it goes back to the IN_FIELD state indicating we're inside an unquoted field. So this is possible:
>>> import csv
>>> import io
>>> print(next(csv.reader(io.StringIO('"a,b"cd,e'))))
['a,bcd', 'e']
But once the parser has reached the end of the initial quoted section, it will consider any subsequent quotes as part of the data. I don't know if this behaviour is to conform with any (written or unwritten) CSV specification, or if it's just a bug.
The data is in a non-standard format and so any solution would need to be tested on the full dataset. A possible workaround could be to first replace ," characters with ;" and then simply split it on the ;. This could be done without using CSV or RE:
tests = [
'"a,b"-"c,d","a,b"-"c,d"',
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0',
]
for test in tests:
row = test.replace(',"' , ';"').split(';')
print(len(row), row)
Giving:
2 ['"a,b"-"c,d"', '"a,b"-"c,d"']
2 ['"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0', '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
If the structure is always the same with the comma sandwiched between an integer and the '"', you can use a regular expression:
import re
re.split('(?<=[0-9]),(?=")', text)

String from file to string utf-8 in python

So I am reading and manipulate a file with :
base_file = open(path+'/'+base_name, "r")
lines = base_file.readlines()
After this I search and find the "raw_data" start of line.
if re.match("\s{0,100}raw_data: ",line):
split_line = line.split("raw_data:")
print(split_line)
raw_string = split_line[1]
One example of raw_data is:
raw_data: "&\276!\300\307 =\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\}\277\210\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
And raw_string will be
print(raw_data)
"&\276!\300\307
=\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\}\277\210\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
If I tried to read this file I will obtain one char to one char even for escape characters.
So, my question is how to transform this plain text to utf-8 string so that I can have one character when reading \300 and not 4 characters.
I tried to pass "encondig =utf-8" in open file method but does not work.
I have made the same example passing raw_data as variable and it works properly.
RAW_DATA = "&\276!\300\307 =\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\\}\277\210\\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300<I>>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
print(f"Qnt -> {len(RAW_DATA)}") # Qnt -> 256
print(type(RAW_DATA))
at = 0
total = 0
while at < len(RAW_DATA):
fin = at+4
substrin = RAW_DATA[at:fin]
resu = FourString_float(substrin)
at = fin
For this example \300 is only one char.
Hope someone can help me.
The problem is that on the read file the escape \ symbols are coming in as \, but in the example you've provided they are being evaluated as part of the numerics that follow it. ie, \276 is read as a single character.
If you run:
RAW_DATA = r"&\276!\300\307 =\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\\}\277\210\\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300<I>>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
print(f"Qnt -> {len(RAW_DATA)}") # Qnt -> 256
print(type(RAW_DATA))
at = 0
total = 0
while at < len(RAW_DATA):
fin = at+4
substrin = RAW_DATA[at:fin]
resu = FourString_float(substrin)
at = fin
You would should be getting the same error that you were getting originally. Notice that we are using the raw-string literal instead of regular string literal. This will ensure that the \ don't get escaped.
You would need to evaluate the RAW_DATA to force it to evaluate the \.
You can do something like RAW_DATA = eval(f'"{RAW_DATA}"') or
import ast
RAW_DATA = ast.literal_eval(f'"{RAW_DATA}"')
Note, the second option is a bit more secure that doing a straight eval as you are limiting the scope of what can be executed.

How do I decode unicode characters via python?

I am trying to import the following json file using python:
The file is called new_json.json:
{
"nextForwardToken": "f/3208873243596875673623625618474139659",
"events": [
{
"ingestionTime": 1045619,
"timestamp": 1909000,
"message": "2 32823453119 eni-889995t1 54.25.64.23 156.43.12.120 3389 23 6 342 24908 143234809 983246 ACCEPT OK"
}]
}
I have the following code to read the json file, and remove the unicode characters:
JSON_FILE = "new_json.json"
with open(JSON_FILE) as infile:
print infile
print '\n type of infile is \n', infile
data = json.load(infile)
str_data = str(data) # convert to string to remove unicode characters
wo_unicode = str_data.decode('unicode_escape').encode('ascii','ignore')
print 'unicode characters have been removed \n'
print wo_unicode
But print wo_unicode still prints with the unicode characters (i.e.u) in it.
The unicode characters cause a problem when trying to treat the json as a dictionary:
for item in data:
iden = item.get['nextForwardToken']
...results in an error:
AttributeError: 'unicode' object has no attribute 'get'
This has to work in Python2.7. Is there an easy way around this?
The error has nothing to do with unicode, you are trying to treat the keys as dicts, just use data to get 'nextForwardToken':
print data.get('nextForwardToken')
When you iterate over data, you are iterating over the keys so 'nextForwardToken'.get('nextForwardToken'), "events".get('nextForwardToken') etc.. are obviously not going to work even with the correct syntax.
Whether you access by data.get(u'nextForwardToken') or data.get('nextForwardToken'), both will return the value for the key:
In [9]: 'nextForwardToken' == u'nextForwardToken'
Out[9]: True
In [10]: data[u'nextForwardToken']
Out[10]: u'f/3208873243596875673623625618474139659'
In [11]: data['nextForwardToken']
Out[11]: u'f/3208873243596875673623625618474139659'
This code will give you the values as str without the unicode
import json
JSON_FILE = "/tmp/json.json"
with open(JSON_FILE) as infile:
print infile
print '\n type of infile is \n', infile
data = json.load(infile)
print data
str_data = json.dumps(data)
print str_data

Categories