Regular Expression to remove selective string - python

Looking to remove particular string coming in between json string:
For Example my Json string is :
{"tableName":"avzConf","rows":[{"Comp":"mster","Conf": "[{\"name\": \"state\", \"dispN\": \"c_d_test\", \"\": {\"updated_at\": \"2020-09-16T06:33:07.684504Z\", \"updated_by\": \"Abc_xyz<abc_xyz#uuvvww.com>\"}}, {\"name\": \"stClu\", \"dNme\": \"tab(s) Updatedd\", \"\": {\"updated_at\": \"2020-09-21T10:17:48.307874Z\", \"updated_by\": \"Def Ghi<def_ghi#uuvvww.com>\"}}
}]
}
want to remove: \"\": {\"updated_at\": \"2020-09-16T06:33:07.684504Z\", \"updated_by\": \"Abc_xyz<abc_xyz#uuvvww.com>\"}
Expected output :
{"tableName":"avzConf","rows":[{"Comp":"mster","Conf": "[{\"name\": \"state\", \"dispN\": \"c_d_test\"}, {\"name\": \"stClu\", \"dNme\": \"tab(s) Updatedd\"}
}]
}
I tried with ( \\"\\": {\\"updated_\w+)(.*)(>\\")
used in my code:
import re
line = re.sub(r"updated_\w+(.*)(.com>)", '', json_str)
But it's also selecting the between lines as there is 2 occurrences of "": {"updated_at\ and "updated_by"
And leaving special char "": {""}
How can I completely remove \"\": {\"updated_at\": \"2020-09-16T06:33:07.684504Z\", \"updated_by\": \"Abc_xyz<abc_xyz#uuvvww.com>\"}?

Try this:
\{\"updated_at[^{]+\}
This matches from the relevant opening { to the relevant closing }by allowing any character except { to occur once or more times in-between

With python json string I'm able to remove those unwanted fields as below:
this has completely removed the unwanted empty key and replace the same with }, to complete the json perfectly.
regex as \,\s\\\"\\\":\s\{\\\"updated_at[^{]+\}[^\]]
json_str = str({"tableName":"avzConf","rows":[{"Comp":"mster","Conf": "[{"name": "state", "dispN": "c_d_test", "": {"updated_at": "2020-09-16T06:33:07.684504Z", "updated_by": "Abc_xyzabc_xyz#uuvvww.com"}}, {"name": "stClu", "dNme": "tab(s) Updatedd", "": {"updated_at": "2020-09-21T10:17:48.307874Z", "updated_by": "Def Ghidef_ghi#uuvvww.com"}} }] })
import re
line = re.sub(r"\,\s\\\"\\\":\s\{\\\"updated_at[^{]+\}",'},', json_str)

Related

format a json and then open it with the json.load () [duplicate]

I currently have JSON in the below format.
Some of the Key values are NOT properly formatted as they are missing double quotes (")
How do I fix these key values to have double-quotes on them?
{
Name: "test",
Address: "xyz",
"Age": 40,
"Info": "test"
}
Required:
{
"Name": "test",
"Address": "xyz",
"Age": 40,
"Info": "test"
}
Using the below post, I was able to find such key values in the above INVALID JSON.
However, I could NOT find an efficient way to replace these found values with double-quotes.
s = "Example: String"
out = re.findall(r'\w+:', s)
How to Escape Double Quote inside JSON
Using Regex:
import re
data = """{ Name: "test", Address: "xyz"}"""
print( re.sub("(\w+):", r'"\1":', data) )
Output:
{ "Name": "test", "Address": "xyz"}
You can use PyYaml. Since JSON is a subset of Yaml, pyyaml may overcome the lack of quotes.
Example
import yaml
dirty_json = """
{
key: "value",
"key2": "value"
}
"""
yaml.load(dirty_json, yaml.SafeLoader)
I had few more issues that I faced in my JSON.
Thought of sharing the final solution that worked for me.
jsonStr = re.sub("((?=\D)\w+):", r'"\1":', jsonStr)
jsonStr = re.sub(": ((?=\D)\w+)", r':"\1"', jsonStr)
First Line will fix this double-quotes issue for the Key. i.e.
Name: "test"
Second Line will fix double-quotes issue for the value. i.e. "Info": test
Also, above will exclude double-quoting within date timestamp which have : (colon) in them.
You can use online formatter. I know most of them are throwing error for not having double quotes but below one seems handling it nicely!
JSON Formatter
The regex approach can be brittle. I suggest you find a library that can parse the JSON text that is missing quotes.
For example, in Kotlin 1.4, the standard way to parse a JSON string is using Json.decodeFromString. However, you can use Json { isLenient = true }.decodeFromString to relax the requirements for quotes. Here is a complete example in JUnit.
import kotlinx.serialization.Serializable
import kotlinx.serialization.decodeFromString
import kotlinx.serialization.json.Json
import org.junit.jupiter.api.Assertions
import org.junit.jupiter.api.Test
#Serializable
data class Widget(val x: Int, val y: String)
class JsonTest {
#Test
fun `Parsing Json`() {
val w: Widget = Json.decodeFromString("""{"x":123, "y":"abc"}""")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
#Test
fun `Parsing Json missing quotes`() {
// Json.decodeFromString("{x:123, y:abc}") failed to decode due to missing quotes
val w: Widget = Json { isLenient = true }.decodeFromString("{x:123, y:abc}")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
}

Parse JSON structures in a txt file containing JSON and text structures

I have a txt file with json structures. the problem is the file does not only contain json structures but also raw text like log error:
2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "1111",
"results": [{
"filename": "xxxx",
"numberID": "7412"
}, {
"filename": "xgjhh",
"numberID": "E52"
}]
}
2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "jfkjgjkf",
"results": [{
"filename": "hhhhh",
"numberID": "478962"
}, {
"filename": "jkhgfc",
"number": "12544"
}]
}
I read the .txt file but trying to patch the jason structures I have an error:
IN :
import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
json_data = json.load(f)
OUT : json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)
I would like to parce json and save as csv file.
A more general solution to parsing a file with JSON objects mixed with other content without any assumption of the non-JSON content would be to split the file content into fragments by the curly brackets, start with the first fragment that is an opening curly bracket, and then join the rest of fragments one by one until the joined string is parsable as JSON:
import re
fragments = iter(re.split('([{}])', f.read()))
while True:
try:
while True:
candidate = next(fragments)
if candidate == '{':
break
while True:
candidate += next(fragments)
try:
print(json.loads(candidate))
break
except json.decoder.JSONDecodeError:
pass
except StopIteration:
break
This outputs:
{'name': '1111', 'results': [{'filename': 'xxxx', 'numberID': '7412'}, {'filename': 'xgjhh', 'numberID': 'E52'}]}
{'name': 'jfkjgjkf', 'results': [{'filename': 'hhhhh', 'numberID': '478962'}, {'filename': 'jkhgfc', 'number': '12544'}]}
This solution will strip out the non-JSON structures, and wrap them in a containing JSON structure.This should do the job for you. I'm posting this as is for expediency, then I'll edit my answer for a more clear explanation. I'll edit this first bit when I've done that:
import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
cleaned = ''.join([item.strip() if item.strip() is not '' else '-split_here-' for item in f.readlines() if '|INFO|' not in item]).split('-split_here-')
json_data = json.loads(json.dumps(('{"entries":[' + ''.join([entry + ', ' for entry in cleaned])[:-2] + ']}')))
Output:
{"entries":[{"name": "1111","results": [{"filename": "xxxx","numberID": "7412"}, {"filename": "xgjhh","numberID": "E52"}]}, {"name": "jfkjgjkf","results": [{"filename": "hhhhh","numberID": "478962"}, {"filename": "jkhgfc","number": "12544"}]}]}
What's going on here?
In the cleaned = ... line, we're using a list comprehension that creates a list of the lines in the file (f.readlines()) that do not contain the string |INFO| and adds the string -split_here- to the list whenever there's a blank line (where .strip() yields '').
Then, we're converting that list of lines (''.join()) into a string.
Finally we're converting that string (.split('-split_here-') into a list of lists, separating the JSON structures into their own lists, marked by blank lines in data.txt.
In the json_data = ... line, we're appending a ', ' to each of the JSON structures using a list comprehension.
Then, we convert that list back into a single string, stripping off the last ', ' (.join()[:-2]. [:-2]slices of the last two characters from the string.).
We then wrap the string with '{"entries":[' and ']}' to make the whole thing a valid JSON structure, and feed it to json.dumps and json.loads to clean any encoding and load your data a a python object.
You could do one of several things:
On the Command Line, remove all lines where, say, "|INFO|Technical|" appears (assuming this appears in every line of raw text):
sed -i '' -e '/\|INFO\|Technical/d' yourfilename (if on Mac),
sed -i '/\|INFO\|Technical/d' yourfilename (if on Linux).
Move these raw lines into their own JSON fields
Use the "text structures" as a delimiter between JSON objects.
Iterate over the lines in the file, saving them to a buffer until you encounter a line that is a text line, at which point parse the lines you've saved as a JSON object.
import re
import json
def is_text(line):
# returns True if line starts with a date and time in "YYYY-MM-DD HH:MM:SS" format
line = line.lstrip('|') # you said some lines start with a leading |, remove it
return re.match("^(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})", line)
json_objects = []
with open("data.txt") as f:
json_lines = []
for line in f:
if not is_text(line):
json_lines.append(line)
else:
# if there's multiple text lines in a row json_lines will be empty
if json_lines:
json_objects.append(json.loads("".join(json_lines)))
json_lines = []
# we still need to parse the remaining object in json_lines
# if the file doesn't end in a text line
if json_lines:
json_objects.append(json.loads("".join(json_lines)))
print(json_objects)
Repeating logic in the last two lines is a bit ugly, but you need to handle the case where the last line in your file is not a text line, so when you're done with the for loop you need parse the last object sitting in json_lines if there is one.
I'm assuming there's never more than one JSON object between text lines and also my regex expression for a date will break in 8,000 years.
You could count curly brackets in your file to find beginning and ending of your jsons, and store them in list, here found_jsons.
import json
open_chars = 0
saved_content = []
found_jsons = []
for i in content.splitlines():
open_chars += i.count('{')
if open_chars:
saved_content.append(i)
open_chars -= i.count('}')
if open_chars == 0 and saved_content:
found_jsons.append(json.loads('\n'.join(saved_content)))
saved_content = []
for i in found_jsons:
print(json.dumps(i, indent=4))
Output
{
"results": [
{
"numberID": "7412",
"filename": "xxxx"
},
{
"numberID": "E52",
"filename": "xgjhh"
}
],
"name": "1111"
}
{
"results": [
{
"numberID": "478962",
"filename": "hhhhh"
},
{
"number": "12544",
"filename": "jkhgfc"
}
],
"name": "jfkjgjkf"
}

How to remove the first and last portion of a string in Python?

How can i cut from such a string (json) everything before and including the first [ and everything behind and including the last ] with Python?
{
"Customers": [
{
"cID": "w2-502952",
"soldToId": "34124"
},
...
...
],
"status": {
"success": true,
"message": "Customers: 560",
"ErrorCode": ""
}
}
I want to have at least only
{
"cID" : "w2-502952",
"soldToId" : "34124",
}
...
...
String manipulation is not the way to do this. You should parse your JSON into Python and extract the relevant data using normal data structure access.
obj = json.loads(data)
relevant_data = obj["Customers"]
Addition to #Daniel Rosman answer, if you want all the list from JSON.
result = []
obj = json.loads(data)
for value in obj.values():
if isinstance(value, list):
result.append(*value)
While I agree that Daniel's answer is the absolute best way to go, if you must use string splitting, you can try .find()
string = #however you are loading this json text into a string
start = string.find('[')
end = string.find(']')
customers = string[start:end]
print(customers)
output will be everything between the [ and ] braces.
If you really want to do this via string manipulation (which I don't recommend), you can do it this way:
start = s.find('[') + 1
finish = s.find(']')
inner = s[start : finish]

In JSON output, force every opening curly brace to appear in a new separate line

With json.dumps(some_dict,indent=4,sort_keys=True) in my code:
I get something like this:
{
"a": {
"x":1,
"y":2
},
"b": {
"z":3,
"w":4
}
}
But I want something like this:
{
"a":
{
"x":1,
"y":2
},
"b":
{
"z":3,
"w":4
}
}
How can I force each opening curly brace to appear at the beginning of a new separate line?
Do I have to write my own JSON serializer, or is there a special argument that I can use when calling json.dumps?
You can use a regular expression replacement on the result.
better_json = re.sub(r'^((\s*)".*?":)\s*([\[{])', r'\1\n\2\3', json, flags=re.MULTILINE)
The first capture group matches everything up to the : after the property name, the second capture group matches the whitespace before the property name, and the third capture group captures the { or [ before the object or array. The whitespace is then copied after the newline, so that the indentation will match properly.
DEMO
Building on Barmar's excellent answer, here's a more complete demo showing how you can convert and customize your JSON in Python:
import json
import re
# JSONifies dict and saves it to file
def save(data, filename):
with open(filename, "w") as write_file:
write_file.write(jsonify(data))
# Converts Python dict to a JSON string. Indents with tabs and puts opening
# braces on their own line.
def jsonify(data):
default_json = json.dumps(data, indent = '\t')
better_json = re.sub(
r'^((\s*)".*?":)\s*([\[{])',
r'\1\n\2\3',
default_json,
flags=re.MULTILINE
)
return better_json
# Sample data for demo
data = {
"president":
{
"name": "Zaphod Beeblebrox",
"species": "Betelgeusian"
}
}
filename = 'test.json'
# Demo
print("Here's your pretty JSON:")
print(jsonify(data))
print()
print('Saving to file:', filename)
save(data, filename)

Parse non-standard semicolon separated "JSON"

I have a non-standard "JSON" file to parse. Each item is semicolon separated instead of comma separated. I can't simply replace ; with , because there might be some value containing ;, ex. "hello; world". How can I parse this into the same structure that JSON would normally parse it?
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}
Use the Python tokenize module to transform the text stream to one with commas instead of semicolons. The Python tokenizer is happy to handle JSON input too, even including semicolons. The tokenizer presents strings as whole tokens, and 'raw' semicolons are in the stream as single token.OP tokens for you to replace:
import tokenize
import json
corrected = []
with open('semi.json', 'r') as semi:
for token in tokenize.generate_tokens(semi.readline):
if token[0] == tokenize.OP and token[1] == ';':
corrected.append(',')
else:
corrected.append(token[1])
data = json.loads(''.join(corrected))
This assumes that the format becomes valid JSON once you've replaced the semicolons with commas; e.g. no trailing commas before a closing ] or } allowed, although you could even track the last comma added and remove it again if the next non-newline token is a closing brace.
Demo:
>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
... "client" : "someone";
... "server" : ["s1"; "s2"];
... "timestamp" : 1000000;
... "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> with open('semi.json', 'r') as semi:
... for token in tokenize.generate_tokens(semi.readline):
... if token[0] == tokenize.OP and token[1] == ';':
... corrected.append(',')
... else:
... corrected.append(token[1])
...
>>> print ''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}
Inter-token whitespace was dropped, but could be re-instated by paying attention to the tokenize.NL tokens and the (lineno, start) and (lineno, end) position tuples that are part of each token. Since the whitespace around the tokens doesn't matter to a JSON parser, I've not bothered with this.
You can do some odd things and get it (probably) right.
Because strings on JSON cannot have control chars such as \t, you could replace every ; to \t, so the file will be parsed correctly if your JSON parser is able to load non strict JSON (such as Python's).
After, you only need to convert your data back to JSON so you can replace back all these \t, to ; and use a normal JSON parser to finally load the correct object.
Some sample code in Python:
data = '''{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world"
}'''
import json
dec = json.JSONDecoder(strict=False).decode(data.replace(';', '\t,'))
enc = json.dumps(dec)
out = json.loads(dec.replace('\\t,' ';'))
Using a simple character state machine, you can convert this text back to valid JSON. The basic thing we need to handle is to determine the current "state" (whether we are escaping a character, in a string, list, dictionary, etc), and replace ';' by ',' when in a certain state.
I don't know if this is properly way to write it, there is a probably a way to make it shorter, but I don't have enough programming skills to make an optimal version for this.
I tried to comment as much as I could :
def filter_characters(text):
# we use this dictionary to match opening/closing tokens
STATES = {
'"': '"', "'": "'",
"{": "}", "[": "]"
}
# these two variables represent the current state of the parser
escaping = False
state = list()
# we iterate through each character
for c in text:
if escaping:
# if we are currently escaping, no special treatment
escaping = False
else:
if c == "\\":
# character is a backslash, set the escaping flag for the next character
escaping = True
elif state and c == state[-1]:
# character is expected closing token, update state
state.pop()
elif c in STATES:
# character is known opening token, update state
state.append(STATES[c])
elif c == ';' and state == ['}']:
# this is the delimiter we want to change
c = ','
yield c
assert not state, "unexpected end of file"
def filter_text(text):
return ''.join(filter_characters(text))
Testing with :
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}
Returns :
{
"client" : "someone",
"server" : ["s1"; "s2"],
"timestamp" : 1000000,
"content" : "hello; world",
...
}
Pyparsing makes it easy to write a string transformer. Write an expression for the string to be changed, and add a parse action (a parse-time callback) to replace the matched text with what you want. If you need to avoid some cases (like quoted strings or comments), then include them in the scanner, but just leave them unchanged. Then, to actually transform the string, call scanner.transformString.
(It wasn't clear from your example whether you might have a ';' after the last element in one of your bracketed lists, so I added a term to suppress these, since a trailing ',' in a bracketed list is also invalid JSON.)
sample = """
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
}"""
from pyparsing import Literal, replaceWith, Suppress, FollowedBy, quotedString
import json
SEMI = Literal(";")
repl_semi = SEMI.setParseAction(replaceWith(','))
term_semi = Suppress(SEMI + FollowedBy('}'))
qs = quotedString
scanner = (qs | term_semi | repl_semi)
fixed = scanner.transformString(sample)
print(fixed)
print(json.loads(fixed))
prints:
{
"client" : "someone",
"server" : ["s1", "s2"],
"timestamp" : 1000000,
"content" : "hello; world"}
{'content': 'hello; world', 'timestamp': 1000000, 'client': 'someone', 'server': ['s1', 's2']}

Categories