Regular expression in python replace - python

Is there any way using regular expression in python to replace all the occurrences of , (comma) after the flower braces {
Data is of the following format in a file - abc.json
{
"Key1":"value1",
"Key2":"value2"
},
{
"Key1":"value3",
"Key2":"value4"
},
{
"Key1":"value5",
"Key2":"value6"
}
This should result in following -
{
"Key1":"value1",
"Key2":"value2"
}
{
"Key1":"value3",
"Key2":"value4"
}
{
"Key1":"value5",
"Key2":"value6"
}
As you can see the ,(comma) has been removed after every braces }.
Would be helpful if this can be achieved via jq as well, apart from python REGEX

Test Source: https://regex101.com/r/wT6uU2/1
import re
p = re.compile(ur'},')
test_str = u"{\n\"Key1\":\"value1\",\n\"Key2\":\"value2\"\n},\n\n{\n\"Key1\":\"value3\",\n\"Key2\":\"value4\"\n},\n\n{\n\"Key1\":\"value5\",\n\"Key2\":\"value6\"\n}"
re.findall(p, test_str)
But use replace instead
replace }, -> }

This works:
import re
s="""{
"Key1":"value1",
"Key2":"value2"
},
{
"Key1":"value3",
"Key2":"value4"
},
{
"Key1":"value5",
"Key2":"value6"
}"""
pattern=re.compile(r'(?P<data>{.*?}),', re.S)
print pattern.findall(s)
s1=pattern.sub(r'\g<data>', s)
print s1

If you intend to process the resulting JSON in jq, it's probably easier to wrap it in brackets [{...}, {...}] to make it a JSON array. Then, you can use .[] in jq to unwrap the array.

Before you even consider other options, you really should go back to the source that generated that file and make sure it actually outputs valid json.
That said, you could use JQ to manipulate the contents as a raw string to add brackets, then parse it as an array to them spit out the contents.
$ jq -Rs '"[\(.)]" | fromjson[]' abc.json

Related

List of dictionaries where value has two double quoted values

I came up with list of dictionaries as a string. I wanted to convert this string to dictionary but it gives error.
data = '{
"address": "Ludwig-Wolf-Straß 1, 75181 Pforzheim Eutingen",
"lat": 48.90962790,
"lng": 8.74648390,
"name": "Psychiatrische Tagesklinik Pforzheim "Alte Mühle"",
"path": "appportrait7e29d81c345927b0start",
"color" : "yellow",
"zIndex": "30",}'
After checking it, I found out a value enclosed in double quotes two times.
data = {
"address": "Ludwig-Wolf-Straß 1, 75181 Pforzheim Eutingen",
"lat": 48.90962790,
"lng": 8.74648390,
"name": "Psychiatrische Tagesklinik Pforzheim "Alte Mühle"", # this value
"path": "appportrait7e29d81c345927b0start",
"color" : "yellow",
"zIndex": "30",}
I want to turn "Alte Mühle" into a single quote 'Alte Mühle' or just Alte Mühle. I tried to parse the dictionary to str and use string.replace() function but it didn't work. Since the value is dynamic I can't just change the value in a static way. i,e
string.replace('"Alte Mühle"', 'Alte Mühle') # will only change this value
is there any way to get rid of this?
Not enough rep to comment, so I'm assuming you are starting with a bunch of string literals you typed manually into your code. If not, there are other ways to handle this or it may have not been an issue to start with.
Here is an solution that doesn't require manually searching for problem strings. Enclose your dictionary string literal using tripple quotes (either """ or ''' are permitted) instead of the single ' or ". This will prevent the interpreter from getting confused about ' or " inside a string literal.
data = """{
"address": "Ludwig-Wolf-Straß 1, 75181 Pforzheim Eutingen",
"lat": 48.90962790,
"lng": 8.74648390,
"name": "Psychiatrische Tagesklinik Pforzheim "Alte Mühle"",
"path": "appportrait7e29d81c345927b0start",
"color" : "yellow",
"zIndex": "30",}"""
Next, the double quote problem can be handled using regular expressions (re). I have to leave this as an exercise as I am on a phone, but you can replace all " that lies inside a dictionary value regular expression search string ": \"([.]+?)\",” with '. Find this pattern, modify the substring, then replace the old substring with the corrected one.
Finally, to interpret it as a dictionary, call ast.literal_eval(...) on the corrected string (a version of eval(...) made safer by only interpreting literals). Requires the standard library ast import.
Consider comparing this workload vs manually fixing your strings or loading the strings or key/value pairs from a database, avoiding these string literal issues all together.

Converting string containing double quotes to json

Python Escape Double quote character and convert the string to json
I have tried escaping double quotes with escape characters but that didn't worked either
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20"x30"","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
new_data = json.loads(raw_string)
It load errors saying Expecting ',' delimiter: line 1 column 180 (char 179)
The expected output is JSON string
The correct JSON string, with escaped quotes should look like this:
[{
"Attribute": "color",
"Keywords": "green",
"AttributeComments": null
}, {
"Attribute": " season",
"Keywords": ["Holly Berry"],
"AttributeComments": null
}, {
"Attribute": " size",
"Keywords": "20\"x30",
"AttributeComments": null
}, {
"Attribute": " unit",
"Keywords": "1",
"AttributeComments": null
}]
Edit:
You can use a regular expression to correct the sting in Python resulting in a valid json:
import re
import json
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20"x30"","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
pattern = r'"Keywords":"([\d].)"x([\d].)""'
correctedString = re.sub(pattern, '"Keywords": "\g<1>x\g<2>"', raw_string)
print(json.loads(correctedString))
Output:
[{u'Keywords': u'green', u'Attribute': u'color', u'AttributeComments': None}, {u'Keywords': [u'Holly Berry'], u'Attribute': u' season', u'AttributeComments': None}, {u'Keywords': u'20x30', u'Attribute': u' size', u'AttributeComments': None}, {u'Keywords': u'1', u'Attribute': u' unit', u'AttributeComments': None}]
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20x30","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
new_data = json.loads(raw_string)
First of all change the key-value pair : "Keywords":"20"x30"" to "Keywords":"20x30".
The formatting is invalid in your code. If this JSON is not made by you or generated by some other source, check the source. You can check if the JSON is valid or not using JSONLint. Just paste your JSON here to check.
As for your code:
import json
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20x30","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
new_data = json.loads(raw_string)
Since new_data is a list. If you check the type of its first and only element, using print(type(new_data[0])) you'll find it is a dict that you desired.
EDIT: Since you say you are fetching this JSON from a database, check if the JSONs there are all carrying these type of formatting errors. If yes, you'd want to check where these are JSONs being generated. Your options are either to correct it at the source and correct it manually or adding escape characters, if this is a one-off problem. I strongly suggest the former.

How to remove characters from a string until a special character?

I have a string like this.
config =\
{"status":"None",
"numbers":["123", "123", "123"],
"schedule":None,
"data":{
"x": "y"
}
}
I would like to remove the config=\ from the string and get a result like this.
{"status":"None",
"numbers":["123", "123", "123"],
"schedule":None,
"data":{
"x": "y"
}
}
How can I get this using python regex? Would like to consider the multiline factor as well!!
I am using this method
re.sub(r'.*{"', '{"', script_config, flags=re.MULTILINE)
But the code consider each line separately. Also I would like to remove only the
You don't need regexp for it:
string = string.replace('config =\\', '')
If the first word is not specified:
string = string[string.find('\\')+1:] if '\\' in string else string
But if you want to use regexps:
string = re.sub(r'^.*\\', '', string)

Python can't parse JSON with extra trailing comma

This code:
import json
s = '{ "key1": "value1", "key2": "value2", }'
json.loads(s)
produces this error in Python 2:
ValueError: Expecting property name: line 1 column 16 (char 15)
Similar result in Python 3:
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 16 (char 15)
If I remove that trailing comma (after "value2"), I get no error. But my code will process many different JSONs, so I can't do it manually. Is it possible to setup the parser to ignore such last commas?
Another option is to parse it as YAML; YAML accepts valid JSON but also accepts all sorts of variations.
import yaml
s = '{ "key1": "value1", "key2": "value2", }'
yaml.load(s)
JSON specification doesn't allow trailing comma. The parser is throwing since it encounters invalid syntax token.
You might be interested in using a different parser for those files, eg. a parser built for JSON5 spec which allows such syntax.
It could be that this data stream is JSON5, in which case there's a parser for that: https://pypi.org/project/json5/
This situation can be alleviated by a regex substitution that looks for ", }, and replaces it with " }, allowing for any amount of whitespace between the quotes, comma and close-curly.
>>> import re
>>> s = '{ "key1": "value1", "key2": "value2", }'
>>> re.sub(r"\"\s*,\s*\}", "\" }", s)
'{ "key1": "value1", "key2": "value2" }'
Giving:
>>> import json
>>> s2 = re.sub(r"\"\s*,\s*\}", "\" }", s)
>>> json.loads(s2)
{'key1': 'value1', 'key2': 'value2'}
EDIT: as commented, this is not a good practice unless you are confident your JSON data contains only simple words, and this change is not corrupting the data-stream further. As I commented on the OP, the best course of action is to repair the up-stream data source. But sometimes that's not possible.
I wrote a regex to find and remove all commas with ] } followed in the json, but the ones in strings will be skipped.
it seems to work fine and fast.
import re, json
s = r'''
[
123, true, false, null,
{
"\n\\\",]\\": "\n\\\",]\\",
"\n\\\",}\\": "\n\\\",}\\",
},
]
'''
r = json.loads(re.sub(r'("(?:\\?.)*?")|,\s*([]}])', r'\1\2', s))
print(r) # [123, True, False, None, {'\n\\",]\\': '\n\\",]\\', '\n\\",}\\': '\n\\",}\\'}]
That's because an extra , is invalid according to JSON standard.
An object is an unordered set of name/value pairs. An object begins
with { (left brace) and ends with } (right brace). Each name is
followed by : (colon) and the name/value pairs are separated by ,
(comma).
If you really need this, you could wrap python's json parser with jsoncomment. But I would try to fix JSON in the origin.
I suspect it doesn't parse because "it's not json", but you could pre-process strings, using regular expression to replace , } with } and , ] with ]
How about use the following regex?
s = re.sub(r",\s*}", "}", s)

Regular expression to check if IP is found on 2 ranges

Is it possible to write a regular expression as one expression to check if an IP is found on 2 ranges?
I can do this in 2 steps:
if ($ip =~ /$range1/ and $ip =~ /$range2/ ) {
print "intersection"
}
but I wonder if it's possible to do this in one regex:
if ($ip =~ /$my_regex/ ) {
print "intersection";
}
You can use the Module NetAddr::IP:
use strict;
use warnings;
use NetAddr::IP;
my #addresses = (
NetAddr::IP->new('192.168.172.1/255.255.0.0'),
NetAddr::IP->new('10.1.0.0/255.0.0.0'),
);
my $address_to_check = NetAddr::IP->new($IP_TO_CHECK);
foreach my $address_in_list (#addresses) {
if ($address_to_check->within $address_in_list) {
# do something
}
}
Below is a solution in Perl.
Why not use NetAddr::IP and let it handle the thing? For example
#!/usr/bin/perl
use strict;
use warnings;
use NetAddr::IP;
my #addresses = (
new NetAddr::IP '216.239.32.0/255.255.32.0',
new NetAddr::IP '64.157.227.255/255.255.252.0'
);
my $banned = 0;
my $visitor_address = NetAddr::IP->new($visitor_ip);
foreach my $banned_address (#addresses) {
if ($visitor_address->within $banned_address) {
$banned = 1;
last;
}
}
Read the documentation and available methods at: https://metacpan.org/pod/NetAddr::IP
Yes, it is possible to join two independent subexpressions into a single regex using lookahead assertions:
if ($ip =~ /^(?=.*$range1)(?=.*$range2)/s ) {
print "intersection"
}
However, if you really are dealing with IP addresses, you should use a module like NetAddr::IP.

Categories