I have a long text, and some list of dict objects which has indexes of this long text. I want to add some strings to these indexes. If I set a loop, indexes change and I must calculate the indexes again. I think this way very confusing. Is there any way add different strings to different indexes in single time?
My sample data:
main_str = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.'
My indexes list:
indexes_list = [
{
"type": "first_type",
"endOffset": 5,
"startOffset": 0,
},
{
"type": "second_type",
"endOffset": 22,
"startOffset": 16,
}
]
My main purpose: I want to add <span> attributes to given indexes with some color styles based on types. After that I render it on template, directly. Have you another suggestion?
For example I want to create this data according to above variables main_str and indexes_list(Please ignore color part of styles. I provide it dynamically from value of type from indexes_list):
new_str = '<span style="color:#FFFFFF">Lorem</span> Ipsum is <span style="color:#FFFFFF">simply</span> dummy text of the printing and typesetting industry.'
Create a new str to avoid change the main_str:
main_str = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.'
indexes_list = [
{
"type": "first_type",
"startOffset": 0,
"endOffset": 5,
},
{
"type": "second_type",
"startOffset": 16,
"endOffset": 22,
}
]
new_str = ""
index = 0
for i in indexes_list:
start = i["startOffset"]
end = i["endOffset"]
new_str += main_str[index: start] + "<span>" + main_str[start:end] + "</span>"
index = end
new_str += main_str[index:]
print(new_str)
Here is a solution without any imperative for loops. It still uses plenty of looping for the list comprehensions.
# Get all the indices and label them as starts or ends.
starts = [(o['startOffset'], True) for o in indexes_list]
ends = [(o['endOffset'], False) for o in indexes_list]
# Sort everything...
all_indices = sorted(starts + ends)
# ...so it is possible zip together adjacent pairs and extract substrings.
pieces = [
(s[1], main_str[s[0]:e[0]])
for s, e in zip(all_indices, all_indices[1:])
]
# And then join all the pieces together with a bit of conditional formatting.
formatted = ''.join([
f"<span>{part}</span>" if is_start else part
for is_start, part in pieces
])
formatted
# '<span>Lorem</span> Ipsum is s<span>imply </span>dummy text of the printing and typesetting industry.'
Also, although you said you do not want for loops, it is important to note that you do not have to do any index modification if you do the updates in reverse order.
def update_str(s, spans):
for lookup in sorted(spans, reverse=True, key=lambda o: o['startOffset']):
start = lookup['startOffset']
end = lookup['endOffset']
before, span, after = s[:start], s[start:end], s[end:]
s = f'{before}<span>{span}</span>{after}'
return s
update_str(main_str, indexes_list)
# '<span>Lorem</span> Ipsum is s<span>imply </span>dummy text of the printing and typesetting industry.'
The unvisited insertion indices won't change if you iterate backwards. This is true for all such problems. It sometimes even lets you modify sequences during iteration if you're careful (not that I'd ever recommend it).
You can find all insertion points from the dict, sort them backwards, and then do the insertion. For example:
items = ['<span ...>', '</span>']
keys = ['startOffset', 'endOffset']
insertion_points = [(d[key], item) for d in indexes_list for key, item in zip(keys, items)]
insertion_points.sort(reverse=True)
for index, content in insertion_points:
main_str = main_str[:index] + content + main_str[index:]
The reason not to do that is that it's inefficient. For reasonable sized text that's not a huge problem, but keep in mind that you are chopping up and reallocating an ever increasing string with each step.
A much more efficient approach would be to chop up the entire string once at all the insertion points. Adding list elements at the right places with the right content would be much cheaper that way, and you would only have to rejoin the whole thing once:
items = ['<span ...>', '</span>']
keys = ['startOffset', 'endOffset']
insertion_points = [(d[key], item) for d in indexes_list for key, item in zip(keys, items)]
insertion_points.sort()
last = 0
chopped_str = []
for index, content in insertion_points:
chopped_str.append(main_str[last:index])
chopped_str.append(content)
last = index
chopped_str.append[main_str[last:]]
main_str = ''.join(chopped_str)
Related
Take this example:
"something": {
"random": 0,
"bag": {
"papers": 0,
"pencils": 0
},
"PAINT": {
"COLORS": [
"A WHITE",
"B MAPLE",
"B LOTUS",
"A OLIVE"
],
"CANS": [
"SOMETHING"
]
}
Ignore everything and focus on the COLORS list in the PAINT dictionary... I want to print all colors that have the color A before them, as a code. In other words I want to print "A WHITE" and "A OLIVE". Here's what happens when I do this:
with open("somethings.json", "r") as f:
data = json.load(f)
print(data["something"]["PAINT"]["COLORS"])
This is the output:
["A WHITE", "B MAPLE", "B LOTUS", "A OLIVE"]
but like I said, I do not want that... I want only A colors to be printed...
I also do not want THIS:
["A WHITE", "A OLIVE"]
the output that I really want (which is quite specific) is this:
OLIVE
WHITE
With line breaks (optional: AND in alphabetical order) that is the output that I want. So how can I print this output? is it possible without using any 'for' loops? This is a very specific question, would appreciate some help. Thanks -
Try this code:
with open("somethings.json", "r") as f:
data = json.load(f)
a_colors = [color for color in data["something"]["PAINT"]["COLORS"] if color.startswith("A ")]
colors = [a_color.replace("A ", "") for a_color in a_colors]
print(colors)
How it works
Opens and loads the JSON data.
Uses a list comprehension to filter only entries that start with "A ".
The .startswith() method of a string returns a boolean value, True if the first few characters of the string are, in fact, the characters passed as an argument, and False otherwise.
Uses another list comprehension to get the string without the "A " for each string in the list created in step 2.
Replaces the "A " with an empty string, which is a hacky way of deleting part of a string using the .replace() method.
It can be done without list comprehensions using a for loop as well
See code below:
with open("somethings.json", "r") as f:
data = json.load(f)
a_colors = []
for color in data["something"]["PAINT"]["COLORS"]:
if color.startswith("A "):
color_without_a = color.replace("A ", "")
a_colors.append(color_without_a)
print(a_colors)
This solution uses a for loop rather than a list comprehension but is otherwise the same. (If you are confused, see below for a solution which is an exact replica of the list comprehension one but implemented with for loops).
If you are interested, here is a lengthier solution more similar to the list comprehension one, using for loops:
with open("somethings.json", "r") as f:
data = json.load(f)
a_colors = []
for color in data["something"]["PAINT"]["COLORS"]:
if color.startswith("A "):
a_colors.append(color)
colors = []
for a_color in a_colors:
colors.append(a_color.replace("A ", ""))
print(colors)
To sort alphabetically, use the sorted() function, like this for the list comprehension solution and the second for loop solution:
sorted_list = sorted(colors)
print(sorted_list)
For the first for loop solution:
sorted_list = sorted(a_colors)
print(sorted_list)
Recommended reading
Python Data Structures documentation
Examples of list comprehensions for practice
Beginner's list comprehension tutorial
Filtering lists in Python
Other helpful resources
List slicing
Sorting Lists
I strongly recommend watching this video as well:
Python Tutorial for Beginners 7: Loops and Iterations - For/While Loops
Well, you can't really don't use a for-loop, you need to iterate over all elements in your COLORS array.
So, what you want to do is:
Iterate over all elements
Check if the first character of each element (e.g. A WHITE) is the desired character (e.g. A)
either print the output directly or store it in list without the A (notice the space)
So:
with open("somethings.json", "r") as f:
data = json.load(f)
colors = data["something"]["PAINT"]["COLORS"]
best_colors = []
for color in colors:
if color[0] == "A": # or any character you want; CASE SENSITIVE!
best_colors.append(color[2:]) # this adds the color without the A and the space to the list
# Optionally: Sort alphabetically
sorted_colors = sorted(best_colors)
Additional resources to help you to understand the code better:
List slicing
Sorting Lists
Based on Unix Doughnut's answer:
# Read JSON File
with open("file_name.json", "r") as f:
data = json.load(f)
# Sort the selected elements starting with A without the leading A
colors = sorted([color.replace("A ", "") for color in data["something"]["PAINT"]["COLORS"] if color.startswith("A ")])
# Print list elements separated by line break ( without for loop )
print(*colors, sep='\n')`
I have a .json file that looks like this:
{
"aaa":"xxx"
"bbb":"yyy"
},
{
"ccc":"zzz"
"ddd":"qqq"
}
I don’t know in advance how many sets of curly braces { } are in the file (items). Might be 0, 1 or any other natural number.
I also don't know whether the whole file is enclosed in square brackets [ ] or not.
I want to read it as a list of dictionaries, so I use the following:
try:
pvt = json.loads(data)
except:
pvt = json.loads("["+data+"]")
This solution works if I have more than one item if the file is not enclosed into square brackets [ ] or with any number or items if is is enclosed. The only case when it fails i.e. it is read as dict instead of list is when I don't have square brackets [ ] and have only one item in the file.
Could you suggest me the solution to read my file as a list of dictionaries in any case?
Thank you!
What about this?
def fix_my_json(json_string):
if len(json_string) == 0 or (json_string[0] != "[" and json_string[-1] != "]"):
return "["+json_string+"]"
return json_string
and then you can do
pvt = json.loads(fix_my_json(data))
When parsing, you could simply assume that the file content is not surrounded by brackets and flatten the resulting list if necessary:
pvt = json.loads("[" + data + "]")
if len(pvt) > 0 and type(pvt[0]) == list:
# if the file content was already a list, i. e. surrounded by brackets
pvt = pvt[0]
So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)
I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!
I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))
As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.
So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()
I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.
I have to strip whitespace for extracted strings, one string at a time for which I'm using split(). The split() function returns list after removing white spaces. I want to store this in my own dynamic list since I have to aggregate all of the strings.
The snippet of my code:
while rec_id = "ffff"
output = procs.run_cmd("get sensor info", command)
sdr_li = []
if output:
byte_str = output[0]
str_1 = byte_str.split(' ')
for byte in str_1:
sdr_li.append(byte)
rec_id = get_rec_id()
Output = ['23 0a 06 01 52 2D 12']
str_1 = ['23','0a','06','01','52','2D','12']
This does not look very elegant, transferring from one list to another. Is there another way to achieve this.
list.extend():
sdr_li.extend(str_1)
str.split() returns you a list so just add your list's items to the main list. Use extend https://docs.python.org/2/tutorial/datastructures.html
so rewriting your data into something legible and properly indented you'd get:
my_list = list
while rec_id = "ffff"
output = procs.run_cmd("get sensor info", command)
if output:
result_string = output[0]
# extend my_list with the list resulting from the whitespace
# seperated tokens of the output
my_list.extend( result_string.split() )
pass # end if
rec_id = get_rec_id()
...
pass # end while
How can i convert the following line(not sure what format is this) to JSON format?
[root=Root [key1=value1, key2=value2, key3=Key3 [key3_1=value3_1, key3_2=value3_2, key3_3=Key3_3 [key3_3_1=value3_3_1]], key4=value4]]
where Root, Key3, Key3_3 denote complex elements.
to
{
"root": {
"key1" : "value1",
"key2" : "value2",
"key3" : {
"key3_1" : "value3_1",
"key3_2" : "value3_2",
"key3_3" : {
"key3_3_1" : "value3_3_1"
}
},
"key4" : "value4
}
}
I am looking for approach and not solution. If you are down-voting this question, Please comment why you are doing so.
Let x be a string with the above serialization.
First, lets replace the occurrences of Root, Key3 and Key3_3 with empty strings
# the string fragments like "root=Root [" need to be replaced by "root=["
# to achieve this, we match the regex pattern "\w+ ["
# This matches ALL instances in the input string where we have a word bounded by "=" & " [",
# i.e. "Root [", "Key3 [", "Key3_3" are all matched. as will any other example you can think of
# where the `word` is composed of letters numbers or underscore followed
# by a single space character and then "["
# We replace this fragment with "[", (which we will later replace with "{")
# giving us the transformation "root=Root [" => "root=["
import re
o = re.compile(r'\w+ [[]')
y = re.sub(o, '[', x, 0)
Then, lets split the resulting string into words and non words
# Here we split the string into two lists, one containing adjacent tokens (nonwords)
# and the other containing the words
# The idea is to split / recombine the source string with quotes around all our words
w = re.compile(r'\W+')
nw = re.compile(r'\w+')
words = w.split(y)[1:-1] # ignore the end elements which are empty.
nonwords = nw.split(y) # list elements are contiguous non-word characters, i.e not a-Z_0-9
struct = '"{}"'.join(nonwords) # format structure of final output with quotes around the word's placeholder.
almost_there = struct.format(*words) # insert words into the string
And finally, replace the square brackets with squigly ones, and = with :
jeeson = almost_there.replace(']', '}').replace('=', ':').replace('[', '{')
# "{'root':{'key1':'value1', 'key2':'value2', 'key3':{'key3_1':'value3_1', 'key3_2':'value3_2', 'key3_3':{'key3_3_1':'value3_3_1'}}, 'key4':'value4'}}"
I had to spend around two hours on this, but I think I have something which would work all the cases based on the format you provided. If not, I am sure it'll be a minor change. Even though you asked only for the idea, since I coded it up anyway, here's the Python code.
import json
def to_json(cust_str):
from_index = 0
left_indices = []
levels = {}
level = 0
for i, char in enumerate(cust_str):
if char == '[':
level += 1
left_indices.append(i)
if level in levels:
levels[level] += 1
else:
levels[level] = 1
elif char == ']':
level -= 1
level = max(levels.keys())
value_stack = []
while True:
left_index = left_indices.pop()
right_index = cust_str.find(']', left_index) + 1
values = {}
pairs = cust_str[left_index:right_index][1:-1].split(',')
if levels[level] > 0:
for pair in pairs:
pair = pair.split('=')
values[pair[0].strip()] = pair[1]
else:
level -= 1
for pair in pairs:
pair = pair.split('=')
if pair[1][-1] == ' ':
values[pair[0].strip()] = value_stack.pop()
else:
values[pair[0].strip()] = pair[1]
value_stack.append(values)
levels[level] -= 1
cust_str = cust_str[:left_index] + cust_str[right_index:]
if levels[1] == 0:
return json.dumps(values)
if __name__ == '__main__':
# Data in custom format
cust_str = '[root=Root [key1=value1, key2=value2, key3=Key3 [key3_1=value3_1, key3_2=value3_2, key3_3=Key3_3 [key3_3_1=value3_3_1]], key4=value4]]'
# Data in JSON format
json_str = to_json(cust_str)
print json_str
The idea is that, we map the number of levels the dicts go to in the custom format and the number of values which are not strings corresponding to those levels. Along with that, we keep track of the indices of the [ character in the given string. We then start from the innermost dict representation by popping the stack containing the [ (left) indices and parse them. As each of them is parsed, we remove them from the string and continue. The rest you can probably read in the code.
I ran it for the data you gave and the result is as follows.
{
"root":{
"key2":"value2",
"key3":{
"key3_2":"value3_2",
"key3_3":{
"key3_3_1":"value3_3_1"
},
"key3_1":"value3_1"
},
"key1":"value1",
"key4":"value4"
}
}
Just to make sure it works for more general cases, I used this custom string.
[root=Root [key1=value1, key2=Key2 [key2_1=value2_1], key3=Key3 [key3_1=value3_1, key3_2=Key3_2 [key3_2_1=value3_2_1], key3_3=Key3_3 [key3_3_1=value3_3_1]], key4=value4]]
And parsed it.
{
"root":{
"key2":{
"key2_1":"value2_1"
},
"key3":{
"key3_2":{
"key3_2_1":"value3_2_1"
},
"key3_3":{
"key3_3_1":"value3_3_1"
},
"key3_1":"value3_1"
},
"key1":"value1",
"key4":"value4"
}
}
Which, as far as I can see, is how it should be parsed. Also, remember, do not strip the values since the logic depends on the whitespace at the end of values which should have the dicts as values (if that makes any sense).