Parsing Erlang data to Python dictionary - python

I have an erlang script from which I would like to get some data and store it in python dictionary.
It is easy to parse the script to get string like this:
{userdata,
[{tags,
[#dt{number=111},
#mp{id='X23.W'}]},
{log,
'LG22'},
{instruction,
"String that can contain characters like -, _ or numbers"}
]
}.
desired result:
userdata = {"tags": {"dt": {"number": 111}, "mp": {"id": "X23.W"}},
"log": "LG22",
"instruction": "String that can contain characters like -, _ or numbers"}
# "#" mark for data in "tags" is not required in this structure.
# Also value for "tags" can be any iterable structure: tuple, list or dictionary.
But I am not sure how to transfer this data into a python dictionary. My first idea was to use json.loads but it requires many modifications (putting words into quotes marks, replacing "," with ":" and many more).
Moreover, keys in userdata are not limited to some pool. In this case, there are 'tags', 'log' and 'instruction', but there can be many more eg. 'slogan', 'ids', etc.
Also, I am not sure about the order. I assume that the keys can appear in random order.
My code (it is not working for id='X23.W' so I removed '.' from input):
import re
import json
in_ = """{userdata, [{tags, [#dt{number=111}, #mp{id='X23W'}]}, {log, 'LG22'}, {instruction, "String that can contain characters like -, _ or numbers"}]}"""
buff = in_.replace("{userdata, [", "")[:-2]
re_helper = re.compile(r"(#\w+)")
buff = re_helper.sub(r'\1:', buff)
partition = buff.partition("instruction")
section_to_replace = partition[0]
replacer = re.compile(r"(\w+)")
match = replacer.sub(r'"\1"', section_to_replace)
buff = ''.join([match, '"instruction"', partition[2]])
buff = buff.replace("#", "")
buff = buff.replace('",', '":')
buff = buff.replace("}, {", "}, \n{")
buff = buff.replace("=", ":")
buff = buff.replace("'", "")
temp = buff.split("\n")
userdata = {}
buff = temp[0][:-2]
buff = buff.replace("[", "{")
buff = buff.replace("]", "}")
userdata .update(json.loads(buff))
for i, v in enumerate(temp[1:]):
v = v.strip()
if v.endswith(","):
v = v[:-1]
userdata .update(json.loads(v))
print(userdata)
Output:
{'tags': {'dt': {'number': '111'}, 'mp': {'id': 'X23W'}}, 'instruction': 'String that can contain characters like -, _ or numbers', 'log': 'LG22'}

import json
import re
in_ = """{userdata, [{tags, [#dt{number=111}, #mp{id='X23.W'}]}, {log, 'LG22'}, {instruction, "String that can contain characters like -, _ or numbers"}]}"""
qouted_headers = re.sub(r"\{(\w+),", r'{"\1":', in_)
changed_hashed_list_to_dict = re.sub(r"\[(#.*?)\]", r'{\1}', qouted_headers)
hashed_variables = re.sub(r'#(\w+)', r'"\1":', changed_hashed_list_to_dict)
equality_signes_replaced_and_quoted = re.sub(r'{(\w+)=', r'{"\1":', hashed_variables)
replace_single_qoutes = equality_signes_replaced_and_quoted.replace('\'', '"')
result = json.loads(replace_single_qoutes)
print(result)
Produces:
{'userdata': [{'tags': {'dt': {'number': 111}, 'mp': {'id': 'X23.W'}}}, {'log': 'LG22'}, {'instruction': 'String that can contain characters like -, _ or numbers'}]}

Related

convert string which contains sub string to dictionary

I am tring to convert particular strings which are in particular format to Python dictionary.
String format is like below,
st1 = 'key1 key2=value2 key3="key3.1, key3.2=value3.2 , key3.3 = value3.3, key3.4" key4'
I want to parse it and convert to dictionary as below,
dict1 {
key1: None,
key2: value2,
key3: {
key3.1: None,
key3.2: value3.2,
key3.3: value3.3,
key3.2: None
}
key4: None,
I tried to use python re package and string split function. not able to acheive the result. I have thousands of string in same format, I am trying to automate it. could someone help.
If all your strings are consistent, and only have 1 layer of sub dict, this code below should do the trick, you may need to make tweaks/changes to it.
import json
st1 = 'key1 key2=item2 key3="key3.1, key3.2=item3.2 , key3.3 = item3.3, key3.4" key4'
st1 = st1.replace(' = ', '=')
st1 = st1.replace(' ,', ',')
new_dict = {}
no_keys=False
while not no_keys:
st1 = st1.lstrip()
if " " in st1:
item = st1.split(" ")[0]
else:
item = st1
if '=' in item:
if '="' in item:
item = item.split('=')[0]
new_dict[item] = {}
st1 = st1.replace(f'{item}=','')
sub_items = st1.split('"')[1]
sub_values = sub_items.split(',')
for sub_item in sub_values:
if "=" in sub_item:
sub_key, sub_value = sub_item.split('=')
new_dict[item].update({sub_key.strip():sub_value.strip()})
else:
new_dict[item].update({sub_item.strip(): None})
st1 = st1.replace(f'"{sub_items}"', '')
else:
key, value = item.split('=')
new_dict.update({key:value})
st1 = st1.replace(f"{item} ","")
else:
new_dict.update({item: None})
st1 = st1.replace(f"{item}","")
if st1 == "":
no_keys=True
print(json.dumps(new_dict, indent=4))
Consider use parsing tool like lark. A simple example to your case:
_grammar = r'''
?start: value
?value: object
| NON_SEPARATOR_STRING?
object : "\"" [pair (_SEPARATOR pair)*] "\""
pair : NON_SEPARATOR_STRING [_PAIRTOR] value
NON_SEPARATOR_STRING: /[a-zA-z0-9\.]+/
_SEPARATOR: /[, ]+/
| ","
_PAIRTOR: " = "
| "="
'''
parser = Lark(_grammar)
st1 = 'key1 key2=value2 key3="key3.1, key3.2=value3.2 , key3.3 = value3.3, key3.4" key4'
tree = parser.parse(f'"{st1}"')
print(tree.pretty())
"""
object
pair
key1
value
pair
key2
value2
pair
key3
object
pair
key3.1
value
pair
key3.2
value3.2
pair
key3.3
value3.3
pair
key3.4
value
pair
key4
value
"""
Then you can write your own Transformer to transform this tree to your desired date type.

How to merge common strings with different values between parenthesis in Python

I am processing some strings within lists that look like these:
['COLOR INCLUDES (40)', 'LONG_DESCRIPTION CONTAINS ("BLACK")', 'COLOR INCLUDES (38)']
['COLOR INCLUDES (30,31,32,33,56,74,84,85,93,99,184,800,823,830,833,838,839)', 'COLOR INCLUDES (30,31,32,33,56,74,84,85,93,99,184,409,800,823,830,833,838,839)', 'COLOR INCLUDES (800)']
Thing is, I want to merge similar strings with their values into one, for each list. Expecting something like this:
['COLOR INCLUDES (40,38)', 'LONG_DESCRIPTION CONTAINS ("BLACK")']
['COLOR INCLUDES (30,31,32,33,56,74,84,85,93,99,184,409,800,823,830,833,838,839)']
And some strings may have values without ():
['FAMILY EQUALS 1145']
What could be the more pythonic and fastest (lazy :P) way of doing this?
I have tried using regex to match strings until a "(" appears, but some strings don't have values between (), and can't find a fitting solution.
I have also tried STree function from suffix_trees lib, which finds the LCS (Longest Common Subsequence) from a list of strings, but then ran out of ideas about handling the values and the closing parenthesis:
from suffix_trees import STree
st = STree.STree(['COLOR INCLUDES(30,31,32,33,56,74,84,85,93,99,184,800,823,830,833,838,839)',
'COLOR INCLUDES(30,31,32,33,56,74,84,85,93,99,184,409,800,823,830,833,838,839)', 'COLOR INCLUDES (800)'])
st.lcs()
out: 'COLOR INCLUDES ('
EDIT: SOLVED
As #stef in the answer said, I broke the problem in smaller pieces and I solved it with his help. Let me paste here the Class Rule_process and the result:
class Rule_process:
def __init__(self):
self.rules = '(COLOR INCLUDES (40)) OR (LONG_DESCRIPTION CONTAINS ("BLACK")):1|||COLOR INCLUDES (30,31,32,33,56,74,84,85,93,99,184,800,823,830,833,838,839):0|||COLOR INCLUDES (30,31,32,33,56,74,84,85,93,99,184,409,800,823,830,833,838,839):0|||COLOR INCLUDES (40):1|||COLOR INCLUDES (800):0'
self.rules_dict = {
0:None,
1:None,
2:None,
4:None,
}
def append_rules(self):
rules = self.rules.split("|||")
values_0 = []
values_1 = []
values_2 = []
values_4 = []
for rule in range(len(rules)):
if rules[rule][-1]=='0':
rules[rule] = rules[rule][:-2]
# self.rules_dict[0].append(rules[rule])
values_0.append(rules[rule])
elif rules[rule][-1]=='1':
rules[rule] = rules[rule][:-2]
# self.rules_dict[1].append(rules[rule])
values_1.append(rules[rule])
elif rules[rule][-1]=='2':
rules[rule] = rules[rule][:-2]
# self.rules_dict[2].append(rules[rule])
values_2.append(rules[rule])
elif rules[rule][-1]=='4':
rules[rule] = rules[rule][:-2]
# self.rules_dict[4].append(rules[rule])
values_4.append(rules[rule])
if values_0!=[]:
self.rules_dict[0] = values_0
if values_1!=[]:
self.rules_dict[1] = values_1
if values_2!=[]:
self.rules_dict[2] = values_2
if values_4!=[]:
self.rules_dict[4] = values_4
regex = r'^\('
# for rules in self.rules_dict.values():
for key in self.rules_dict.keys():
if self.rules_dict[key] is not None:
for rule in range(len(self.rules_dict[key])):
new_rule = self.rules_dict[key][rule].split(' OR ')
if len(new_rule)>1:
joined_rule = []
for r in new_rule:
r = r.replace("))",")")
r = re.sub(regex, "", r)
joined_rule.append(r)
self.rules_dict[key].remove(self.rules_dict[key][rule])
self.rules_dict[key].extend(joined_rule)
self.rules_dict[key] = list(set(self.rules_dict[key]))
else:
new_rule = [r.replace("))",")") for r in new_rule]
new_rule = [re.sub(regex, "", r) for r in new_rule]
new_rule = ", ".join(new_rule)
self.rules_dict[key][rule] = new_rule
self.rules_dict[key] = list(set(self.rules_dict[key]))
return self.rules_dict
def split_rule(self): # COLOR INCLUDES (30,31,32,33) -> name = 'COLOR INCLUDES', values = [30,31,32,33]
# LONG_DESCRIPTION CONTAINS ("BLACK") -> name = LONG_DESCRIPTION, values ='"BLACK"'
new_dict = {
0:None,
1:None,
2:None,
4:None,
}
for key in self.rules_dict.keys():
pql_dict = {}
if self.rules_dict[key] is not None:
for rule in range(len(self.rules_dict[key])): #self.rules_dict[key][rule] -> COLOR INCLUDES (30,31,32,33,56,74,84,85,93,99,184,800,823,830,833,838,839)
rule = self.rules_dict[key][rule]
name = rule.rsplit(maxsplit=1)[0] #------------------------------->COLOR INCLUDES
values_as_str = rule.rsplit(maxsplit=1)[1].replace("(","")
values_as_str = values_as_str.replace(")","") #-------------------------------> 30,31,32,33,56,74,84,85,93,99,184,800,823,830,833,838,839
try:
values = list(map(int, values_as_str.split(","))) # [30,31,32,33,56,74,84,85,93,99,184,800,823,830,833,838,839]
except:
values = values_as_str # '"BLACK"'
if name in pql_dict.keys():
pql_dict[name] = pql_dict[name] + (values)
pql_dict[name] = list(set(pql_dict[name]))
else:
pql_dict.setdefault(name, values)
# pql_dict = {'COLOR INCLUDES': [32, 33, 800, 99, 833, 838, 839, 74, 84, 85, 30, 823, 184, 409, 56, 93, 830, 31]}
for name in pql_dict.keys():
values = pql_dict[name]
joined_rule = name + " " + str(values)
if new_dict[key] is not None:
new_dict[key] = new_dict[key] + [joined_rule]
else:
new_dict[key] = [joined_rule]
self.rules_dict = new_dict
And the result:
process = Rule_process()
process.append_rules()
process.split_rule()
process.rules_dict
OUT:
{0: ['COLOR INCLUDES [32, 33, 800, 99, 833, 838, 839, 74, 84, 85, 30, 823, 184, 409, 56, 93, 830, 31]'],
1: ['COLOR INCLUDES [40]', 'LONG_DESCRIPTION CONTAINS "BLACK"'],
2: None,
4: None}
Split this task into smaller, simpler tasks.
First task:
Write a function that takes a string and returns a pair (name, list_of_values) where name is the first part of the string and list_of_values is a python list of integers.
Hint: You can use '(' in s to test whether string s contains an opening parenthesis; you can use s.split() to split on whitespace or s.rsplit(maxsplit=1) to only split on the last whitespace; s.split('(') to split on opening parenthesis; and s.split(',') to split on comma.
Second task:
Write a function that takes a list of pairs (name, list_of_values) and merges the lists when the names are equal.
Hint: This is extremely easy in python using a dict with name as key and list_of_values as value. You can use if name in d: ... else: to test whether a name is already in the dict or not; or you can use d.get(name, []) or d.setdefault(name, []) to automatically add a name: [] entry in the dict when name is not already in the dict.
Third task:
Write a function to convert back, from the pairs (name, list_of_values) to the strings "name (value1, value2, ...)". This task is easier than the first task, so I suggest doing it first.
Hint: ' '.join(...) and ','.join(...) can both be useful.

Converting a list of dictionaries, into rdf format

Goal:(Automation: When there is large list of dictionaries, i want to generate a spectic format of data)
this is the input:
a = ['et2': 'OBJ Type',
'e2': 'OBJ',
'rel': 'rel',
'et1': 'SUJ Type',
'e1': 'SUJ'},
{'et2': 'OBJ Type 2',
'e2': 'OBJ',
'rel': 'rel',
'et1': 'SUJ Type',
'e1': 'SUJ'}
]
The expected output is this :
:Sub a :SubType.
:Sub :rel "Obj".
This is what i have tried
Sub = 0
for i in a:
entity_type1 = i["EntityType1"]
entity1 = i["Entity1"]
entity_type2 = i["EntityType2"]
entity2 = i["Entity2"]
relation = i["Relation"]
if 'Sub' in entity_type1 or entity_type2:
if entity1 == Sub and Sub <= 0 :
Sub +=1
sd_line1 = ""
sd_line2 = ""
sd_line1 = ":" + entity1 + " a " + ":" + entity_type1 + "."
relation = ":"+relation
sd_line2 ="\n" ":" + entity1 + " " + relation + " \"" + entity2 + "\"."
sd_line3 = sd_line1 + sd_line2
print(sd_line3)
A bit of advice: when doing such a transformation workflow, try to separate the major steps, e.g.: loading from a system, parsing data in one format, extracting, transforming, serializing to another format, loading to another system.
In your code example, you are mixing the extraction, transformation and serialization steps. Separating those steps will make your code easier to read and, thus, easier to maintain or reuse.
Below, I give you two solutions: the first is extracting data to a simple dict-based subject-predicate-object graph, the second one to a real RDF graph.
In both cases, you'll see that I separated the extraction/transformation steps (that returns a graph) and serialization steps (that uses the graph), making them more reusable:
the dict-based transformation is implemented with a simple dict or with a defaultdict. The serialization step is common to both.
the rdflib.Graph-based transformation is common to two serializations: one to your format, the other one to any available rdflib.Graph serializations.
This will build a simple dict-based graph from your a dictionary:
graph = {}
for e in a:
subj = e["Entity1"]
graph[subj] = {}
# :Entity1 a :EntityType1.
obj = e["EntityType1"]
graph[subj]["a"] = obj
# :Entity1 :Relation "Entity2".
pred, obj = e["Relation"], e["Entity2"]
graph[subj][pred] = obj
print(graph)
like this:
{'X450-G2': {'a': 'switch',
'hasFeatures': 'Role-Based Policy',
'hasLocation': 'WallJack'},
'ers 3600': {'a': 'switch',
'hasFeatures': 'ExtremeXOS'},
'slx 9540': {'a': 'router',
'hasFeatures': 'ExtremeXOS',
'hasLocation': 'Chasis'}})
Or, in a shorter form, with a defaultdict:
from collections import defaultdict
graph = defaultdict(dict)
for e in a:
subj = e["Entity1"]
# :Entity1 a :EntityType1.
graph[subj]["a"] = e["EntityType1"]
# :Entity1 :Relation "Entity2".
graph[subj][e["Relation"]] = e["Entity2"]
print(graph)
And this will print your subject predicate object. triples from the graph:
def normalize(text):
return text.replace(' ', '')
for subj, po in graph.items():
subj = normalize(subj)
# :Entity1 a :EntityType1.
print(':{} a :{}.'.format(subj, po.pop("a")))
for pred, obj in po.items():
# :Entity1 :Relation "Entity2".
print(':{} :{} "{}".'.format(subj, pred, obj))
print()
like this:
:X450-G2 a :switch.
:X450-G2 :hasFeatures "Role-Based Policy".
:X450-G2 :hasLocation "WallJack".
:ers3600 a :switch.
:ers3600 :hasFeatures "ExtremeXOS".
:slx9540 a :router.
:slx9540 :hasFeatures "ExtremeXOS".
:slx9540 :hasLocation "Chasis".
This will build a real RDF graph using the rdflib library:
from rdflib import Graph, Literal, URIRef
from rdflib.namespace import RDF
A = RDF.type
graph = Graph()
for d in a:
subj = URIRef(normalize(d["Entity1"]))
# :Entity1 a :EntityType1.
graph.add((
subj,
A,
URIRef(normalize(d["EntityType1"]))
))
# :Entity1 :Relation "Entity2".
graph.add((
subj,
URIRef(normalize(d["Relation"])),
Literal(d["Entity2"])
))
This:
print(graph.serialize(format="n3").decode("utf-8"))
will print the graph in the N3 serialization format:
<X450-G2> a <switch> ;
<hasFeatures> "Role-Based Policy" ;
<hasLocation> "WallJack" .
<ers3600> a <switch> ;
<hasFeatures> "ExtremeXOS" .
<slx9540> a <router> ;
<hasFeatures> "ExtremeXOS" ;
<hasLocation> "Chasis" .
And this will query the graph to print it in your format:
for subj in set(graph.subjects()):
po = dict(graph.predicate_objects(subj))
# :Entity1 a :EntityType1.
print(":{} a :{}.".format(subj, po.pop(A)))
for pred, obj in po.items():
# :Entity1 :Relation "Entity2".
print(':{} :{} "{}".'.format(subj, pred, obj))
print()

How to parse a string in Python

How to parse string composed of n parameter and randomly sorted such as:
{ UserID : 36875; tabName : QuickAndEasy}
{ RecipeID : 1150; UserID : 36716}
{ isFromLabel : 0; UserID : 36716; type : recipe; searchWord : soup}
{ UserID : 36716; tabName : QuickAndEasy}
Ultimately I'm looking to ouput parameters in separate columns for a table.
The regex ([^{}\s:]+)\s*:\s*([^{}\s;]+) works on your examples. You need to be aware, though, that all the matches will be strings, so if you want to store 36875 as a number, you'll need to do some additional processing.
import re
regex = re.compile(
r"""( # Match and capture in group 1:
[^{}\s:]+ # One or more characters except braces, whitespace or :
) # End of group 1
\s*:\s* # Match a colon, optionally surrounded by whitespace
( # Match and capture in group 2:
[^{}\s;]+ # One or more characters except braces, whitespace or ;
) # End of group 2""",
re.VERBOSE)
You can then do
>>> dict(regex.findall("{ isFromLabel : 0; UserID : 36716; type : recipe; searchWord : soup}"))
{'UserID': '36716', 'isFromLabel': '0', 'searchWord': 'soup', 'type': 'recipe'}
Test it live on regex101.com.
lines = "{ UserID : 36875; tabName : QuickAndEasy } ", \
"{ RecipeID : 1150; UserID : 36716}", \
"{ isFromLabel : 0; UserID : 36716; type : recipe; searchWord : soup}" , \
"{ UserID : 36716; tabName : QuickAndEasy}"
counter = 0
mappedLines = {}
for line in lines:
counter = counter + 1
lineDict = {}
line = line.replace("{","")
line = line.replace("}","")
line = line.strip()
fieldPairs = line.split(";")
for pair in fieldPairs:
fields = pair.split(":")
key = fields[0].strip()
value = fields[1].strip()
lineDict[key] = value
mappedLines[counter] = lineDict
def printField(key, lineSets, comma_desired = True):
if key in lineSets:
print(lineSets[key],end="")
if comma_desired:
print(",",end="")
else:
print()
for key in range(1,len(mappedLines) + 1):
lineSets = mappedLines[key]
printField("UserID",lineSets)
printField("tabName",lineSets)
printField("RecipeID",lineSets)
printField("type",lineSets)
printField("searchWord",lineSets)
printField("isFromLabel",lineSets,False)
CSV output:
36875,QuickAndEasy,,,,
36716,,1150,,,
36716,,,recipe,soup,0
36716,QuickAndEasy,,,,
The code above was Python 3.4. You can get similar output with 2.7 by replacing the function and the last for loop with this:
def printFields(keys, lineSets):
output_line = ""
for key in keys:
if key in lineSets:
output_line = output_line + lineSets[key] + ","
else:
output_line += ","
print output_line[0:len(output_line) - 1]
fields = ["UserID", "tabName", "RecipeID", "type", "searchWord", "isFromLabel"]
for key in range(1,len(mappedLines) + 1):
lineSets = mappedLines[key]
printFields(fields,lineSets)

Learning Python: Store values in dict from stdout

How can I do the following in Python:
I have a command output that outputs this:
Datexxxx
Clientxxx
Timexxx
Datexxxx
Client2xxx
Timexxx
Datexxxx
Client3xxx
Timexxx
And I want to work this in a dict like:
Client:(date,time), Client2:(date,time) ...
After reading the data into a string subject, you could do this:
import re
d = {}
for match in re.finditer(
"""(?mx)
^Date(.*)\r?\n
Client\d*(.*)\r?\n
Time(.*)""",
subject):
d[match.group(2)] = (match.group(1), match.group(2))
How about something like:
rows = {}
thisrow = []
for line in output.split('\n'):
if line[:4].lower() == 'date':
thisrow.append(line)
elif line[:6].lower() == 'client':
thisrow.append(line)
elif line[:4].lower() == 'time':
thisrow.append(line)
elif line.strip() == '':
rows[thisrow[1]] = (thisrow[0], thisrow[2])
thisrow = []
print rows
Assumes a trailing newline, no spaces before lines, etc.
What about using a dict with tuples?
Create a dictionary and add the entries:
dict = {}
dict['Client'] = ('date1','time1')
dict['Client2'] = ('date2','time2')
Accessing the entires:
dict['Client']
>>> ('date1','time1')

Categories