Goal:(Automation: When there is large list of dictionaries, i want to generate a spectic format of data)
this is the input:
a = ['et2': 'OBJ Type',
'e2': 'OBJ',
'rel': 'rel',
'et1': 'SUJ Type',
'e1': 'SUJ'},
{'et2': 'OBJ Type 2',
'e2': 'OBJ',
'rel': 'rel',
'et1': 'SUJ Type',
'e1': 'SUJ'}
]
The expected output is this :
:Sub a :SubType.
:Sub :rel "Obj".
This is what i have tried
Sub = 0
for i in a:
entity_type1 = i["EntityType1"]
entity1 = i["Entity1"]
entity_type2 = i["EntityType2"]
entity2 = i["Entity2"]
relation = i["Relation"]
if 'Sub' in entity_type1 or entity_type2:
if entity1 == Sub and Sub <= 0 :
Sub +=1
sd_line1 = ""
sd_line2 = ""
sd_line1 = ":" + entity1 + " a " + ":" + entity_type1 + "."
relation = ":"+relation
sd_line2 ="\n" ":" + entity1 + " " + relation + " \"" + entity2 + "\"."
sd_line3 = sd_line1 + sd_line2
print(sd_line3)
A bit of advice: when doing such a transformation workflow, try to separate the major steps, e.g.: loading from a system, parsing data in one format, extracting, transforming, serializing to another format, loading to another system.
In your code example, you are mixing the extraction, transformation and serialization steps. Separating those steps will make your code easier to read and, thus, easier to maintain or reuse.
Below, I give you two solutions: the first is extracting data to a simple dict-based subject-predicate-object graph, the second one to a real RDF graph.
In both cases, you'll see that I separated the extraction/transformation steps (that returns a graph) and serialization steps (that uses the graph), making them more reusable:
the dict-based transformation is implemented with a simple dict or with a defaultdict. The serialization step is common to both.
the rdflib.Graph-based transformation is common to two serializations: one to your format, the other one to any available rdflib.Graph serializations.
This will build a simple dict-based graph from your a dictionary:
graph = {}
for e in a:
subj = e["Entity1"]
graph[subj] = {}
# :Entity1 a :EntityType1.
obj = e["EntityType1"]
graph[subj]["a"] = obj
# :Entity1 :Relation "Entity2".
pred, obj = e["Relation"], e["Entity2"]
graph[subj][pred] = obj
print(graph)
like this:
{'X450-G2': {'a': 'switch',
'hasFeatures': 'Role-Based Policy',
'hasLocation': 'WallJack'},
'ers 3600': {'a': 'switch',
'hasFeatures': 'ExtremeXOS'},
'slx 9540': {'a': 'router',
'hasFeatures': 'ExtremeXOS',
'hasLocation': 'Chasis'}})
Or, in a shorter form, with a defaultdict:
from collections import defaultdict
graph = defaultdict(dict)
for e in a:
subj = e["Entity1"]
# :Entity1 a :EntityType1.
graph[subj]["a"] = e["EntityType1"]
# :Entity1 :Relation "Entity2".
graph[subj][e["Relation"]] = e["Entity2"]
print(graph)
And this will print your subject predicate object. triples from the graph:
def normalize(text):
return text.replace(' ', '')
for subj, po in graph.items():
subj = normalize(subj)
# :Entity1 a :EntityType1.
print(':{} a :{}.'.format(subj, po.pop("a")))
for pred, obj in po.items():
# :Entity1 :Relation "Entity2".
print(':{} :{} "{}".'.format(subj, pred, obj))
print()
like this:
:X450-G2 a :switch.
:X450-G2 :hasFeatures "Role-Based Policy".
:X450-G2 :hasLocation "WallJack".
:ers3600 a :switch.
:ers3600 :hasFeatures "ExtremeXOS".
:slx9540 a :router.
:slx9540 :hasFeatures "ExtremeXOS".
:slx9540 :hasLocation "Chasis".
This will build a real RDF graph using the rdflib library:
from rdflib import Graph, Literal, URIRef
from rdflib.namespace import RDF
A = RDF.type
graph = Graph()
for d in a:
subj = URIRef(normalize(d["Entity1"]))
# :Entity1 a :EntityType1.
graph.add((
subj,
A,
URIRef(normalize(d["EntityType1"]))
))
# :Entity1 :Relation "Entity2".
graph.add((
subj,
URIRef(normalize(d["Relation"])),
Literal(d["Entity2"])
))
This:
print(graph.serialize(format="n3").decode("utf-8"))
will print the graph in the N3 serialization format:
<X450-G2> a <switch> ;
<hasFeatures> "Role-Based Policy" ;
<hasLocation> "WallJack" .
<ers3600> a <switch> ;
<hasFeatures> "ExtremeXOS" .
<slx9540> a <router> ;
<hasFeatures> "ExtremeXOS" ;
<hasLocation> "Chasis" .
And this will query the graph to print it in your format:
for subj in set(graph.subjects()):
po = dict(graph.predicate_objects(subj))
# :Entity1 a :EntityType1.
print(":{} a :{}.".format(subj, po.pop(A)))
for pred, obj in po.items():
# :Entity1 :Relation "Entity2".
print(':{} :{} "{}".'.format(subj, pred, obj))
print()
Related
I have a class and in that class I have a method that calls multiple methods in it.
But the problem I am facing now is that when the method with the multiple methods in it duplicate parameter has.
And so when I am calling the method with the multiple methods in it, it returns a empty list:[].
So this is the method with the multiple methods in it:
def show_extracted_data_from_file(self, file_name):
self.extractingText.extract_text_from_image(file_name)
total_fruit = self.filter_verdi_total_number_fruit()
fruit_name = self.filter_verdi_fruit_name()
fruit_total_cost = self.filter_verdi_total_fruit_cost(file_name)
return "\n".join("{} \t {} \t {}".format(a, b, c) for a, b, c in zip(total_fruit, fruit_name, fruit_total_cost))
and this is the method: filter_verdi_total_fruit_cost:
def filter_verdi_total_fruit_cost(self, file_name):
locale.setlocale(locale.LC_ALL, locale='Dutch')
self.extractingText.extract_text_from_image(file_name)
return [
locale.atof(items[-1]) for items in (
token.split() for token in file_name.split('\n')
) if len(items) > 2 and items[1] in self.extractingText.list_fruit
]
this method returns the following data:
[123.2, 2772.0, 46.2, 577.5, 69.3, 3488.16, 137.5, 500.0, 1000.0, 2000.0, 1000.0, 381.25]
You see that I am calling two times file_name.
and so when I calling the method show_extracted_data_from_file in the views.py:
if uploadfile.image.path.endswith('.pdf'):
content = filter_text.show_extracted_data_from_file(uploadfile.image.path)
print(content)
it produces a empty list: []
Question: how can I reduce the parameter file_name so that it will return the correct results?
this are my two other methods that I am calling in the combined method:
def filter_verdi_total_number_fruit(self):
regex = r"(\d*(?:\.\d+)*)\s*\W+(?:" + '|'.join(re.escape(word)
for word in self.extractingText.list_fruit) + ')'
return re.findall(regex, self.extractingText.text_factuur_verdi[0])
def filter_verdi_fruit_name(self):
regex = r"(?:\d*(?:\.\d+)*)\s*\W+(" + '|'.join(re.escape(word)
for word in self.extractingText.list_fruit) + ')'
return re.findall(regex, self.extractingText.text_factuur_verdi[0])
So this is the other class:
class ExtractingTextFromFile:
def extract_text_from_image(self, filename):
self.text_factuur_verdi = []
pdf_file = wi(filename=filename, resolution=300)
all_images = pdf_file.convert('jpeg')
for image in all_images.sequence:
image = wi(image=image)
image = image.make_blob('jpeg')
image = Image.open(io.BytesIO(image))
text = pytesseract.image_to_string(image, lang='eng')
self.text_factuur_verdi.append(text)
return self.text_factuur_verdi
def __init__(self):
# class variables:
self.tex_factuur_verdi = []
self.list_fruit = ['Appels', 'Ananas', 'Peen Waspeen',
'Tomaten Cherry', 'Sinaasappels',
'Watermeloenen', 'Rettich', 'Peren', 'Peen',
'Mandarijnen', 'Meloenen', 'Grapefruit', 'Rettich']
#AndrewRyan has the right idea.
I presume calling extract_text_from_image just adds the attribute list_fruit
Two routes you can go, from what you are commenting you'll probably just go with #1.. but I gave #2 as another option in case you'd ever want to call filter_verdi_total_fruit_cost by itself.
Path 1, Just remove it.
Note: filter_verdi_total_fruit_cost is only called from show_extracted_data_from_file.
def show_extracted_data_from_file(self, file_name):
# extract text
# Note: stores data in `self.extractingText.list_fruit`
self.extractingText.extract_text_from_image(file_name)
total_fruit = self.filter_verdi_total_number_fruit()
fruit_name = self.filter_verdi_fruit_name()
fruit_total_cost = self.filter_verdi_total_fruit_cost()
return "\n".join("{} \t {} \t {}".format(a, b, c) for a, b, c in zip(total_fruit, fruit_name, fruit_total_cost))
def filter_verdi_total_fruit_cost(self):
# Note: `self.extractingText.list_fruit` should be already defined
locale.setlocale(locale.LC_ALL, locale='Dutch')
return [
locale.atof(items[-1]) for items in (
token.split() for token in file_name.split('\n')
) if len(items) > 2 and items[1] in self.extractingText.list_fruit
]
Path 2, Check if it's already extracted- if not, extract; if so, continue
Note: if you wanted to just call filter_verdi_total_fruit_cost
def show_extracted_data_from_file(self, file_name):
# extract text
# Note: stores data in `self.extractingText.list_fruit`
self.extractingText.extract_text_from_image(file_name)
total_fruit = self.filter_verdi_total_number_fruit()
fruit_name = self.filter_verdi_fruit_name()
fruit_total_cost = self.filter_verdi_total_fruit_cost(file_name)
return "\n".join("{} \t {} \t {}".format(a, b, c) for a, b, c in zip(total_fruit, fruit_name, fruit_total_cost))
def filter_verdi_total_fruit_cost(self, file_name):
locale.setlocale(locale.LC_ALL, locale='Dutch')
if not hasattr(self, 'list_fruit'):
# file hasn't been extracted yet.. extract it
# Note: stores data in `self.extractingText.list_fruit`
self.extractingText.extract_text_from_image(file_name)
return [
locale.atof(items[-1]) for items in (
token.split() for token in file_name.split('\n')
) if len(items) > 2 and items[1] in self.extractingText.list_fruit
]
I have an erlang script from which I would like to get some data and store it in python dictionary.
It is easy to parse the script to get string like this:
{userdata,
[{tags,
[#dt{number=111},
#mp{id='X23.W'}]},
{log,
'LG22'},
{instruction,
"String that can contain characters like -, _ or numbers"}
]
}.
desired result:
userdata = {"tags": {"dt": {"number": 111}, "mp": {"id": "X23.W"}},
"log": "LG22",
"instruction": "String that can contain characters like -, _ or numbers"}
# "#" mark for data in "tags" is not required in this structure.
# Also value for "tags" can be any iterable structure: tuple, list or dictionary.
But I am not sure how to transfer this data into a python dictionary. My first idea was to use json.loads but it requires many modifications (putting words into quotes marks, replacing "," with ":" and many more).
Moreover, keys in userdata are not limited to some pool. In this case, there are 'tags', 'log' and 'instruction', but there can be many more eg. 'slogan', 'ids', etc.
Also, I am not sure about the order. I assume that the keys can appear in random order.
My code (it is not working for id='X23.W' so I removed '.' from input):
import re
import json
in_ = """{userdata, [{tags, [#dt{number=111}, #mp{id='X23W'}]}, {log, 'LG22'}, {instruction, "String that can contain characters like -, _ or numbers"}]}"""
buff = in_.replace("{userdata, [", "")[:-2]
re_helper = re.compile(r"(#\w+)")
buff = re_helper.sub(r'\1:', buff)
partition = buff.partition("instruction")
section_to_replace = partition[0]
replacer = re.compile(r"(\w+)")
match = replacer.sub(r'"\1"', section_to_replace)
buff = ''.join([match, '"instruction"', partition[2]])
buff = buff.replace("#", "")
buff = buff.replace('",', '":')
buff = buff.replace("}, {", "}, \n{")
buff = buff.replace("=", ":")
buff = buff.replace("'", "")
temp = buff.split("\n")
userdata = {}
buff = temp[0][:-2]
buff = buff.replace("[", "{")
buff = buff.replace("]", "}")
userdata .update(json.loads(buff))
for i, v in enumerate(temp[1:]):
v = v.strip()
if v.endswith(","):
v = v[:-1]
userdata .update(json.loads(v))
print(userdata)
Output:
{'tags': {'dt': {'number': '111'}, 'mp': {'id': 'X23W'}}, 'instruction': 'String that can contain characters like -, _ or numbers', 'log': 'LG22'}
import json
import re
in_ = """{userdata, [{tags, [#dt{number=111}, #mp{id='X23.W'}]}, {log, 'LG22'}, {instruction, "String that can contain characters like -, _ or numbers"}]}"""
qouted_headers = re.sub(r"\{(\w+),", r'{"\1":', in_)
changed_hashed_list_to_dict = re.sub(r"\[(#.*?)\]", r'{\1}', qouted_headers)
hashed_variables = re.sub(r'#(\w+)', r'"\1":', changed_hashed_list_to_dict)
equality_signes_replaced_and_quoted = re.sub(r'{(\w+)=', r'{"\1":', hashed_variables)
replace_single_qoutes = equality_signes_replaced_and_quoted.replace('\'', '"')
result = json.loads(replace_single_qoutes)
print(result)
Produces:
{'userdata': [{'tags': {'dt': {'number': 111}, 'mp': {'id': 'X23.W'}}}, {'log': 'LG22'}, {'instruction': 'String that can contain characters like -, _ or numbers'}]}
I am trying to rephrase the implementation found here. This is what I have so far:
import csv
import math
import random
training_set_ratio = 0.67
training_set = []
test_set = []
class IrisFlower:
def __init__(self, petal_length, petal_width, sepal_length, sepal_width, flower_type):
self.petal_length = petal_length
self.petal_width = petal_width
self.sepal_length = sepal_length
self.sepal_width = sepal_width
self.flower_type = flower_type
def __hash__(self) -> int:
return hash((self.petal_length, self.petal_width, self.sepal_length, self.sepal_width))
def __eq__(self, other):
return (self.petal_length, self.petal_width, self.sepal_length, self.sepal_width) \
== (other.petal_length, other.petal_width, other.sepal_length, other.sepal_width)
def load_data():
with open('dataset.csv') as csvfile:
rows = csv.reader(csvfile, delimiter=',')
for row in rows:
iris_flower = IrisFlower(float(row[0]), float(row[1]), float(row[2]), float(row[3]), row[4])
if random.random() < training_set_ratio:
training_set.append(iris_flower)
else:
test_set.append(iris_flower)
def euclidean_distance(flower_one: IrisFlower, flower_two: IrisFlower):
distance = 0.0
distance = distance + math.pow(flower_one.petal_length - flower_two.petal_length, 2)
distance = distance + math.pow(flower_one.petal_width - flower_two.petal_width, 2)
distance = distance + math.pow(flower_one.sepal_length - flower_two.sepal_length, 2)
distance = distance + math.pow(flower_one.sepal_width - flower_two.sepal_width, 2)
return distance
def get_neighbors(test_flower: IrisFlower):
distances = []
for training_flower in training_set:
dist = euclidean_distance(test_flower, training_flower)
d = dict()
d[training_flower] = dist
print(d)
return
load_data()
get_neighbors(test_set[0])
Currently, print statements in the following code block:
def get_neighbors(test_flower: IrisFlower):
distances = []
for training_flower in training_set:
dist = euclidean_distance(test_flower, training_flower)
d = dict()
d[training_flower] = dist
print(d)
return
will have outputs similar to
{<__main__.IrisFlower object at 0x107774fd0>: 0.25999999999999945}
which is ok. But I do not want to create the dictionary first, and then append the key value, as in:
d = dict()
d[training_flower] = dist
So this is what I am trying:
d = dict(training_flower = dist)
However, it does not seem like the dist method is using the instance, but rather a String, because what I see printed is as follows:
{'training_flower': 23.409999999999997}
{'training_flower': 16.689999999999998}
How do I create the dictionary by using the object as key in one statement?
In your snippet, where you write d = dict(training_flower=dist), "training_flower" is a keyword argument for dict function and not an object. It is equivalent to writing d = {'training_flower': dist}. The only way to create a dictionary with an object as a key is to use the latter syntax:
d = {training_flower: dist}
To directly create a dict with a key which is not a valid keyword, use the {} syntax like:
Code:
d = {training_flower: 'a_value'}
Test Code:
training_flower = 'a key'
d = {training_flower: 'a_value'}
print(d)
Results:
{'a key': 'a_value'}
to initialize a dictionary with an object as a key, (edit: and the string in Stephen's example is an object anyway)
class Flower:
def __repr__(self):
return 'i am flower'
flower1 = Flower()
d = {flower1: 4}
print(d)
outputs
{i am flower: 4}
this is my first post here, and I know I'm late, sorry if it's a duplicate solution. just to show it works with an object.
would upvote Stephen's answer but I can't yet.
had a question regarding summing the multiple values of duplicate keys into one key with the aggregate total. For example:
1:5
2:4
3:2
1:4
Very basic but I'm looking for an output that looks like:
1:9
2:4
3:2
In the two files I am using, I am dealing with a list of 51 users(column 1 of user_artists.dat) who have the artistID(column 2) and how many times that user has listened to that particular artist given by the weight(column 3).
I am attempting to aggregate the total times that artist has been played, across all users and display it in a format such as:
Britney Spears (289) 2393140. Any help or input would be so appreciated.
import codecs
#from collections import defaultdict
with codecs.open("artists.dat", encoding = "utf-8") as f:
artists = f.readlines()
with codecs.open("user_artists.dat", encoding = "utf-8") as f:
users = f.readlines()
artist_list = [x.strip().split('\t') for x in artists][1:]
user_stats_list = [x.strip().split('\t') for x in users][1:]
artists = {}
for a in artist_list:
artistID, name = a[0], a[1]
artists[artistID] = name
grouped_user_stats = {}
for u in user_stats_list:
userID, artistID, weight = u
grouped_user_stats[artistID] = grouped_user_stats[artistID].astype(int)
grouped_user_stats[weight] = grouped_user_stats[weight].astype(int)
for artistID, weight in u:
grouped_user_stats.groupby('artistID')['weight'].sum()
print(grouped_user_stats.groupby('artistID')['weight'].sum())
#if userID not in grouped_user_stats:
#grouped_user_stats[userID] = { artistID: {'name': artists[artistID], 'plays': 1} }
#else:
#if artistID not in grouped_user_stats[userID]:
#grouped_user_stats[userID][artistID] = {'name': artists[artistID], 'plays': 1}
#else:
#grouped_user_stats[userID][artistID]['plays'] += 1
#print('this never happens')
#print(grouped_user_stats)
how about:
import codecs
from collections import defaultdict
# read stuff
with codecs.open("artists.dat", encoding = "utf-8") as f:
artists = f.readlines()
with codecs.open("user_artists.dat", encoding = "utf-8") as f:
users = f.readlines()
# transform artist data in a dict with "artist id" as key and "artist name" as value
artist_repo = dict(x.strip().split('\t')[:2] for x in artists[1:])
user_stats_list = [x.strip().split('\t') for x in users][1:]
grouped_user_stats = defaultdict(lambda:0)
for u in user_stats_list:
#userID, artistID, weight = u
grouped_user_stats[u[0]] += int(u[2]) # accumulate weights in a dict with artist id as key and sum of wights as values
# extra: "fancying" the data transforming the keys of the dict in "<artist name> (artist id)" format
grouped_user_stats = dict(("%s (%s)" % (artist_repo.get(k,"Unknown artist"), k), v) for k ,v in grouped_user_stats.iteritems() )
# lastly print it
for k, v in grouped_user_stats.iteritems():
print k,v
This is the code I'm using :
where_con=''
#loop on model name
# getting all info for one model
where_con = {}
for k in model_k_j:
type_val = type(model_k_j[k])
if type_val== dict:
print "dictonary type"
"""
for model_field_dict in model_k_j[k]:
start= model_k_j[k][model_field_dict]
end= model_k_j[k][model_field_dict]
where_con[k] = medical_home_last_visit__range=[start,end ]
break
"""
else:
col_name.append(k)
where_con[k] = model_k_j[k]
# covert data type
# **where_con {unpack tuple}
# where_con =str(where_con)
# print where_con
qs_new = model_obj.objects.filter(**where_con)
The field medical_home_last_visit is not static, it is coming dynamically.
How do I append it ? I have tried something like:
colname_variable = medical_home_last_visit
where_con[k] = colname_variable + __range=[start,end ]
but it is not working properly, and gives this error :
where_con[k] = colname_variable + __range=[start,end ]
^
SyntaxError: invalid syntax
where_con is dict and key name should be equal colname_variable__range:
#k = 'medical_home_last_visit__range'
where_con[k] = (start, end)
qs_new = model_obj.objects.filter(**where_con)
it is equal to:
model_obj.objects.filter(medical_home_last_visit__range=(start, end))
and any other filter args should be keys in where_con, for example:
#k = 'some_date__lte'
where_con[k] = datetime.datetime.now()