I have a database schema in Postgres that looks like this (in pseudo code):
users (table):
pk (field, unique)
name (field)
permissions (table):
pk (field, unique)
permission (field, unique)
addresses (table):
pk (field, unique)
address (field, unique)
association1 (table):
user_pk (field, foreign_key)
permission_pk (field, foreign_key)
association2 (table):
user_pk (field, foreign_key)
address_pk (field, foreign_key)
Hopefully this makes intuitive sense. It's a users table that has a many-to-many relationship with a permissions table as well as a many-to-many relationship with an addresses table.
In Python, when I perform the correct SQLAlchemy query incantations, I get back results that look something like this (after converting them to a list of dictionaries in Python):
results = [
{'pk': 1, 'name': 'Joe', 'permission': 'user', 'address': 'home'},
{'pk': 1, 'name': 'Joe', 'permission': 'user', 'address': 'work'},
{'pk': 1, 'name': 'Joe', 'permission': 'admin', 'address': 'home'},
{'pk': 1, 'name': 'Joe', 'permission': 'admin', 'address': 'work'},
{'pk': 2, 'name': 'John', 'permission': 'user', 'address': 'home'},
]
So in this contrived example, Joe is both a user and and an admin. John is only a user. Both Joe's home and work addresses exist in the database. Only John's home address exists.
So the question is, does anybody know the best way to go from these SQL query 'results' to the more compact 'desired_results' below?
desired_results = [
{
'pk': 1,
'name': 'Joe',
'permissions': ['user', 'admin'],
'addresses': ['home', 'work']
},
{
'pk': 2,
'name': 'John',
'permissions': ['user'],
'addresses': ['home']
},
]
Additional information required: Small list of dictionaries describing the 'labels' I would like to use in the desired_results for each of the fields that have many-to-many relationships.
relationships = [
{'label': 'permissions', 'back_populates': 'permission'},
{'label': 'addresses', 'back_populates': 'address'},
]
Final consideration, I've put together a concrete example for the purposes of this question, but in general I'm trying to solve the problem of querying SQL databases in general, assuming an arbitrary amount of relationships. SQLAlchemy ORM solves this problem well, but I'm limited to using SQLAlchemy Core; so am trying to build my own solution.
Update
Here's an answer, but I'm not sure it's the best / most efficient solution. Can anyone come up with something better?
# step 1: generate set of keys that will be replaced by new keys in desired_result
back_populates = set(rel['back_populates'] for rel in relationships)
# step 2: delete from results keys generated in step 1
intermediate_results = [
{k: v for k, v in res.items() if k not in back_populates}
for res in results]
# step 3: eliminate duplicates
intermediate_results = [
dict(t)
for t in set([tuple(ires.items())
for ires in intermediate_results])]
# step 4: add back information from deleted fields but in desired form
for ires in intermediate_results:
for rel in relationships:
ires[rel['label']] = set([
res[rel['back_populates']]
for res in results
if res['pk'] == ires['pk']])
# done
desired_results = intermediate_results
Iterating over the groups of partial entries looks like a job for itertools.groupby.
But first lets put relationships into a format that is easier to use, prehaps a back_populates:label dictionary?
conversions = {d["back_populates"]:d['label'] for d in relationships}
Next because we will be using itertools.groupby it will need a keyfunc to distinguish between the different groups of entries.
So given one entry from the initial results, this function will return a dictionary with only the pairs that will not be condensed/converted
def grouper(entry):
#each group is identified by all key:values that are not identified in conversions
return {k:v for k,v in entry.items() if k not in conversions}
Now we will be able to traverse the results in groups something like this:
for base_info, group in itertools.groupby(old_results, grouper):
#base_info is dict with info unique to all entries in group
for partial in group:
#partial is one entry from results that will contribute to the final result
#but wait, what do we add it too?
The only issue is that if we build our entry from base_info it will confuse groupby so we need to make an entry to work with:
entry = {new_field:set() for new_field in conversions.values()}
entry.update(base_info)
Note that I am using sets here because they are the natural container when all contence are unique,
however because it is not json-compatible we will need to change them into lists at the end.
Now that we have an entry to build we can just iterate through the group to add to each new field from the original
for partial in group:
for original, new in conversions.items():
entry[new].add(partial[original])
then once the final entry is constructed all that is left is to convert the sets back into lists
for new in conversions.values():
entry[new] = list(entry[new])
And that entry is done, now we can either append it to a list called new_results but since this process is essentially generating results it would make more sense to put it into a generator
making the final code look something like this:
import itertools
results = [
{'pk': 1, 'name': 'Joe', 'permission': 'user', 'address': 'home'},
{'pk': 1, 'name': 'Joe', 'permission': 'user', 'address': 'work'},
{'pk': 1, 'name': 'Joe', 'permission': 'admin', 'address': 'home'},
{'pk': 1, 'name': 'Joe', 'permission': 'admin', 'address': 'work'},
{'pk': 2, 'name': 'John', 'permission': 'user', 'address': 'home'},
]
relationships = [
{'label': 'permissions', 'back_populates': 'permission'},
{'label': 'addresses', 'back_populates': 'address'},
]
#first we put the "relationships" in a format that is much easier to use.
conversions = {d["back_populates"]:d['label'] for d in relationships}
def grouper(entry):
#each group is identified by all key:values that are not identified in conversions
return {k:v for k,v in entry.items() if k not in conversions}
def parse_results(old_results, conversions=conversions):
for base_info, group in itertools.groupby(old_results, grouper):
entry = {new_field:set() for new_field in conversions.values()}
entry.update(base_info)
for partial in group: #for each entry in the original results set
for original, new in conversions.items(): #for each field that will be condensed
entry[new].add(partial[original])
#convert sets back to lists so it can be put back into json
for new in conversions.values():
entry[new] = list(entry[new])
yield entry
Then the new_results can be gotten like this:
>>> new_results = list(parse_results(results))
>>> from pprint import pprint #for demo purpose
>>> pprint(new_results,width=50)
[{'addresses': ['home', 'work'],
'name': 'Joe',
'permissions': ['admin', 'user'],
'pk': 1},
{'addresses': ['home'],
'name': 'John',
'permissions': ['user'],
'pk': 2}]
Related
I'm receiving many CSV-files that contain orders for different products. Those CSV-files need to be "converted" into a specific JSON-structure.
Each row of the CSV-file represents the order of one product. This means that if I would order two products, the CSV would contain two rows.
A simplified version of the CSV-file may look like this (please note the orderId "111" in the first and third row):
orderId,itemNumber,itemName,name,street
111,123,testitem,john doe,samplestreet 1
222,345,anothertestitem,jane doe,samplestreet 1
111,345,anothertestitem,john doe,samplestreet 1
My current solution works but I think I'm overcomplicating things.
Currently, I'm iterating over each CSV-row and create the JSON-structure where I use a helper-function that will either add the order or append a list that contains ordered items like so:
def add_orderitem(orderitem, order, all_orders):
""" Adds an ordered product to the order or "create" a new order if it doesn't exist """
for row in all_orders:
# Order already exists
if any(order["orderNumber"] == value for field, value in row.items()):
print(f"Order '{order['orderNumber']}' already exists, adding product #{orderitem['sku']}")
row["orderItems"].append(orderitem)
return all_orders
# New order
print(f"New Order found, creating order '{order['orderNumber']}' and adding product #{orderitem['sku']}")
all_orders.append(order)
order["orderItems"].append(orderitem)
return all_orders
def parse_orders():
""" Converts CSV-orders into JSON """
results = []
orders = read_csv("testorder.csv") # helper-function returns CSV-dictreader (list of dicts)
for order in orders:
# Create basic structure
orderdata = {
"orderNumber": order["orderId"],
"address": {
"name": order["orderId"],
"street": order["street"]
},
"orderItems": [] # <-- this will be filled later
}
# Extract product-information that will be inserted in above 'orderItems' list
product = {
"sku": order["itemNumber"],
"name": order["itemName"]
}
# Add order to final list or add item if order already exists
results = add_orderitem(product, orderdata, results)
return results
def main():
from pprint import pprint
parsed_orders = parse_orders()
pprint(parsed_orders)
if __name__ == "__main__":
main()
The skript works fine, the output below is what I'm expecting:
New Order found, creating order '111' and adding product #123
New Order found, creating order '222' and adding product #345
Order '111' already exists, adding product #345
[{'address': {'name': '111', 'street': 'samplestreet 1'},
'orderItems': [{'name': 'testitem', 'sku': '123'},
{'name': 'anothertestitem', 'sku': '345'}],
'orderNumber': '111'},
{'address': {'name': '222', 'street': 'samplestreet 1'},
'orderItems': [{'name': 'anothertestitem', 'sku': '345'}],
'orderNumber': '222'}]
Is there a way, to do this "smarter"?
Imo a namedtuple and a groupby would make your code clearer:
from collections import namedtuple
from itertools import groupby
# csv data or file
data = """orderId,itemNumber,itemName,name,street
111,123,testitem,john doe,samplestreet 1
222,345,anothertestitem,jane doe,samplestreet 1
111,345,anothertestitem,john doe,samplestreet 1
"""
# the Order tuple
Order = namedtuple('Order', 'orderId itemNumber itemName name street')
# load the csv into orders
orders = [Order(*values) for line in data.split("\n")[1:] if line for values in [line.split(",")]]
# and group it by orderId
orders = sorted(orders, key = lambda order: order.orderId)
# group it by orderId
output = list()
for key, values in groupby(orders, key=lambda order: order.orderId):
items = list(values)
dct = {"address": {"name": items[0].name, "street": items[0].street},
"orderItems": [{"name": item.itemName, "sku": item.itemNumber} for item in items]}
output.append(dct)
print(output)
This yields
[{'address': {'name': 'john doe', 'street': 'samplestreet 1'}, 'orderItems': [{'name': 'testitem', 'sku': '123'}, {'name': 'anothertestitem', 'sku': '345'}]},
{'address': {'name': 'jane doe', 'street': 'samplestreet 1'}, 'orderItems': [{'name': 'anothertestitem', 'sku': '345'}]}]
You could even put it in a great comprehension but that would not make it more readable.
I am getting a dict of users and their information
{'Username': 'username', 'Attributes': [{'Name': 'sub', 'Value': 'userSub'}, {'Name': 'email', 'Value': 'email'}
I want to restructure this into an array of objects
ex) [{username: 'username', sub: 'userSub', email: 'email'}, {username: 'secondUsername', sub: 'secondSub'...}]
How do I accomplish this without manually putting in every value, as there may be different Attributes for each user
I have this so far
for user in response['Users']:
userList.append({
'username': user['Username'],
user['Attributes'][0]['Name']: user['Attributes'][0]['Value'],
})
This will return the correct structure, but I need to dynamically add the user attributes instead of manually putting in each index or string value
I would initially create each dict with just its username key, then use the update method to add the remaining keys.
from operator import itemgetter
get_kv_pairs = itemgetter('Name', 'Value')
# e.g.
# get_kv_pairs({'Name': 'sub', 'Value': 'userSub'}) == ('sub', 'userSub')
user_list = []
for user in response['Users']:
d = {'username': user['Username']}
kv_pairs = map(get_kv_pairs, user['Attributes'])
d.update(kv_pairs)
user_list.append(d)
I have this data
data = [
{
'id': 'abcd738asdwe',
'name': 'John',
'mail': 'test#test.com',
},
{
'id': 'ieow83janx',
'name': 'Jane',
'mail': 'test#foobar.com',
}
]
The id's are unique, it's impossible that multiple dictonaries have the same id.
For example I want to get the item with the id "ieow83janx".
My current solution looks like this:
search_id = 'ieow83janx'
item = [x for x in data if x['id'] == search_id][0]
Do you think that's the be solution or does anyone know an alternative solution?
Since the ids are unique, you can store the items in a dictionary to achieve O(1) lookup.
lookup = {ele['id']: ele for ele in data}
then you can do
user_info = lookup[user_id]
to retrieve it
If you are going to get this kind of operations more than once on this particular object, I would recommend to translate it into a dictionary with id as a key.
data = [
{
'id': 'abcd738asdwe',
'name': 'John',
'mail': 'test#test.com',
},
{
'id': 'ieow83janx',
'name': 'Jane',
'mail': 'test#foobar.com',
}
]
data_dict = {item['id']: item for item in data}
#=> {'ieow83janx': {'mail': 'test#foobar.com', 'id': 'ieow83janx', 'name': 'Jane'}, 'abcd738asdwe': {'mail': 'test#test.com', 'id': 'abcd738asdwe', 'name': 'John'}}
data_dict['ieow83janx']
#=> {'mail': 'test#foobar.com', 'id': 'ieow83janx', 'name': 'Jane'}
In this case, this lookup operation will cost you some constant* O(1) time instead of O(N).
How about the next built-in function (docs):
>>> data = [
... {
... 'id': 'abcd738asdwe',
... 'name': 'John',
... 'mail': 'test#test.com',
... },
... {
... 'id': 'ieow83janx',
... 'name': 'Jane',
... 'mail': 'test#foobar.com',
... }
... ]
>>> search_id = 'ieow83janx'
>>> next(x for x in data if x['id'] == search_id)
{'id': 'ieow83janx', 'name': 'Jane', 'mail': 'test#foobar.com'}
EDIT:
It raises StopIteration if no match is found, which is a beautiful way to handle absence:
>>> search_id = 'does_not_exist'
>>> try:
... next(x for x in data if x['id'] == search_id)
... except StopIteration:
... print('Handled absence!')
...
Handled absence!
Without creating a new dictionary or without writing several lines of code, you can simply use the built-in filter function to get the item lazily, not checking after it finds the match.
next(filter(lambda d: d['id']==search_id, data))
should for just fine.
Would this not achieve your goal?
for i in data:
if i.get('id') == 'ieow83janx':
print(i)
(xenial)vash#localhost:~/python$ python3.7 split.py
{'id': 'ieow83janx', 'name': 'Jane', 'mail': 'test#foobar.com'}
Using comprehension:
[i for i in data if i.get('id') == 'ieow83janx']
if any(item['id']=='ieow83janx' for item in data):
#return item
As any function returns true if iterable (List of dictionaries in your case) has value present.
While using Generator Expression there will not be need of creating internal List. As there will not be duplicate values for the id in List of dictionaries, any will stop the iteration until the condition returns true. i.e the generator expression with any will stop iterating on shortcircuiting. Using List comprehension will create a entire List in the memory where as GE creates the element on the fly which will be better if you are having large items as it uses less memory.
How do I merge the JSON data rows as shown below using the merge function below with pyspark?
Note: Assume this is just a minutia example and I have 1000s of rows of data to merge. What is the most performant solution? For better or for worse, I must use pyspark.
Input:
data = [
{'timestamp': '20080411204445', 'address': '100 Sunder Ct', 'name': 'Joe Schmoe'},
{'timestamp': '20040218165319', 'address': '100 Lee Ave', 'name': 'Joe Schmoe'},
{'timestamp': '20120309173318', 'address': '1818 Westminster', 'name': 'John Doe'},
... More ...
]
Desired Output:
combined_result = [
{'name': 'Joe Schmoe': {'addresses': [('20080411204445', '100 Sunder Ct'), ('20040218165319', '100 Lee Ave')]}},
{'name': 'John Doe': {'addresses': [('20120309173318', '1818 Westminster')]}},
... More ...
]
Merge function:
def reduce_on_name(a, b):
'''Combines two JSON data rows based on name'''
merged = {}
if a['name'] == b['name']:
addresses = (a['timestamp'], a['address']), (b['timestamp'], b['address'])
merged['name'] = a['name']
merged['addresses'] = addresses
return merged
I think it would be something like this:
sc.parallelize(data).groupBy(lambda x: x['name']).map(lambda t: {'name':t[0],'addresses':[(x['timestamp'], x['address']) for x in t[1]]}).collect()
All right, using maxymoo's example, I put together my own reusable code. It's not exactly what I was looking for, but it gets me closer to how I want to solve this particular problem: without lambdas and with reusable code.
#!/usr/bin/env pyspark
# -*- coding: utf-8 -*-
data = [
{'timestamp': '20080411204445', 'address': '100 Sunder Ct', 'name': 'Joe Schmoe'},
{'timestamp': '20040218165319', 'address': '100 Lee Ave', 'name': 'Joe Schmoe'},
{'timestamp': '20120309173318', 'address': '1818 Westminster', 'name': 'John Doe'},
]
def combine(field):
'''Returns a function which reduces on a specific field
Args:
field(str): data field to use for merging
Returns:
func: returns a function which supplies the data for the field
'''
def _reduce_this(data):
'''Returns the field value using data'''
return data[field]
return _reduce_this
def aggregate(*fields):
'''Merges data based on a list of fields
Args:
fields(list): a list of fields that should be used as a composite key
Returns:
func: a function which does the aggregation
'''
def _merge_this(iterable):
name, iterable = iterable
new_map = dict(name=name, window=dict(max=None, min=None))
for data in iterable:
for field, value in data.iteritems():
if field in fields:
new_map[field] = value
else:
new_map.setdefault(field, set()).add(value)
return new_map
return _merge_this
# sc provided by pyspark context
combined = sc.parallelize(data).groupBy(combine('name'))
reduced = combined.map(aggregate('name'))
output = reduced.collect()
I have this models:
class Sub(EmbeddedDocument):
name = StringField()
class Main(Document):
subs = ListField(EmbeddedDocumentField(Sub))
When i use this query, it returns all of Main data but i just need subs that their name is 'foo'.
query: Main.objects(__raw__={'subs': {'$elemMatch': {'name': 'foo'}}})
For example with this data:
{
subs: [
{'name': 'one'},
{'name': 'two'},
{'name': 'foo'},
{'name': 'bar'},
{'name': 'foo'}
]
}
The result must be:
{
subs: [
{'name': 'foo'},
{'name': 'foo'}
]
}
Note that in mongodb client, that query returns this values.
If you are allowed to change your data model then try this:
class Main(Document):
subs = ListField(StringField())
Main.objectsfilter(subs__ne="foo")
I propose this approach assuming that the embedded document only has one field in which case it is redundant.
MongoEngine provides the .aggregate(*pipeline, **kwargs) method which performs a aggregate function.
MongoDB 3.2 or newer
match = {"$match": {"subs.name": "foo"}}
project = {'$project': {'subs': {'$filter': {'as': 'sub',
'cond': {'$eq': ['$$sub.name', 'foo']},
'input': '$subs'}}}}
pipeline = [match, project]
Main.objects.aggregate(*pipeline)
MongoDB version <= 3.0
{'$redact': {'$cond': [{'$or': [{'$eq': ['$name', 'foo']}, {'$not': '$name'}]},
'$$DESCEND',
'$$PRUNE']}}
pipeline = [match, redact]
Main.objects.aggregate(*pipeline)