MongoDB - Update different arrays simultaneously with update_many()

MongoDB - Update different arrays simultaneously with update_many() - python

First, some background.
I have a function in Python which consults an external API to retrieve some information associated with an ID. Such function takes as argument an ID and it returns a list of numbers (they correspond to some metadata associated with such ID).
For example, let us introduce in such function the IDs {0001, 0002, 0003}. Let's say that the function returns for each ID the following arrays:
0001 → [45,70,20]
0002 → [20,10,30,45]
0003 → [10,45]
My goal is to implement a collection which structures data as so:
{
"_id":45,
"list":[0001,0002,0003]
},
{
"_id":70,
"list":[0001]
},
{
"_id":20,
"list":[0001,0002]
},
{
"_id":10,
"list":[0002,0003]
},
{
"_id":30,
"list":[0002]
}
As it can be seen, I want my collection to index the information by the metadata itself. With this structure, the document with $_id "45" contains a list with all the IDs that have metadata 45 associated. This way I can retrieve with a single request to the collection all IDs mapped to a particular metadata value.
The class method in charge of inserting IDs and metadata in the collection is the following:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
for data in metadataVector:
self.SegmentDB.update_one(
filter = {"_id":data},
update = {"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
metadataVector is the list which contains all metadata (integers) associated to a given ID (i.e.:[45,70,20]).
id is the ID associated to the metadata in metadataVector. (i.e.:0001).
This method currently iterates through the list and performs an operation for every element (every metadata) on the list. This method implements the collection I desire: it updates the document whose "_id" is a given metadata and adds to its corresponding list the ID from which such metadata originated (if such document doesn't exist yet, it inserts it - that's what upsert = true is all for).
However, this implementation ends up being somewhat slow on the long run. metadataVector usually has around 1000-3000 items for each ID (metainformation integers which can range in 800 - 23000000), and I have around 40000 IDs to analyze. As a result, the collection grows quickly. At the moment, I have around 3.2m documents in the collection (one specifically dedicated to each individual metadata integer). I would like to implement a faster solution; if possible, I would like to insert all metadata in one only DB request instead of calling an update for each item in metadataVector individually.
I tried this approach but it doesn't seem to work as I intended:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
self.SegmentDB.update_many(
filter={"_id": {"$in":metadataVector}},
update={"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
I tried using update_many (as it seemed the natural approach to tackle the problem) specifying a filter which, to my understanding, states "any document whose _id is in metadataVector". In this way, all documents involved would add to the list the originating ID (or the document would be created if it didn't exist due to the Upsert condition) but instead the collection ends up being filled with documents containing a single element in the list and an ObjectId() _id.
Picture showing the final result.
Is there a way to implement what I want? Should I restructure the DB differently all together?
Thanks a lot in advance!

Here is an example, and it uses Bulk Write operations. Bulk operations submits multiple inserts, updates, deletes (can be a combination) as a single call to the database and returns a result. This is more efficient than multiple single calls to the database.
Scenario 1:
Input: 3 -> [10, 45]
def some_fn(id):
# id = 3; and after some process... returns a dictionary
return { 10: 3, 45: 3, }
Scenario 2:
Input (as a list):
3 -> [10, 45]
1 -> [45, 70, 20]
def some_fn(ids):
# ids are 1 and 3; and after some process... returns a dictionary
return { 10: [ 3 ], 45: [ 3, 1 ], 20: [ 1 ], 70: [ 1 ] }
Perform Bulk Write
Now, perform the bulk operation on the database using the returned value from some_fn.
data = some_fn(id) # or some_fn(ids)
requests = []
for k, v in data.items():
op = UpdateOne({ '_id': k }, { '$push': { 'list': { '$each': v }}}, upsert=True)
requests.append(op)
result = db.collection.bulk_write(requests, ordered=False)
Note the ordered=False - this option is used for, again, better performance as writes can happen in parallel.
References:
collection.bulk_write

Related

Get list value by comparing values

I have a list like this:
data.append(
{
"type": type,
"description": description,
"amount": 1,
}
)
Every time there is a new object I want to check if there already is an entry in the list with the same description. If there is, I need to add 1 to the amount.
How can I do this the most efficient? Is the only way going through all the entries?

I suggest making data a dict and using the description as a key.
If you are concerned about the efficiency of using the string as a key, read this: efficiency of long (str) keys in python dictionary.
Example:
data = {}
while loop(): # your code here
existing = data.get(description)
if existing is None:
data[description] = {
"type": type,
"description": description,
"amount": 1,
}
else:
existing["amount"] += 1
In either case you should first benchmark the two solutions (the other one being the iterative approach) before reaching any conclusions about efficiency.

Combine and Subtract items from different Collecitons

Excuse my ignorance, I am new in MongoDB. I am having tree collections, where the one is a superset of the other two whose elements are not overlapped. Each item is distinguish by a unique string id. What I want is to get the items of the superset that are not included in the other two collections. Could you please provide me some hint on how do do this efficiently?
Thanks.
EDIT:
Superset structure:
{ "_id" : 1, "str_id" : "ABC1fd3fsewer", "date": "a day" }
Subset 1 structure: { "_id" : 1, "str_id" : "ABre1fd3fsewer", "description" : "product" }
Subset 2 structure: { "_id" : 1, "str_id" : "ABC1fd3fsewfe"}
Each collection has a different structure but all have a common filed, the str_id.
EDIT Improved by #Neel suggestion
I have following format:
parent = [{'str_id':'a', 'tag1':'parent_random', 'tag2': 'parent_random', 'tag3':'parent_random'},{'str_id':'b',...},{'str_id':'c',...},{'str_id':'d',...}...]
child1 = [{'str_id':'a', 'tag2': child1_random'},{'str_id':'b', 'tag2': 'child1_random'}]
child2 = [{'str_id':'c', 'tag1':'child2_random'}]
and I want
outcome = [{'str_id':'c', 'tag1':'parent_random', 'tag2': 'parent_random', 'tag3':'parent_random'},{'str_id':'d', 'tag1':'parent_random', 'tag2': 'parent_random', 'tag3':'parent_random'}]

It sounds like you'll need an aggregate operation.
This document might help you:
Lookup in an array
You can do multiple lookups with one aggregate operation so you can check both the subset collections.
I am going to assume you are working with a REST API and that the client is sending a request for a subset of documents from the superset collection. You can send the array of documents you want to check from superset from the client then:
1 - match all the documents in superset to the array of documents you're sending
2 - unwind your superset document array
3 - lookup the subset collections on "str_id" field and set to a field, like "subset_one_results".
4 - do a match operation on both subset results that returns an empty array on, say, "subset_one_results"... this will match all superset documents that are not contained in subset1 for example.
$match({ $and : { "subset_one_results" : { $eq : [] } }, { "subset_two_results" : { $eq : [] } } })
5 - group them in a new array if you want to return them as an array to the client.
To increase the performance of your operations, you have to determine how often this request will be made. If it risks being often, be sure to create an index on the field that will be solicited if it's not an ObjectId field. I can't tell from your code if you are using a custom string field or an ObjectId, which is why I'm bringing up this point.
I don't know what you're using for making your queries (pure MongoDB query language, driver, etc.) so I am not sure how to answer with code hence delineating the steps up above.

Boto3 - Can I get items with batch_get_item whose sort key begins with a variable?

I'm using boto3 to deal with AWS Dynamo DB.
I'd like to query once to get items whose sort key begins with some variable in two partition keys.
I've read documents many times, but the examples are all about getting items that exactly match partition key and sort key.
I know it's possible for one partition key to get items whose sort key begins with ABC_.
response = table.query(
KeyConditionExpression=Key('partition_key').eq(partition_key1) &
Key('sort_key').begins_with('ABC_')
)
response2 = table.query(
KeyConditionExpression=Key('partition_key').eq(partition_key2) &
Key('sort_key').begins_with('ABC_')
)
But is it also possible to query once to get multiple items in two partition keys whose sort key begins with ABC_?
response = dynamodb.batch_get_item(
RequestItems={
'test_table': {
'Keys': [
{
'partition_key': partition_key1,
'sort_key': 'ABC_1', # begins with 'ABC_'
},
{
'partition_key': partition_key2,
'sort_key': 'ABC_2', # begins with 'ABC_'
},
],
}
}
)

No, you will need to use two queries. Execute them in parallel if you need a faster response.

Populating a variably nested dictionary with multiple API calls

I am using a public API at www.gpcontract.co.uk to populate a large variably nested dictionary representing a hierarchy of UK health organisations.
Some background information
The top level of the hierarchy is the four UK countries (England, Scotland, Wales and Northern Ireland), then regional organisations all the way down to individual clinics. The depth of the hierarchy is different for each of the countries and can change depending on the year. Each organisation has a name, orgcode and dictionary listing its child organisations.
Unfortunately, the full nested hierarchy is not available from the API, but calls to http://www.gpcontract.co.uk/api/children/[organisation code]/[year] will return the immediate child organisations of any other.
So that the hierarchy can be easily navigated in my app, I want to generate an offline dictionary of this full hierarchy (on a per year basis) which will be saved using pickle and bundled with the app.
Getting this means a lot of API calls, and I am having trouble converting the returned JSON into the dictionary object I require.
Here is an example of one tiny part of the hierarchy (I have only shown a single child organisation as an example).
JSON hierarchy example
{
"eng": {
"name": "England",
"orgcode": "eng",
"children": {}
},
"sco": {
"name": "Scotland",
"orgcode": "sco",
"children": {}
},
"wal": {
"name": "Wales",
"orgcode": "wal",
"children": {}
},
"nir": {
"name": "Northern Ireland",
"orgcode": "nir",
"children": {
"blcg": {
"name": "Belfast Local Commissioning Group",
"orgcode": "blcg",
"children": {
"abc": {
"name": "Random Clinic",
"orgcode": "abc",
"children": {}
}
}
}
}
}
}
Here’s the script I’m using to make the API calls and populate the dictionary:
My script
import json, pickle, urllib.request, urllib.error, urllib.parse
# Organisation hierarchy may vary between years. Set the year here.
year = 2017
# This function returns a list containing a dictionary for each child organisation with keys for name and orgcode
def get_child_orgs(orgcode, year):
orgcode = str(orgcode)
year = str(year)
# Correct 4-digit year to 2-digit
if len(year) > 2:
year = year[2:]
try:
child_data = json.loads(urllib.request.urlopen('http://www.gpcontract.co.uk/api/children/' + str(orgcode) + '/' + year).read())
output = []
if child_data != []:
for item in child_data['children']:
output.append({'name' : item['name'], 'orgcode' : str(item['orgcode']).lower(), 'children' : {}})
return output
except urllib.error.HTTPError:
print('HTTP error!')
except:
print('Other error!')
# I start with a template of the top level of the hierarchy and then populate it
hierarchy = {'eng' : {'name' : 'England', 'orgcode' : 'eng', 'children' : {}}, 'nir' : {'name' : 'Northern Ireland', 'orgcode' : 'nir', 'children' : {}}, 'sco' : {'name' : 'Scotland', 'orgcode' : 'sco', 'children' : {}}, 'wal' : {'name' : 'Wales', 'orgcode' : 'wal', 'children' : {}}}
print('Loading data...\n')
# Here I use nested for loops to make API calls and populate the dictionary down the levels of the hierarchy. The bottom level contains the most items.
for country in ('eng', 'nir', 'sco', 'wal'):
for item1 in get_child_orgs(country, year):
hierarchy[country]['children'][item1['orgcode']] = item1
for item2 in get_child_orgs(item1['orgcode'], year):
hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']] = item2
# Only England and Wales hierarchies go deeper than this
if country in ('eng', 'wal'):
level3 = get_child_orgs(item2['orgcode'], year)
# Check not empty array
if level3 != []:
for item3 in level3:
hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']] = item3
level4 = get_child_orgs(item3['orgcode'], year)
# Check not empty array
if level4 != []:
for item4 in level4:
hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']]['children'][item4['orgcode']] = item4
# Save the completed hierarchy with pickle
file_name = 'hierarchy_' + str(year) + '.dat'
with open(file_name, 'wb') as out_file:
pickle.dump(hierarchy, out_file)
print('Success!')
The problem
This seems to work most of the time, but it feels hacky and sometimes crashes when a nested for loop returns a "NoneType is not iterable error". I realise this is making a lot of API calls and takes several minutes to run, but I cannot see a way around this, as I want the completed hierarchy available offline for the user to make the data searchable quickly. I will then use the API in a slightly different way to get the actual healthcare data for the chosen organisation.
My question
Is there a cleaner and more flexible way to do this that would accommodate the variable nesting of the organisation hierarchy?
Is there a way to do this significantly more quickly?
I am relatively inexperienced with JSON so any help would be appreciated.

I think this question may be better suited over on the Code Review Stack Exchange, but as you mention that your code sometimes crashes and returns NoneType errors I'll give it the benefit of the doubt.
Looking at your description, this is what stands out to me
Each organisation has a name, orgcode and dictionary listing its child organisations. [API calls] will return the immediate child organisations of any other.
So, what this suggests to me (and how it looks in your sample data) is that all your data is exactly equivalent; the hierarchy only exists due to the nesting of the data and is not enforced by the format of any particular node.
This, consequently, means that you should be able to have a single piece of code which handles the nesting of an infinitely (or arbitrarily, if you prefer) deep tree. Obviously, you do this for the API call itself (get_child_orgs()), so just replicate that for constructing the tree.
def populate_hierarchy(organization,year):
""" Recursively Populate the Organization Hierarchy
organization should be a dict with an "orgcode" key with a string value
and "children" key with a dict value.
year should be a 2-4 character string representing a year.
"""
orgcode = organization['orgcode']
## get_child_orgs returns a list of organizations
children = get_child_orgs(orgcode,year)
## get_child_orgs returns None on Errors
if children:
for child in children:
## Add child to the current organization's children, using
## orgcode as its key
organization['children'][child['orgcode']] = child
## Recursively populate the child's sub-hierarchy
populate_hierarchy(child,year)
## Technically, the way this is written, returning organization is
## pointless because we're modifying organization in place, but I'm
## doing it anyway to explicitly denote the end of the function
return organization
for country in hierarchy.values():
populate_hierarchy(country,year)
It's worth noting (since you were checking for empty lists prior to iterating in your original code) that for x in y still functions correctly if y is an empty list, so you don't need to check.
The NoneType Error likely arises because you catch the Error in get_child_orgs and then implicitly return None. Therefore, for example level3 = get_child_orgs[etc...] results in level3 = None; this leads to if None != []: in the next line being True, and then you try to iterate over None with for item3 in None: which would raise the error. As noted in the code above, this is why I check the truthiness of children.
As for whether this can be done more quickly, you can try working with the threading/multiprocessing modules. I just don't know how profitable either of those will be for three reasons:
I haven't tried out the API, so I don't know how much time you have to gain from implementing multiple threads/processes
I have seen API's which timeout requests from IP Addresses when you query too quickly/too often (which would make the implementation pointless)
You say you're only running this process once per year, so runtime in perspective of a full year seems pretty insignificant (obviously, unless the current API calls are taking literal days to complete)
Finally, I would simply question whether pickle is the appropriate method of storing the information, or if you wouldn't just be better off using json.dump/load (for the record, the json module doesn't care if you change the extension to .dat if you're partial to that extension name).

Python using array elements as indices in multidimension array

I'm parsing results from a SOAP query (wsdl) and getting the results in an array of arrays like:
[(item){
item[] =
(item){
key[] =
"cdr_id",
value[] =
"201407000000000431",
},
(item){
key[] =
"cdr_date",
value[] =
"2014-07-07 07:47:12",
},
... (snipped for brevity - 81 items in total)
(item){
key[] =
"extradata",
value[] = <empty>
},
}]
I need to extract a single value that corresponds to a particular key.
I have encountered 2 issues:
How to map keys to values? (Otherwise I nest for loops over result.item[][])
How to return value as integer?
I am quite new to python, so sorry in advance if question seems too simple.
My current code looks a bit like:
success_calls = client.service.selectRowset(tables[table], sfilter, None, None, None)[1]
total_time = calls_num = 0
for call in success_calls:
for key in range(len(call.item)):
if call.item[key][0] is "elapsed_time":
item_value = call.item[key][1]
total_time += int(item_value)

You can create a Dictionary for all these values to access specific values using a specific key. Dictionaries are pythons implementation of hashed tables. Note that this would only make sense if you like to access the dictionary multiple times because you need to run through your array during the creation of the hash-table and calculate a hash-value for every entry. If you only want to extract a single value, the best solution is to run through the array as you already do in your code example. In worst case you have a complexity O(n).
You can easily parse a String to Integer in python using int("12345"). But be careful. This can raise a ValueError exception if your string is not parseable to integer. If your elapsed_time is a real number and not a natural one you can parse is using float("22.2123")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.