I am using a public API at www.gpcontract.co.uk to populate a large variably nested dictionary representing a hierarchy of UK health organisations.
Some background information
The top level of the hierarchy is the four UK countries (England, Scotland, Wales and Northern Ireland), then regional organisations all the way down to individual clinics. The depth of the hierarchy is different for each of the countries and can change depending on the year. Each organisation has a name, orgcode and dictionary listing its child organisations.
Unfortunately, the full nested hierarchy is not available from the API, but calls to http://www.gpcontract.co.uk/api/children/[organisation code]/[year] will return the immediate child organisations of any other.
So that the hierarchy can be easily navigated in my app, I want to generate an offline dictionary of this full hierarchy (on a per year basis) which will be saved using pickle and bundled with the app.
Getting this means a lot of API calls, and I am having trouble converting the returned JSON into the dictionary object I require.
Here is an example of one tiny part of the hierarchy (I have only shown a single child organisation as an example).
JSON hierarchy example
{
"eng": {
"name": "England",
"orgcode": "eng",
"children": {}
},
"sco": {
"name": "Scotland",
"orgcode": "sco",
"children": {}
},
"wal": {
"name": "Wales",
"orgcode": "wal",
"children": {}
},
"nir": {
"name": "Northern Ireland",
"orgcode": "nir",
"children": {
"blcg": {
"name": "Belfast Local Commissioning Group",
"orgcode": "blcg",
"children": {
"abc": {
"name": "Random Clinic",
"orgcode": "abc",
"children": {}
}
}
}
}
}
}
Here’s the script I’m using to make the API calls and populate the dictionary:
My script
import json, pickle, urllib.request, urllib.error, urllib.parse
# Organisation hierarchy may vary between years. Set the year here.
year = 2017
# This function returns a list containing a dictionary for each child organisation with keys for name and orgcode
def get_child_orgs(orgcode, year):
orgcode = str(orgcode)
year = str(year)
# Correct 4-digit year to 2-digit
if len(year) > 2:
year = year[2:]
try:
child_data = json.loads(urllib.request.urlopen('http://www.gpcontract.co.uk/api/children/' + str(orgcode) + '/' + year).read())
output = []
if child_data != []:
for item in child_data['children']:
output.append({'name' : item['name'], 'orgcode' : str(item['orgcode']).lower(), 'children' : {}})
return output
except urllib.error.HTTPError:
print('HTTP error!')
except:
print('Other error!')
# I start with a template of the top level of the hierarchy and then populate it
hierarchy = {'eng' : {'name' : 'England', 'orgcode' : 'eng', 'children' : {}}, 'nir' : {'name' : 'Northern Ireland', 'orgcode' : 'nir', 'children' : {}}, 'sco' : {'name' : 'Scotland', 'orgcode' : 'sco', 'children' : {}}, 'wal' : {'name' : 'Wales', 'orgcode' : 'wal', 'children' : {}}}
print('Loading data...\n')
# Here I use nested for loops to make API calls and populate the dictionary down the levels of the hierarchy. The bottom level contains the most items.
for country in ('eng', 'nir', 'sco', 'wal'):
for item1 in get_child_orgs(country, year):
hierarchy[country]['children'][item1['orgcode']] = item1
for item2 in get_child_orgs(item1['orgcode'], year):
hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']] = item2
# Only England and Wales hierarchies go deeper than this
if country in ('eng', 'wal'):
level3 = get_child_orgs(item2['orgcode'], year)
# Check not empty array
if level3 != []:
for item3 in level3:
hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']] = item3
level4 = get_child_orgs(item3['orgcode'], year)
# Check not empty array
if level4 != []:
for item4 in level4:
hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']]['children'][item4['orgcode']] = item4
# Save the completed hierarchy with pickle
file_name = 'hierarchy_' + str(year) + '.dat'
with open(file_name, 'wb') as out_file:
pickle.dump(hierarchy, out_file)
print('Success!')
The problem
This seems to work most of the time, but it feels hacky and sometimes crashes when a nested for loop returns a "NoneType is not iterable error". I realise this is making a lot of API calls and takes several minutes to run, but I cannot see a way around this, as I want the completed hierarchy available offline for the user to make the data searchable quickly. I will then use the API in a slightly different way to get the actual healthcare data for the chosen organisation.
My question
Is there a cleaner and more flexible way to do this that would accommodate the variable nesting of the organisation hierarchy?
Is there a way to do this significantly more quickly?
I am relatively inexperienced with JSON so any help would be appreciated.
I think this question may be better suited over on the Code Review Stack Exchange, but as you mention that your code sometimes crashes and returns NoneType errors I'll give it the benefit of the doubt.
Looking at your description, this is what stands out to me
Each organisation has a name, orgcode and dictionary listing its child organisations. [API calls] will return the immediate child organisations of any other.
So, what this suggests to me (and how it looks in your sample data) is that all your data is exactly equivalent; the hierarchy only exists due to the nesting of the data and is not enforced by the format of any particular node.
This, consequently, means that you should be able to have a single piece of code which handles the nesting of an infinitely (or arbitrarily, if you prefer) deep tree. Obviously, you do this for the API call itself (get_child_orgs()), so just replicate that for constructing the tree.
def populate_hierarchy(organization,year):
""" Recursively Populate the Organization Hierarchy
organization should be a dict with an "orgcode" key with a string value
and "children" key with a dict value.
year should be a 2-4 character string representing a year.
"""
orgcode = organization['orgcode']
## get_child_orgs returns a list of organizations
children = get_child_orgs(orgcode,year)
## get_child_orgs returns None on Errors
if children:
for child in children:
## Add child to the current organization's children, using
## orgcode as its key
organization['children'][child['orgcode']] = child
## Recursively populate the child's sub-hierarchy
populate_hierarchy(child,year)
## Technically, the way this is written, returning organization is
## pointless because we're modifying organization in place, but I'm
## doing it anyway to explicitly denote the end of the function
return organization
for country in hierarchy.values():
populate_hierarchy(country,year)
It's worth noting (since you were checking for empty lists prior to iterating in your original code) that for x in y still functions correctly if y is an empty list, so you don't need to check.
The NoneType Error likely arises because you catch the Error in get_child_orgs and then implicitly return None. Therefore, for example level3 = get_child_orgs[etc...] results in level3 = None; this leads to if None != []: in the next line being True, and then you try to iterate over None with for item3 in None: which would raise the error. As noted in the code above, this is why I check the truthiness of children.
As for whether this can be done more quickly, you can try working with the threading/multiprocessing modules. I just don't know how profitable either of those will be for three reasons:
I haven't tried out the API, so I don't know how much time you have to gain from implementing multiple threads/processes
I have seen API's which timeout requests from IP Addresses when you query too quickly/too often (which would make the implementation pointless)
You say you're only running this process once per year, so runtime in perspective of a full year seems pretty insignificant (obviously, unless the current API calls are taking literal days to complete)
Finally, I would simply question whether pickle is the appropriate method of storing the information, or if you wouldn't just be better off using json.dump/load (for the record, the json module doesn't care if you change the extension to .dat if you're partial to that extension name).
Related
I am currently querying Bugzilla as follows:
r = requests.get(
"https://bugzilla.mozilla.org/rest/bug",
params={
"chfield": "[Bug creation]",
"chfieldfrom": "2015-01-01",
"chfieldto": "2016-01-01",
"resolution": "FIXED",
"limit": 200,
"api_key": api_key,
"include_fields": [
"id",
"description",
"creation_time",
],
},
)
and all I would like to add to my query is a method for ordering the bug reports. I have scoured the web for a method for ordering these results: ultimately, I would like them to be ordered from "2016-01-01" descending. I have tried adding the following key-value pairs to params:
"order": "creation_time desc"
"sort_by": "creation_time", "order" : "desc"
"chfieldorder": "desc"
and I've tried editing the URL to be https://bugzilla.mozilla.org/rest/bug?orderBy=creation_time:desc but none of these approaches have worked. Unfortunately, adding invalid keys fails without error: results are returned, just not in sorted order.
Ordering and ranges (ie., chfieldfrom and chfieldto) were not in any of the documentation that I found either.
I am aware that a hacked method of gathering ordered results would be to specify a narrow range of dates to get bug reports from, but I'm hoping there exists an actual key-value pair that can be specified to achieve the task.
Notably, of course: sorting after the request returns in r is invalid, because the results in r do not contain the most recent bugs.
You need to add
"order": [
"opendate DESC",
],
to your params.
Quick test
To see more easily that it works, just run something like this after you received the response in r:
data = json.loads(r.content)
bugs = data['bugs']
times = [x['creation_time'] for x in bugs]
print(times)
gives:
['2016-01-01T21:53:20Z', '2016-01-01T21:37:58Z', '2016-01-01T20:12:07Z', '2016-01-01T19:29:30Z', '2016-01-01T19:10:46Z', '2016-01-01T15:56:35Z',...
Details
If you are interested in the details: It looks like some fields in the Bugzilla codebase have different field names.
Take a look here https://github.com/bugzilla/bugzilla/blob/5.2/Bugzilla/Search.pm#L557:
# Backward-compatibility for old field names. Goes new_name => old_name.
# These are here and not in _translate_old_column because the rest of the
# code actually still uses the old names, while the fielddefs table uses
# the new names (which is not the case for the fields handled by
# _translate_old_column).
my %old_names = (
creation_ts => 'opendate',
delta_ts => 'changeddate',
work_time => 'actual_tFile.join(File.dirname(__FILE__), *%w[rel path here])ime',
);
First, some background.
I have a function in Python which consults an external API to retrieve some information associated with an ID. Such function takes as argument an ID and it returns a list of numbers (they correspond to some metadata associated with such ID).
For example, let us introduce in such function the IDs {0001, 0002, 0003}. Let's say that the function returns for each ID the following arrays:
0001 → [45,70,20]
0002 → [20,10,30,45]
0003 → [10,45]
My goal is to implement a collection which structures data as so:
{
"_id":45,
"list":[0001,0002,0003]
},
{
"_id":70,
"list":[0001]
},
{
"_id":20,
"list":[0001,0002]
},
{
"_id":10,
"list":[0002,0003]
},
{
"_id":30,
"list":[0002]
}
As it can be seen, I want my collection to index the information by the metadata itself. With this structure, the document with $_id "45" contains a list with all the IDs that have metadata 45 associated. This way I can retrieve with a single request to the collection all IDs mapped to a particular metadata value.
The class method in charge of inserting IDs and metadata in the collection is the following:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
for data in metadataVector:
self.SegmentDB.update_one(
filter = {"_id":data},
update = {"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
metadataVector is the list which contains all metadata (integers) associated to a given ID (i.e.:[45,70,20]).
id is the ID associated to the metadata in metadataVector. (i.e.:0001).
This method currently iterates through the list and performs an operation for every element (every metadata) on the list. This method implements the collection I desire: it updates the document whose "_id" is a given metadata and adds to its corresponding list the ID from which such metadata originated (if such document doesn't exist yet, it inserts it - that's what upsert = true is all for).
However, this implementation ends up being somewhat slow on the long run. metadataVector usually has around 1000-3000 items for each ID (metainformation integers which can range in 800 - 23000000), and I have around 40000 IDs to analyze. As a result, the collection grows quickly. At the moment, I have around 3.2m documents in the collection (one specifically dedicated to each individual metadata integer). I would like to implement a faster solution; if possible, I would like to insert all metadata in one only DB request instead of calling an update for each item in metadataVector individually.
I tried this approach but it doesn't seem to work as I intended:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
self.SegmentDB.update_many(
filter={"_id": {"$in":metadataVector}},
update={"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
I tried using update_many (as it seemed the natural approach to tackle the problem) specifying a filter which, to my understanding, states "any document whose _id is in metadataVector". In this way, all documents involved would add to the list the originating ID (or the document would be created if it didn't exist due to the Upsert condition) but instead the collection ends up being filled with documents containing a single element in the list and an ObjectId() _id.
Picture showing the final result.
Is there a way to implement what I want? Should I restructure the DB differently all together?
Thanks a lot in advance!
Here is an example, and it uses Bulk Write operations. Bulk operations submits multiple inserts, updates, deletes (can be a combination) as a single call to the database and returns a result. This is more efficient than multiple single calls to the database.
Scenario 1:
Input: 3 -> [10, 45]
def some_fn(id):
# id = 3; and after some process... returns a dictionary
return { 10: 3, 45: 3, }
Scenario 2:
Input (as a list):
3 -> [10, 45]
1 -> [45, 70, 20]
def some_fn(ids):
# ids are 1 and 3; and after some process... returns a dictionary
return { 10: [ 3 ], 45: [ 3, 1 ], 20: [ 1 ], 70: [ 1 ] }
Perform Bulk Write
Now, perform the bulk operation on the database using the returned value from some_fn.
data = some_fn(id) # or some_fn(ids)
requests = []
for k, v in data.items():
op = UpdateOne({ '_id': k }, { '$push': { 'list': { '$each': v }}}, upsert=True)
requests.append(op)
result = db.collection.bulk_write(requests, ordered=False)
Note the ordered=False - this option is used for, again, better performance as writes can happen in parallel.
References:
collection.bulk_write
I create a dictionary, ss, which has a total number of shares for a specific stock. Then I need to do some calculations and create a new dictionary, temp, which I will feed into my .html page for presentation. The update function does not work as I intend because all that is in temp is the last stock, not a listing of all the stocks I have purchased. When I print ss, there is [{key:value, etc}], but when I print temp there is no [ ] around the {key:value, etc}. I think I am missing something basic.
Also my .html page is not reading the temp dictionary as the page is empty. Here is the code:
#app.route("/")
#login_required
def index():
"""Show portfolio of stocks"""
#dictionary to feed data into index
temp={}
#Select from trades all stocks held by this user
ss = db.execute("SELECT SUM(shares), symbol FROM trades WHERE id=? GROUP BY symbol", session["user_id"])
print(ss)
#lookup current price for each stock and create index of all stocks held
for row in ss:
data=lookup(row["symbol"])
totval=row["SUM(shares)"]*data["price"]
temp1={"symbol":data["symbol"], "name":data["name"], "shares":row["SUM(shares)"], "price":data["price"], "total value":totval}
temp.update(temp1)
print(temp)
return render_template("index.html", temp=temp)
Any direction would be great. Thanks.
TL;DR
# Note use of the data["symbol"]
temp.update({
data["symbol"]: {
"symbol": data["symbol"],
"name": data["name"],
"shares": row["SUM(shares)"],
"price": data["price"],
"total value": totval
}
})
There are two generic manners of referencing related data, lists and dictionaries. In the simplest manner, consider:
apple
orange
pear
Using basic grammatical syntax, we can understand one is a list that "enumerates" it's part; the adjacency is the meaningful relationship between each individual part. The context and use of the list relates to it's external (variable) context.
A dictionary, on the other hand, specifies something specific is contextual to the list of definitions specifically. Consider:
fruit: apple, orange, pear
Here, fruit is an enumeration of different types of fruit; to define "fruit" is to give a list of qualifying "fruit" names contextual to an external (variable) context. Or:
fruit: An edible, usually sweet and fleshy form of such a structure.
Maybe a "real" definition.
So if we consider how we would refer to a list versus a dictionary, to add definitions to a list, we create a new list by (generally) appending a new item:
apple
orange
pear
+ kiwi
Before we had three, now we have four (contextually).
Whereas we append a new definition by specifying it's definition and naming it:
fruit: An edible, usually sweet and fleshy form of such a structure.
+ vegetable: A plant cultivated for its edible parts.
We could, if we want, update fruit by redefining it:
fruit: An edible, usually sweet and fleshy form of such a structure.
vegetable: A plant cultivated for its edible parts.
+ fruit: I like fruit.
Which gives us a dictionary of only it's constituent parts:
vegetable: A plant cultivated for its edible parts.
fruit: I like fruit.
Because you can only define (and update) the internal reference (fruit).
In psuedocode, a list:
fruits = []
fruits.add('apple')
fruits.add('orange')
fruits.add('pear')
// ['apple','orange','pear']
Likewise, a definition list works on the "key" relationship, and thus you may add or redefine a key relation:
foods = {}
foods['fruit'] = 'I like fruit.'
foods['vegetables'] = 'Gotta eat your veggies.'
// {
// fruit: 'I like fruit.',
// vegetables: 'Gotta eat your veggies!',
// }
In this sense, "updating" a dictionary means redefining and/or providing a new "key" relationship (internally).
Consider:
fruits = []
fruits.append('apple')
fruits.append('orange')
fruits.append('pear')
print(', '.join(fruits))
# apple, orange, pear
foods = {
'fruits': 'Fruits are good.'
}
# Adding a definition
foods['vegetables'] = 'Gotta eat your veggies!'
# Updating a definition
foods.update({
'fruits': 'I like fruit!',
'meats': 'Can haz meat?'
})
for food in foods.values():
print(food)
# I like fruit!
# Gotta eat your veggies!
# Can haz meat?
https://onlinegdb.com/SkneEJsw_
What you really need, then, are unique keys for your dictionary. Unique in the sense that within the dictionary's context, one key equals one definition. Which I think will look this:
# Note use of the data["symbol"]
temp.update({
data["symbol"]: {
"symbol": data["symbol"],
"name": data["name"],
"shares": row["SUM(shares)"],
"price": data["price"],
"total value": totval
}
})
Or directly:
temp[data["symbol"]] = {
"symbol": data["symbol"],
"name": data["name"],
"shares": row["SUM(shares)"],
"price": data["price"],
"total value": totval
}
Now you're updating your dictionary with meaningfully defined terms that resolve to a specific definition based on a key term.
There already is an individual row for each stock in ss. Remember, key/value pairs can be added to dictionaries quite simply by "declaring" them, eg row["totval"] = {value}. Hint, SELECT as much as possible in the SQL, eg symbol, name in the sql.
When I print ss, there is [{key:value, etc}], but when I print temp there is no [] around the {key:value, etc}. I think I am missing something basic.
I think you're mismatching types, which is a common mistake. I'm not sure what API/package you're using for db.execute, but that method seems to assign a list ([]) to ss. On the other hand, your temp value is a dict, ({}). I suggest one of two solutions.
If render_template expects temp to be a dict instead of a list, try this, as DinoCoderSaurus suggests:
def index():
# other code here
for row in ss:
data = lookup(row["symbol"])
totval = row["SUM(shares)"] * data["price"]
# notice omission of data["symbol"] in temp1 assignment
temp1 = { "name": data["name"], "shares": row["SUM(shares)"], "price": data["price"], "total value":totval }
# assign temp1 to data["symbol"] key in new dict
temp[data["symbol"]] = temp1
On the other hand, if render_template expects temp to be a list like ss seems to be, try:
def index():
# list to feed data into index (note change in data structure)
temp = []
# other code
for row in ss:
data = lookup(row["symbol"])
totval = row["SUM(shares)"] * data["price"]
temp1 = { "symbol": data["symbol"], "name": data["name"], "shares": row["SUM(shares)"], "price": data["price"], "total value": totval }
temp.append(temp1)
I have a list like this:
data.append(
{
"type": type,
"description": description,
"amount": 1,
}
)
Every time there is a new object I want to check if there already is an entry in the list with the same description. If there is, I need to add 1 to the amount.
How can I do this the most efficient? Is the only way going through all the entries?
I suggest making data a dict and using the description as a key.
If you are concerned about the efficiency of using the string as a key, read this: efficiency of long (str) keys in python dictionary.
Example:
data = {}
while loop(): # your code here
existing = data.get(description)
if existing is None:
data[description] = {
"type": type,
"description": description,
"amount": 1,
}
else:
existing["amount"] += 1
In either case you should first benchmark the two solutions (the other one being the iterative approach) before reaching any conclusions about efficiency.
Excuse my ignorance, I am new in MongoDB. I am having tree collections, where the one is a superset of the other two whose elements are not overlapped. Each item is distinguish by a unique string id. What I want is to get the items of the superset that are not included in the other two collections. Could you please provide me some hint on how do do this efficiently?
Thanks.
EDIT:
Superset structure:
{ "_id" : 1, "str_id" : "ABC1fd3fsewer", "date": "a day" }
Subset 1 structure: { "_id" : 1, "str_id" : "ABre1fd3fsewer", "description" : "product" }
Subset 2 structure: { "_id" : 1, "str_id" : "ABC1fd3fsewfe"}
Each collection has a different structure but all have a common filed, the str_id.
EDIT Improved by #Neel suggestion
I have following format:
parent = [{'str_id':'a', 'tag1':'parent_random', 'tag2': 'parent_random', 'tag3':'parent_random'},{'str_id':'b',...},{'str_id':'c',...},{'str_id':'d',...}...]
child1 = [{'str_id':'a', 'tag2': child1_random'},{'str_id':'b', 'tag2': 'child1_random'}]
child2 = [{'str_id':'c', 'tag1':'child2_random'}]
and I want
outcome = [{'str_id':'c', 'tag1':'parent_random', 'tag2': 'parent_random', 'tag3':'parent_random'},{'str_id':'d', 'tag1':'parent_random', 'tag2': 'parent_random', 'tag3':'parent_random'}]
It sounds like you'll need an aggregate operation.
This document might help you:
Lookup in an array
You can do multiple lookups with one aggregate operation so you can check both the subset collections.
I am going to assume you are working with a REST API and that the client is sending a request for a subset of documents from the superset collection. You can send the array of documents you want to check from superset from the client then:
1 - match all the documents in superset to the array of documents you're sending
2 - unwind your superset document array
3 - lookup the subset collections on "str_id" field and set to a field, like "subset_one_results".
4 - do a match operation on both subset results that returns an empty array on, say, "subset_one_results"... this will match all superset documents that are not contained in subset1 for example.
$match({ $and : { "subset_one_results" : { $eq : [] } }, { "subset_two_results" : { $eq : [] } } })
5 - group them in a new array if you want to return them as an array to the client.
To increase the performance of your operations, you have to determine how often this request will be made. If it risks being often, be sure to create an index on the field that will be solicited if it's not an ObjectId field. I can't tell from your code if you are using a custom string field or an ObjectId, which is why I'm bringing up this point.
I don't know what you're using for making your queries (pure MongoDB query language, driver, etc.) so I am not sure how to answer with code hence delineating the steps up above.