I'm trying to update an item in DynamoDB that has somewhat complicated(?) data structure.
Item:
{
'user_id': 'abc123',
'groups': [
{
'group_id': 'Group1',
'games_won': [],
'games_lost': []
},
{
'group_id': 'Group2',
'games_won': [],
'games_lost': []
},
]
}
I am trying to append a string to games_won on a specific group_id. I am trying to use a conditional to avoid multiple db queries but I can't seem to figure out how to iterate over groups in my conditional.
Basically, I want to do this:
for g in groups:
if g.group_id == 'Group2':
g.games_won.append('game12345')
Sorry for the complicated title. I'm a bit new to DynamoDB and NoSQL in general.
You could read the 'groups' attribute, then change the data outside of the query and when your done write the whole thing back. That way no matter how many groups you change you always have just one read and one write action. The number of read or write capacity units consumed is off course related to the size of your 'groups' attribute.
Related
First, some background.
I have a function in Python which consults an external API to retrieve some information associated with an ID. Such function takes as argument an ID and it returns a list of numbers (they correspond to some metadata associated with such ID).
For example, let us introduce in such function the IDs {0001, 0002, 0003}. Let's say that the function returns for each ID the following arrays:
0001 → [45,70,20]
0002 → [20,10,30,45]
0003 → [10,45]
My goal is to implement a collection which structures data as so:
{
"_id":45,
"list":[0001,0002,0003]
},
{
"_id":70,
"list":[0001]
},
{
"_id":20,
"list":[0001,0002]
},
{
"_id":10,
"list":[0002,0003]
},
{
"_id":30,
"list":[0002]
}
As it can be seen, I want my collection to index the information by the metadata itself. With this structure, the document with $_id "45" contains a list with all the IDs that have metadata 45 associated. This way I can retrieve with a single request to the collection all IDs mapped to a particular metadata value.
The class method in charge of inserting IDs and metadata in the collection is the following:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
for data in metadataVector:
self.SegmentDB.update_one(
filter = {"_id":data},
update = {"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
metadataVector is the list which contains all metadata (integers) associated to a given ID (i.e.:[45,70,20]).
id is the ID associated to the metadata in metadataVector. (i.e.:0001).
This method currently iterates through the list and performs an operation for every element (every metadata) on the list. This method implements the collection I desire: it updates the document whose "_id" is a given metadata and adds to its corresponding list the ID from which such metadata originated (if such document doesn't exist yet, it inserts it - that's what upsert = true is all for).
However, this implementation ends up being somewhat slow on the long run. metadataVector usually has around 1000-3000 items for each ID (metainformation integers which can range in 800 - 23000000), and I have around 40000 IDs to analyze. As a result, the collection grows quickly. At the moment, I have around 3.2m documents in the collection (one specifically dedicated to each individual metadata integer). I would like to implement a faster solution; if possible, I would like to insert all metadata in one only DB request instead of calling an update for each item in metadataVector individually.
I tried this approach but it doesn't seem to work as I intended:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
self.SegmentDB.update_many(
filter={"_id": {"$in":metadataVector}},
update={"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
I tried using update_many (as it seemed the natural approach to tackle the problem) specifying a filter which, to my understanding, states "any document whose _id is in metadataVector". In this way, all documents involved would add to the list the originating ID (or the document would be created if it didn't exist due to the Upsert condition) but instead the collection ends up being filled with documents containing a single element in the list and an ObjectId() _id.
Picture showing the final result.
Is there a way to implement what I want? Should I restructure the DB differently all together?
Thanks a lot in advance!
Here is an example, and it uses Bulk Write operations. Bulk operations submits multiple inserts, updates, deletes (can be a combination) as a single call to the database and returns a result. This is more efficient than multiple single calls to the database.
Scenario 1:
Input: 3 -> [10, 45]
def some_fn(id):
# id = 3; and after some process... returns a dictionary
return { 10: 3, 45: 3, }
Scenario 2:
Input (as a list):
3 -> [10, 45]
1 -> [45, 70, 20]
def some_fn(ids):
# ids are 1 and 3; and after some process... returns a dictionary
return { 10: [ 3 ], 45: [ 3, 1 ], 20: [ 1 ], 70: [ 1 ] }
Perform Bulk Write
Now, perform the bulk operation on the database using the returned value from some_fn.
data = some_fn(id) # or some_fn(ids)
requests = []
for k, v in data.items():
op = UpdateOne({ '_id': k }, { '$push': { 'list': { '$each': v }}}, upsert=True)
requests.append(op)
result = db.collection.bulk_write(requests, ordered=False)
Note the ordered=False - this option is used for, again, better performance as writes can happen in parallel.
References:
collection.bulk_write
I have similar code that return all entries from a table:
all_entries = Entry.objects.all()
and I have the following array:
exclusion_list = [
{
"username": "Tom",
"start_date": 01/03/2019,
"end_date": 29/02/2020,
},
{
"username": "Mark",
"start_date": 01/02/2020,
"end_date": 29/02/2020,
},
{
"username": "Pam",
"start_date": 01/03/2019,
"end_date": 29/02/2020,
}
]
I want to exclude all Tom's records from "01/03/2019" to "29/02/2020", all "Mark" records from "01/02/2020" to "29/02/2020" and all Pam's record from "01/03/2019" to "29/02/2020"
I want to do that in a loop, so I believe i should do something like:
for entry in all_entries:
filtered_entry = all_entries.exclude(username=entry.username).filter(date__gte=entry.start_date, date__lte=entry.end_date)
Is this approach correct? I am new to Django ORM. Is there a better and more efficient solution?
Thank you for your help
Yes, you can do this with a loop.
This results in a query whose WHERE-clause gets extended every cycle of your loop. But to do this, you have to use the filtered queryset of your previous cycle:
filtered_entry = all_entries
for exclude_entry in exclusion_list:
filtered_entry = filtered_entry.exclude(username=exclude_entry.username, date__gte=exclude_entry.start_date, date__lte=exclude_entry.end_date)
Notes
Using the same reference of the queryset to limit the results further every loop cycle
To use multiple criteria connected with AND, just write multiple keyword arguments within exclude() (look into the docs [here][1])
Be aware, that this can result in a large WHERE-clause and maybe there are limitations of your database
So if your exclude_list is not too big, I think you can use this without concerns.
If your exclude_list grows, the best would be to save your exclusion_list in the database itself. With this the ORM can generate subqueries instead of single values. Just an example:
exclusion_query = ExclusionEntry.objects.all().values('username')
filtered = all_entries.exclude(username__in=exclusion_query)
[1]: https://docs.djangoproject.com/en/3.1/topics/db/queries/#retrieving-specific-objects-with-filters
I am trying to get a value from a data JSON. I have successfully traversed deep into the JSON data and almost have what I need!
Running this command in Python :
autoscaling_name = response['Reservations'][0]['Instances'][0]['Tags']
Gives me this :
'Tags': [{'Key': 'Name', 'Value': 'Trove-Dev-Inst : App WebServer'}, {'Key': 'aws:autoscaling:groupName', 'Value': 'CodeDeploy_Ernie-dev-Autoscaling-Deploy_d-4WTRTRTRT'}, {'Key': 'CodeDeployProvisioningDeploymentId', 'Value': 'd-4WTRTRTRT'}, {'Key': 'Environment', 'Value': 'ernie-dev'}]
I only want to get the value "CodeDeploy_Ernie-dev-Autoscaling-Deploy_d-4WTRTRTRT". This is from the key "aws:autoscaling:groupName".
How can I further my command to only return the value "CodeDeploy_Ernie-dev-Autoscaling-Deploy_d-4WTRTRTRT"?
Is this the full output? This a dictionary containing a list with nested dictionaries, so you should treat it that way. Suppose it is called:
A = {
"Tags": [
{
"Key": "Name",
"Value": "Trove-Dev-Inst : App WebServer"
},
{
"Key": "aws:autoscaling:groupName",
"Value": "CodeDeploy_Ernie-dev-Autoscaling-Deploy_d-4WTRTRTRT"
},
{
"Key": "CodeDeployProvisioningDeploymentId",
"Value": "d-4WTRTRTRT"
},
{
"Key": "Environment",
"Value": "ernie-dev"
}
]
}
Your first adress the object, then its key in the dictionary, the index within the list and the key for that dictionary:
print(A['Tags'][1]['Value'])
Output:
CodeDeploy_Ernie-dev-Autoscaling-Deploy_d-4WTRTRTRT
EDIT: Based on what you are getting then you should try:
autoscaling_name = response['Reservations'][0]['Instances'][0]['Tags'][1]['Value']
You could also use glom it's great for deeply nested functions and has sooo many uses that make complicated nested tasks easy.
For example translating #Celius's answer:
glom(A, 'Tags.1.Value')
Returns the same thing:
CodeDeploy_Ernie-dev-Autoscaling-Deploy_d-4WTRTRTRT
So to answer your original question you'd use:
glom(response, 'Reservations.0.Instances.0.Tags.1.Value')
The final code for this is -
tags = response['Reservations'][0]['Instances'][0]['Tags']
autoscaling_name = next(t["Value"] for t in tags if t["Key"] == "aws:autoscaling:groupName")
This also ensures that if the order of the data is moved in the JSON data it will still find the correct one.
For anyone struggling to get their heads around list comprehensions and iterators, the cherrypicker package (pip install --user cherrypicker) does this sort of thing for you pretty easily:
from cherrypicker import CherryPicker
tags = CherryPicker(response['Reservations'][0]['Instances'][0]['Tags'])
tags(Key="aws:autoscaling:groupName")[0]["Value"].get()
which gives you 'CodeDeploy_Ernie-dev-Autoscaling-Deploy_d-4WTRTRTRT'. If you're expecting multiple values, omit the [0] to get back a list of all values that have an associated "aws:autoscaling:groupName" key.
This is probably all a bit overkill for your question, which can be solved easily with a simple list comprehension. But this approach might come in handy if you need to do more complicated things later, like matching on partial keys only (e.g. aws:* or something more complicated like a regular expression), or you need to filter based on the values in an intermediate layer of the nested object. This sort of task could lead to lots of complicated nested for loops or list comprehensions, whereas with CherryPicker it stays as a simple, potentially one-line command.
You can find out more about advanced usage at https://cherrypicker.readthedocs.io.
I have a simple dictionary in python. For each item in the dictionary, I have another dictionary I need to attach to each line (i.e. 5 contacts, each contact has FirstName, LastName, Gender, plus 'other' fields which all fall in single embedded dictionary.
I have attached the loop I am using. The resulting output is exactly how I want it, but when I run a type() function on it, Python reads it as a list rather than a dictionary. How can I convert it back to a dictionary?
itemcount = 0
for item in dict_primarydata:
dict_primarydata[itemcount]['otherData'] = dict_otherdata[itemcount]
itemcount = itemcount+1
I'm going to hazard a guess and say dict_primarydata and dict_otherdata look something like this to start out:
dict_primarydata = [
{
'first_name': 'Kaylee',
'last_name': 'Smith',
'gender': 'f'
},
{
'first_name': 'Kevin',
'last_name': 'Hoyt',
'gender': 'm'
}
]
dict_otherdata = [
{
'note': 'Note about kaylee'
},
{
'note': 'Note about kevin'
}
]
It looks like dict_primarydata and dict_otherdata are initialized as lists of dicts. In other words, dict_primarydata is not actually a dict; it's a list containing several dicts.
If you want your output to be a dict containing dicts you need to perform a conversion. Before you can do the conversion, you need to decide what you will use as key to your outer dict.
Sidenote
Since you are iterating over two lists. A range-based for loop would be a bit more readable:
for i in range(len(dict_primarydata)):
dict_primarydata[i]['otherData'] = dict_otherdata[i]
I am scraping a collection of text documents and building a json object to query with python-lifter. I currently have data like
[
[
{name:dad},
{name:son, dob:2/24/2000}
],
[
{name:forever_alone, cats:12}
]
]
I would like to do two different queries based on the existence of the dob key: 1) to get son and 2) to get the family that contains the son (dad and son). As I understand it, a list of lists of dictionaries is not well supported in lifter. Suspending for a moment the issue that lifter does not yet allow queries on fields that are not on every record, what would be a better, what would be a better structure for lifter?
a list of dictionaries of dictionaries?
[
{
0:{name:dad},
1:{name:son, dob:2/24/2000}
},
{
0:{name:forever_alone, cats:12}
}
]
or a dictionary of lists of dictionaries?
{18283923:
[
{name:dad},
{name:son, dob:2/24/2000}
],
18283927:
[
{name:forever_alone, cats:12}
]
}
And, given an ideal nested data structure, what are the two queries that would return 1) the son and 2) the family containing the son?
[Disclaimer: lifter maintainer here]
This kind of requests is not supported by lifter right now, because lifter will try to lookup queried fields on each object and will raise an error if the field does not exist.
Support for querying against iterable fields is not good at the moment either.
An issue has been opened regarding the missing fields problem though,
but anyway, your data structure is not really suited for such queries.
A better data structure would be:
families = [
{
'id': 1,
'members': [
{'name': 'dad'},
{'name': 'son', 'dob':'2/24/2000'}
]
},
{
'id': 2,
'members': [
{'name': 'forever_alone', 'cats': 12}
]
}
]
Then, after previous linked issues has been solved, you could query with something like:
Family = lifter.models.Model('Family')
manager = Family.load(families)
# get families with son/dob members
son_dob_families = manager.filter(Family.members.name == 'son', Family.members.dob.exists())\
.values(Family.id, Family.members)
# keep only son members with dob
Member = lifter.models.Model('Member')
members = [member for family in son_dob_families for member in family['members']]
sons_with_dob = Member.load(members).filter(Member.name == 'son', Member.dob.exists())
This is a theorical API though, it's not implemented yet.