I'm wondering if there's a Pythonic way to squash this nested for loop:
dict = {
"keyA": { "subkey1": { "A1a": "frog", "A1b": "dog", "A1c": "airplane" } },
"keyA": { "subkey2": { "A2a": "cat" } },
"keyB": { "subkey1": { "B1a": "Zorba", "B1q": ["popcorn", -34] } },
"keyB": { "subkey2": { "B2z": "A Man A Plan A Canal", "B2e": "armadillo", "B2w": [1, 3, "jump"] } },
"keyC": { "subkey1": { "C1a": 3.14, "C1z": { "aaa": "dishwater", "bbb": "Dishwalla" }, "C1x": "bat" } },
"keyC": { "subkey2": { "C2a": None, "C2b": 123 } }
}
for key in dict.keys():
for subsubkey in dict[key]["subkey2"].keys():
print(key+":"+subsubkey)
Output:
keyA:A2a
keyB:B2z
keyB:B2e
keyB:B2w
keyC:C2a
keyC:C2b
One Pythonic way to solve this is to use list comprehension. This allows you to define a list within a single line, following the for loop structure you have already laid out. A working version may look something like:
final_keys = [(first_key, second_key) for first_key in dict.keys() for second_key in dict[first_key]['subkey2'].keys()]
Outputting (from your dataset):
[('keyA', 'A2a'), ('keyB', 'B2z'), ('keyB', 'B2e'), ('keyB', 'B2w'), ('keyC', 'C2a'), ('keyC', 'C2b')]
Related
I have a nested dict(as below). Goal is to to extract the values of "req_key1" and "rkey1", and append them to a list.
raw_json = {
"first_key": {
"f_sub_key1": "some_value",
"f_sub_key2": "some_value"
},
"second_key": {
"another_key": [{
"s_sub_key1": [{
"date": "2022-01-01",
"day": {
"key1": "value_1",
"keyn": "value_n"
}
}],
"s_sub_key2": [{
"req_key1": "req_value1",
"req_key2": {
"rkey1": "rvalue_1",
"rkeyn": "rvalue_n"
}
}]
}]
}
}
I am able to append the values to a list and below is my approach.
emp_ls = []
filtered_key = raw_json["second_key"]["another_key"]
for i in filtered_key:
for k in i.get("s_sub_key2"):
emp_ls.append({"first_val": k.get("req_key1"), "second_val": k["req_key2"].get("rkey1") })
print(emp_ls)
Is it a good approach i.e. it can be used in production or there can be another approach to do this task?
I have a collection where the objects have a structure similar to
{'_id': ObjectId('5e691cb9e73282f624362221'),
'created_at': 'Tue Mar 10 09:23:54 +0000 2020',
'id': 1237308186757120001,
'id_str': '1237308186757120001',
'full_text': 'See you in July'}
I am struggling to only keep object which have a unique full text. Using distinct only gives me a list of the distinct full text field values where as I want to only conserve object in the collection with unique full texts.
There is, the code should look like this:
dict = {"a": 1, "b": 2, "c": 3, "a": 5, "d": 4, "e": 5, "c": 8}
#New clean dictionary
unique = {}
#Go through the original dictionary's items
for key, value in dict.items():
if(key in unique.keys()):
#If the key already exists in the new dictionary
continue
else:
#Otherwise
unique[key] = value
print(unique)
I hope this helps you!
There are 2 ways:
MongoDB way
We perform MongoDB aggregation where we group records by full_text, filter unique documents only and insert them into collection. (in the shell)
db.collection.aggregate([
{
$group: {
_id: "$full_text",
data: {
$push: "$$ROOT"
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
$eq: 1
}
}
},
{
$addFields: {
data: {
$arrayElemAt: [
"$data",
0
]
}
}
},
{
$replaceRoot: {
newRoot: "$data"
}
},
{
$out: "tmp"
}
])
When you run this query, it will create new collection with unique full_text values. You can drop old collection and rename this one.
You may also put your collection name into $out operator like this {$out:"collection"}, but there is no going back.
Python way
We perform MongoDB aggregation grouping by full_text field, filter duplicate documents and create single array with all _id to be removed. Once MongoDB returns results, we execute remove command for duplicate documents.
db.collection.aggregate([
{
$group: {
_id: "$full_text",
data: {
$push: "$_id"
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
$gt: 1
}
}
},
{
$group: {
_id: null,
data: {
$push: "$data"
}
}
},
{
$addFields: {
data: {
$reduce: {
input: "$data",
initialValue: [],
in: {
$concatArrays: [
"$$value",
"$$this"
]
}
}
}
}
}
])
MongoPlayground
Pseudocode
data = list(collection.aggregate(...))
if len(data) > 0:
colleciton.remove({'_id':{'$in':data[0]["data"]}})
I have a python dictionary,
dict = {
"A": [{
"264": "0.1965"
}, {
"289": "0.1509"
}, {
"192": "0.1244"
}]
}
I have a collection in mongoDB that has,
{
"_id": ObjectId("5d5a7f474c55b68a873f9602"),
"A": [{
"264": "0.5700"
}, {
"175": "0.321"
}
}
{
"_id": ObjectId("5d5a7f474c55b68a873f9610"),
"B": [{
"152": "0.2826"
}, {
"012": "0.1234"
}
}
}
I want to see if the key "A" from dict is available in mongodb. If yes, I want to loop over the keys in the list i.e.
[{
"264": "0.19652049960139123"
}, {
"289": "0.1509138215380371"
}, {
"192": "0.12447470015715734"
}]
}
and check if 264 is available in mongodb and update the key value else append.
Expected output in mongodb:
{
"_id": ObjectId("5d5a7f474c55b68a873f9602"),
"A": [{
"264": "0.1965"
}, {
"175": "0.321"
}, {
"289": "0.1509"
}, {
"192": "0.1244"
}
}
{
"_id": ObjectId("5d5a7f474c55b68a873f9610"),
"B": [{
"152": "0.2826"
},{
"012": "0.1234"
}
}
The value for key 264 is updated. Kindly help.
Assuming you are looking for the python part and not the mongoDB, try:
for k,v in dict['A'].items(): #k is key, v is value
process_entry(k, v) #do what you want with the database
assuming your mongodb collection is called your_collection
data= your_collection.find_one({'A':{'$exists':1}})
if data:
#loop over the keys
for item in data['A']:
#check whether a certain key is available
if 'some_key' not in item:
do_something()# update
I have a collection with fields like this:
{
"_id":"5cf54857bbc85fd0ff5640ba",
"book_id":"5cf172220fb516f706d00591",
"tags":{
"person":[
{"start_match":209, "length_match":6, "word":"kimmel"}
],
"organization":[
{"start_match":107, "length_match":12, "word":"philadelphia"},
{"start_match":209, "length_match":13, "word":"kimmel center"}
],
"location":[
{"start_match":107, "length_match":12, "word":"philadelphia"}
]
},
"deleted":false
}
I want to collect the different words in the categories and count it.
So, the output should be like this:
{
"response": [
{
"tag": "location",
"tag_list": [
{
"count": 31,
"phrase": "philadelphia"
},
{
"count": 15,
"phrase": "usa"
}
]
},
{
"tag": "organization",
"tag_list": [ ... ]
},
{
"tag": "person",
"tag_list": [ ... ]
},
]
}
The pipeline like this works:
def pipeline_func(tag):
return [
{'$replaceRoot': {'newRoot': '$tags'}},
{'$unwind': '${}'.format(tag)},
{'$group': {'_id': '${}.word'.format(tag), 'count': {'$sum': 1}}},
{'$project': {'phrase': '$_id', 'count': 1, '_id': 0}},
{'$sort': {'count': -1}}
]
But it make a request for each tag. I want to know how to make it in one request.
Thank you for attention.
As noted, there is a slight mismatch in the question data to the current claimed pipeline process since $unwind can only be used on arrays and the tags as presented in the question is not an array.
For the data presented in the question you basically want a pipeline like this:
db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
Again as per the note, since tags is in fact an object then what you actually need in order to collect data based on it's sub-keys as the question is asking, is to turn that essentially into an array of items.
The usage of $replaceRoot in your current pipeline would seem to indicate that $objectToArray is of fair use here, as it is available from later patch releases of MongoDB 3.4, being the bare minimal version you should be using in production right now.
That $objectToArray actually does pretty much what the name says and produces an array ( or "list" to be more pythonic ) of entries broken into key and value pairs. These are essentially a "list" of objects ( or "dict" entries ) which have the keys k and v respectively. The output of the first pipeline stage would look like this on the supplied document:
{
"book_id": "5cf172220fb516f706d00591",
"tags": [
{
"k": "person",
"v": [
{
"start_match": 209,
"length_match": 6,
"word": "kimmel"
}
]
}, {
"k": "organization",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}, {
"start_match": 209,
"length_match": 13,
"word": "kimmel center"
}
]
}, {
"k": "location",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}
]
}
],
"deleted" : false
}
So you should be able to see how you can now easily access those k values and use them in grouping, and of course the v is the standard array as well. So it's just the two $unwind stages as shown and then two $group stages. Being the first $group in order to collection over the combination of keys, and the second to collect as per the main grouping key whilst adding the other accumulations to a "list" within that entry.
Of course the output by the above listing is not exactly how you asked for in the question, but the data is basically there. You can optionally add an $addFields or $project stage to essentially rename the _id key as the final aggregation stage:
{ "$addFields": {
"_id": "$$REMOVE",
"tag": "$_id"
}}
Or simply do something pythonic with a little list comprehension on the cursor output:
cursor = db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
output = [{ 'tag': doc['_id'], 'tag_list': doc['tag_list'] } for doc in cursor]
print({ 'response': output });
And final output as a "list" you can use for response:
{
"tag_list": [
{
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "location"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel"
}
],
"tag": "person"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel center"
}, {
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "organization"
}
Noting that using a list comprehension approach you have a bit more control over the order of "keys" as output, as MongoDB itself would simply append NEW key names in a projection keeping existing keys ordered first. If that sort of thing is important to you that is. Though it really should not be since all Object/Dict like structures should not be considered to have any set order of keys. That's what arrays ( or lists ) are for.
I am using MongoDB 3.4 and PyMongo. I have a set of keywords:
keywords = [ 'bar', 'foo', ..., 'zoo' ]
I also have a collection:
docs = { 'data' : ' ... bar foo ... ',
'data' : ' ... foo ... ',
'data' : ' ... zoo ... ' }
I am looking for a PyMongo aggregation query which is going to give me a dict:
{ 'bar' : 0, 'foo' : 2, ..., 'zoo' : 0 }
There isn't anything language specific about this, as the only solutions are either all aggregate or using mapReduce, where the latter is defined in JavaScript functions
Just setting up some sample data:
db.wordstuff.insertMany([
{ 'data': "foo brick bar" },
{ 'data': "brick foo" },
{ 'data': "bar brick baz" },
{ 'data': "bax" },
{ 'data': "brin brok fu foo" }
])
Aggregation Framework
Then you can run the aggregation statement:
db.wordstuff.aggregate([
{ "$project": {
"_id": 0,
"split": {
"$filter": {
"input": { "$split": [ "$data", " " ] },
"cond": { "$in": [ "$$this", ["bar","foo","baz","blat"] ] }
}
}
}},
{ "$unwind": "$split" },
{ "$group": { "_id": "$split", "count": { "$sum": 1 } }},
{ "$group": {
"_id": null,
"data": { "$push": { "k": "$_id", "v": "$count" } }
}},
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": {
"$map": {
"input": ["bar","foo","baz","blat"],
"as": "d",
"in": {
"$cond": {
"if": { "$ne": [{ "$indexOfArray": ["$data.k","$$d"] },-1] },
"then": {
"$arrayElemAt": [
"$data",
{ "$indexOfArray": ["$data.k","$$d"] }
]
},
"else": { "k": "$$d", "v": 0 }
}
}
}
}
}
}}
])
In reality, all of the real work is done by this point:
db.wordstuff.aggregate([
{ "$project": {
"_id": 0,
"split": {
"$filter": {
"input": { "$split": [ "$data", " " ] },
"cond": { "$in": [ "$$this", ["bar","foo","baz","blat"] ] }
}
}
}},
{ "$unwind": "$split" },
{ "$group": { "_id": "$split", "count": { "$sum": 1 } }},
])
Which gives you output like:
{ "_id" : "baz", "count" : 1.0 }
{ "_id" : "bar", "count" : 2.0 }
{ "_id" : "foo", "count" : 3.0 }
So the real work here is being done by $split and that is the main dependency on using the aggregation framework, so you need MongoDB 3.4 at least in order to do this. The very simple premise is to $split the words out individually as array members, then $filter the content to match the input array of words to match.
That $filter uses $in, which is another addition as of MongoDB 3.4 to match against each listed word. There are other operators that can do this with longer syntax, but we know we already need MongoDB 3.4 so this is the shortest syntax.
All that is really done after that is to $unwind the matched array of words from each document, then $group to obtain those matched words as a distinct list, along with the count of the occurrences.
That really is all there is to it from the main perspective of the database.
The following parts are actually "optional" since these are easy to reproduce in code, and probably look a lot clearer and cleaner by doing so. But just to demonstrate the newer operators that would require MongoDB 3.4.4 at least for the introduction of $arrayToObject.
Again the basics are that the next $group "rolls up" the matched words from the cursor into an array within a single document. There is also a very specific key naming applied of "k" and "v" for later reasons.
Then you use a $replaceRoot stage since the content of the document returned is evaluated from an expression. This expression uses $map to iterate over the "input array" of words and matches those to the entries created from the aggregation. This matching is done using $indexOfArray do return the matched index of the compared value.
You use this within $cond as you either want to transform that value into a matched elment using $arrayElemAt, or alternately recognize the index was not a match. This either returns the aggregated entry ( obtained from earlier matches ) or a "default" value of 0 for the given word.
The final part uses $arrayToObject which transforms an array of objects with properties "k" and "v" in to "key/value" pairs as an object.
So you can ask MongoDB to do it, but the data is actually reduced by the minimal pipeline as shown, so you may as well do it in client code. It's pretty simple, and for JavaScript you just do:
var words = db.wordstuff.aggregate([
{ "$project": {
"_id": 0,
"split": {
"$filter": {
"input": { "$split": [ "$data", " " ] },
"cond": { "$in": [ "$$this", ["bar","foo","baz","blat"] ] }
}
}
}},
{ "$unwind": "$split" },
{ "$group": { "_id": "$split", "count": { "$sum": 1 } }},
]).toArray();
var result = ["bar","foo","baz","blat"].map(
w => ( words.map(wd => wd._id).indexOf(w) !== -1)
? words[words.map(wd => wd._id).indexOf(w)]
: { _id: w, count: 0 }
).reduce((acc,curr) => Object.assign(acc,{ [curr._id]: curr.count }),{})
So if there is anything that's language specific at all, then that would be the part. So if you choose to run the aggregation at it's basics and process the resulting cursor, then the python code would be:
input = ["bar","foo","baz","blat"]
words = list(db.wordstuff.aggregate([
{ "$project": {
"_id": 0,
"split": {
"$filter": {
"input": { "$split": [ "$data", " " ] },
"cond": { "$in": [ "$$this", input ] }
}
}
}},
{ "$unwind": "$split" },
{ "$group": { "_id": "$split", "count": { "$sum": 1 } }},
]))
result = reduce(
lambda x,y:
dict(x.items() + { y['_id']: y['count'] }.items()),
map(lambda w: words[map(lambda wd: wd['_id'],words).index(w)]
if w in map(lambda wd: wd['_id'],words)
else { '_id': w, 'count': 0 },
input
),
{}
)
And either method pulls out the same result:
{
"bar" : 2.0,
"foo" : 3.0,
"baz" : 1.0,
"blat" : 0.0
}
MapReduce
The alternate case where you don't even have the minimum MongoDB 3.4.0 available is to use mapReduce for the process instead. Again, this needs to be sent to the server as JavaScript, which is generally represented within "strings" in most language implementations ( other than JavaScript itself ):
db.wordstuff.mapReduce(
function() {
this.data.split(' ')
.filter( w => words.indexOf(w) !== -1 )
.forEach( w => emit(null,{ [w]: 1 }) );
},
function(key,values) {
return [].concat.apply([],
values.map(v => Object.keys(v).map(k => ({ k: k, v: v[k] })))
).reduce((acc,curr) => Object.assign(acc,{
[curr.k]: (acc.hasOwnProperty(curr.k))
? acc[curr.k] + curr.v : curr.v
}),{});
},
{
"out": { "inline": 1 },
"scope": { "words": ["bar","foo","baz","blat"] },
"finalize": function(key,value) {
return words.map( w => (value.hasOwnProperty(w))
? { [w]: value[w] } : { [w]: 0 }
).reduce((acc,curr) => Object.assign(acc,curr),{})
}
}
)
And that gives you the same results and really does exactly the same thing. Just a little slower because MongoDB needs to evaluate and process the JavaScript as compared to using it's own native coded methods with the aggregation framework.