Mongodb: How to change an element of a nested arrary? - python

From what I have read it is impossible to update an element in an nested array using the positional operator $ in mongo. The $ only works one level deep. I see it is a requested feature in mongo 2.7.
Updating the whole document one level up is not an option because of write conflicts. I need to just be able to change the 'username' for a particular reward program for instance.
One of the ideas would to be pull, modify, and push the entire 'reward_programs' element but then I would loose the order. Order is important.
Consider this document:
{
"_id:"0,
"firstname":"Tom",
"profiles" : [
{
"profile_name": "tom",
"reward_programs:[
{
'program_name':'American',
'username':'tomdoe',
},
{
'program_name':'Delta',
'username':'tomdoe',
}
]
}
]
}
How would you go about specifically changing the 'username' of 'program_name'=Delta?

After doing more reading it looks like this is unsupported in mongodb at the moment. Positional updates are only supported for one level deep. The feature might be added for mongodb 2.7.
The are a couple of work arounds.
1) Flatten out your database structure. In this case, make 'reward_programs' it's own collection and do your operation on that.
2) Instead of arrays of dicts, use dicts of dicts. That way you can just have an absolute path down to the object you need to modify. This can have drawbacks to query flexibility.
3) Seems hacky to me but you can also walk the list on the nested array find it's position index in the array and do something like this:
users.update({'_id': request._id, 'profiles.profile_name': profile_name}, {'$set': {'profiles.$.reward_programs.{}.username'.format(index): new_username}})
4) Read in the whole document, modify, write back. However, this has possible write conflicts
Setting up your database structure initially is extremely important. It really depends on how you are going to use it.

A simple way to do this:
doc = collection.find_one({'_id': 0})
doc['profiles'][0]["reward_programs"][1]['username'] = 'new user name'
#replace the whole collection
collection.save(doc)

Related

Generate JSON object from json-path in python

I have a list of json path-s and some values for every path, for example:
bla.[0].ble with a value: 3
and I would like to generate a json object where to output will look like this:
{
"bla": [
{
"ble": 3
}
]
}
To find the expression in the json I used jsonpath-ng library, but now I want to do the other direction, and build json from json-paths.
Can you give me some advice how make this json-generator, which can be used for every json-path?
I tried to just loop through the keys and create list if needed, but maybe there is a more generic solution for this? (any open source library is also perfect if there is any)
As a workaround my solution was to build a new dictionary using the expressions (or their hash) as keys and the values as the values:
generated_json[hash(bla.[0].ble)] = 3
So even though the json object doesn't match the expected output format, I can use this to lookup my expressions as they describe unique paths.
Please feel free to suggest any better solution as this is just a workaround.

How to paginate an aggregation pipeline result in pymongo?

I have a web app where I store some data in Mongo, and I need to return a paginated response from a find or an aggregation pipeline. I use Django Rest Framework and its pagination, which in the end just slices the Cursor object. This works seamlessly for Cursors, but aggregation returns a CommandCursor, which does not implement __getitem__().
cursor = collection.find({})
cursor[10:20] # works, no problem
command_cursor = collection.aggregate([{'$match': {}}])
command_cursor[10:20] # throws not subscriptable error
What is the reason behind this? Does anybody have an implementation for CommandCursor.__getitem__()? Is it feasible at all?
I would like to find a way to not fetch all the values when I need just a page. Converting to a list and then slicing it is not feasible for large (100k+ docs) pipeline results. There is a workaround with based on this answer, but this only works for the first few pages, and the performance drops rapidly for pages at the end.
Mongo has certain aggregation pipeline stages to deal with this, like $skip and $limit that you can use like so:
aggregation_results = list(collection.aggregate([{'$match': {}}, {'$skip': 10}, {'$limit': 10}]))
Specifically as you noticed Pymongo's command_cursor does not have implementation for __getitem__ hence the regular iterator syntax does not work as expected. I would personally recommend not to tamper with their code unless you're interested in becoming a contributer to their package.
The MongoDB cursor for find and aggregate functions in a different way since cursor result from aggregation query is a result of precessed data (in most cases) which is not the case for find-cursors as they are static and hence documents can be skipped and limitted to your will.
You can add the paginator limits as $skip and $limit stages in the aggregation pipeline.
For Example:
command_cursor = collection.aggregate([
{
"$match": {
# Match Conditions
}
},
{
"$skip": 10 # No. of documents to skip (Should be `0` for Page - 1)
},
{
"$limit": 10 # No. of documents to be displayed on your webpage
}
])

Check if JSON var has nullable key (Twitter Streaming API)

I'm downloading tweets from Twitter Streaming API using Tweepy. I manage to check if downloaded data has keys as 'extended_tweet', but I'm struggling with an specific key inside another key.
def on_data(self, data):
savingTweet = {}
if not "retweeted_status" in data:
dataJson = json.loads(data)
if 'extended_tweet' in dataJson:
savingTweet['text'] = dataJson['extended_tweet']['full_text']
else:
savingTweet['text'] = dataJson['text']
if 'coordinates' in dataJson:
if 'coordinates' in dataJson['coordinates']:
savingTweet['coordinates'] = dataJson['coordinates']['coordinates']
else:
savingTweet['coordinates'] = 'null'
I'm checking 'extended_key' propertly, but when I try to do the same with ['coordinates]['coordinates] I get the following error:
TypeError: argument of type 'NoneType' is not iterable
Twitter documentation says that key 'coordinates' has the following structure:
"coordinates":
{
"coordinates":
[
-75.14310264,
40.05701649
],
"type":"Point"
}
I achieved to solve it by just putting the conflictive check in a try, except, but I think this is not the most suitable approach to the problem. Any other idea?
So the twitter API docs are probably lying a bit about what they return (shock horror!) and it looks like you're getting a None in place of the expected data structure. You've already decided against using try, catch, so I won't go over that, but here are a few other suggestions.
Using dict get() default
There are a couple of options that occur to me, the first is to make use of the default ability of the dict get command. You can provide a fall back if the expected key does not exist, which allows you to chain together multiple calls.
For example you can achieve most of what you are trying to do with the following:
return {
'text': data.get('extended_tweet', {}).get('full_text', data['text']),
'coordinates': data.get('coordinates', {}).get('coordinates', 'null')
}
It's not super pretty, but it does work. It's likely to be a little slower that what you are doing too.
Using JSONPath
Another option, which is likely overkill for this situation is to use a JSONPath library which will allow you to search within data structures for items matching a query. Something like:
from jsonpath_rw import parse
matches = parse('extended_tweet.full_text').find(data)
if matches:
print(matches[0].value)
This is going to be a lot slower that what you are doing, and for just a few fields is overkill, but if you are doing a lot of this kind of work it could be a handy tool in the box. JSONPath can also express much more complicated paths, or very deeply nested paths where the get method might not work, or would be unweildy.
Parse the JSON first!
The last thing I would mention is to make sure you parse your JSON before you do your test for "retweeted_status". If the text appears anywhere (say inside the text of a tweet) this test will trigger.
JSON parsing with a competent library is usually extremely fast too, so unless you are having real speed problems it's not necessarily worth worrying about.

Convert data between databases with different schemes and value codings

As part of my job I am meant to write a script to convert data from an available database here on-site to an external database for publication which is similar but far less detailed. I want to achieve this in Python.
Some columns can just be converted by copying their content changing the coding, i.e. 2212 in the original database becomes 2 in the external database in column X. To achieve this I wrote the codings in JSON, e.g.
{
"2212": 2,
"2213": 1,
"2214": 2,
...
}
This leads to some repetition as you can see but since lists cannot be keys in JSON I don't see a better way to it (which is simple and clean; sure I could have the right hand side as a key but then instead of jsonParsedDict["2212"] I would have to go through all the keys 1, 2, ... and find my original key on the right hand side.)
Where it gets ugly (in my opinion) is when information from multiple columns in the original database need to be combined to get the new column. Right now I just wrote a Python function doing a lot of if checks. It works and I will finish the job this way but it just seems aesthetically wrong and I want to learn more about Python's possibilities in that task.
Imagine for example I have two columns in the original database X and Y and then based on values in X I either do nothing (for values that are not coded in the external database), return a value directly or I return a result based on what value Y has in the same row. Right now this leads to quite some if statements. My other idea was to have stacked entries in the JSON file, e.g.
{
"X":
{
"2211": 1,
"2212": null,
"2213":
{
"Y":
{
"3112": 1
"3212": 2
}
}
"2214":
{
"Y":
{
"3112": 2
"3212": 1
}
}
"2215":
{
"Y":
{
"3112": 1
"3212": 2
}
}
}
}
But this approach really blows up the JSON file and repetitions get even more painful. Alas I cannot think of any other way to code these kind of conditions, apart from if in the code.
Is this a feasible way to do it or is there a better solutions? It would be great if I could specify the variables and associated variables, which are part of the decision process, only in the JSON file. I want to abstract the conversion process such that it is mostly steered by these JSON files and the Python code is quite general. If there is a better format for this than JSON then suggestions are very welcome.

Reach a string behind unknown value in JSON

I use Wikipedia's API to get information about a page.
The API gives me JSON like this:
"query":{
"pages":{
"188791":{
"pageid":188791,
"ns":0,
"title":"Vanit\u00e9",
"langlinks":[
{
"lang":"bg",
"*":"Vanitas"
},
{
"lang":"ca",
"*":"Vanitas"
},
ETC.
}
}
}
}
You can see the full JSON response.
I want to obtain all entries like:
{
"lang":"ca",
"*":"Vanitas"
}
but the number key ("188791") in the pages object is the problem.
I found Find a value within nested json dictionary in python that explains me how to do enumerate the values.
Unfortunately I get the following exception:
TypeError: 'dict_values' object does not support indexing
My code is:
json["query"]["pages"].values()[0]["langlinks"]
It's probably a dumb question but I can't find a way to pass in the page id value.
One solution is to use the indexpageids parameter, e.g.: http://fr.wikipedia.org/w/api.php?action=query&titles=Vanit%C3%A9&prop=langlinks&lllimit=500&format=jsonfm&indexpageids. It will add an array of pageids to the response. You can then use that to access the dictionary.
As long as you're only querying one page at a time, Simeon Visser's answer will work. However, as a matter of good style, I'd recommend structuring your code so that you iterate over all the returned results, even if you know there should be only one:
for page in data["query"]["pages"].values():
title = page["title"]
langlinks = page["langlinks"]
# do something with langlinks...
In particular, by writing your code this way, if you ever find yourself needing to run the query for multiple pages, you can do it efficiently with a single MediaWiki API request.
You're using Python 3 and values() now returns a dict_values instead of a list. This is a view on the values of the dictionary.
Hence that's why you're getting that error because indexing fails. Indexing is possible in a list but not a view.
To fix it:
list(json["query"]["pages"].values())[0]["langlinks"]
If you really want just one page arbitrarily, do that the way Simeon Visser suggested.
But I suspect you want all langlinks in all pages, yes?
For that, you want a comprehension:
[page["langlinks"] for page in json["query"]["pages"].values()]
But of course that gives you a 2D list. If you want to iterate over each page's links, that's perfect. If you want to iterate over all of the langlinks at once, you want to flatten the list:
[langlink for page in json["query"]["pages"]
for langlink in page["langlinks"].values()]
… or…
itertools.chain.from_iterable(page["langlinks"]
for page in json["query"]["pages"].values())
(The latter gives you an iterator; if you need a list, wrap the whole thing in list. Conversely, for the first two, if you don't need a list, just any iterable, use parens instead of square brackets to get a generator expression.)

Categories