Mongodb how to update many and set profile specific to id - python

I have a list of ids called batch i want to update all of them to set a field called fetched to true.
Original Test Collection
[{
"user_id": 1,
},
{
"user_id": 2,
}
]
batch variable
[1, 2]
UpdateMany:
mongodb["test"].update_many({"user_id": {"$in": batch}}, {"$set": {"fetched": True}})
I can do that using the above statement.
I also have another variable called user_profiles which is a list/array of json objects. I now ALSO want to set a field profile to be the profile found in the list(user_profiles) where the id matches the user_id/batch(id) i am updating.
user_profiles
[{
"id": 1,
"name": "john"
},
{
"id": 2,
"name": "jane"
}
]
Expected Final Result
[{
"user_id": 1,
"fetched": true,
"profile": {
"id": 1,
"name": "john"
}
},
{
"user_id": 2,
"fetched": true,
"profile": {
"id": 2,
"name": "jane"
}
}
]
I have a millions of these documents so i am trying to keep performance in mind.

You'll want to use db.collection.bulkWrite, see the updateOne example in the docs
If you've got millions you'll want to batch the bulkWrites into smaller chunks that work with your database server's capabilities.
Edit:
#Kay I just re-read the second part of your question which I didn't address earlier. You may want to try the $out stage of the aggregation pipeline. Be careful though since it will overwrite the existing collection so if you don't project all fields you could lose data. Definitely worth using a temporary collection for testing first.
Finally, you could also just create a view based on the aggregation query (with $lookup) if you don't absolutely need that data physically stored in the same collection.

Related

MongoDB $lookup (aggregation): Put multiple matches into array rather than create multiple documents?

Using Ubuntu 21.04, MongoDB Community 4.4.9, pymongo in Python 3.9:
I'm merging data from two collections on one shared key, membershipNumber. membershipNumber is associated with a different user-level identifier, an_user_id, in another collection, and should be unique. However, in many cases, there are n an_user_ids for a single membershipNumber. Right now, this means that I have many duplicate membershipNumbers, causing there to be duplicate documents where everything - apart from an_user_id - is the same in my newly created collection.
In order to circumvent this issue, I want the following to happen:
whenever there are >1 an_user_ids which match a given membershipNumber, I want to create an array that holds ALL an_user_ids that match a given membershipNumber in a newly created collection (using $out)
that way, every membershipNumber in the collection will be unique.
One question re the practicality of this also remains: Will this mean I'll be able to $merge or $insert data which is linked via an_user_id and from a different collection/aggregation onto this newly created collection?
Any help would be hugely appreciated. Thanks!
Working code that I have (which however doesn't prevent duplication):
p = [
{
'$project' : {
'_id' : 0,
'membershipNumber' : 1,
'address' : 1,
'joinDate' : 1,
'membershipType' : 1
}
},
# THE JOIN!!
{
'$lookup': {
'from': "an_users", # the other collection
'localField' : 'membershipNumber',
'foreignField' : 'memref',
'as': "details"
}
},
# retain unmatchable cases
{
'$unwind' : {
'path' : '$details',
'preserveNullAndEmptyArrays' : True
}
},
{
'$project' : {
'_id' : 0,
'membershipNumber' : 1,
'home' : 1,
'joinDate' : 1,
'membershipType' : 1,
'an_user_id' : '$details.user_id',
}
},
{
'$out' : {
'db' : 'mydb',
'coll' : 'new_coll'
}
}
]
members.aggregate(pipeline=p)
And this is what the (unwanted) duplicate data look like in the new collection:
{
"_id": 1,
"membershipNumber": "123456",
"membershipType": "STD",
"home: "Hogwarts",
"joinDate": {
"$date": "2000-01-01T00:00:00.000Z"
},
"an_user_id": "12345"
},
{
"_id": 2,
"membershipNumber": "123456",
"membershipType": "STD",
"home": "Hogwarts"
"joinDate": {
"$date": "2000-01-01T00:00:00.000Z"
},
"an_user_id": "12346"
}
And this is what I'd like it to look like...
{
"_id": 1,
"membershipNumber": "123456",
"membershipType": "STD",
"home": "Hogwarts"
"joinDate": {
"$date": "2000-01-01T00:00:00.000Z"
},
"an_user_id": ["12345", "12346"]
}
Not exactly sure how the $out conditionally comes into play here, but given two collections as follows:
db.foo.insert([
{_id:1, membershipNumber: 1, type: "STD"},
{_id:3, membershipNumber: 5, type: "STD"},
{_id:8, membershipNumber: 8, type: "STD"}
]);
db.foo2.insert([
{_id:1, memref: 1, an_user_id: 1},
{_id:2, memref: 1, an_user_id: 2},
{_id:3, memref: 1, an_user_id: 3},
{_id:4, memref: 5, an_user_id: 5}
// No lookup for memref 8, just to test
]);
Then this pipeline produces the target output. No initial $project is required.
db.foo.aggregate([
// Call the join field "an_user_id" because we are going to OVERWRITE
// it in the next stage. This avoids creating extra fields that we will
// want to $unset later to minimize clutter.
{$lookup: {from: "foo2",
localField: "membershipNumber",
foreignField: "memref",
as: "an_user_id"}}
// Turn the big array of objects into an array of just an_user_id:
,{$addFields: {an_user_id: {$map: {
input: "$an_user_id",
in: "$$this.an_user_id"
}}
}}
]);

JSON array out of order

I have a React/Django application where users can answer multiple choice questions. I have the "choices" array rendered onto the UI in this exact order.
{
"id": 2,
"question_text": "Is Lebron James the GOAT?",
"choices": [
{
"id": 5,
"choice_text": "No",
"votes": 0,
"percent": 0
},
{
"id": 4,
"choice_text": "Yes",
"votes": 1,
"percent": 100
}
],
}
When I select a choice in development mode, I send a request to Django to increment the votes counter for that choice and it will send back a response with updated votes in the same order. When I try to select a choice in production mode using npm run build, the order becomes switched.
{
"id": 2,
"question_text": "Is Lebron James the GOAT?",
"choices": [
{
"id": 4,
"choice_text": "Yes",
"votes": 1,
"percent": 50
},
{
"id": 5,
"choice_text": "No",
"votes": 1,
"percent": 50
}
]
}
I thought the order of JSON array must be preserved. Can anyone explain why this is happening? I'm almost positive that this issue is originating from Django. Here is the function view on Django.
#api_view(['POST'])
def vote_poll(request, poll_id):
if request.method == 'POST':
poll = Poll.objects.get(pk=poll_id)
selected_choice = Choice.objects.get(pk=request.data['selected_choice_id'])
selected_choice.votes += 1
selected_choice.save()
poll_serializer = PollAndChoicesSerializer(poll)
return Response({ 'poll': poll_serializer.data })
You need to set ordering option in your Choice model Meta if you want to have consistent order.
class Choice(Smodels.Model):
class Meta:
ordering = ['-id']
From docs:
Warning
Ordering is not a free operation. Each field you add to the ordering incurs a cost to your database. Each foreign key you add will implicitly include all of its default orderings as well.
If a query doesn’t have an ordering specified, results are returned from the database in an unspecified order. A particular ordering is guaranteed only when ordering by a set of fields that uniquely identify each object in the results. For example, if a name field isn’t unique, ordering by it won’t guarantee objects with the same name always appear in the same order.

How to parse specific data from JSON request

I'm getting into coding, and I'm wondering how I'd go about retrieving the data for "tag_id": 4 specifically.
I know that to get the data for status, but how would I go about getting specific data if there are multiple entries?
r = requests.get('url.com', headers = user_agent).json()
event = (r['status'])
print(event)
//////////////////
{
"status": "SUCCESS",
"status_message": "blah blah blah",
"pri_tag": [
{
"tag_id": 1,
"name": "Tag1"
},
{
"tag_id": 2,
"name": "Tag2"
},
{
"tag_id": 3,
"name": "Tag3"
},
{
"tag_id": 4,
"name": "Tag4"
}
]
}
The for loop answer is sufficient, but this is a good chance to learn how to use list comprehensions, which are ubiquitous and "pythonic":
desired_tag_name = [tag["name"] for tag in event["pri_tag"] if tag["tag_id"] == 4]
List comprehensions are advantageous for readability (I know it may not seem so the first time you look at one) and because they tend to be much faster.
There is a bounty of documentation and blog posts out there to understand the syntax better, and I don't prefer any particular one over another.
I think you're looking for something like:
tags = event["pri_tag"]
for tag in tags:
if tag['tag_id']==4:
print(tag['name'])
Output:
Tag4

Accessing nested objects with python

I have a response that I receive from foursquare in the form of json. I have tried to access the certain parts of the object but have had no success. How would I access say the address of the object? Here is my code that I have tried.
url = 'https://api.foursquare.com/v2/venues/explore'
params = dict(client_id=foursquare_client_id,
client_secret=foursquare_client_secret,
v='20170801', ll=''+lat+','+long+'',
query=mealType, limit=100)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)
msg = '{} {}'.format("Restaurant Address: ",
data['response']['groups'][0]['items'][0]['venue']['location']['address'])
print(msg)
Here is an example of json response:
"items": [
{
"reasons": {
"count": 0,
"items": [
{
"summary": "This spot is popular",
"type": "general",
"reasonName": "globalInteractionReason"
}
]
},
"venue": {
"id": "412d2800f964a520df0c1fe3",
"name": "Central Park",
"contact": {
"phone": "2123106600",
"formattedPhone": "(212) 310-6600",
"twitter": "centralparknyc",
"instagram": "centralparknyc",
"facebook": "37965424481",
"facebookUsername": "centralparknyc",
"facebookName": "Central Park"
},
"location": {
"address": "59th St to 110th St",
"crossStreet": "5th Ave to Central Park West",
"lat": 40.78408342593807,
"lng": -73.96485328674316,
"labeledLatLngs": [
{
"label": "display",
"lat": 40.78408342593807,
"lng": -73.96485328674316
}
],
the full response can be found here
Like so
addrs=data['items'][2]['location']['address']
Your code (at least as far as loading and accessing the object) looks correct to me. I loaded the json from a file (since I don't have your foursquare id) and it worked fine. You are correctly using object/dictionary keys and array positions to navigate to what you want. However, you mispelled "address" in the line where you drill down to the data. Adding the missing 'a' made it work. I'm also correcting the typo in the URL you posted.
I answered this assuming that the example JSON you linked to is what is stored in data. If that isn't the case, a relatively easy way to see exact what python has stored in data is to import pprint and use it like so: pprint.pprint(data).
You could also start an interactive python shell by running the program with the -i switch and examine the variable yourself.
data["items"][2]["location"]["address"]
This will access the address for you.
You can go to any level of nesting by using integer index in case of an array and string index in case of a dict.
Like in your case items is an array
#items[int index]
items[0]
Now items[0] is a dictionary so we access by string indexes
item[0]['location']
Now again its an object s we use string index
item[0]['location']['address]

Parsing child nodes from JSON file with Python

I'm trying to parse specific child nodes from a JSON file using Python.
I know similar questions have been asked and answered before, but I simply haven't been able to translate those solutions to my own problem (disclaimer: I'm not a developer).
This is the beginning of my JSON file (each new "entry" starts at "_index"):
{
"took": 83,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"failed": 0
},
"hits": {
"total": 713628,
"max_score": 1.3753585,
"hits": [{
"_index": "offentliggoerelser-prod-20161006",
"_type": "offentliggoerelse",
"_id": "urn:ofk:oid:5135592",
"_score": 1.3753585,
"_source": {
"cvrNummer": 89986915,
"indlaesningsId": "AUzWhUXw3pscZq1LGK_z",
"sidstOpdateret": "2015-04-20T10:53:09.154Z",
"omgoerelse": false,
"regNummer": null,
"offentliggoerelsestype": "regnskab",
"regnskab": {
"regnskabsperiode": {
"startDato": "2014-01-01",
"slutDato": "2014-12-31"
}
},
"indlaesningsTidspunkt": "2015-04-20T11:10:53.529Z",
"sagsNummer": "X15-AA-66-TA",
"dokumenter": [{
"dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzdlL2I5L2U2LzlkLzIxN2EtNDA1OC04Yjg0LTAwZGJlNzUwMjU3Yw.pdf",
"dokumentMimeType": "application/pdf",
"dokumentType": "AARSRAPPORT"
}, {
"dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzk0LzNlL2RjL2Q4L2I1NjUtNGJjZC05NzJmLTYyMmE4ZTczYWVhNg.xhtml",
"dokumentMimeType": "application/xhtml+xml",
"dokumentType": "AARSRAPPORT"
}, {
"dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzc5LzM3LzUwLzMxL2NjZWQtNDdiNi1hY2E1LTgxY2EyYjRmOGYzMw.xml",
"dokumentMimeType": "application/xml",
"dokumentType": "AARSRAPPORT"
}],
"offentliggoerelsesTidspunkt": "2015-04-20T10:53:09.075Z"
}
},
More specifically, I'm trying to extract all "dokumentUrl" where "dokumentMimeType" is equal to "application/xhtml+xml".
When I use something simple like this:
import json
from pprint import pprint
with open('output.json') as data_file:
data = json.load(data_file)
pprint(data['hits']['hits'][0]['_source']['dokumenter'][1]['dokumentUrl'])
I get the first URL that matches my criteria. But how do I create a list of all URLs (all 713.628 of them) from the file with the criteria mentioned above and export it to a CSV file?
I should probably mention that my end goal is to create a program that can loop scrape my list of URLs (I'll save that for another post!).
Hopefully I am understand this right, and #roganjosh has a similar idea. You can loop through the specific parts with contain lists of useful things. So, we can do something like:
myURL = []
hits = data['hits']['hits']
for hit in hits:
// Making the assumption here that you want all of the URLs associated with a given document
document = hit['_source']['dokumenter']
for url in document:
if url['dokumentMimeType'] == "application/xhtml+xml":
myURL.append(url['dokumentUrl'])
Again, I am hoping that I understand your JSON schema enough that this does what you want it to. At least it should get you close.
Also just saw another part of your question regarding CSV outputting.

Categories