Pymongo / Mongodb Aggregate Tip

Pymongo / Mongodb Aggregate Tip - python

I have a collection of documents as follows:
{_id: [unique_id], kids: ['unique_id', 'unique_id', 'unique_id', 'unique_id']}
Although, each document has multiple fields, but I am only concerned about _id and kids, so bringing it to the focus (where _id being the _id of the parent and kids an array of ids corresponding to kids).
I have 2 million plus documents in the collection, and what I am looking for is the best possible (i.e. quickest possible) way to retrieve these records. What I have tried initially for 100k documents goes as follows:
-- Plain aggregation:
t = coll.aggregate([{'$project': {'_id': 1, 'kids': 1}},
{'$limit' : 100000},
{'$group': {'_id': '$_id', 'kids': {'$push': "$kids"}}}
])
This is taking around 85 seconds to aggregate for each _id.
-- Aggregation with a condition:
In some of the documents, the kids field is missing, so to get all the relevant documents I am setting a $match with $exists feature:
t = coll.aggregate([{'$project': {'_id': 1, 'kids': 1}},
{'$match': {'kids': {'$exists': True}}},
{'$limit' : 100000},
{'$group': {'_id': '$_id', 'kids': {'$push': "$kids"}}}
])
This is taking around 190 seconds for 100k records.
-- Recursion with find:
The third technique I am using is to find all the documents and append them to a dictionary, and making sure that none of them repeats (since some kids are also parents)..so beginning with two parents and recursing:
def agg(qtc):
qtc = [_id, _id]
for a in qtc:
for b in coll.find({'kids': {'$exists': True}, '_id': ObjectId(a)}, {'_id': 1, 'kids': 1}):
t.append({'_id': str(a['_id']), 'kids': [str(c) for c in b['kids']]})
t = [dict(_id=d['_id'], kid=v) for d in t for v in d['kids']]
t = [dict(tupleized) for tupleized in set(tuple(item.items()) for item in t)]
#THE ABOVE TWO LINES ARE FOR ASSIGNING EACH ID WITH EACH 'KID', SO THAT, IF AN ID HAS
#FOUR KIDS, THEN THE RESULTING ARRAY CONTAINS FOUR RECORDS FOR THAT ID WITH EACH
#HAVING A SINGLE KID.
for a in flatten(t):
if a['kid'] in qtc:
print 'skipped'
continue
else:
t.extend(similar_once([k['kid'] for k in t]))
return t
for this particular one, the time remains unknown as i am not able to figure how exactly to achieve this.
So, the objective is to get all the kids of all the parents (where some kids are also parents) in a minimum possible time. To mention again, I have tested for upto 100k records, and I have 2 million + records. Any Help would be great. Thanks.

Each document's _id is unique, so grouping by the _id does nothing: unique documents go in, and the same unique documents come out. A simple find does all you need:
for document in coll.find(
{'_id': {'$in': [ parent_id1, parent_id2 ]}},
{'_id': True, 'kids': True}):
print document

Related

how to normalize this below json using panda in django

using this view.py query my output is showing something like this. you can see in choices field there are multiple array so i can normalize in serial wise here is my json
{"pages":[{"name":"page1","title":"SurveyWindow Pvt. Ltd. Customer Feedback","description":"Question marked * are compulsory.",
"elements":[{"type":"radiogroup","name":"question1","title":"Do you like our product? *","isRequired":true,
"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]},{"type":"checkbox","name":"question2","title":"Please Rate Our PM Skill","isRequired":false,"choices":[{"value":"High","text":"High"},{"value":"Low","text":"Low"},{"value":"Medium","text":"Medium"}]},{"type":"radiogroup","name":"question3","title":"Do you like our services? *","isRequired":true,"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]}]}]}
this is my view.py
jsondata=SurveyMaster.objects.all().filter(survey_id='1H2711202014572740')
q = jsondata.values('survey_json_design')
qs_json = pd.DataFrame.from_records(q)
datatotable = pd.json_normalize(qs_json['survey_json_design'], record_path=['pages','elements'])
qs_json = datatotable.to_html()

Based on your comments and picture here's what I would do to go from the picture to something more SQL-friendly (what you refer to as "normalization"), but keep in mind this might blow up if you don't have sufficient memory.
Create a new list which you'll fill with the new data, then iterate over the pandas table's rows, and then over every item in your list. For every iteration in the inner loop use the data from the row (minus the column you're iteration over). For convenience I added it as the last element.
# Example data
df = pd.DataFrame({"choices": [[{"text": "yes", "value": "yes"},
{"text": "no", "value": "no"}],
[{"ch1": 1, "ch2": 2}, {"ch3": "ch3"}]],
"name": ["kostas", "rajesh"]})
data = []
for i, row in df.iterrows():
for val in row["choices"]:
data.append((*row.drop("choices").values, val))
df = pd.DataFrame(data, columns=["names", "choices"])
print(df)
names choices
0 kostas {'text': 'yes', 'value': 'yes'}
1 kostas {'text': 'no', 'value': 'no'}
2 george {'ch1': 1, 'ch2': 2}
3 george {'ch3': 'ch3'}
This is where I guess you want to go. All that's left is to just modify the column / variable names with your own data.

Using Pymongo Upsert to Update or Create a Document in MongoDB using Python

I have a dataframe that contains data I want to upload into MongoDB. Below is the data:
MongoRow = pd.DataFrame.from_dict({'school': {1: schoolID}, 'student': {1: student}, 'date': {1: dateToday}, 'Probability': {1: probabilityOfLowerThanThreshold}})
school student date Probability
1 5beee5678d62101c9c4e7dbb 5bf3e06f9a892068705d8420 2020-03-27 0.000038
I have the following code which checks if a row in mongo contains the same student ID and date, if it doesn't then it adds the row:
def getPredictions(school):
schoolDB = DB[school['database']['name']]
schoolPredictions = schoolDB['session_attendance_predicted']
Predictions = schoolPredictions.aggregate([{
'$project': {
'school': '$school',
'student':'$student',
'date':'$date'
}
}])
return list(Predictions)
Predictions = getPredictions(school)
Predictions = pd.DataFrame(Predictions)
schoolDB = DB[school['database']['name']]
collection = schoolDB['session_attendance_predicted']
import json
for i in Predictions.index:
schoolOld = Predictions.loc[i,'school']
studentOld = Predictions.loc[i,'student']
dateOld = Predictions.loc[i,'date']
if(studentOld == student and date == dateOld):
print("Student Exists")
#UPDATE THE ROW WITH NEW VALUES
else:
print("Student Doesn't Exist")
records = json.loads(df.T.to_json()).values()
collection.insert(records)
However if it does exist, I want it to update the row with the new values. Does anyone know how to do this? I have looked at pymongo upsert but I'm not sure how to use it. Can anyone help?
'''''''UPDATE'''''''
The above is partly working now, however, I am now getting an error with the following code:
dateToday = datetime.datetime.combine(dateToday, datetime.time(0, 0))
MongoRow = pd.DataFrame.from_dict({'school': {1: schoolID}, 'student': {1: student}, 'date': {1: dateToday}, 'Probability': {1: probabilityOfLowerThanThreshold}})
data_dict = MongoRow.to_dict()
for i in Predictions.index:
print(Predictions)
collection.replace_one({'student': student, 'date': dateToday}, data_dict, upsert=True)
Error:
InvalidDocument: documents must have only string keys, key was 1

Probably a number of people are going to be confused by the accepted answer as it suggests using replace_one with the upsert flag.
Upserting means 'Updated or Insert' (Up = update and sert= insert). For most people looking to 'upsert', they should be using update_one with the upsert flag.
For example:
collection.update_one({'matchable_field': field_data_to_match}, {"$set": upsertable_data}, upsert=True)

To upsert you cannot use insert() (deprecated) insert_one() or insert_many(). You must use one of the collection level operators that supports upserting.
To get started I would point you towards reading the dataframe line by line and using replace_one() on each line. There are more advanced ways of doing this but this is the easiest.
Your code will look a bit like:
collection.replace_one({'Student': student, 'Date': date}, record, upsert=True)

How to extract specific values from a list of dictionaries in python

I have a list of dictionaries like shown below and i would like to extract the partID and the corresponding quantity for a specific orderID using python, but i don't know how to do it.
dataList = [{'orderID': 'D00001', 'customerID': 'C00001', 'partID': 'P00001', 'quantity': 2},
{'orderID': 'D00002', 'customerID': 'C00002', 'partID': 'P00002', 'quantity': 1},
{'orderID': 'D00003', 'customerID': 'C00003', 'partID': 'P00001', 'quantity': 1},
{'orderID': 'D00004', 'customerID': 'C00004', 'partID': 'P00003', 'quantity': 3}]
So for example, when i search my dataList for a specific orderID == 'D00003', i would like to receive both the partID ('P00001'), as well as the corresponding quantity (1) of the specified order. How would you go about this? Any help is much appreciated.

It depends.
You are not going to do that a lot of time, you can just iterate over the list of dictionaries until you find the "correct" one:
search_for_order_id = 'D00001'
for d in dataList:
if d['orderID'] == search_for_order_id:
print(d['partID'], d['quantity'])
break # assuming orderID is unique
Outputs
P00001 2
Since this solution is O(n), if you are going to do this search a lot of times it will add up.
In that case it will be better to transform the data to a dictionary of dictionaries, with orderID being the outer key (again, assuming orderID is unique):
better = {d['orderID']: d for d in dataList}
This is also O(n) but you pay it only once. Any subsequent lookup is an O(1) dictionary lookup:
search_for_order_id = 'D00001'
print(better[search_for_order_id]['partID'], better[search_for_order_id]['quantity'])
Also outputs
P00001 2

I believe you would like to familiarize yourself with the pandas package, which is very useful for data analysis. If these are the kind of problems you're up against, I advise you to take the time and take a tutorial in pandas. It can do a lot, and is very popular.
Your dataList is very similar to a DataFrame structure, so what you're looking for would be as simple as:
import pandas as pd
df = pd.DataFrame(dataList)
df[df['orderID']=='D00003']

You can use this:
results = [[x['orderID'], x['partID'], x['quantity']] for x in dataList]
for i in results:
print(i)
Also,
results = [['Order ID: ' + x['orderID'], 'Part ID: ' + x['partID'],'Quantity:
' + str(x['quantity'])] for x in dataList]

To get the partID you can make use of the filter function.
myData = [{"x": 1, "y": 1}, {"x": 2, "y": 5}]
filtered = filter(lambda item: item["x"] == 1) # Search for an object with x equal to 1
# Get the next item from the filter (the matching item) and get the y property.
print(next(filtered)["y"])
You should be able to apply this to your situation.

Elasticsearch: match only datefields that are None

I'm trying to return a list of unit id's where the date field is None.
The example below is just a snippet. A company can have several hundred unit id's, but I only want to return a list of active units (where 'validUntil' is None).
'_source': {'company': {'companyId': 1,
{'unit': [{'unitId': 1,
'period': {'validUntil': '2016-02-07' }},
{'unitId': 2,
'period': {'validUntil': None }}]
payload = {
"size": 200,
"_source": "company.companyId.unitId,
"query":{
"term":{
"company.companyId": "1"
}
}
}
I have tried several different things (filter, must_not exists etc.), but either the searches return all unit id's pertaining to that company id or nothing, making me suspect that I'm not filtering correctly.
The date format is 'dateOptionalTime' if that is any help.

It looks like your problem might not be in the query itself.
As far as I know, you cannot return only part of the array if it's type is not nested,
I recommend looking at this question:
select matching objects from array in elasticsearch

Dictionary as key value?

I have been searching for my answer, probably just not using the right verbiage, and only come up with using lists as dictionary key values.
I need to take 20 csv files and anonomyze identifying student, teacher, school and district information for research purposes on testing data. The csv files range anywhere from 20K to 50K rows and 11 to 20 columns, not all have identical information.
One file may have:
studid, termdates, testname, score, standarderr
And another may have:
termdates, studid, studfirstname, studlastname, studdob, ethnicity, grade
And yet another may have:
termdates, studid, teacher, classname, schoolname, districtname
I am putting the varying data into dictionaries for each type of file/dataset, maybe this isn't the best, but I am getting stuck when trying to use a dictionary as a key value for when a student may have taken multiple tests i.e. Language, Reading, Math etc.
For instance:
studDict{studid{'newid': 12345, 'dob': 1/1/1, test1:{'score': 50, 'date': 1/1/15}, test2:{'score': 50, 'date': 1/1/15}, 'school': 'Hard Knocks'},
studid1{'newid': 12345, 'dob': 1/1/1, test1:{'score': 50, 'date': 1/1/15}, test2:{'score': 50, 'date': 1/1/15}, 'school': 'Hard Knocks'}}
Any guidance on which libraries or a brief direction to a method would be greatly appreciated. I understand enough Python that I do not need a full hand holding, but helping me get across the street would be great. :D
CLARIFICATION
I have a better chance of winning the lottery than this project does of being used more than once, so the simpler the method the better. If it would be a repeating project I would most likely dump the data into db tables and work from there.

A dictionary cannot be a key, but a dictionary can be a value for some key in another dictionary (a dict-of-dicts). However, instantiating dictionaries of varying length for every tuple is probably going to make your data analysis very difficult.
Consider using Pandas to read the tuples into a DataFrame with null values where appropriate.
dict API: https://docs.python.org/2/library/stdtypes.html#mapping-types-dict
Pandas Data handling package: http://pandas.pydata.org/

You cannot use a dictionary as a key to a dictionary. Keys must be hashable (i.e., immutable), and dictionaries are not, therefore cannot be used as keys.
You can store a dictionary in another dictionary just the same as any other value. You can, for example do
studDict = { studid: {'newid': 12345, 'dob': 1/1/1, test1:{'score': 50, 'date': 1/1/15}, test2:{'score': 50, 'date': 1/1/15}, 'school': 'Hard Knocks'},
studid1: {'newid': 12345, 'dob': 1/1/1, test1:{'score': 50, 'date': 1/1/15}, test2:{'score': 50, 'date': 1/1/15}, 'school': 'Hard Knocks'}}
assuming you have defined studid and studid1 elsewhere.

If I interpret you correctly, in the end you want a dict with students (i.e. studid) as key and different student related data as value? This is probably not exactly what you want, but I think it will point you in the right direction (adapted from this answer):
import csv
from collections import namedtuple, defaultdict
D = defaultdict(list)
for filename in files:
with open(filename, mode="r") as infile:
reader = csv.reader(infile)
Data = namedtuple("Data", next(reader))
for row in reader:
data = Data(*row)
D[data.studid].append(data)
In the end that should give you a dict D with stuids as keys and a list of test results as values. Each test result is a namedtuple. This assumes that every file has a studid column!.

If you can know the order of a file ahead of time, it's not hard to make a dictionary for it with help from csv.
File tests.csv:
12345,2015-05-19,AP_Bio,96,0.12
67890,2015-04-28,AP_Calc,92,0.17
In a Python file in the same directory as tests.csv:
import csv
with open("tests.csv") as tests:
# Change the fields for files that follow a different form
fields = ["studid", "termdates", "testname", "score", "standarderr"]
students_data = list(csv.DictReader(tests, fieldnames=fields))
# Just a pretty show
print(*students_data, sep="\n")
# {'studid': '12345', 'testname': 'AP_Bio', 'standarderr': '0.12', 'termdates': '2015-05-19', 'score': '96'}
# {'studid': '67890', 'testname': 'AP_Calc', 'standarderr': '0.17', 'termdates': '2015-04-28', 'score': '92'}

Be more explicit please. Your solution depend on the design.
in district you have schools and in each school you have teachers or student.
first you order your datas by district and school
districts = {
"name_district1":{...},
"name_district2":{...},
...,
"name_districtn":{...},
}
for each distric:
# "name_districtn"
{
"name_school1": {...},
"name_school2": {...},
...,
"name_schooln": {...},
}
for each school:
#"name_schooln"
{
id_student1: {...},
id_student2: {...},
...,
id_studentn: {...}
}
and for each student...you define his elements
you can also define one dictionary for all the student but you have to design a uniq id for each student in this case for example:
uniq_Id = "".join(("name_district","name_school", str(student_id)))
Total = {
uniq_Id: {'dob': 1/1/1, test1:{'score': 50, 'date': 1/1/15}, test2:{'score': 50, 'date': 1/1/15}, 'school': 'Hard Knocks'}} ,
...,
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.