How to import data from mongodb to pandas?

How to import data from mongodb to pandas? - python

I have a large amount of data in a collection in mongodb which I need to analyze. How do i import that data to pandas?
I am new to pandas and numpy.
EDIT:
The mongodb collection contains sensor values tagged with date and time. The sensor values are of float datatype.
Sample Data:
{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
"SensorReport"
],
"Readings" : [
{
"a" : 0.958069536790466,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
"b" : 6.296118156595,
"_cls" : "Reading"
},
{
"a" : 0.95574014778624,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
"b" : 6.29651468650064,
"_cls" : "Reading"
},
{
"a" : 0.953648289182713,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
"b" : 7.29679823731148,
"_cls" : "Reading"
},
{
"a" : 0.955931884300997,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
"b" : 6.29642922525632,
"_cls" : "Reading"
},
{
"a" : 0.95821381,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
"b" : 7.28956613,
"_cls" : "Reading"
},
{
"a" : 4.95821335,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
"b" : 6.28956574,
"_cls" : "Reading"
},
{
"a" : 9.95821341,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
"b" : 0.28956488,
"_cls" : "Reading"
},
{
"a" : 1.95667927,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
"b" : 0.29115237,
"_cls" : "Reading"
}
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

pymongo might give you a hand, followings is some code I'm using:
import pandas as pd
from pymongo import MongoClient
def _connect_mongo(host, port, username, password, db):
""" A util for making a connection to mongo """
if username and password:
mongo_uri = 'mongodb://%s:%s#%s:%s/%s' % (username, password, host, port, db)
conn = MongoClient(mongo_uri)
else:
conn = MongoClient(host, port)
return conn[db]
def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Make a query to the specific DB and Collection
cursor = db[collection].find(query)
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
# Delete the _id
if no_id:
del df['_id']
return df

You can load your mongodb data to pandas DataFrame using this code. It works for me. Hopefully for you too.
import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))

As per PEP, simple is better than complicated:
import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())
You can include conditions as you would working with regular mongoDB database or even use find_one() to get only one element from the database, etc.
and voila!

Monary does exactly that, and it's super fast. (another link)
See this cool post which includes a quick tutorial and some timings.

Another option I found very useful is:
from pandas.io.json import json_normalize
cursor = my_collection.find()
df = json_normalize(cursor)
(or json_normalize(list(cursor)), depending on your python/pandas versions).
This way you get the unfolding of nested mongodb documents for free.

import pandas as pd
from odo import odo
data = odo('mongodb://localhost/db::collection', pd.DataFrame)

For dealing with out-of-core (not fitting into RAM) data efficiently (i.e. with parallel execution), you can try Python Blaze ecosystem: Blaze / Dask / Odo.
Blaze (and Odo) has out-of-the-box functions to deal with MongoDB.
A few useful articles to start off:
Introducing Blaze Expessions (with MongoDB query example)
ReproduceIt: Reddit word count
Difference between Dask Arrays and Blaze
And an article which shows what amazing things are possible with Blaze stack: Analyzing 1.7 Billion Reddit Comments with Blaze and Impala (essentially, querying 975 Gb of Reddit comments in seconds).
P.S. I'm not affiliated with any of these technologies.

Using
pandas.DataFrame(list(...))
will consume a lot of memory if the iterator/generator result is large
better to generate small chunks and concat at the end
def iterator2dataframes(iterator, chunk_size: int):
"""Turn an iterator into multiple small pandas.DataFrame
This is a balance between memory and efficiency
"""
records = []
frames = []
for i, record in enumerate(iterator):
records.append(record)
if i % chunk_size == chunk_size - 1:
frames.append(pd.DataFrame(records))
records = []
if records:
frames.append(pd.DataFrame(records))
return pd.concat(frames)

You can also use pymongoarrow -- it's an official library offered by MongoDB for exporting mongodb data to pandas, numPy, parquet files, etc.

http://docs.mongodb.org/manual/reference/mongoexport
export to csv and use read_csv
or JSON and use DataFrame.from_records()

You can achieve what you want with pdmongo in three lines:
import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")
If your data is very large, you can do an aggregate query first by filtering data you do not want, then map them to your desired columns.
Here is an example of mapping Readings.a to column a and filtering by reportCount column:
import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")
read_mongo accepts the same arguments as pymongo aggregate

Following this great answer by waitingkuo I would like to add the possibility of doing that using chunksize in line with .read_sql() and .read_csv(). I enlarge the answer from Deu Leung by avoiding go one by one each 'record' of the 'iterator' / 'cursor'.
I will borrow previous read_mongo function.
def read_mongo(db,
collection, query={},
host='localhost', port=27017,
username=None, password=None,
chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]
# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
skips_variable = [0,len(skips_variable)]
# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):
# Expand the cursor and construct the DataFrame
#df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))
if no_id:
del df_aux['_id']
# Concatenate the chunks into a unique df
if 'df' not in locals():
df = df_aux
else:
df = pd.concat([df, df_aux], ignore_index=True)
return df

A similar approach like Rafael Valero, waitingkuo and Deu Leung using pagination:
def read_mongo(
# db,
collection, query=None,
# host='localhost', port=27017, username=None, password=None,
chunksize = 100, page_num=1, no_id=True):
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Calculate number of documents to skip
skips = chunksize * (page_num - 1)
# Sorry, this is in spanish
# https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
if not query:
query = {}
# Make a query to the specific DB and Collection
cursor = db[collection].find(query).skip(skips).limit(chunksize)
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
# Delete the _id
if no_id:
del df['_id']
return df

Start mongo in shell with:
mongosh
Scroll up on shell until you see where mongo is connected to. It should look something like this:
mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4
Copy and paste that into mongoclient
Here is the code:
from pymongo import MongoClient
import pandas as pd
client = MongoClient('mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4')
mydatabase = client.yourdatabasename
mycollection = mydatabase.yourcollectionname
cursor = mycollection.find()
listofDocuments = list(cursor)
df = pd.DataFrame(listofDocuments)
df

You can use the "pandas.json_normalize" method:
import pandas as pd
display(pd.json_normalize( x ))
display(pd.json_normalize( x , record_path="Readings" ))
It should display two tables, where x is your cursor or:
from bson import ObjectId
def ISODate(st):
return st
x = {
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
"SensorReport"
],
"Readings" : [
{
"a" : 0.958069536790466,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
"b" : 6.296118156595,
"_cls" : "Reading"
},
{
"a" : 0.95574014778624,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
"b" : 6.29651468650064,
"_cls" : "Reading"
},
{
"a" : 0.953648289182713,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
"b" : 7.29679823731148,
"_cls" : "Reading"
},
{
"a" : 0.955931884300997,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
"b" : 6.29642922525632,
"_cls" : "Reading"
},
{
"a" : 0.95821381,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
"b" : 7.28956613,
"_cls" : "Reading"
},
{
"a" : 4.95821335,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
"b" : 6.28956574,
"_cls" : "Reading"
},
{
"a" : 9.95821341,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
"b" : 0.28956488,
"_cls" : "Reading"
},
{
"a" : 1.95667927,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
"b" : 0.29115237,
"_cls" : "Reading"
}
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

Related

Removing items from JSON using Python loop

how do I iterate over the data and keep object keys that have the string "Java" in the value and remove keys with the string "Javascript" in the value? In addition to the iterations I already have in my code. For example:
this key has the word 'Java' in the value.
"value" : "A vulnerability in the encryption implementation of EBICS messages in the open source librairy ebics-java/ebics-java-client allows an attacker sniffing network traffic to decrypt EBICS payloads. This issue affects: ebics-java/ebics-java-client versions prior to 1.2."
the current code below iterates thru other JSON items (that are also needed), but not the Java/Javascript issue.
from encodings import utf_8
import json
from zipfile import ZipFile
from urllib.request import urlretrieve
from io import BytesIO
import os
url = "https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-2022.json.zip"
urlretrieve(url, "nvdcve-1.1-2022.json.zip")
with ZipFile('nvdcve-1.1-2022.json.zip', 'r') as zip:
zip.extractall('.')
with open('nvdcve-1.1-2022.json', encoding='utf-8') as x:
data = json.load(x)
#function to sort through without rewriting code with parameters/arguments passed into the function(variable)
def base_score(metric):
if 'baseMetricV3' not in metric['impact']:
#no values = 0 so it will auto sort by ID
return (0, metric['cve']['CVE_data_meta']['ID'])
#sorts by ID if two or more base scores are equal
return (metric['impact']['baseMetricV3']['cvssV3']['baseScore'], metric['cve']['CVE_data_meta']['ID'])
#return allows assigment of function output to new variable
#direct python to open json file using specific encoding to avoid encoding error
for CVE_Item in data['CVE_Items']:
for node in CVE_Item['configurations']['nodes']:
#removes items while iterating through them
node['cpe_match'][:] = [item for item in node['cpe_match'] if item['vulnerable']]
#also check children objects for vulnerable
if node['children']:
for children_node in node['children']:
children_node['cpe_match'][:] = [item for item in children_node['cpe_match'] if item['vulnerable']]
#sorts data in descending order using reverse
data['CVE_Items'].sort(reverse=True, key=base_score)
#write file to current working directory
with open('sorted_nvdcve-1.1-2022.json', 'w') as new_file:
new_file.write(json.dumps(data, indent=4))
if os.path.exists('nvdcve-1.1-2022.json.zip'):
os.remove('nvdcve-1.1-2022.json.zip')
else:
print("The file does not exist")
if os.path.exists('nvdcve-1.1-2022.json'):
os.remove('nvdcve-1.1-2022.json')
else:
print("The file does not exist")
here is the link to the original JSON file (too large to post entire text here):
https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-2022.json.zip
the key 'value' is located in the 'description' list.
here is a sample of the JSON text:
{
"CVE_data_type" : "CVE",
"CVE_data_format" : "MITRE",
"CVE_data_version" : "4.0",
"CVE_data_numberOfCVEs" : "15972",
"CVE_data_timestamp" : "2022-11-01T07:00Z",
"CVE_Items" : [ {
"cve" : {
"data_type" : "CVE",
"data_format" : "MITRE",
"data_version" : "4.0",
"CVE_data_meta" : {
"ID" : "CVE-2022-0001",
"ASSIGNER" : "secure#intel.com"
},
"problemtype" : {
"problemtype_data" : [ {
"description" : [ {
"lang" : "en",
"value" : "NVD-CWE-noinfo"
} ]
} ]
},
"references" : {
"reference_data" : [ {
"url" : "https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00598.html",
"name" : "https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00598.html",
"refsource" : "MISC",
"tags" : [ "Vendor Advisory" ]
}, {
"url" : "http://www.openwall.com/lists/oss-security/2022/03/18/2",
"name" : "[oss-security] 20220318 Xen Security Advisory 398 v2 - Multiple speculative security issues",
"refsource" : "MLIST",
"tags" : [ "Mailing List", "Third Party Advisory" ]
}, {
"url" : "https://www.oracle.com/security-alerts/cpujul2022.html",
"name" : "N/A",
"refsource" : "N/A",
"tags" : [ "Patch", "Third Party Advisory" ]
}, {
"url" : "https://security.netapp.com/advisory/ntap-20220818-0004/",
"name" : "https://security.netapp.com/advisory/ntap-20220818-0004/",
"refsource" : "CONFIRM",
"tags" : [ "Third Party Advisory" ]
} ]
},
"description" : {
"description_data" : [ {
"lang" : "en",
"value" : "JavaScript sharing of branch predictor selectors between contexts in some Intel(R) Processors may allow an authorized user to potentially enable information disclosure via local access."
} ]
}

Add this inside the for CVE_Item loop.
CVE_Item['cve']['description']['description_data'] = [
d for d in CVE_Item['cve']['description']['description_data']
if 'Java' in d['value'] and 'JavaScript' not in d['value']]
The modified loop looks like:
for CVE_Item in data['CVE_Items']:
CVE_Item['cve']['description']['description_data'] = [
d for d in CVE_Item['cve']['description']['description_data']
if 'Java' in d['value'] and 'JavaScript' not in d['value']]
for node in CVE_Item['configurations']['nodes']:
#removes items while iterating through them
node['cpe_match'][:] = [item for item in node['cpe_match'] if item['vulnerable']]
#also check children objects for vulnerable
if node['children']:
for children_node in node['children']:
children_node['cpe_match'][:] = [item for item in children_node['cpe_match'] if item['vulnerable']]

Trouble understanding mongo aggregation

I am trying to list all the virtual machines (vms) in my mongo database that use a certain data store, EMC_123.
I have this script, but it list vms that do not use the data store EMC_123.
#!/usr/bin/env python
import pprint
import pymongo
def run_query():
server = '127.0.0.1'
client = pymongo.MongoClient("mongodb://%s:27017/" % server)
db = client["data_center_test"]
collection = db["data_centers"]
pipeline = [
{ "$match": { "clusters.hosts.vms.data_stores.name" : "EMC_123"}},
{ "$group": { "_id" : "$clusters.hosts.vms.name" }}
]
for doc in list(db.data_centers.aggregate(pipeline)):
pp = pprint.PrettyPrinter()
pp.pprint(doc)
pp.pprint (db.command('aggregate', 'data_centers', pipeline=pipeline, explain=True))
def main():
run_query()
return 0
# Start program
if __name__ == "__main__":
main()
I assume I there is something wrong with my pipeline.
Here is the plan that gets printed out:
{u'ok': 1.0,
u'stages': [{u'$cursor': {u'fields': {u'_id': 0,
u'clusters.hosts.vms.name': 1},
u'query': {u'clusters.hosts.vms.data_stores.name': u'EMC_123'},
u'queryPlanner': {u'indexFilterSet': False,
u'namespace': u'data_center_test.data_centers',
u'parsedQuery': {u'clusters.hosts.vms.data_stores.name': {u'$eq': u'EMC_123'}},
u'plannerVersion': 1,
u'rejectedPlans': [],
u'winningPlan': {u'direction': u'forward',
u'filter': {u'clusters.hosts.vms.data_stores.name': {u'$eq': u'EMC_123'}},
u'stage': u'COLLSCAN'}}}},
{u'$group': {u'_id': u'$clusters.hosts.vms.name'}}]}
UPDATE:
Here is a skeleton of what the document looks like:
{
"name" : "data_center_name",
"clusters" : [
{
"hosts" : [
{
"name" : "esxi-hostname",
"vms" : [
{
"data_stores" : [ { "name" : "EMC_123" } ],
"name" : "vm-name1",
"networks" : [ { "name" : "vlan334" } ]
},
{
"data_stores" : [ { "name" : "some_other_data_store" } ],
"name" : "vm-name2",
"networks" : [ { "name" : "vlan334" } ]
}
]
}
],
"name" : "cluster_name"
}
]
}
The problem I am seeing is that vm-name2 shows up in the results when it doesn't have EMC_123 as a data store.
Upate 2:
ok I am able to write a mongo shell query that does what I want. It is a little ugly:
db.data_centers.aggregate({$unwind: '$clusters'}, {$unwind: '$clusters.hosts'}, {$unwind: '$clusters.hosts.vms'}, {$unwind: '$clusters.hosts.vms.data_stores'}, {$match: {"clusters.hosts.vms.data_stores.name": "EMC_123"}})
I came about this in the second answer of this SO question: MongoDB Projection of Nested Arrays

Based on the answers in MongoDB Projection of Nested Arrays I had to change my pipeline to this:
pipeline = [
{'$unwind': '$clusters'},
{'$unwind': '$clusters.hosts'},
{'$unwind': '$clusters.hosts.vms'},
{'$unwind': '$clusters.hosts.vms.data_stores'},
{'$match': {"clusters.hosts.vms.data_stores.name": "EMC_123"}}
]

inserting multiple rows into mongo

I am having a pandas data frame like below :-
I am using below code and inserting the data in mongodb:-
mydb = conn["mydatabase"]
mycol = mydb["test"]
x = results_df["user"] # result_df is the data frame.
for item in x:
mycol.collection.insert({"user" : item , },check_keys= False)
In the below format:-
{ "_id" : ObjectId("5bc0df186b3f65f926bceaeb"), "user" : ".287aa7e54ebe4088ac0a7983df4e4a28.#fnwp.vivox.com" }
{ "_id" : ObjectId("5bc0df186b3f65f926bceaec"), "user" : ".8f47cf677f9b429ab13245e12ce2fdda.#fnwp.vivox.com" }
{ "_id" : ObjectId("5bc0df186b3f65f926bceaed"), "user" : ".9ab4cdcc2cd24c9688f162817cbbbf34.#fnwp.vivox.com" }
I want to insert more row into each object id like below:-
{ "_id" : ObjectId("5bc0df186b3f65f926bceaeb"), "user" : ".287aa7e54ebe4088ac0a7983df4e4a28.#fnwp.vivox.com", "ua":"Vivox-SDK-4.9.0002.29794O" , "type":"vx_pp_log"}
I want to insert billions of rows like this and would like to keep it dynamic as may be in future i will add more rows.

Here you go :-
mydb = conn["testdb"]
mycol = mydb["test"]
user = results_df['user']
ua = results_df['ua']
time = results_df['#timestamp']
df = pd.DataFrame({'user': user, 'ua': ua, 'time': time}) # keep increasing the columns
mycol.collection.insert(df.to_dict('records'))

PyMongo iterate through documents / arrays within documents, replace found values

I have the following data in Mongodb:
{ "_id" : 1, "items" : [ "apple", "orange", "plum" ] }
{ "_id" : 2, "items" : [ "orange", "apple", "pineapple" ] }
{ "_id" : 3, "items" : [ "cherry", "carrot", "apple" ] }
{ "_id" : 4, "items" : [ "sprouts", "pear", "lettuce" ] }
I am trying to make a function using Python / PyMongo that takes 2 arguments, an old and a new string. I would like to find all the "apple" in all of the arrays across all of the documents and replace them with the string "banana".
Below is the code I have so far:
def update(old, new):
for result in db.collection.find({"items" : old}):
for i in result["items"]:
if i == old:
db.collection.update({"_id": result["._id"]}, {"$set": {"items": new}})

Finally figured this out, it is actually really simple:
from pymongo import MongoClient
mongo_client = MongoClient(127.0.0.1, 27017)
db = mongo_client.my_databse
def update(old, new)
for result in db.collection.find({'items' : old}):
db.collection.update_one({ '_id': result['_id'], 'items': old}, { '$set' : {'items.$' : new}})

use update_many() to update multiple documents in one query instead of looping through the documents.
from pymongo import MongoClient
mongo_client = MongoClient(127.0.0.1, 27017)
db = mongo_client.my_databse
def update(old, new)
db.collection.update_many({'items': old}, { '$set' : {'items.$' : new}})
For pymongo less than 3.2 use update with multi flag
def update(old, new)
db.collection.update({'items': old}, { '$set' : {'items.$' : new}}, multi=True)

Remove duplicate values in mongodb

I am learning mongodb using python with tornado.I have a mongodb collection, when I do
db.cal.find()
{
"Pid" : "5652f92761be0b14889d9854",
"Registration" : "TN 56 HD 6766",
"Vid" : "56543ed261be0b0a60a896c9",
"Period" : "10-2015",
"AOs": [
"14-10-2015",
"15-10-2015",
"18-10-2015",
"14-10-2015",
"15-10-2015",
"18-10-2015"
],
"Booked": [
"5-10-2015",
"7-10-2015",
"8-10-2015",
"5-10-2015",
"7-10-2015",
"8-10-2015"
],
"NA": [
"1-10-2015",
"2-10-2015",
"3-10-2015",
"4-10-2015",
"1-10-2015",
"2-10-2015",
"3-10-2015",
"4-10-2015"
],
"AOr": [
"23-10-2015",
"27-10-2015",
"23-10-2015",
"27-10-2015"
]
}
I need an operation to remove the duplicate values from the Booked,NA,AOs,AOr. Finally it should be
{
"Pid" : "5652f92761be0b14889d9854",
"Registration" : "TN 56 HD 6766",
"Vid" : "56543ed261be0b0a60a896c9",
"AOs": [
"14-10-2015",
"15-10-2015",
"18-10-2015",
],
"Booked": [
"5-10-2015",
"7-10-2015",
"8-10-2015",
],
"NA": [
"1-10-2015",
"2-10-2015",
"3-10-2015",
"4-10-2015",
],
"AOr": [
"23-10-2015",
"27-10-2015",
]
}
How do I achieve this in mongodb?

Working solution
I have created a working solution based on JavaScript, which is available on the mongo shell:
var codes = ["AOs", "Booked", "NA", "AOr"]
// Use bulk operations for efficiency
var bulk = db.dupes.initializeUnorderedBulkOp()
db.dupes.find().forEach(
function(doc) {
// Needed to prevent unnecessary operatations
changed = false
codes.forEach(
function(code) {
var values = doc[code]
var uniq = []
for (var i = 0; i < values.length; i++) {
// If the current value can not be found, it is unique
// in the "uniq" array after insertion
if (uniq.indexOf(values[i]) == -1 ){
uniq.push(values[i])
}
}
doc[code] = uniq
if (uniq.length < values.length) {
changed = true
}
}
)
// Update the document only if something was changed
if (changed) {
bulk.find({"_id":doc._id}).updateOne(doc)
}
}
)
// Apply all changes
bulk.execute()
Resulting document with your sample input:
replset:PRIMARY> db.dupes.find().pretty()
{
"_id" : ObjectId("567931aefefcd72d0523777b"),
"Pid" : "5652f92761be0b14889d9854",
"Registration" : "TN 56 HD 6766",
"Vid" : "56543ed261be0b0a60a896c9",
"Period" : "10-2015",
"AOs" : [
"14-10-2015",
"15-10-2015",
"18-10-2015"
],
"Booked" : [
"5-10-2015",
"7-10-2015",
"8-10-2015"
],
"NA" : [
"1-10-2015",
"2-10-2015",
"3-10-2015",
"4-10-2015"
],
"AOr" : [
"23-10-2015",
"27-10-2015"
]
}
Using indices with dropDups
This simply does not work. First, as per version 3.0, this option no longer exists. Since we have 3.2 released, we should find a portable way.
Second, even with dropDups, the documentation clearly states that:
dropDups boolean : MongoDB indexes only the first occurrence of a key and removes all documents from the collection that contain subsequent occurrences of that key.
So if there would be another document which has the same values in one of the billing codes as a previous one, the whole document would be deleted.

You can't use the "dropDups" syntax here first because it has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0 and will not even work.
To remove the duplicate from each list you need to use the set class in python.
import pymongo
fields = ['Booked', 'NA', 'AOs', 'AOr']
client = pymongo.MongoClient()
db = client.test
collection = db.cal
bulk = colllection.initialize_ordered_op()
count = 0
for document in collection.find():
update = dict(zip(fields, [list(set(document[field])) for field in fields]))
bulk.find({'_id': document['_id']}).update_one({'$set': update})
count = count + 1
if count % 200 == 0:
bulk.execute()
bulk = colllection.initialize_ordered_op()
if count > 0:
bulk.execute()
MongoDB 3.2 deprecates Bulk() and its associated methods and provides the .bulkWrite() method. This method is available from Pymongo 3.2 as bulk_write(). The first thing to do using this method is to import the UpdateOne class.
from pymongo import UpdateOne
requests = [] # list of write operations
for document in collection.find():
update = dict(zip(fields, [list(set(document[field])) for field in fields]))
requests.append(UpdateOne({'_id': document['_id']}, {'$set': update}))
collection.bulk_write(requests)
The two queries give the same and expected result:
{'AOr': ['27-10-2015', '23-10-2015'],
'AOs': ['15-10-2015', '14-10-2015', '18-10-2015'],
'Booked': ['7-10-2015', '5-10-2015', '8-10-2015'],
'NA': ['1-10-2015', '4-10-2015', '3-10-2015', '2-10-2015'],
'Period': '10-2015',
'Pid': '5652f92761be0b14889d9854',
'Registration': 'TN 56 HD 6766',
'Vid': '56543ed261be0b0a60a896c9',
'_id': ObjectId('567f808fc6e11b467e59330f')}

have you tried "Distinct()" ?
Link: https://docs.mongodb.org/v3.0/reference/method/db.collection.distinct/
Specify Query with distinct
The following example returns the distinct values for the field sku, embedded in the item field, from the documents whose dept is equal to "A":
db.inventory.distinct( "item.sku", { dept: "A" } )
The method returns the following array of distinct sku values:
[ "111", "333" ]

Assuming that you want to remove duplicate dates from the collection, so you can add a unique index with the dropDups: true option:
db.bill_codes.ensureIndex({"fieldName":1}, {unique: true, dropDups: true})
For more reference:
db.collection.ensureIndex() - MongoDB Manual 3.0
Note: Back up your database first in case it doesn't do exactly as you're expecting.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to import data from mongodb to pandas? - python

You can load your mongodb data to pandas DataFrame using this code. It works for me. Hopefully for you too. import pymongo import pandas as pd from pymongo import MongoClient client = MongoClient() db = client.database_name collection = db.collection_name data = pd.DataFrame(list(collection.find()))

Monary does exactly that, and it's super fast. (another link) See this cool post which includes a quick tutorial and some timings.

Another option I found very useful is: from pandas.io.json import json_normalize cursor = my_collection.find() df = json_normalize(cursor) (or json_normalize(list(cursor)), depending on your python/pandas versions). This way you get the unfolding of nested mongodb documents for free.

import pandas as pd from odo import odo data = odo('mongodb://localhost/db::collection', pd.DataFrame)

You can also use pymongoarrow -- it's an official library offered by MongoDB for exporting mongodb data to pandas, numPy, parquet files, etc.

http://docs.mongodb.org/manual/reference/mongoexport export to csv and use read_csv or JSON and use DataFrame.from_records()

Related

Removing items from JSON using Python loop

Trouble understanding mongo aggregation

inserting multiple rows into mongo

PyMongo iterate through documents / arrays within documents, replace found values

Remove duplicate values in mongodb

Categories

Resources