inserting multiple rows into mongo - python

I am having a pandas data frame like below :-
I am using below code and inserting the data in mongodb:-
mydb = conn["mydatabase"]
mycol = mydb["test"]
x = results_df["user"] # result_df is the data frame.
for item in x:
mycol.collection.insert({"user" : item , },check_keys= False)
In the below format:-
{ "_id" : ObjectId("5bc0df186b3f65f926bceaeb"), "user" : ".287aa7e54ebe4088ac0a7983df4e4a28.#fnwp.vivox.com" }
{ "_id" : ObjectId("5bc0df186b3f65f926bceaec"), "user" : ".8f47cf677f9b429ab13245e12ce2fdda.#fnwp.vivox.com" }
{ "_id" : ObjectId("5bc0df186b3f65f926bceaed"), "user" : ".9ab4cdcc2cd24c9688f162817cbbbf34.#fnwp.vivox.com" }
I want to insert more row into each object id like below:-
{ "_id" : ObjectId("5bc0df186b3f65f926bceaeb"), "user" : ".287aa7e54ebe4088ac0a7983df4e4a28.#fnwp.vivox.com", "ua":"Vivox-SDK-4.9.0002.29794O" , "type":"vx_pp_log"}
I want to insert billions of rows like this and would like to keep it dynamic as may be in future i will add more rows.

Here you go :-
mydb = conn["testdb"]
mycol = mydb["test"]
user = results_df['user']
ua = results_df['ua']
time = results_df['#timestamp']
df = pd.DataFrame({'user': user, 'ua': ua, 'time': time}) # keep increasing the columns
mycol.collection.insert(df.to_dict('records'))

Related

MongoDB values to Dict in python

Basically I need to connect to MongoDB documents records and put into values into dict.
**MongoDB Values**
{ "_id" : "LAC1397", "code" : "MIS", "label" : "Marshall Islands", "mappingName" : "RESIDENTIAL_COUNTRY" }
{ "_id" : "LAC1852", "code" : "COP", "label" : "Colombian peso", "mappingName" : "FOREIGN_CURRENCY_CODE"}
How do i map it to dict in the below fashion in python
**syntax :**
dict = {"mappingName|Code" : "Value" }
**Example :**
dict = { "RESIDENTIAL_COUNTRY|MIS" : "Marshall Islands" , "FOREIGN_CURRENCY_CODE|COP" : "Colombian peso" , "COMM_LANG|ENG" : "English" }
**Python Code**
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.mongo
collection = db.masters
for post in collection.find():
Got stuck after this , not sure how to put into dict in the mentioned method
post will be a dict with the values from mongo, so you can loop the records and append to a new dictionary. As the comments mention, any duplicates would be overridden by the last found value. If this might be an issue, consider a sort() on the find() function.
Sample code:
from pymongo import MongoClient
db = MongoClient()['mydatabase']
db.mycollection.insert_one({ "_id" : "LAC1397", "code" : "MIS", "label" : "Marshall Islands", "mappingName" : "RESIDENTIAL_COUNTRY" })
db.mycollection.insert_one({ "_id" : "LAC1852", "code" : "COP", "label" : "Colombian peso", "mappingName" : "FOREIGN_CURRENCY_CODE"})
mydict = {}
for post in db.mycollection.find():
k = f"{post.get('mappingName')}|{post.get('code')}"
mydict[k] = post.get('label')
print(mydict)
Gives:
{'RESIDENTIAL_COUNTRY|MIS': 'Marshall Islands', 'FOREIGN_CURRENCY_CODE|COP': 'Colombian peso'}

Sqlite3 Db to Json, used for Highcharts?

I'm currently working with a database and i would like to display its values on a webpage, using highcharts.
Here is what i use to fetch the data in the web app :
#app.route("/data.json")
def data():
connection = sqlite3.connect("/home/pi/database/Main_Database.db")
cursor = connection.cursor()
cursor.execute("SELECT epochTime, data_x from table")
results = cursor.fetchall()
return json.dumps(results)
Then i currently get this value by doing this in my html:
$.getJSON('http://192.168.1.xx/data.json', function (data) {
// Create the chart
$('#container').highcharts('StockChart', {
rangeSelector : {
selected : 1
},
title : {
text : 'title'
},
series : [{
name : 'Value',
data : data,
tooltip: {
valueDecimals: 2
}, .......
This works if i want to display only one data array.
If i want to display more than one array, then it looks like each arrays must be preceded by its name respecting a certain parsing (i checked on the data sample used by highcharts).
Example:
data1:[(epochTime, 200),(epochTime,400)];data2:[(epochTime, 2),(epochTime,4)]
I have some trouble to json.dumps two arrays from two different tables for example. I tried to use this following command : json.dumps({data1:results}).
But the results is still not readable.
Do you have any advice ? Or example/templates of webapp/highcharts using sqlite?
Thanks a lot !
I think this should work:
In the controller:
Fetch 2 results and put them in a dictionary.
#app.route("/data.json")
def data():
connection = sqlite3.connect("/home/pi/database/Main_Database.db")
cursor = connection.cursor()
cursor.execute("SELECT epochTime, data_x from table")
results1 = cursor.fetchall()
cursor.execute("SELECT epochTime, data_x from table2")
results2 = cursor.fetchall()
return json.dumps({'result1': results1,
'result2': results2})
On the page:
$.getJSON('http://192.168.1.xx/data.json', function (data) {
// Create the chart
$('#container').highcharts('StockChart', {
rangeSelector : {
selected : 1
},
title : {
text : 'title'
},
series : [{
name : 'Value1',
data : data.result1,//read result1
tooltip: {
valueDecimals: 2
},
{
name : 'Value2',
data : data.result2,//read result2
tooltip: {
valueDecimals: 2
}, .......

Aggregating over map keys using mongo aggregation framework

I have a use case where I have documents stored in a mongo collection with one of the columns as map. For example :
{ "_id" : ObjectId("axa"), "date" : "2015-08-05", "key1" : "abc", "aggregates" : { "x" : 12, "y" : 1 } }
{ "_id" : ObjectId("axa1"), "date" : "2015-08-04", "key1" : "abc", "aggregates" : { "x" : 4, "y" : 19 } }
{ "_id" : ObjectId("axa2"), "date" : "2015-08-03", "key1" : "abc", "aggregates" : { "x" : 3, "y" : 13 } }
One thing to note is keys inside aggregates sub document could change. for example instead of x and y , it could be z and k or any combination and any number
Now I am pulling that data over from an API and need to use mongo aggregation framework to aggregate over date range. For instance, for the above example, I want to run query for date 08/03 -08/05 and aggregate x and y (group by x and y ) and the result should be
{ "key1" : "abc", "aggregates" : { "x" : 19, "y" : 33 } }
How can I do it?
First you should update you document because date is string. You can do that using the Bulk() API
from datetime import datetime
import pymongo
conn = pymongo.MongoClient()
db = conn.test
col = db.collection
bulk = col.initialize_ordered_bulk_op()
count = 0
for doc in col.find():
conv_date = datetime.strptime(doc['date'], '%Y-%m-%d')
bulk.find({'_id': doc['_id']}).update_one({'$set': {'date': conv_date}})
count = count + 1
if count % 500 == 0:
# Execute per 500 operations and re-init.
bulk.execute()
bulk = col.initialize_ordered_bulk_op()
# Clean up queues
if count % 500 != 0:
bulk.execute()
Then comes the aggregation part:
You need to filter you documents by date using the $match operator. Next $group your documents by a specified identifierkey1 and apply the accumulator $sum. With $project you can reshape your document.
x = 'x'
y = 'y'
col.aggregate([
{'$match': { 'date': { '$lte': datetime(2015, 8, 5), '$gte': datetime(2015, 8, 3)}}},
{'$group': {'_id': '$key1', 'x': {'$sum': '$aggregates. ' +x}, 'y': {'$sum': '$aggregates.' + y}}},
{'$project': {'key1': '$_id', 'aggregates': {'x': '$x', 'y': '$y'}, '_id': 0}}
])

strange pymongo behaviour when adding python list

i have this code:
for product_code in product_codes:
product_categories = []
product_belongs_to = []
get_categories = """SELECT * FROM stock_groups_styles_map WHERE stock_groups_styles_map.style ='%s'""" % (product_code,)
for category in sql_query(get_categories):
if {product_code: category[1]} in product_categories:
pass
else:
product_categories.append({product_code: category[1]})
for category in product_categories:
category_group = get_group(category.values()[0])
if category_group:
category_name = category_group.replace("-", " ").title()
if category_name:
if category_name == "Vests":
product_belongs_to.append(get_category_ids("Tanks"))
else:
cat_value = get_category_ids(category_name)
if cat_value:
cat_id = get_category_ids(category_name)
product_belongs_to.append(cat_id[0])
ccc_products = {
'_id': ObjectId(),
'collectionId': collectionId,
'categoryIds': product_belongs_to,
'visible' : 'true',
}
products.save(ccc_products)
when i look at the mongdb collection, i have:
{
"_id" : ObjectId("53aaa4e1d901f2430f25a6ba"),
"collectionId" : ObjectId("53aaa4d6d901f2430f25a604"),
"visible" : "true",
"categoryIds" : [
ObjectId("53aaa4d6d901f2430f25a5fc"),
ObjectId("53aaa4d3d901f2430f25a5f9")
]
}
this is correct, but if i only have one item in the product_belongs_to list, i get:
{
"_id" : ObjectId("53aaa4e1d901f2430f25a6bd"),
"collectionId" : ObjectId("53aaa4d6d901f2430f25a604"),
"visible" : "true",
"categoryIds" : [
[
ObjectId("53aaa4d6d901f2430f25a5fe")
]
]
}
basically, "categoryIds" is an array containing an array
the only way to fix this is to do the following:
if len(product_belongs_to) == 1:
product_belongs_to = product_belongs_to[0]
what am i missing?
any advice much appreciated.
I suspect that this line is the problematic one:
product_belongs_to.append(get_category_ids("Tanks"))
get_category_ids is returning a list which you're appending to product_belongs_to.
You probably wanted to merge the results instead, so that they contain unique values:
product_belongs_to = list(set(product_belongs_to + get_category_ids("Tanks")))

How to import data from mongodb to pandas?

I have a large amount of data in a collection in mongodb which I need to analyze. How do i import that data to pandas?
I am new to pandas and numpy.
EDIT:
The mongodb collection contains sensor values tagged with date and time. The sensor values are of float datatype.
Sample Data:
{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
"SensorReport"
],
"Readings" : [
{
"a" : 0.958069536790466,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
"b" : 6.296118156595,
"_cls" : "Reading"
},
{
"a" : 0.95574014778624,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
"b" : 6.29651468650064,
"_cls" : "Reading"
},
{
"a" : 0.953648289182713,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
"b" : 7.29679823731148,
"_cls" : "Reading"
},
{
"a" : 0.955931884300997,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
"b" : 6.29642922525632,
"_cls" : "Reading"
},
{
"a" : 0.95821381,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
"b" : 7.28956613,
"_cls" : "Reading"
},
{
"a" : 4.95821335,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
"b" : 6.28956574,
"_cls" : "Reading"
},
{
"a" : 9.95821341,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
"b" : 0.28956488,
"_cls" : "Reading"
},
{
"a" : 1.95667927,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
"b" : 0.29115237,
"_cls" : "Reading"
}
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}
pymongo might give you a hand, followings is some code I'm using:
import pandas as pd
from pymongo import MongoClient
def _connect_mongo(host, port, username, password, db):
""" A util for making a connection to mongo """
if username and password:
mongo_uri = 'mongodb://%s:%s#%s:%s/%s' % (username, password, host, port, db)
conn = MongoClient(mongo_uri)
else:
conn = MongoClient(host, port)
return conn[db]
def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Make a query to the specific DB and Collection
cursor = db[collection].find(query)
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
# Delete the _id
if no_id:
del df['_id']
return df
You can load your mongodb data to pandas DataFrame using this code. It works for me. Hopefully for you too.
import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))
As per PEP, simple is better than complicated:
import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())
You can include conditions as you would working with regular mongoDB database or even use find_one() to get only one element from the database, etc.
and voila!
Monary does exactly that, and it's super fast. (another link)
See this cool post which includes a quick tutorial and some timings.
Another option I found very useful is:
from pandas.io.json import json_normalize
cursor = my_collection.find()
df = json_normalize(cursor)
(or json_normalize(list(cursor)), depending on your python/pandas versions).
This way you get the unfolding of nested mongodb documents for free.
import pandas as pd
from odo import odo
data = odo('mongodb://localhost/db::collection', pd.DataFrame)
For dealing with out-of-core (not fitting into RAM) data efficiently (i.e. with parallel execution), you can try Python Blaze ecosystem: Blaze / Dask / Odo.
Blaze (and Odo) has out-of-the-box functions to deal with MongoDB.
A few useful articles to start off:
Introducing Blaze Expessions (with MongoDB query example)
ReproduceIt: Reddit word count
Difference between Dask Arrays and Blaze
And an article which shows what amazing things are possible with Blaze stack: Analyzing 1.7 Billion Reddit Comments with Blaze and Impala (essentially, querying 975 Gb of Reddit comments in seconds).
P.S. I'm not affiliated with any of these technologies.
Using
pandas.DataFrame(list(...))
will consume a lot of memory if the iterator/generator result is large
better to generate small chunks and concat at the end
def iterator2dataframes(iterator, chunk_size: int):
"""Turn an iterator into multiple small pandas.DataFrame
This is a balance between memory and efficiency
"""
records = []
frames = []
for i, record in enumerate(iterator):
records.append(record)
if i % chunk_size == chunk_size - 1:
frames.append(pd.DataFrame(records))
records = []
if records:
frames.append(pd.DataFrame(records))
return pd.concat(frames)
You can also use pymongoarrow -- it's an official library offered by MongoDB for exporting mongodb data to pandas, numPy, parquet files, etc.
http://docs.mongodb.org/manual/reference/mongoexport
export to csv and use read_csv
or JSON and use DataFrame.from_records()
You can achieve what you want with pdmongo in three lines:
import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")
If your data is very large, you can do an aggregate query first by filtering data you do not want, then map them to your desired columns.
Here is an example of mapping Readings.a to column a and filtering by reportCount column:
import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")
read_mongo accepts the same arguments as pymongo aggregate
Following this great answer by waitingkuo I would like to add the possibility of doing that using chunksize in line with .read_sql() and .read_csv(). I enlarge the answer from Deu Leung by avoiding go one by one each 'record' of the 'iterator' / 'cursor'.
I will borrow previous read_mongo function.
def read_mongo(db,
collection, query={},
host='localhost', port=27017,
username=None, password=None,
chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]
# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
skips_variable = [0,len(skips_variable)]
# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):
# Expand the cursor and construct the DataFrame
#df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))
if no_id:
del df_aux['_id']
# Concatenate the chunks into a unique df
if 'df' not in locals():
df = df_aux
else:
df = pd.concat([df, df_aux], ignore_index=True)
return df
A similar approach like Rafael Valero, waitingkuo and Deu Leung using pagination:
def read_mongo(
# db,
collection, query=None,
# host='localhost', port=27017, username=None, password=None,
chunksize = 100, page_num=1, no_id=True):
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Calculate number of documents to skip
skips = chunksize * (page_num - 1)
# Sorry, this is in spanish
# https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
if not query:
query = {}
# Make a query to the specific DB and Collection
cursor = db[collection].find(query).skip(skips).limit(chunksize)
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
# Delete the _id
if no_id:
del df['_id']
return df
Start mongo in shell with:
mongosh
Scroll up on shell until you see where mongo is connected to. It should look something like this:
mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4
Copy and paste that into mongoclient
Here is the code:
from pymongo import MongoClient
import pandas as pd
client = MongoClient('mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4')
mydatabase = client.yourdatabasename
mycollection = mydatabase.yourcollectionname
cursor = mycollection.find()
listofDocuments = list(cursor)
df = pd.DataFrame(listofDocuments)
df
You can use the "pandas.json_normalize" method:
import pandas as pd
display(pd.json_normalize( x ))
display(pd.json_normalize( x , record_path="Readings" ))
It should display two tables, where x is your cursor or:
from bson import ObjectId
def ISODate(st):
return st
x = {
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
"SensorReport"
],
"Readings" : [
{
"a" : 0.958069536790466,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
"b" : 6.296118156595,
"_cls" : "Reading"
},
{
"a" : 0.95574014778624,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
"b" : 6.29651468650064,
"_cls" : "Reading"
},
{
"a" : 0.953648289182713,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
"b" : 7.29679823731148,
"_cls" : "Reading"
},
{
"a" : 0.955931884300997,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
"b" : 6.29642922525632,
"_cls" : "Reading"
},
{
"a" : 0.95821381,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
"b" : 7.28956613,
"_cls" : "Reading"
},
{
"a" : 4.95821335,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
"b" : 6.28956574,
"_cls" : "Reading"
},
{
"a" : 9.95821341,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
"b" : 0.28956488,
"_cls" : "Reading"
},
{
"a" : 1.95667927,
"_types" : [
"Reading"
],
"ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
"b" : 0.29115237,
"_cls" : "Reading"
}
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

Categories