How to insert large documents in mongodb

How to insert large documents in mongodb - python

I want to store a large JSON document larger than 16MB (which is the size limit per document) in MongoDB, but due to the size limit i am unable to do so. How can I store such large documents in MongoDB? I know GridFS API can be an option, but after a lot of struggle, I am still unable to figure out how to use GridFS and what are the right commands to insert and retrieve data using GridFS. Any help in using GridFS or any other alternative to store large JSON documents would be much appreciated.
I am using Python's PyMongo package.
Thanks!

There are so many methods to store large data in MongoDB
var Data = {
"userID": "1",
"userData": {
"firstName": "Test Firest Name",
"lastName": "Test Last Name",
"number":{
"phNumber": "9999999991",
"cellNumber": "8888888888",
},
"address": {
"Geo": {
"latitude": 15.40,
"longtitude": -70.90
},
"city": "surat",
"state": "gujarat",
"contry": "india"
},
"product": {
"game": {
"GTA": true,
"DOTA": true
},
"television": {
"TV": true,
"PlayStation": true
"Xbox": false
}
},
},
"key": "ANbcsgYSIDncsSK"
};
db.collection("[collection Name]").insertMany(Data)

Related

How to get documents' objects length in MongoDB using PyMongo?

I have hundreds of documents following the format below. Basically, each doc contains objects with the types "BehavioralData", "Notes", etc.
{
"_id": "123",
"Notes": {
"1222": "Something is here"
},
"BehavioralData": {
"Folder1": {
"Sex": "Male",
"Age": "22",
"Date": "",
"ResearchGroup": "",
"Institution": "University of Manitoba"
},
"MoCA": {
"Visual-Executive": "",
"Naming": "NameHere",
"Attention": "",
"Language": "",
"Abstraction": "",
"Delayed Recall": "",
"Orientation": "",
"Education": "",
"Total": ""
}
}
}
Is there a way to query the collection using PyMongo in Python to get the following result:
{
"NotesLength": 1,
"BehavioralLength": 2
}
What I did before, is upload documents one by one to my python script, and then measured the lengths of the dictionaries inside the dictionary. This takes a bit long as my python program queries the whole collection. Is there a way to do it faster than I do now?

Best practices when pulling API data that is being updated/changed repeatedly

currently I'm pulling data from the Quickbooks online API, parsing it and storing it into a database. My problem is, right now with the small amount of data I am pulling I am just deleting my db tables and repopulating the tables with the updated - is there any way I can do this more optimally?

QuickBooks provides an API that is exactly what you're looking for. It's called Change Data Capture and is a common pattern for time-based updates like you're describing.
If you refer to Intuit's docs they tell you all about it:
https://developer.intuit.com/app/developer/qbo/docs/learn/explore-the-quickbooks-online-api/change-data-capture
Basically you make requests like this, providing a date/time you want data changed since:
https://quickbooks.api.intuit.com/v3/company/<realmId>/cdc?entities=<entityList>&changedSince=<dateTime>
And you get back a list of changed objects that you can then update your local database with:
{
"CDCResponse": [{
"QueryResponse": [{
"Customer": [{
...
"Id": "63",
...
},
{
...
"Id": "99",
...
}],
"startPosition": 1,
"maxResults": 2
},
{
"Estimate": [{
...
"Id": "34",
...
},
{
...
"Id": "123",
...
},
{
"domain": "QBO",
"status": "Deleted",
"Id": "979",
"MetaData": {
"LastUpdatedTime": "2015-12-23T12:55:50-08:00"
}
}],
"startPosition": 1,
"maxResults": 3,
"totalCount": 5
}]
}],
"time": "2015-12-23T10:00:01-07:00"
}

No enum error when validating JSON using jsonschema in python

First of all, I am not getting a proper error reponse on the web platform as well (https://jsonschemalint.com). I am using jsonschema in python, and have a proper json schema and json data that works.
The problem I'd like to solve is the following: Before we deliver JSON files with example data, we need to run them through SoapUI to test if they are proper, as we are dealing with huge files and usually our devs may make some errors in generating them, so we do the final check.
I'd like to create a script to automate this, avoiding SoapUI. So after googling, I came across jsonschema, and tried to use it. I get all the proper results,etc, I get errors when I delete certain elements as usual, but the biggest issues are the following:
Example :
I have a subsubsub object in my JSON schema, let's call it Test1, which contains the following :
**Schema**
{
"exname":"2",
"info":{},
"consumes":{},
"produces":{},
"schemes":{},
"tags":{},
"parameters":{},
"paths":{},
"definitions":{
"MainTest1":{
"description":"",
"minProperties":1,
"properties":{
"test1":{
"items":{
"$ref":"#//Test1"
},
"maxItems":10,
"minItems":1,
"type":"array"
},
"test2":{
"items":{
"$ref":"#//"
},
"maxItems":10,
"minItems":1,
"type":"array"
}
}
},
"Test1":{
"description":"test1des",
"minProperties":1,
"properties":{
"prop1":{
"description":"prop1des",
"example":"prop1exam",
"maxLength":10,
"minLength":2,
"type":"string"
},
"prop2":{
"description":"prop2des",
"example":"prop2example",
"maxLength":200,
"minLength":2,
"type":"string"
},
"prop3":{
"enum":[
"enum1",
"enum2",
"enum3"
],
"example":"enum1",
"type":"string"
}
},
"required":[
"prop3"
],
"type":"object"
}
}
}
**Proper example for Test1**
{
"Test1": [{
"prop1": "TestStr",
"prop2": "Test and Test",
"prop3": "enum1"
}]
}
**Improper example that still passes validation for Test1**
{
"test1": [{
"prop1": "TestStr123456", [wrong as it passes the max limit]
"prop2": "Test and Test",
"prop3": " enum1" [wrong as it has a whitespace char before enum1]
}]
}
The first issue I ran across is that enum in prop3 isn't validated correctly. So, when I use " enum1" or "enumruwehrqweur" or "literally anything", the tests pass. In addition, that min-max characters do not get checked throughout my JSON. No matter how many characters I use in any field, I do not get an error. Anyone has any idea how to fix this, or has anyone found a better workaround to do what I would like to do? Thank you in advance!

There were a few issues with your schema. I'll address each of them.
In your schema, you have "Test1". In your JSON instance, you have "test1". Case is important. I would guess this is just an error in creating your example.
In your schema, you have "Test1" at the root level. Because this is not a schema key word, it is ignored, and has no effect on validation. You need to nest it inside a "properties" object, as you have done elsewhere.
{
"properties": {
"test1": {
Your validation would still not work correctly. If you want to validate each item in an array, you need to use the items keyword.
{
"properties": {
"test1": {
"items": {
"description": "test1des",
Finally, you'll need to nest the required and type key words inside the items object.
Here's the complete schema:
{
"properties": {
"test1": {
"items": {
"description": "test1des",
"minProperties": 1,
"properties": {
"prop1": {
"description": "prop1des",
"example": "prop1exam",
"maxLength": 10,
"minLength": 2,
"type": "string"
},
"prop2": {
"description": "prop2des",
"example": "prop2example",
"maxLength": 200,
"minLength": 2,
"type": "string"
},
"prop3": {
"enum": [
"enum1",
"enum2",
"enum3"
],
"example": "enum1",
"type": "string"
}
},
"required": [
"prop3"
],
"type": "object"
}
}
}
}

how to search in n element elastic-search-dsl

I've a problem in elastic search to know if an element already exist, because for a lot of practical reason I store in elastic-search meta data of image for each user (the author) and to avoid some attack I need to be sure that the image hasn't be already saved. here is a standard json file :
{
"user_group": "user",
"user_data": {
"name": "myname"
},
"user_image": [
{
"size": "1920x1080",
"location": "my_city",
"description": "my_desc",
"name": "myname",
"md5_checksum_image": "a_md5"
},
{
"size": "1920x1080",
"location": "my_city",
"description": "my_desc",
"name": "myothername",
"md5_checksum_image": "a_md5"
}
]
}
my problem is that I could have thousand of image so ho could I test with a query if I've already registered this image by looking at the checksum ?

Python .get nested Json values

I have a json file with the following example json entry:
{
"title": "Test prod",
"leafPage": true,
"type": "product",
"product": {
"title": "test product",
"offerPrice": "$19.95",
"offerPriceDetails": {
"amount": 19.95,
"text": "$19.95",
"symbol": "$"
},
"media": [
{
"link": "http://www.test.com/cool.jpg",
"primary": true,
"type": "image",
"xpath": "/html[1]/body[1]/div[1]/div[3]/div[2]/div[1]/div[1]/div[1]/div[1]/a[1]/img[1]"
}
],
"availability": true
},
"human_language": "en",
"url": "http://www.test.com"
}
I can post via python script this to my test server perfectly when I use:
"text": entry.get("title"),
"url": entry.get("url"),
"type": entry.get("type"),
However I cannot get the following nested item to upload the values, how do I structure the python json call to get a nested python json entry?
Ive tried the below without success, I need to have it as .get because there are different fields currently in the json file and it errors out without the .get call.
"Amount": entry.get("product"("offerPrice"))
Any help on how to structure the nested json entry would be very much appreciated.

You need to do:
"Amount": entry.get("product", {}).get("offerPrice")
entry.get("product", {}) returns a product dictionary (or an empty dictionary if there is no product key).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to insert large documents in mongodb - python

Related

How to get documents' objects length in MongoDB using PyMongo?

Best practices when pulling API data that is being updated/changed repeatedly

No enum error when validating JSON using jsonschema in python

how to search in n element elastic-search-dsl

Python .get nested Json values

Categories

Resources