how to create document and collection in mongodb to make python code configuration. Get attribute name, datatype, function to be called from mongodb ?
mongodb collection sample example
db.attributes.insertMany([
{ attributes_names: "email", attributes_datype: "string", attributes_isNull="false", attributes_std_function = "email_valid" }
{ attributes_names: "address", attributes_datype: "string", attributes_isNull="false", attributes_std_function = "address_valid" }
]);
Python script and function
def email_valid(df):
df1 = df.withColumn(df.columns[0], regexp_replace(lower(df.columns[0]), "^a-zA-Z0-9#\._\-| ", ""))
extract_expr = expr(
"regexp_extract_all(emails, '(\\\w+([\\\.-]?\\\w+)*#\\[A-Za-z\-\.]+([\\\.-]?\\\w+)*(\\\.\\\w{2,3})+)', 0)")
df2 = df1.withColumn(df.columns[0], extract_expr) \
.select(df.columns[0])
return df2
How to get all the mongodb values in python script and call the function according to attribues.
To create MongoDB collection from a python script :
import pymongo
# connect to your mongodb client
client = pymongo.MongoClient(connection_url)
# connect to the database
db = client[database_name]
# get the collection
mycol = db[collection_name]
from bson import ObjectId
from random_object_id import generate
# create a sample dictionary for the collection data
mydict = { "_id": ObjectId(generate()),
"attributes_names": "email",
"attributes_datype": "string",
"attributes_isNull":"false",
"attributes_std_function" : "email_valid" }
# insert the dictionary into the collection
mycol.insert_one(mydict)
To insert multiple values in the MongoDB, use insert_many() instead of insert_one() and pass the list of dictionary to it. So your list of dictionary will look like this
mydict = [{ "_id": ObjectId(generate()),
"attributes_names": "email",
"attributes_datype": "string",
"attributes_isNull":"false",
"attributes_std_function" : "email_valid" },
{ "_id": ObjectId(generate()),
"attributes_names": "email",
"attributes_datype": "string",
"attributes_isNull":"false",
"attributes_std_function" : "email_valid" }]
To get all the data from MongoDB collection into python script :
data = list()
for x in mycol.find():
data.append(x)
import pandas as pd
data = pd.json_normalize(data)
And then access the data as you access an element of a list of dictionaries:
value = data[0]["attributes_names"]
I have sql class in python which inserts data to my DB. In my table, one column is jsonfield and when I insert data to that table , i get error (psycopg2.ProgrammingError: can't adapt type 'dict') .
I have used json.load , json.loads , json.dump , json.dumps. None of them worked. Even I tried string formatting. It did not work, either.
Any idea how to do?
my demo code is
json_data = {
"key": "value"
}
query = """INSERT INTO table(json_field) VALUES(%s)"""
self.cursor.execute(query, ([json_data,]))
self.connection.commit()
Below block of code worked for me
import psycopg2
import json
json_data = {
"key": "value"
}
json_object = json.dumps(json_data, indent = 4)
query = """INSERT INTO json_t(field) VALUES(%s)"""
dbConn = psycopg2.connect(database='test', port=5432, user='username')
cursor=dbConn.cursor()
cursor.execute(query, ([json_object,]))
dbConn.commit()
I have a table that already exists with the following schema:
{
"schema": {
"fields": [
{
"mode": "required",
"name": "full_name",
"type": "string"
},
{
"mode": "required",
"name": "age",
"type": "integer"
}]
}
}
It already contains entries like:
{'full_name': 'John Doe',
'age': int(33)}
I want to insert a new record with a new field and have the load job automatically add the new column as it loads. The new format looks like this:
record = {'full_name': 'Karen Walker',
'age': int(48),
'zipcode': '63021'}
My code is as follows:
from google.cloud import bigquery
client = bigquery.Client(project=projectname)
table = client.get_table(table_id)
config = bigquery.LoadJobConfig()
config.autoedetect = True
config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
config.schema_update_options = [
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
]
job = client.load_table_from_json([record], table, job_config=config)
job.result()
This results in the following error:
400 Provided Schema does not match Table my_project:my_dataset:mytable. Field age has changed mode from REQUIRED to NULLABLE
I can fix this by changing config.schema_update_options as follows:
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
]
This allows me to insert the new record, with zipcode added to the schema, but it causes both full_name and age to become NULLABLE, which is not the behavior I want. Is there a way to prevent schema auto-detect from changing the existing columns?
If you need to add fields to your schema, you can do the following:
from google.cloud import bigquery
client = bigquery.Client()
table = client.get_table("your-project.your-dataset.your-table")
original_schema = table.schema # Get your current table's schema
new_schema = original_schema[:] # Creates a copy of the schema.
# Add new field to schema
new_schema.append(bigquery.SchemaField("new_field", "STRING"))
# Set new schema in your table object
table.schema = new_schema
# Call API to update your table with the new schema
table = client.update_table(table, ["schema"])
After updating your table's schema you can load your new records with this additional field ignoring any schema configurations.
I have a small json file, with the following lines:
{
"IdTitulo": "Jaws",
"IdDirector": "Steven Spielberg",
"IdNumber": 8,
"IdDecimal": "2.33"
}
An there is a schema in my db collection, named test_dec. This is what I've used to create the schema:
db.createCollection("test_dec",
{validator: {
$jsonSchema: {
bsonType: "object",
required: ["IdTitulo","IdDirector"],
properties: {
IdTitulo: {
"bsonType": "string",
"description": "string type, nombre de la pelicula"
},
IdDirector: {
"bsonType": "string",
"description": "string type, nombre del director"
},
IdNumber : {
"bsonType": "int",
"description": "number type to test"
},
IdDecimal : {
"bsonType": "decimal",
"description": "decimal type"
}
}
}}
})
I've made multiple attempts to insert the data. The problem is in the IdDecimal field value.
Some of the trials, replacing the IdDecimal line by:
"IdDecimal": 2.33
"IdDecimal": {"$numberDecimal": "2.33"}
"IdDecimal": NumberDecimal("2.33")
None of them work. The second one is the formal solution provided by MongoDB manuals (mongodb-extended-json) adn the error is the output I've placed in my question: bson.errors.InvalidDocument: key'$numberDecimal' must not start with '$'.
I am currently using a python to load the json. I've been playing around with this file:
import os,sys
import re
import io
import json
from pymongo import MongoClient
from bson.raw_bson import RawBSONDocument
from bson.json_util import CANONICAL_JSON_OPTIONS,dumps,loads
import bsonjs as bs
#connection
client = MongoClient('localhost',27018,document_class=RawBSONDocument)
db = client['myDB']
coll = db['test_dec']
other_col = db['free']
for fname in os.listdir('/mnt/win/load'):
num = re.findall("\d+", fname)
if num:
with io.open(fname, encoding="ISO-8859-1") as f:
doc_data = loads(dumps(f,json_options=CANONICAL_JSON_OPTIONS))
print(doc_data)
test = '{"idTitulo":"La pelicula","idRelease":2019}'
raw_bson = bs.loads(test)
load_raw = RawBSONDocument(raw_bson)
db.other_col.insert_one(load_raw)
client.close()
I am using a json file. If I try to parse anything like Decimal128('2.33') the output is "ValueError: No JSON object could be decoded", because my json has an invalid format.
The result of
db.other_col.insert_one(load_raw)
Is that the content of "test" is inserted.
But I cannot use doc_data with RawBSONDocument, because it goes like that. It says:
TypeError: unpack_from() argument 1 must be string or buffer, not list:
When I manage to parse the json directly to the RawBSONDocument I got all the trash within and the record in database looks like the sample here:
{
"_id" : ObjectId("5eb2920a34eea737626667c2"),
"0" : "{\n",
"1" : "\t\"IdTitulo\": \"Gremlins\",\n",
"2" : "\t\"IdDirector\": \"Joe Dante\",\n",
"3" : "\t\"IdNumber\": 6,\n",
"4" : "\"IdDate\": {\"$date\": \"2010-06-18T:00.12:00Z\"}\t\n",
"5" : "}\n"
}
It seems it is not that simple to load a extended json into MongoDB. The extended version is because I want to use schema validation.
Oleg pointed out that is numberDecimal and not NumberDecimal as I had it before. I've fixed the json file, but nothing changed.
Executed:
with io.open(fname, encoding="ISO-8859-1") as f:
doc_data = json.load(f)
coll.insert(doc_data)
And the json file:
{
"IdTitulo": "Gremlins",
"IdDirector": "Joe Dante",
"IdNumber": 6,
"IdDecimal": {"$numberDecimal": "3.45"}
}
One more roll of the dice from me. If you are using schema validation as you are, I would recommend defining a class and being explicit with defining each field and how you propose to convert the field to the relevant python datatypes. While your solution is generic, the data structure has to be rigid to match the validation.
IMO this is clearer and you have control over any errors etc within the class.
Just to confirm I ran the schema validation and this works with the supplied validation.
from pymongo import MongoClient
import bson.json_util
import dateutil.parser
import json
class Film:
def __init__(self, file):
data = file.read()
loaded = json.loads(data)
self.IdTitulo = loaded.get('IdTitulo')
self.IdDirector = loaded.get('IdDirector')
self.IdDecimal = bson.json_util.Decimal128(loaded.get('IdDecimal'))
self.IdNumber = int(loaded.get('IdNumber'))
self.IdDateTime = dateutil.parser.parse(loaded.get('IdDateTime'))
def insert_one(self, collection):
collection.insert_one(self.__dict__)
client = MongoClient()
mycollection = client.mydatabase.test_dec
with open('c:/temp/1.json', 'r') as jfile:
film = Film(jfile)
film.insert_one(mycollection)
gives:
> db.test_dec.findOne()
{
"_id" : ObjectId("5eba79eabf951a15d32843ae"),
"IdTitulo" : "Jaws",
"IdDirector" : "Steven Spielberg",
"IdDecimal" : NumberDecimal("2.33"),
"IdNumber" : 8,
"IdDateTime" : ISODate("2020-05-12T10:08:21Z")
}
>
JSON file used:
{
"IdTitulo": "Jaws",
"IdDirector": "Steven Spielberg",
"IdNumber": 8,
"IdDecimal": "2.33",
"IdDateTime": "2020-05-12T11:08:21+0100"
}
JSON with type information is called Extended JSON. Following the examples, construct extended json for your data:
ext_json = '''
{
"IdTitulo": "Jaws",
"IdDirector": "Steven Spielberg",
"IdNumber": 8,
"IdDecimal": {"$numberDecimal":"2.33"}
}
'''
In Python, use json_util to load extended json into a Python dictionary:
from bson.json_util import loads
doc = loads(ext_json)
print(doc)
# {u'IdTitulo': u'Jaws', u'IdDirector': u'Steven Spielberg', u'IdDecimal': Decimal128('2.33'), u'IdNumber': 8}
The result of this load is sometimes referred to as a "BSON document" but it is not BSON, which is binary. "BSON" in this context really means that some values are not of python standard library types. The "document" part basically means the object is a dictionary.
You will notice that IdNumber is of a non-standard library type:
print type(doc['IdDecimal'])
# <class 'bson.decimal128.Decimal128'>
To insert this dictionary into MongoDB, follow pymongo tutorial:
from pymongo import MongoClient
client = MongoClient('localhost', 14420)
db = client.test_database
collection = db.test_collection
collection.insert_one(doc)
print(doc)
Finally, I've got the solution and it is using RawBSONDocument.
First the json file:
{
"IdTitulo": "Dead Snow",
"IdDirector": "Tommy Wirkola",
"IdNumber": 11,
"IdDecimal": {"$numberDecimal": "2.22"}
}
& the validation schema file:
db.createCollection("test_dec",
{validator: {
$jsonSchema: {
bsonType: "object",
required: ["IdTitulo","IdDirector"],
properties: {
IdTitulo: {
"bsonType": "string",
"description": "string type, nombre de la pelicula"
},
IdDirector: {
"bsonType": "string",
"description": "string type, nombre del director"
},
IdNumber : {
"bsonType": "int",
"description": "number type to test"
},
IdDecimal : {
"bsonType": "decimal",
"description": "decimal type"
}
}
}}
})
So, the collection in this case is "test_dec".
And the python script that opens the file ".json", reads it and parses it to be imported into MongoDB.
import json
from bson.raw_bson import RawBSONDocument
from pymongo import MongoClient
import bsonjs
#connection
client = MongoClient('localhost',27018)
db = client['movieDB']
coll = db['test_dec']
#open an read file
with open('1.json', 'r') as jfile:
data = jfile.read()
loaded = json.loads(data)
dumped = json.dumps(loaded, indent=4)
bson_bytes = bsonjs.loads(dumped)
coll.insert_one(RawBSONDocument(bson_bytes))
client.close()
The inserted document:
{
"_id" : ObjectId("5eb971ec6fbab859dfae8a6f"),
"IdTitulo" : "Dead Snow",
"IdDirector" : "Toomy Wirkola",
"IdDecimal" : NumberDecimal("2.22"),
"IdNumber" : 11
}
I don't know how it flipped the fields IdDecimal and IdNumber, but it passes the validation and I am really happy.
I tried a document with 'hello' instead of a number in NumberDecimal and the insertion resulted in:
{
"_id" : ObjectId("5eb973b76fbab859dfae8ecd"),
"IdTitulo" : "Shining",
"IdDirector" : "Stanley Kubrick",
"IdDecimal" : NumberDecimal("NaN"),
"IdNumber" : 19
}
Thanks to all that tried to help. Specially Oleg!!! Thank you for being so patient.
Could you not just use bson.decimal128.Decimal128? Ot am I missing something?
from pymongo import MongoClient
from bson.decimal128 import Decimal128
db = MongoClient()['mydatabase']
data = {
"IdTitulo": "Jaws",
"IdDirector": "Steven Spielberg",
"IdNumber": 8,
"IdDecimal": "2.33"
}
data['IdDecimal'] = Decimal128(data['IdDecimal'])
db.other_col.insert_one(data)
I am using gv3.6.2 mongo db and using $set to-update a field and it just doesn't work and am clueless as to why?any pointers are appreciated?
from pymongo import MongoClient .
from bson import ObjectId
import os,pymongo
dbuser = os.environ.get('user', '')
dbpass = os.environ.get('pwd', '')
uri = 'mongodb://{dbuser}:{dbpass}#machineip/data'.format(**locals())
client = MongoClient(uri)
db = client.data
collection = db['test']
print db.version
db.collection.update(
{ "_id" : ObjectId("5a95a1c32a2e2e0025e6d6e2") },
{ "$set":
{
"status": "submission"
}
}
)
Document:
{
"_id" : ObjectId("5a95a1c32a2e2e0025e6d6e2"),
"status" : "Submitting",
"endRev" : "9531c3448d3f7713dc74c4b05d177ecf0c6e4df6",
"chip" : "4364",
}
Your update isn't working because of the match portion of your query:
{ "_id": "5a95a1c32a2e2e0025e6d6e2" }
That is searching for a document with a string _id. You must cast to an ObjectId in order for it to find the matching document and perform the update.
{ "_id" : ObjectId("5a95a1c32a2e2e0025e6d6e2") }
Also be sure to include from pymongo import ObjectId.
Use update_many to update more than one document. If you want to update one document use update_one.
Actually update is deprecated.
from bson import ObjectId
db.collection.update_many({"_id" : ObjectId("5a95a1c32a2e2e0025e6d6e2")}, {"status": "submission"})
I hope it helps.