I'm quite new to MongoDB so this might be a rookie question.
I've written a script to insert a few records in my database but I want it to be transactional. So that even if an exception arises, all the changes made before this is rollbacked.
try:
# CONNECTIONS
db_sql = pymysql.connect(host, user, password, name)
mongo_client = pymongo.MongoClient(MONGO_URL, retryWrites=True)
mongo_db = mongo_client[MONGO_DATABASE_NAME]
mongo_collection = mongo_db[MONGO_COLLECTION]
input_list = get_input(db_sql)
doc_list = get_docs(mongo_collection)
added_by = "SYSTEM"
for doc in doc_list:
for input in input_list:
bot_ref = input['bot_ref']
new_doc = {
'customer_id': doc['customer_id'],
'channel': doc['channel'],
'locale': doc['locale'],
'available_for': bot_ref,
'display_name': doc['display_name'],
'added_by': added_by,
'reply_text': doc['reply_text'],
'created_on': doc['created_on'],
'created_by': doc['created_by'],
'updated_on': doc['updated_on'],
'updated_by': doc['updated_by']
}
mongo_collection.insert_one(new_doc)
except Exception as e:
print(traceback.format_exc())
raise
finally:
db_sql.close()
get_input() and get_docs() are functions I've written to fetch data from respective DBs.
In mongoDB version >=4.0(For replicaSets) and >=4.2(For Sharded cluster) multi document transactions are supported , so you can do it following way:
with client.start_session() as s:
def cb(s):
collection_one.insert_one(doc_one, session=s)
collection_two.insert_one(doc_two, session=s)
s.with_transaction(cb)
If something happen and operation is not successful during the transaction commit phase , all the changes will be rolled back automatically ...
Related
I have a sufficiently large dataset that I would like to bulk index the JSON objects in AWS OpenSearch.
I cannot see how to achieve this using any of: boto3, awswrangler, opensearch-py, elasticsearch, elasticsearch-py.
Is there a way to do this without using a python request (PUT/POST) directly?
Note that this is not for: ElasticSearch, AWS ElasticSearch.
Many thanks!
I finally found a way to do it using opensearch-py, as follows.
First establish the client,
# First fetch credentials from environment defaults
# If you can get this far you probably know how to tailor them
# For your particular situation. Otherwise SO is a safe bet :)
import boto3
credentials = boto3.Session().get_credentials()
region='eu-west-2' # for example
auth = AWSV4SignerAuth(credentials, region)
# Now set up the AWS 'Signer'
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
auth = AWSV4SignerAuth(credentials, region)
# And finally the OpenSearch client
host=f"...{region}.es.amazonaws.com" # fill in your hostname (minus the https://) here
client = OpenSearch(
hosts = [{'host': host, 'port': 443}],
http_auth = auth,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection
)
Phew! Let's create the data now:
# Spot the deliberate mistake(s) :D
document1 = {
"title": "Moneyball",
"director": "Bennett Miller",
"year": "2011"
}
document2 = {
"title": "Apollo 13",
"director": "Richie Cunningham",
"year": "1994"
}
data = [document1, document2]
TIP! Create the index if you need to -
my_index = 'my_index'
try:
response = client.indices.create(my_index)
print('\nCreating index:')
print(response)
except Exception as e:
# If, for example, my_index already exists, do not much!
print(e)
This is where things go a bit nutty. I hadn't realised that every single bulk action needs an, er, action e.g. "index", "search" etc. - so let's define that now
action={
"index": {
"_index": my_index
}
}
You can read all about the bulk REST API, there.
The next quirk is that the OpenSearch bulk API requires Newline Delimited JSON (see https://www.ndjson.org), which is basically JSON serialized as strings and separated by newlines. Someone wrote on SO that this "bizarre" API looked like one designed by a data scientist - far from taking offence, I think that rocks. (I agree ndjson is weird though.)
Hideously, now let's build up the full JSON string, combining the data and actions. A helper fn is at hand!
def payload_constructor(data,action):
# "All my own work"
action_string = json.dumps(action) + "\n"
payload_string=""
for datum in data:
payload_string += action_string
this_line = json.dumps(datum) + "\n"
payload_string += this_line
return payload_string
OK so now we can finally invoke the bulk API. I suppose you could mix in all sorts of actions (out of scope here) - go for it!
response=client.bulk(body=payload_constructor(data,action),index=my_index)
That's probably the most boring punchline ever but there you have it.
You can also just get (geddit) .bulk() to just use index= and set the action to:
action={"index": {}}
Hey presto!
Now, choose your poison - the other solution looks crazily shorter and neater.
PS The well-hidden opensearch-py documentation on this are located here.
conn = wr.opensearch.connect(
host=self.hosts, # URL
port=443,
username=self.username,
password=self.password
)
def insert_index_data(data, index_name='stocks', delete_index_data=False):
""" Bulk Create
args: body [{doc1}{doc2}....]
"""
if delete_index_data:
index_name = 'symbol'
self.delete_es_index(index_name)
resp = wr.opensearch.index_documents(
self.conn,
documents=data,
index=index_name
)
print(resp)
return resp
I have used below code to bulk insert records from postgres into OpenSearch ( ES 7.2 )
import sqlalchemy as sa
from sqlalchemy import text
import pandas as pd
import numpy as np
from opensearchpy import OpenSearch
from opensearchpy.helpers import bulk
import json
engine = sa.create_engine('postgresql+psycopg2://postgres:postgres#127.0.0.1:5432/postgres')
host = 'search-xxxxxxxxxx.us-east-1.es.amazonaws.com'
port = 443
auth = ('username', 'password') # For testing only. Don't store credentials in code.
# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
hosts = [{'host': host, 'port': port}],
http_compress = True,
http_auth = auth,
use_ssl = True,
verify_certs = True,
ssl_assert_hostname = False,
ssl_show_warn = False
)
with engine.connect() as connection:
result = connection.execute(text("select * from account_1_study_1.stg_pred where domain='LB'"))
records = []
for row in result:
record = dict(row)
record.update(record['item_dataset'])
del record['item_dataset']
records.append(record)
df = pd.DataFrame(records)
#df['Date'] = df['Date'].astype(str)
df = df.fillna("null")
print(df.keys)
documents = df.to_dict(orient='records')
#bulk(es ,documents, index='search-irl-poc-dump', raise_on_error=True)\
#response=client.bulk(body=documents,index='sample-index')
bulk(client, documents, index='search-irl-poc-dump', raise_on_error=True, refresh=True)
This is only my second task (bug I need to fix) in a Python\Flask\SQLAlchemy\Marshmallow system I need to work on. So please try to be easy with me :)
In short: I'd like to approve an apparently invalid request.
In details:
I need to handle a case in which a user might send a request with some json in which he included by mistake a duplicate value in a list.
For example:
{
"ciphers": [
"TLS_AES_256_GCM_SHA384",
"AES256-SHA256"
],
"is_default": true,
"tls_versions": [
"tls10",
"tls10",
"tls11",
]
}
What I need to do is to eliminate one of the duplicated tls1.0 values, but consider the request as valid, update the db with the correct and distinct tls versions, and in the response return the non duplicated json in body.
Current code segments are as follows:
tls Controller:
...
#client_side_tls_bp.route('/<string:tls_profile_id>', methods=['PUT'])
def update_tls_profile_by_id(tls_profile_id):
return update_entity_by_id(TlsProfileOperator, entity_name, tls_profile_id)
...
general entity controller:
...
def update_entity_by_id(operator, entity_name, entity_id):
"""flask route for updating a resource"""
try:
entity_body = request.get_json()
except Exception:
return make_custom_response("Bad Request", HTTPStatus.BAD_REQUEST)
entity_obj = operator.get(g.tenant, entity_id, g.correlation)
if not entity_obj:
response = make_custom_response(http_not_found_message(entity_name, entity_id), HTTPStatus.NOT_FOUND)
else:
updated = operator.update(g.tenant, entity_id, entity_body, g.correlation)
if updated == "accepted":
response = make_custom_response("Accepted", HTTPStatus.ACCEPTED)
else:
response = make_custom_response(updated, HTTPStatus.OK)
return response
...
tls operator:
...
#staticmethod
def get(tenant, name, correlation_id=None):
try:
tls_profile = TlsProfile.get_by_name(tenant, name)
return schema.dump(tls_profile)
except NoResultFound:
return None
except Exception:
apm_logger.error(f"Failed to get {name} TLS profile", tenant=tenant,
consumer=LogConsumer.customer, correlation=correlation_id)
raise
#staticmethod
def update(tenant, name, json_data, correlation_id=None):
schema.load(json_data)
try:
dependant_vs_names = VirtualServiceOperator.get_dependant_vs_names_locked_by_client_side_tls(tenant, name)
# locks virtual services and tls profile table simultaneously
to_update = TlsProfile.get_by_name(tenant, name)
to_update.update(json_data, commit=False)
db.session.flush() # TODO - need to change when 2 phase commit will be implemented
snapshots = VirtualServiceOperator.get_snapshots_dict(tenant, dependant_vs_names)
# update QWE
# TODO handle QWE update atomically!
for snapshot in snapshots:
QWEController.update_abc_services(tenant, correlation_id, snapshot)
db.session.commit()
apm_logger.info(f"Update successfully {len(dependant_vs_names)} virtual services", tenant=tenant,
correlation=correlation_id)
return schema.dump(to_update)
except Exception:
db.session.rollback()
apm_logger.error(f"Failed to update {name} TLS profile", tenant=tenant,
consumer=LogConsumer.customer, correlation=correlation_id)
raise
...
and in the api schema class:
...
#validates('_tls_versions')
def validate_client_side_tls_versions(self, value):
if len(noDuplicatatesList) < 1:
raise ValidationError("At least a single TLS version must be provided")
for tls_version in noDuplicatatesList:
if tls_version not in TlsProfile.allowed_tls_version_values:
raise ValidationError("Not a valid TLS version")
...
I would have prefer to solve the problem in the schema level, so it won't accept the duplication.
So, as easy as it is to remove the duplication from the "value" parameter value, how can I propagate the non duplicates list back in order to use it to update the db and the response?
Thanks.
I didn't test but I think mutating value in the validation function would work.
However, this is not really guaranteed by marshmallow's API.
The proper way to do it would be to add a post_load method to de-duplicate.
#post_load
def deduplicate_tls(self, data, **kwargs):
if "tls_versions" in data:
data["tls_version"] = list(set(data["tls_version"]))
return data
This won't maintain the order, so if the order matters, or for issues related to deduplication itself, see https://stackoverflow.com/a/7961390/4653485.
I'm trying to migrate some models from OpenERP 7 to Odoo 8 by code. I want to insert objects into new table maintaining the original id number, but it doesn't do it.
I want to insert the new object including its id number.
My code:
import openerp
from openerp import api, modules
from openerp.cli import Command
import psycopg2
class ImportCategory(Command):
"""Import categories from source DB"""
def process_item(self, model, data):
if not data:
return
# Model structure
model.create({
'id': data['id'],
'parent_id': None,
'type': data['type'],
'name': data['name']
})
def run(self, cmdargs):
# Connection to the source database
src_db = psycopg2.connect(
host="127.0.0.1", port="5432",
database="db_name", user="db_user", password="db_password")
src_cr = src_db.cursor()
try:
# Query to retrieve source model data
src_cr.execute("""
SELECT c.id, c.parent_id, c.name, c.type
FROM product_category c
ORDER BY c.id;
""")
except psycopg2.Error as e:
print e.pgerror
openerp.tools.config.parse_config(cmdargs)
dbname = openerp.tools.config['db_name']
r = modules.registry.RegistryManager.get(dbname)
cr = r.cursor()
with api.Environment.manage():
env = api.Environment(cr, 1, {})
# Define target model
product_category = env['product.category']
id_ptr = None
c_data = {}
while True:
r = src_cr.fetchone()
if not r:
self.process_item(product_category, c_data)
break
if id_ptr != r[0]:
self.process_item(product_category, c_data)
id_ptr = r[0]
c_data = {
'id': r[0],
'parent_id': r[1],
'name': r[2],
'type': r[3]
}
cr.commit()
How do I do that?
The only way I could find was to use reference attributes in others objects to relate them in the new database. I mean create relations over location code, client code, order number... and when they are created in the target database, look for them and use the new ID.
def run(self, cmdargs):
# Connection to the source database
src_db = psycopg2.connect(
host="localhost", port="5433",
database="bitnami_openerp", user="bn_openerp", password="bffbcc4a")
src_cr = src_db.cursor()
try:
# Query to retrieve source model data
src_cr.execute("""
SELECT fy.id, fy.company_id, fy.create_date, fy.name,
p.id, p.code, p.company_id, p.create_date, p.date_start, p.date_stop, p.special, p.state,
c.id, c.name
FROM res_company c, account_fiscalyear fy, account_period p
WHERE p.fiscalyear_id = fy.id AND c.id = fy.company_id AND p.company_id = fy.company_id
ORDER BY fy.id;
""")
except psycopg2.Error as e:
print e.pgerror
openerp.tools.config.parse_config(cmdargs)
dbname = openerp.tools.config['db_name']
r = modules.registry.RegistryManager.get(dbname)
cr = r.cursor()
with api.Environment.manage():
env = api.Environment(cr, 1, {})
# Define target model
account_fiscalyear = env['account.fiscalyear']
id_fy_ptr = None
fy_data = {}
res_company = env['res.company']
r = src_cr.fetchone()
if not r:
self.process_fiscalyear(account_fiscalyear, fy_data)
break
company = res_company.search([('name','like',r[13])])
print "Company id: {} | Company name: {}".format(company.id,company.name)
The previous code is only an extract from the whole source code.
I'm trying to run a multi search request on the Elasticsearch Python client. I can run the singular search correctly but can't figure out how to format the request for a msearch. According to the documentation, the body of the request needs to be formatted as:
The request definitions (metadata-search request definition pairs), as
either a newline separated string, or a sequence of dicts to serialize
(one per row).
What's the best way to create this request body? I've been searching for examples but can't seem to find any.
If you follow the demo of official doc(even thought it's for BulkAPI) , you will find how to construct your request in python with the Elasticsearch client:
Here is the newline separated string way:
def msearch():
es = get_es_instance()
search_arr = []
# req_head
search_arr.append({'index': 'my_test_index', 'type': 'doc_type_1'})
# req_body
search_arr.append({"query": {"term" : {"text" : "bag"}}, 'from': 0, 'size': 2})
# req_head
search_arr.append({'index': 'my_test_index', 'type': 'doc_type_2'})
# req_body
search_arr.append({"query": {"match_all" : {}}, 'from': 0, 'size': 2})
request = ''
for each in search_arr:
request += '%s \n' %json.dumps(each)
# as you can see, you just need to feed the <body> parameter,
# and don't need to specify the <index> and <doc_type> as usual
resp = es.msearch(body = request)
As you can see, the final-request is constructed by several req_unit.
Each req_unit construct shows below:
request_header(search control about index_name, optional mapping-types, search-types etc.)\n
reqeust_body(which involves query detail about this request)\n
The sequence of dicts to serialize way is almost same with the previous one, except that you don't need to convert it to string:
def msearch():
es = get_es_instance()
request = []
req_head = {'index': 'my_test_index', 'type': 'doc_type_1'}
req_body = {
'query': {'term': {'text' : 'bag'}},
'from' : 0, 'size': 2 }
request.extend([req_head, req_body])
req_head = {'index': 'my_test_index', 'type': 'doc_type_2'}
req_body = {
'query': {'range': {'price': {'gte': 100, 'lt': 300}}},
'from' : 0, 'size': 2 }
request.extend([req_head, req_body])
resp = es.msearch(body = request)
Here is the structure it returns. Read more about msearch.
If you are using elasticsearch-dsl, you can use the class MultiSearch.
Example from the documentation:
from elasticsearch_dsl import MultiSearch, Search
ms = MultiSearch(index='blogs')
ms = ms.add(Search().filter('term', tags='python'))
ms = ms.add(Search().filter('term', tags='elasticsearch'))
responses = ms.execute()
for response in responses:
print("Results for query %r." % response.search.query)
for hit in response:
print(hit.title)
Here is what I came up with. I am using the same document type and index so I optimized the code to run multiple queries with the same header:
from elasticsearch import Elasticsearch
from elasticsearch import exceptions as es_exceptions
import json
RETRY_ATTEMPTS = 10
RECONNECT_SLEEP_SECS = 0.5
def msearch(es_conn, queries, index, doc_type, retries=0):
"""
Es multi-search query
:param queries: list of dict, es queries
:param index: str, index to query against
:param doc_type: str, defined doc type i.e. event
:param retries: int, current retry attempt
:return: list, found docs
"""
search_header = json.dumps({'index': index, 'type': doc_type})
request = ''
for q in queries:
# request head, body pairs
request += '{}\n{}\n'.format(search_header, json.dumps(q))
try:
resp = es_conn.msearch(body=request, index=index)
found = [r['hits']['hits'] for r in resp['responses']]
except (es_exceptions.ConnectionTimeout, es_exceptions.ConnectionError,
es_exceptions.TransportError): # pragma: no cover
logging.warning("msearch connection failed, retrying...") # Retry on timeout
if retries > RETRY_ATTEMPTS: # pragma: no cover
raise
time.sleep(RECONNECT_SLEEP_SECS)
found = msearch(queries=queries, index=index, retries=retries + 1)
except Exception as e: # pragma: no cover
logging.critical("msearch error {} on query {}".format(e, queries))
raise
return found
es_conn = Elasticsearch()
queries = []
queries.append(
{"min_score": 2.0, "query": {"bool": {"should": [{"match": {"name.tokenized": {"query": "batman"}}}]}}}
)
queries.append(
{"min_score": 1.0, "query": {"bool": {"should": [{"match": {"name.tokenized": {"query": "ironman"}}}]}}}
)
queries.append(
{"track_scores": True, "min_score": 9.0, "query":
{"bool": {"should": [{"match": {"name": {"query": "not-findable"}}}]}}}
)
q_results = msearch(es_conn, queries, index='pipeliner_current', doc_type='event')
This may be what some of you are looking for if you want to do multiple queries on the same index and doc type.
Got it! Here's what I did for anybody else...
query_list = ""
es = ElasticSearch("myurl")
for obj in my_list:
query = constructQuery(name)
query_count += 1
query_list += json.dumps({})
query_list += json.dumps(query)
if query_count <= 19:
query_list += "\n"
if query_count == 20:
es.msearch(index = "m_index", body = query_list)
I was beging screwed up by having to add the index twice. Even when using the Python client you still have to include the index part described in the original docs. Works now though!
I am trying to bulk insert a lot of documents into elastic search using the Python API.
import elasticsearch
from pymongo import MongoClient
es = elasticsearch.Elasticsearch()
def index_collection(db, collection, fields, host='localhost', port=27017):
conn = MongoClient(host, port)
coll = conn[db][collection]
cursor = coll.find({}, fields=fields, timeout=False)
print "Starting Bulk index of {} documents".format(cursor.count())
def action_gen():
"""
Generator to use for bulk inserts
"""
for n, doc in enumerate(cursor):
op_dict = {
'_index': db.lower(),
'_type': collection,
'_id': int('0x' + str(doc['_id']), 16),
}
doc.pop('_id')
op_dict['_source'] = doc
yield op_dict
res = bulk(es, action_gen(), stats_only=True)
print res
The documents come from a Mongodb collection and I amusing the function above to do the bulk indexing according to the way explained in the docs.
the bulk indexing goes on filling elastic search with thousands of empty documents. Can anyone tell me what am I doing wrong?
I've never seen the bulk data put together that way, especially what you're doing with "_source". There may be a way to get that to work, I don't know off-hand, but when I tried it I got weird results.
If you look at the bulk api, ES is expecting a meta-data document, then the document to be indexed. So you need two entries in your bulk data list for each document. So maybe something like:
import elasticsearch
from pymongo import MongoClient
es = elasticsearch.Elasticsearch()
def index_collection(db, collection, fields, host='localhost', port=27017):
conn = MongoClient(host, port)
coll = conn[db][collection]
cursor = coll.find({}, fields=fields, timeout=False)
print "Starting Bulk index of {} documents".format(cursor.count())
bulk_data = []
for n, doc in enumerate(cursor):
bulk_data.append({
'_index': db.lower(),
'_type': collection,
'_id': int('0x' + str(doc['_id']), 16),
})
bulk_data.append(doc)
es.bulk(index=index_name,body=bulk_data,refresh=True)
I didn't try to run that code, though. Here is a script I know works, that you can play with, if it helps:
from elasticsearch import Elasticsearch
es_client = Elasticsearch(hosts = [{ "host" : "localhost", "port" : 9200 }])
index_name = "test_index"
if es_client.indices.exists(index_name):
print("deleting '%s' index..." % (index_name))
print(es_client.indices.delete(index = index_name, ignore=[400, 404]))
print("creating '%s' index..." % (index_name))
print(es_client.indices.create(index = index_name))
bulk_data = []
for i in range(4):
bulk_data.append({
"index": {
"_index": index_name,
"_type": 'doc',
"_id": i
}
})
bulk_data.append({ "idx": i })
print("bulk indexing...")
res = es_client.bulk(index=index_name,body=bulk_data,refresh=True)
print(res)
print("results:")
for doc in es_client.search(index=index_name)['hits']['hits']:
print(doc)