Compare data between MongoDB and MySQL using python script

Compare data between MongoDB and MySQL using python script - python

I am working on a Django Application that uses both MySQL and MongoDB to store its data. What I need to do is to compare the data that are stored in the MongoDB's collection and stored in the MySQL's table.
For example, my MySQL database contains the table "relation", which is created using:
CREATE TABLE relations (service_id int, beneficiary_id int, PRIMARY KEY (service_id, beneficiary_id));
My MongoDB contains a collection called "relation", which is expected to store the same data as the relation table in MySQL. The following is one document of the collection "relation":
{'_id': 0, 'service_id': 1, 'beneficiary_id': 32}
I tried to create a python script that compares the data between the relation table in MySQL and relation collection in Mongo. The script works as the following:
mysql_relations = Relations.objects.values('beneficiary_id', 'service_id')
mongo_relations_not_in_mysql = relations_mongodb.find({'$nor':list(mysql_relations)})
mongo_relations = relations_mongodb.find({}, {'_id': 0, 'beneficiary_id':1, 'service_id': 1})
filter_list = Q()
for mongo_relation in mongo_relations:
filter_list &= Q(mongo_relation)
mysql_relations_not_in_mongo = Relations.objects.exclude(filter_list)
However, this code takes forever.
I think the main problem is because of the primary key that is composed of 2 columns, which required the usage of the Q() and the '$nor'.
What do you suggest?

Just in case someone is interested, I used the following solution to optimize the data comparison.
(The Idea was to create a temporary MySQL Table to store mongo's data, then doing the comparison between the the MySQL tables). The code is below:
Get the relations From MongoDB
mongo_relations = relations_mongodb.find({}, {'_id': 0, 'service_id': 1, 'beneficiary_id': 1})
Create a temporary MySQL table to store MongoDB'S relations
cursor = connection.cursor()
cursor.execute(
"CREATE TEMPORARY TABLE temp_relations (service_id int, beneficiary_id int, INDEX `id_related` (`service_id`, `beneficiary_id`) );"
)
Insert MongoDB's relations not the temporary table just created
cursor.executemany(
'INSERT INTO temp_relations (service_id, beneficiary_id) values (%(service_id)s, %(beneficiary_id)s) ',
list(mongo_relations)
)
Get the MongoDB's relations that does not exist in MySQL
cursor.execute(
"SELECT service_id, beneficiary_id FROM temp_relations WHERE (service_id, beneficiary_id) NOT IN ("
"SELECT service_id, beneficiary_id FROM relations);"
)
mongo_relations_not_in_mysql = cursor.fetchall()
Get MySQL relations that does not exist in MongoDB
cursor.execute(
"SELECT id, service_id, beneficiary_id, date FROM relations WHERE (service_id, beneficiary_id) not IN ("
"SELECT service_id, beneficiary_id FROM temp_relations);"
)
mysql_relations_not_in_mongo = cursor.fetchall()
cursor.close() # Close the connection to MySQL

Related

Postgresql database error: column does not exist

I'm using Postrgesql via python, but I'm getting an error on the following insertion, alleging that 'column "\ufeff61356169" does not exist'.
c_id = "\ufeff61356169"
chunk = "engine"
query = f"""
INSERT INTO company_chunks(company_id, chunk)
VALUES(`{c_id}`, `{chunk}`);
"""
c.execute(query)
>>>
DatabaseError: {'S': 'ERROR', 'V': 'ERROR', 'C': '42703', 'M': 'column "\ufeff61356169" does not exist', 'P': '88', 'F': 'parse_relation.c', 'L': '3514', 'R': 'errorMissingColumn'}
One key note: "\ufeff61356169" is the value which is to be inserted into the column. So the error confuses me. It's confusing the insertion value for the column, which should receive the insertion. Any thoughts?
Just to verify that everything else is in working order I made sure to check that my table was successfully created.
query = """
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'company_chunks';
"""
c.execute(query)
c.fetchall()
>>>
(['company_id'], ['chunk'])
So the table does exist and it has the columns, which I'm trying to make insertions to. Where am I going wrong here?
Btw, I'm connecting to this database, which is stored in GCP, via the Cloud SQL Python Connector. However, this connector was able to create the table, so I believe the problem is specific to python syntax and/or Postgres.
Edit: For the sake of understanding what this table looks like, here's the creation query.
query= """
CREATE TABLE company_chunks
(
company_id VARCHAR(25) NOT NULL,
chunk VARCHAR(100) NOT NULL
);
"""
c.execute(query)
conn.commit()

better to do it this way by using %s placeholder:
sql = "INSERT INTO company_chunks(company_id, chunk) VALUES (%s, %s)"
var = (c_id,chunk)
mycursor.execute(sql,var)

Delete values from db2 table if exist in schema in Python

i want to ask for a little help about my problem. I have sql query that get all the tables from some schema and put those tables in a list in Python. For example:
tablesList = ['TABLE1','TABLE2',...]
After i get this list of tables that i want i go one more time through each table in a for loop for example:
for t in range(len(tables)):
table = tables[t]
...
#here i want to check if this table exist in some db2 schema and if exist delete content
#of this table, otherwise go with next table check and don't delete content
Query for checking will be:
sql = """SELECT COUNT(*) FROM SYSIBM.SYSTABLES
WHERE TYPE = 'T'
AND CREATOR = 'MY_SCHEMA'
AND NAME = '{table}';""".format(table = table)
cursor.execute(sql)
rows_count = cursor.fetchone()
if rows_count is None:
pass
else:
delete...

How to automap the result set of a custom SQL query in SQLAlchemy

I'd like to run raw SQL queries through SQLAlchemy and have the resulting rows use python types which are automatically mapped from the database type. This AutoMap functionality is available for tables in the database. Is it available for any arbitrary resultset?
As an example, we build small sqlite database:
import sqlite3
con = sqlite3.connect('test.db')
cur = con.cursor()
cur.execute("CREATE TABLE Trainer (id INTEGER PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), dob DATE, tiger_skill FLOAT);")
cur.execute("INSERT INTO Trainer VALUES (1, 'Joe', 'Exotic', '1963-03-05', 0.6)")
cur.execute("INSERT INTO Trainer VALUES (2, 'Carole', 'Baskin', '1961-06-06', 0.3)")
cur.close()
con.commit()
con.close()
And uing SQLAlchemy, I query the newly created database "test.db":
from sqlalchemy import create_engine
engine = create_engine("sqlite:///test.db")
connection = engine.connect()
CUSTOM_SQL_QUERY = "SELECT count(*) as total_trainers, min(dob) as first_dob from Trainer"
result = connection.execute(CUSTOM_SQL_QUERY)
for r in result:
print(r)
>>> (2, '1961-06-06')
Notice that the second column in the result set is a python string, not a python datetime.date object. Is there a way for sqlalchemy to automap an arbitrary result set? Or is this automap reflection capability limited to just actual tables in the database?

Bulk Insert Data from List of Dictionaries into Postgresql database [Faster Way]?

For Example:
books = [{'name':'pearson', 'price':60, 'author':'Jesse Pinkman'},{'name':'ah publications', 'price':80, 'author':'Gus Fring'},{'name':'euclidean', 'price':120, 'author':'Skyler White'},{'name':'Nanjial', 'price':260, 'author':'Saul Goodman'}]
I need to insert each dictionary into already created table by just taking 'author','price'
I have like 100k records to be inserted into table.
Right now what I am doing is to loop through the list of dictionaries and take the required key/value pair and insert one by one
def insert_books(self, val):
cur = self.con.cursor()
sql = """insert into testtable values {}""".format(val)
cur.execute(sql)
self.con.commit()
cur.close()
for i in books:
result = i['author'],i['price']
db_g.insert_books(result) #db_g is class - connection properties
So is there a faster and easier way to bulk insert the data like 10k at a time?

i think bulk insert by dumping the whole dataframe will be much faster..Why Bulk Import is faster than bunch of INSERTs?
import sqlalchemy
def db_conn():
connection = sqlalchemy.create_engine(//connection string)
return connection
books = [{'name':'pearson', 'price':60, 'author':'Jesse Pinkman'},{'name':'ah publications', 'price':80, 'author':'Gus Fring'},{'name':'euclidean', 'price':120, 'author':'Skyler White'},{'name':'Nanjial', 'price':260, 'author':'Saul Goodman'}]
df_to_ingest = pd.DataFrame(books)
df_to_ingest = df_to_ingest([['author', 'price']])
df_to_ingest('tablename', db_conn(), if_exists='append', index=False)
Hope this helps

Getting the id of the last record inserted for Postgresql SERIAL KEY with Python

I am using SQLAlchemy without the ORM, i.e. using hand-crafted SQL statements to directly interact with the backend database. I am using PG as my backend database (psycopg2 as DB driver) in this instance - I don't know if that affects the answer.
I have statements like this,for brevity, assume that conn is a valid connection to the database:
conn.execute("INSERT INTO user (name, country_id) VALUES ('Homer', 123)")
Assume also that the user table consists of the columns (id [SERIAL PRIMARY KEY], name, country_id)
How may I obtain the id of the new user, ideally, without hitting the database again?

You might be able to use the RETURNING clause of the INSERT statement like this:
result = conn.execute("INSERT INTO user (name, country_id) VALUES ('Homer', 123)
RETURNING *")
If you only want the resulting id:
result = conn.execute("INSERT INTO user (name, country_id) VALUES ('Homer', 123)
RETURNING id")
[new_id] = result.fetchone()

User lastrowid
result = conn.execute("INSERT INTO user (name, country_id) VALUES ('Homer', 123)")
result.lastrowid

Current SQLAlchemy documentation suggests
result.inserted_primary_key should work!

Python + SQLAlchemy
after commit, you get the primary_key column id (autoincremeted) updated in your object.
db.session.add(new_usr)
db.session.commit() #will insert the new_usr data into database AND retrieve id
idd = new_usr.usrID # usrID is the autoincremented primary_key column.
return jsonify(idd),201 #usrID = 12, correct id from table User in Database.

this question has been asked many times on stackoverflow and no answer I have seen is comprehensive. Googling 'sqlalchemy insert get id of new row' brings up a lot of them.
There are three levels to SQLAlchemy.
Top: the ORM.
Middle: Database abstraction (DBA) with Table classes etc.
Bottom: SQL using the text function.
To an OO programmer the ORM level looks natural, but to a database programmer it looks ugly and the ORM gets in the way. The DBA layer is an OK compromise. The SQL layer looks natural to database programmers and would look alien to an OO-only programmer.
Each level has it own syntax, similar but different enough to be frustrating. On top of this there is almost too much documentation online, very hard to find the answer.
I will describe how to get the inserted id AT THE SQL LAYER for the RDBMS I use.
Table: User(user_id integer primary autoincrement key, user_name string)
conn: Is a Connection obtained within SQLAlchemy to the DBMS you are using.
SQLite
======
insstmt = text(
'''INSERT INTO user (user_name)
VALUES (:usernm) ''' )
# Execute within a transaction (optional)
txn = conn.begin()
result = conn.execute(insstmt, usernm='Jane Doe')
# The id!
recid = result.lastrowid
txn.commit()
MS SQL Server
=============
insstmt = text(
'''INSERT INTO user (user_name)
OUTPUT inserted.record_id
VALUES (:usernm) ''' )
txn = conn.begin()
result = conn.execute(insstmt, usernm='Jane Doe')
# The id!
recid = result.fetchone()[0]
txn.commit()
MariaDB/MySQL
=============
insstmt = text(
'''INSERT INTO user (user_name)
VALUES (:usernm) ''' )
txn = conn.begin()
result = conn.execute(insstmt, usernm='Jane Doe')
# The id!
recid = conn.execute(text('SELECT LAST_INSERT_ID()')).fetchone()[0]
txn.commit()
Postgres
========
insstmt = text(
'''INSERT INTO user (user_name)
VALUES (:usernm)
RETURNING user_id ''' )
txn = conn.begin()
result = conn.execute(insstmt, usernm='Jane Doe')
# The id!
recid = result.fetchone()[0]
txn.commit()

result.inserted_primary_key
Worked for me. The only thing to note is that this returns a list that contains that last_insert_id.

Make sure you use fetchrow/fetch to receive the returning object
insert_stmt = user.insert().values(name="homer", country_id="123").returning(user.c.id)
row_id = await conn.fetchrow(insert_stmt)

For Postgress inserts from python code is simple to use "RETURNING" keyword with the "col_id" (name of the column which you want to get the last inserted row id) in insert statement at end
syntax -
from sqlalchemy import create_engine
conn_string = "postgresql://USERNAME:PSWD#HOSTNAME/DATABASE_NAME"
db = create_engine(conn_string)
conn = db.connect()
INSERT INTO emp_table (col_id, Name ,Age)
VALUES(3,'xyz',30) RETURNING col_id;
or
(if col_id column is auto increment)
insert_sql = (INSERT INTO emp_table (Name ,Age)
VALUES('xyz',30) RETURNING col_id;)
result = conn.execute(insert_sql)
[last_row_id] = result.fetchone()
print(last_row_id)
#output = 3
ex -

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare data between MongoDB and MySQL using python script - python

Related

Postgresql database error: column does not exist

Delete values from db2 table if exist in schema in Python

How to automap the result set of a custom SQL query in SQLAlchemy

Bulk Insert Data from List of Dictionaries into Postgresql database [Faster Way]?

Getting the id of the last record inserted for Postgresql SERIAL KEY with Python

Categories

Resources