My problem is : I have multiple identical databases and I want to merge them into one. But I may have duplicate entries as Primary Keys. What i'm trying to do is handle the duplicates before putting them in Mysql.
my actual code is:
df = pd.DataFrame()
duplicates = pd.DataFrame()
size=[]
lignes=[]
chunksize = 100000
for db in dbtuple: #For each databases in the tuple given as entry
engine = create_engine(URL(
drivername="mysql",
username="xxx",
password="xxx",
host="localhost",
database=db
))
conn = engine.connect()
#Get the data from the table given as entry
sql = "SELECT * FROM "+ tableName
#Execution of the query above
generator_df = pd.read_sql(sql=sql,con=conn,chunksize= chunksize)
#Init of sizechunk value
sizechunk = 0
#Because the query can be very big number of rows there's a separation
# every 100k rows so that dataframe size <= 100k
for dataframe in generator_df:
df = pd.concat([df,dataframe],ignore_index = True, axis=0,sort=False)
#We add the size of the chunk to know how many rows we have per database
sizechunk+= dataframe.shape[0]
size.append(sizechunk)
if tableName == 'table1':
duplicates = df.duplicated(subset='id')
for i in range(0,len(df)):
if duplicates[i]:
df.id[i] = numligne + '_' + df.id[i]
#same for all tables
But this is not pythonic way at all and is it very long to execute. Do you have any suggestion on how to improve the code to make it faster ?
Here is my db schema to understand better:
table1 = Table('table1', metadata,
Column('id', VARCHAR(40), primary_key=True,nullable=False),
mysql_engine='InnoDB'
)
table2= Table('table2', metadata,
Column('id', VARCHAR(40), primary_key=True,nullable=False),
Column('id_of', VARCHAR(20),ForeignKey("table1.id"), nullable=False, index= True)
)
table3= Table('table3', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True),
Column('id_produit_enfant', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)
table4= Table('table4', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)
table5= Table('table5', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)
table6= Table('table6', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)```
Related
I'm trying to understand what the set_ means in SQLAlchemy's on_conflict_do_update method. i have the following Table:
Table(
"test",
metadata,
Column("id", Integer, primary_key=True),
Column("firstname", String(100)),
Column("lastname", String(100)),
)
and what insert some like this (if i wrote it in psql)
INSERT INTO test (id, firstname, lastname) VALUES (1, 'John', 'Doe)
ON CONFLICT (id) DO UPDATE SET firstname = EXCLUDED.firstname, lastname = EXCLUDED.lastname
I did some due diligence and saw people write in the set_ like this:
import sqlalchemy.dialects import postgresql
insert_stmt = postgresql.insert(target).values([{'id':1,'firstname':'John','lastname':'Doe'}])
primary_keys = [key.name for key in inspect(target).primary_key]
update_dict = {c.name: c for c in insert_stmt.excluded if not c.primary_key}
stmt = insert_stmt.on_conflict_do_update(index_elements = primary_keys , set_ = update_dict)
engine.execute(stmt)
Is the update_dict just looking at the EXCLUDED values (the ones I want to update with) that I set in my insert_stmt? If I str(update_dict) I get an dictionary of specific information regarding the column {'firstname': Column('firstname', VARCHAR(length=100), table=<excluded>), 'lastname': Column('lastname', VARCHAR(length=100), table=<excluded>)}, is the method above the only way to retrieve the data? Can you write it out manually?
Consider the following database table:
ID ticker description
1 GDBR30 30YR
2 GDBR10 10YR
3 GDBR5 5YR
4 GDBR2 2YR
It can be replicated with this piece of code:
from sqlalchemy import (
Column,
Integer,
MetaData,
String,
Table,
create_engine,
insert,
select,
)
engine = create_engine("sqlite+pysqlite:///:memory:", echo=True, future=True)
metadata = MetaData()
# Creating the table
tickers = Table(
"tickers",
metadata,
Column("id", Integer, primary_key=True, autoincrement=True),
Column("ticker", String, nullable=False),
Column("description", String(), nullable=False),
)
metadata.create_all(engine)
# Populating the table
with engine.connect() as conn:
result = conn.execute(
insert(tickers),
[
{"ticker": "GDBR30", "description": "30YR"},
{"ticker": "GDBR10", "description": "10YR"},
{"ticker": "GDBR5", "description": "5YR"},
{"ticker": "GDBR2", "description": "2YR"},
],
)
conn.commit()
I need to filter tickers for some values:
search_list = ["GDBR10", "GDBR5", "GDBR30"]
records = conn.execute(
select(tickers.c.description).where((tickers.c.ticker).in_(search_list))
)
print(records.fetchall())
# Result
# [('30YR',), ('10YR',), ('5YR',)]
However, I need the resulting list of tuples ordered in the way search_list has been ordered. That is, I need the following result:
print(records.fetchall())
# Expected result
# [('10YR',), ('5YR',), ('30YR',)]
Using SQLite, you could create a cte with two columns (id and ticker). Applying the following code will lead to the expected result (see Maintain order when using SQLite WHERE-clause and IN operator). Unfortunately, I am not able to transfer the SQLite solution to sqlalchemy.
WITH cte(id, ticker) AS (VALUES (1, 'GDBR10'), (2, 'GDBR5'), (3, 'GDBR30'))
SELECT t.*
FROM tbl t INNER JOIN cte c
ON c.ticker = t.ticker
ORDER BY c.id
Suppose, I have search_list_tuple as folllows, how am I suppose to code the sqlalchemy query?
search_list_tuple = [(1, 'GDBR10'), (2, 'GDBR5'), (3, 'GDBR30')]
Below works and is actually equivalent to the VALUES (...) on sqlite albeit somewhat more verbose:
# construct the CTE
sub_queries = [
select(literal(i).label("id"), literal(v).label("ticker"))
for i, v in enumerate(search_list)
]
cte = union_all(*sub_queries).cte("cte")
# desired query
records = conn.execute(
select(tickers.c.description)
.join(cte, cte.c.ticker == tickers.c.ticker)
.order_by(cte.c.id)
)
print(records.fetchall())
# [('10YR',), ('5YR',), ('30YR',)]
Below is using the values() contruct, but unfortunately the resulting query fails on SQLite, but it works perfectly on postgresql:
cte = select(
values(
column("id", Integer), column("ticker", String), name="subq"
).data(list(zip(range(len(search_list)), search_list)))
).cte("cte")
qq = (
select(tickers.c.description)
.join(cte, cte.c.ticker == tickers.c.ticker)
.order_by(cte.c.id)
)
records = conn.execute(qq)
print(records.fetchall())
I am studying sqlalchemy and I cannot understand the reason for this error.
from sqlalchemy import create_engine, Table, Column, MetaData, INTEGER, String, ForeignKey, ForeignKeyConstraint
from sqlalchemy.sql import Select
engine = create_engine('sqlite:///:memory:')
with engine.connect() as conn:
meta = MetaData(engine)
cars = Table('Cars', meta,
Column('Id', INTEGER, primary_key=True, autoincrement=True),
Column('Name', String, nullable=False),
Column('BrandId', INTEGER, ForeignKey('Brands.Id')))
brands = Table('Brands', meta,
Column('Id', INTEGER, primary_key=True, autoincrement=True),
Column('Name', String))
meta.create_all()
cars_values = [{'Name':'Escort', 'BrandId':1}]
brands_values = [{'Id':1, 'Name': 'Ford'}]
insert1 = brands.insert().values(brands_values)
insert2 = cars.insert().values(cars_values)
conn.execute(insert1)
conn.execute(insert2)
query = Select([cars]).join(brands, brands.c.Id == cars.c.BrandId)
#query = 'select * from cars c JOIN brands b on b.id = c.brandid'
result = conn.execute(query)
print(result.fetchall())
When I run this way, I get an error
Select([cars]).join(brands, brands.c.Id == cars.c.BrandId)
sqlalchemy.exc.ObjectNotExecutableError: Not an executable object: <sqlalchemy.sql.selectable.Join at 0x1e62654ed88; Join object on Select object(2087997204936) and Brands(2087997203848)>
But if you run raw sql the JOIN is accepted
'select * from cars c JOIN brands b on b.id = c.brandid'
[(1, 'Escort', 1, 1, 'Ford')]
query = select(['*']).select_from(cars.join(brands, brands.c.Id == cars.c.BrandId))
# print(query)
# SELECT * FROM "Cars" JOIN "Brands" ON "Brands"."Id" = "Cars"."BrandId"
see Using Joins
This is another way
from sqlalchemy import text
query = text("select c.Id, c.Name,c.BrandId, b.Id, b.Name from Cars c left join
Brands b on b.Id = b.Id ")
result = engine.execute(query)
I can create new table in tablespace using raw query:
engine = sqlalchemy.engine.create_engine(ENGINE_PATH_DDS, encoding="utf-8")
connection = engine.connect()
conn = engine.raw_connection()
cur = conn.cursor()
sql = """
CREATE TABLE {schema}.{table_name}
(
ID NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
STATUS NVARCHAR2(32) NOT NULL,
DESCR NVARCHAR2(256) DEFAULT NULL
)
TABLESPACE SOME_TABLESPACE
""".format(table_name = table_name, schema = SCHEMA)
cur.execute(sql)
Created table is accesseble for all users with privileges. But how to do same thing without raw_connection? With this approach:
metadata = MetaData(engine)
table = Table('Example',metadata,
Column('id',Integer, primary_key=True),
Column('status',String))
metadata.create_all()
How to specify tablespase for new table?
Such code works:
df = pd.DataFrame(columns=['id', 'status', 'descr']) # pandas data frame
df.to_sql(con=connection, name=tbl, schema=SCHEMA, index=False, dtype=data_types, if_exists='replace')
but when tring to insert any data i got:
ORA-01950: no privileges on tablespace 'USERS'
I'm having problems with SQLAlchemy's select_from statement when using the core component. I try to construct an outer join query which currently looks like:
query = select([b1.c.id, b1.c.num, n1.c.name, n1.c.num, ...]
).where(and_(
... some conditions ...
)
).select_from(
???.outerjoin(
n1,
and_(
... some conditions ...
)
).select_from(... more outer joins similar to the above ...)
According to the docs, the structure should look like this:
table1 = table('t1', column('a'))
table2 = table('t2', column('b'))
s = select([table1.c.a]).\
select_from(
table1.join(table2, table1.c.a==table2.c.b)
)
My problem is that I don't have a table1 object in this case, as the select ... part consists of columns and not a single table (see question marks in my query). I've tried using n1.outerjoin(n1..., but that caused an exception (Exception: (ProgrammingError) table name "n1" specified more than once).
The above snippet is derived from a working session-based (ORM) query, which I try to convert (with limited success).
b = Table('b', metadata,
Column('id', Integer, Sequence('seq_b_id')),
Column('num', Integer, nullable=False),
Column('active', Boolean, default=False),
)
n = Table('n', metadata,
Column('b_id', Integer, nullable=False),
Column('num', Integer, nullable=False),
Column('active', Boolean, default=False),
)
p = Table('p', metadata,
Column('b_id', Integer, nullable=False),
Column('num', Integer, nullable=False),
Column('active', Boolean, default=False),
)
n1 = aliased(n, name='n1')
n2 = aliased(n, name='n2')
b1 = aliased(b, name='b1')
b2 = aliased(b, name='b2')
p1 = aliased(p, name='p1')
p2 = aliased(p, name='p2')
result = sess.query(b1.id, b1.num, n1.c.name, n1.c.num, p1.par, p1.num).filter(
b1.active==False,
b1.num==sess.query(func.max(b2.num)).filter(
b2.id==b1.id
)
).outerjoin(
n1,
and_(
n1.c.b_id==b1.id,
n1.c.num<=num,
n1.c.active==False,
n1.c.num==sess.query(func.max(n2.num)).filter(
n2.id==n1.c.id
)
)
).outerjoin(
p1,
and_(
p1.b_id==b1.id,
p1.num<=num,
p1.active==False,
p1.num==sess.query(func.max(p2.num)).filter(
p2.id==p1.id
)
)
).order_by(b1.id)
How do I go about converting this ORM query into a plain Core query?
Update:
I was able to narrow down the problem. It seems that a combination of two select_from calls causes the problem.
customer = Table('customer', metadata,
Column('id', Integer),
Column('name', String(50)),
)
order = Table('order', metadata,
Column('id', Integer),
Column('customer_id', Integer),
Column('order_num', Integer),
)
address = Table('address', metadata,
Column('id', Integer),
Column('customer_id', Integer),
Column('city', String(50)),
)
metadata.create_all(db)
customer1 = aliased(customer, name='customer1')
order1 = aliased(order, name='order1')
address1 = aliased(address, name='address1')
columns = [
customer1.c.id, customer.c.name,
order1.c.id, order1.c.order_num,
address1.c.id, address1.c.city
]
query = select(columns)
query = query.select_from(
customer1.outerjoin(
order1,
and_(
order1.c.customer_id==customer1.c.id,
)
)
)
query = query.select_from(
customer1.outerjoin(
address1,
and_(
customer1.c.id==address1.c.customer_id
)
)
)
result = connection.execute(query)
for r in result.fetchall():
print r
The above code causes the following exception:
ProgrammingError: (ProgrammingError) table name "customer1" specified more than once
'SELECT customer1.id, customer.name, order1.id, order1.order_num, address1.id, address1.city \nFROM customer, customer AS customer1 LEFT OUTER JOIN "order" AS order1 ON order1.customer_id = customer1.id, customer AS customer1 LEFT OUTER JOIN address AS address1 ON customer1.id = address1.customer_id' {}
If I was a bit more experienced in using SQLAlchemy, I would say this could be a bug...
I finally managed to solved the problem. Instead of cascading select_from, additional joins need to be chained to the actual join. The above query would read:
query = select(columns)
query = query.select_from(
customer1.outerjoin(
order1,
and_(
order1.c.customer_id==customer1.c.id,
)
).outerjoin(
address1,
and_(
customer1.c.id==address1.c.customer_id
)
)
)