I use the following sqlalchemy code to retrieve some data from a database
q = session.query(hd_tbl).\
join(dt_tbl, hd_tbl.c['data_type'] == dt_tbl.c['ID']).\
filter(or_(and_(hd_tbl.c['object_id'] == get_id(row['object']),
hd_tbl.c['data_type'] == get_id(row['type']),
hd_tbl.c['data_provider'] == get_id(row['provider']),
hd_tbl.c['data_account'] == get_id(row['account']))
for index, row in data.iterrows())).\
with_entities(hd_tbl.c['ID'], hd_tbl.c['object_id'],
hd_tbl.c['data_type'], hd_tbl.c['data_provider'],
hd_tbl.c['data_account'], dt_tbl.c['value_type'])
where hd_tbland dt_tbl are two tables in sql db, and datais pandas dataframe containing typically around 1k-9k entries. hd_tbl contains at the moment around 90k rows.
The execution time seems to exponentially grow with the length of data. The corresponding sql statement (by sqlalchemy) looks as follows:
SELECT data_header.`ID`, data_header.object_id, data_header.data_type, data_header.data_provider, data_header.data_account, basedata_data_type.value_type
FROM data_header INNER JOIN basedata_data_type ON data_header.data_type = basedata_data_type.`ID`
WHERE data_header.object_id = %s AND data_header.data_type = %s AND data_header.data_provider = %s AND data_header.data_account = %s OR
data_header.object_id = %s AND data_header.data_type = %s AND data_header.data_provider = %s AND data_header.data_account = %s OR
...
data_header.object_id = %s AND data_header.data_type = %s AND data_header.data_provider = %s AND data_header.data_account = %s OR
The tables and columns are fully indexed, and performance is not satisfying. Currently it is way faster to read all the data of hd_tbl and dt_tbl into memory and merge with pandas merge function. However, this is seems to be suboptimal. Anyone having an idea on how to improve the sqlalchemy call?
EDIT:
I was able to improve performance signifcantly by using sqlalchemy tuple_ in the following way:
header_tuples = [tuple([int(y) for y in tuple(x)]) for x in
data_as_int.values]
q = session.query(hd_tbl). \
join(dt_tbl, hd_tbl.c['data_type'] == dt_tbl.c['ID']). \
filter(tuple_(hd_tbl.c['object_id'], hd_tbl.c['data_type'],
hd_tbl.c['data_provider'],
hd_tbl.c['data_account']).in_(header_tuples)). \
with_entities(hd_tbl.c['ID'], hd_tbl.c['object_id'],
hd_tbl.c['data_type'], hd_tbl.c['data_provider'],
hd_tbl.c['data_account'], dt_tbl.c['value_type'])
with corresponding query...
SELECT data_header.`ID`, data_header.object_id, data_header.data_type, data_header.data_provider, data_header.data_account, basedata_data_type.value_type
FROM data_header INNER JOIN basedata_data_type ON data_header.data_type = basedata_data_type.`ID`
WHERE (data_header.object_id, data_header.data_type, data_header.data_provider, data_header.data_account) IN ((%(param_1)s, %(param_2)s, %(param_3)s, %(param_4)s), (%(param_5)s, ...))
I'd recommend you create a composite index on fields object_id, data_type, data_provider, ... with the same order, which they are placed in table, and make sure they're following in the same order in your WHERE condition. It may speed-up a bit your requests by cost of the disk space.
Also you may use several consequent small SQL requests instead a large query with complex OR condition. Accumulate extracted data on the application side or, if amount is large enough, in a fast temporary storage (a temporary table, noSQL, etc.)
In addition you may check MySQL configuration and increase values, related to memory volume per a thread, request, etc. A good idea is to check is your composite index fits into available memory, or it is useless.
I guess DB tuning may help a lot to increase productivity. Otherwise you may analyze your application's architecture to get more significant results.
Related
I am using the peewee ORM for a python application and I am trying to write code to fetch batches of records from a SQLite database. I have a subquery that seems to work by itself but when added to an update query the fn.EXISTS(sub_query) seems to have no effect as every record in the database is updated.
Note: I am using the APSW extension for peewee.
def batch_logic(self, id_1, path_1, batch_size=1000, **kwargs):
sub_query = (self.select(ModelClass.granule_id).distinct().where(
(ModelClass.status == 'old_status') &
(ModelClass.collection_id == collection_id) &
(ModelClass.name.contains(provider_path))
).order_by(ModelClass.discovered_date.asc()).limit(batch_size)).limit(batch_size))
print(f'len(sub_query): {len(sub_query)}')
fb_st_2 = time.time()
updated_records= list(
(self.update(status='new_status').where(fn.EXISTS(sub_query)).returning(ModelClass))
)
print(f'update {len(updated_records)}: {time.time() - fb_st_2}')
db.close()
return updated_records
Below is output from testing locally:
id_1: id_1_1676475997_PQXYEQGJWR
len(sub_query): 2
update 20000: 1.0583274364471436
fetch_batch 20000: 1.1167597770690918
count_things 0: 0.02147078514099121
processed_things: 20000
The subquery is correctly returning 2 but the update query where(fn.EXISTS(sub_query)) seems to be ignored. Have I made a mistake in my understanding of how this works?
Edit 1: I believe GROUP BY is needed as rows can have the same granule_id and I need to fetch rows up to batch_size granule_ids
I think your use of UPDATE...WHERE EXISTS is incorrect or inappropriate here. This may work better for you:
# Unsure why you have a GROUP BY with no aggregation, that seems
# incorrect possibly, so I've removed it.
sub_query = (self.select(ModelClass.id)
.where(
(ModelClass.status == 'old_status') &
(ModelClass.collection_id == id_1) &
(ModelClass.name.contains(path_1)))
.order_by(ModelClass.discovered_date.asc())
.limit(batch_size))
update = (self.update(status='new_status')
.where(self.id.in_(sub_query))
.returning(ModelClass))
cursor = update.execute() # It's good to explicitly execute().
updated_records = list(cursor)
The key idea, at any rate, is I'm correlating the update with the subquery.
For a university project I am using Neo4j together with flask and pyneo for a shift scheduling algorithm. On saving the scheduled shifts to Neo4j I realized that relationships go missing, from 330 only 91 get inserted.
On printing them before/after inserting, they are in the list to be inserted, and I also moved the transaction around to check if this changes the result.
I have the following structure:
(w:Worker)-[r:works_during]->(s:Shift) with
r.day, r.month, r.year as set parameters for the relationship and multiple connections between each worker and each shift, which can be filtered via the relation then.
my code looks like the following:
header = df.columns.tolist()
header.remove("index")
header.remove("worker")
tuplelist = []
for index, row in df.iterrows():
for i in header:
worker = self.driver.nodes.match("Worker", id=int(row["worker"])).first()
if row[i] == 1:
# Shifts are in the format {day}_{shift_of_day}
shift_id = str(i).split("_")[1]
shift_day = str(i).split("_")[0]
shift = self.driver.nodes.match("Shift", id=int(shift_id)).first()
rel = Relationship(worker, "works_during", shift)
rel["day"] = int(shift_day)
rel["month"] = int(month)
rel["year"] = int(year)
tuplelist.append(rel)
print(len(tuplelist))
for i in tuplelist:
connection = self.driver.begin()
connection.create(i)
connection.commit()
Is there any special behaviour in pyneo which I need to be aware of that could cause this issue?
Pyneo allows just one connection from the same type between node A and node B.
If multiple connections of the same type (even with different attributes) are needed, it is necessary to use plain Cypher Querying as pyneo will merge this edges to a single edge.
I need to store a defaultdict object containing ~20M objects into a database. The dictionary maps a string to a string, so the table has two columns, no primary key because it's constructed later.
Things I've tried:
executemany, passing in the set of keys and values in the dictionary. Works well when number of values < ~1M.
Executing single statements. Works, but slow.
Using transactions
con = sqlutils.getconnection()
cur = con.cursor()
print len(self.table)
cur.execute("SET FOREIGN_KEY_CHECKS = 0;")
cur.execute("SET UNIQUE_CHECKS = 0;")
cur.execute("SET AUTOCOMMIT = 0;")
i = 0
for k in self.table:
cur.execute("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s);", (k, str(self.hashtable[k])))
i += 1
if i % 10000 == 0:
print i
#cur.executemany("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s)", [(k, str(self.table[k])) for k in self.table])
cur.execute("SET UNIQUE_CHECKS = 1;")
cur.execute("SET FOREIGN_KEY_CHECKS = 1;")
cur.execute("COMMIT")
con.commit()
cur.close()
con.close()
print "Finished", self.sequence, "in %.3f sec" % (time.time() - t)
This is a recent conversion from SQLite to MySQL. Oddly enough, I'm getting much better performance when I use SQLite (30s to insert 3M rows in SQLite, 480s in MySQL). Unfortunately, MySQL is a necessity because the project will be scaled up in the future.
-
Edit
Using LOAD DATA INFILE works like a charm. Thanks to all who helped! Inserting 3.2M rows takes me ~25s.
MySQL can inserts multiple values with one query: INSERT INTO table (key1, key2) VALUES ("value_key1", "value_key2"), ("another_value_key1", "another_value_key2"), ("and_again", "and_again...");
Also, you could try to write your datas inside a file and use LOAD DATA from Mysql that is designed to insert with "very hight speed" (dixit Mysql).
I dunno if "file writing" + "MySQL Load Data" will be faster than Insert multiple values in one query (or many queries if MySQL has a limit for it)
It depends on your hardware (write a file is "fast" with a SSD), on your file system configuration, on MySQL configuration etc. So, you have to test on your "prod" env to see what solution is the fastest for you.
Insert of directly inserting, generate a sql file (using extended inserts etc) then fetch this to MySQL, this will save you quite a lot of overhead.
NB : you'll still save some execution time if you avoid recomputing constant values in your loop, ie:
for k in self.table:
xxx = sqlutils.gettablename(self.sequence)
do_something_with(xxx, k)
=>
xxx = sqlutils.gettablename(self.sequence)
for k in self.table:
do_something_with(xxx, k)
I have this sqlalchemy query:
query = session.query(Store).options(joinedload('salesmen').
joinedload('comissions').
joinedload('orders')).\
filter(Store.store_code.in_(selected_stores))
stores = query.all()
for store in stores:
for salesman in store.salesmen:
for comission in salesman.comissions:
#generate html for comissions for each salesman in each store
#print html document using PySide
This was working perfectly, however I added two new filter queries:
filter(Comissions.payment_status == 0).\
filter(Order.order_date <= self.dateEdit.date().toPython())
If I add just the first filter the application hangs for a couple of seconds, if I add both the application hangs indefinitely
What am I doing wrong here? How do I make this query fast?
Thank you for your help
EDIT: This is the sql generated, unfortunately the class and variable names are in Portuguese, I just translated them to English so it would be easier to undertand,
so Loja = Store, Vendedores = Salesmen, Pedido = Order, Comission = Comissao
Query generated:
SELECT "Loja"."CodLoja", "Vendedores_1"."CodVendedor", "Vendedores_1"."NomeVendedor", "Vendedores_1"."CodLoja", "Vendedores_1"."PercentualComissao",
"Vendedores_1"."Ativo", "Comissao_1"."CodComissao", "Comissao_1"."CodVendedor", "Comissao_1"."CodPedido",
"Pedidos_1"."CodPedido", "Pedidos_1"."CodLoja", "Pedidos_1"."CodCliente", "Pedidos_1"."NomeCliente", "Pedidos_1"."EnderecoCliente", "Pedidos_1"."BairroCliente",
"Pedidos_1"."CidadeCliente", "Pedidos_1"."UFCliente", "Pedidos_1"."CEPCliente", "Pedidos_1"."FoneCliente", "Pedidos_1"."Fone2Cliente", "Pedidos_1"."PontoReferenciaCliente",
"Pedidos_1"."DataPedido", "Pedidos_1"."ValorProdutos", "Pedidos_1"."ValorCreditoTroca",
"Pedidos_1"."ValorTotalDoPedido", "Pedidos_1"."Situacao", "Pedidos_1"."Vendeu_Teflon", "Pedidos_1"."ValorTotalTeflon",
"Pedidos_1"."DataVenda", "Pedidos_1"."CodVendedor", "Pedidos_1"."TipoVenda", "Comissao_1"."Valor", "Comissao_1"."DataPagamento", "Comissao_1"."StatusPagamento"
FROM "Comissao", "Pedidos", "Loja" LEFT OUTER JOIN "Vendedores" AS "Vendedores_1" ON "Loja"."CodLoja" = "Vendedores_1"."CodLoja"
LEFT OUTER JOIN "Comissao" AS "Comissao_1" ON "Vendedores_1"."CodVendedor" = "Comissao_1"."CodVendedor" LEFT OUTER JOIN "Pedidos" AS "Pedidos_1" ON "Pedidos_1"."CodPedido" = "Comissao_1"."CodPedido"
WHERE "Loja"."CodLoja" IN (:CodLoja_1) AND "Comissao"."StatusPagamento" = :StatusPagamento_1 AND "Pedidos"."DataPedido" <= :DataPedido_1
Your FROM clause is producing a Cartesian product and includes each table twice, once for filtering the result and once for eagerly loading the relationship.
To stop this use contains_eager instead of joinedload in your options. This will look for the related attributes in the query's columns instead of constructing an extra join. You will also need to explicitly join to the other tables in your query, e.g.:
query = session.query(Store)\
.join(Store.salesmen)\
.join(Store.commissions)\
.join(Store.orders)\
.options(contains_eager('salesmen'),
contains_eager('comissions'),
contains_eager('orders'))\
.filter(Store.store_code.in_(selected_stores))\
.filter(Comissions.payment_status == 0)\
.filter(Order.order_date <= self.dateEdit.date().toPython())
Here is my query which is done in python and sqlalchemy, but I don't think this is sqlalchemy being slow, just me not knowing how to make fast queries. The query takes about 8 seconds, and returns 45,000 results.
games = s.query(Box_Score, Game).join(Game, Box_Score.espn_game_id==Game.espn_game_id)
.filter(Game.total!=999)
.filter(Game.a_line!=999)\
.order_by(Box_Score.date.desc()).all()
This is the query in regular SQL
SELECT box_scores.date AS box_scores_date, box_scores.id AS box_scores_id, box_scores.player_name AS box_scores_player_name, box_scores.team_name AS box_scores_team_name, box_scores.espn_player_id AS box_scores_espn_player_id, box_scores.espn_game_id AS box_scores_espn_game_id, box_scores.pass_attempt AS box_scores_pass_attempt, box_scores.pass_made AS box_scores_pass_made, box_scores.pass_yards AS box_scores_pass_yards, box_scores.pass_td AS box_scores_pass_td, box_scores.pass_int AS box_scores_pass_int, box_scores.pass_longest AS box_scores_pass_longest, box_scores.run_carry AS box_scores_run_carry, box_scores.run_yards AS box_scores_run_yards, box_scores.run_td AS box_scores_run_td, box_scores.run_longest AS box_scores_run_longest, box_scores.reception AS box_scores_reception, box_scores.reception_yards AS box_scores_reception_yards, box_scores.reception_td AS box_scores_reception_td, box_scores.reception_longest AS box_scores_reception_longest, box_scores.interception_lost AS box_scores_interception_lost, box_scores.interception_won AS box_scores_interception_won, box_scores.fg_attempt AS box_scores_fg_attempt, box_scores.fg_made AS box_scores_fg_made, box_scores.fg_longest AS box_scores_fg_longest, box_scores.punt AS box_scores_punt, box_scores.first_down AS box_scores_first_down, box_scores.penalty AS box_scores_penalty, box_scores.penalty_yards AS box_scores_penalty_yards, box_scores.fumbles AS box_scores_fumbles, box_scores.possession AS box_scores_possession, games.id AS games_id, games.espn_game_id AS games_espn_game_id, games.date AS games_date, games.status AS games_status, games.time AS games_time, games.season AS games_season, games.h_name AS games_h_name, games.a_name AS games_a_name, games.league AS games_league, games.h_q1 AS games_h_q1, games.h_q2 AS games_h_q2, games.h_q3 AS games_h_q3, games.h_q4 AS games_h_q4, games.h_ot AS games_h_ot, games.h_score AS games_h_score, games.a_q1 AS games_a_q1, games.a_q2 AS games_a_q2, games.a_q3 AS games_a_q3, games.a_q4 AS games_a_q4, games.a_ot AS games_a_ot, games.a_score AS games_a_score, games.possession_h2 AS games_possession_h2, games.d_yards_h1 AS games_d_yards_h1, games.f_yards_h1 AS games_f_yards_h1, games.h_ml AS games_h_ml, games.a_ml AS games_a_ml, games.h_h1_ml AS games_h_h1_ml, games.a_h1_ml AS games_a_h1_ml, games.h_q1_ml AS games_h_q1_ml, games.a_q1_ml AS games_a_q1_ml, games.h_h2_ml AS games_h_h2_ml, games.a_h2_ml AS games_a_h2_ml, games.h_line AS games_h_line, games.h_price AS games_h_price, games.a_line AS games_a_line, games.a_price AS games_a_price, games.h_open_line AS games_h_open_line, games.h_open_price AS games_h_open_price, games.a_open_line AS games_a_open_line, games.a_open_price AS games_a_open_price, games.h_h1_line AS games_h_h1_line, games.h_h1_price AS games_h_h1_price, games.a_h1_line AS games_a_h1_line, games.a_h1_price AS games_a_h1_price, games.h_q1_line AS games_h_q1_line, games.h_q1_price AS games_h_q1_price, games.a_q1_line AS games_a_q1_line, games.a_q1_price AS games_a_q1_price, games.h_h2_line AS games_h_h2_line, games.h_h2_price AS games_h_h2_price, games.a_h2_line AS games_a_h2_line, games.a_h2_price AS games_a_h2_price, games.total AS games_total, games.o_price AS games_o_price, games.u_price AS games_u_price, games.total_h1 AS games_total_h1, games.o_h1_price AS games_o_h1_price, games.u_h1_price AS games_u_h1_price, games.total_q1 AS games_total_q1, games.o_q1_price AS games_o_q1_price, games.u_q1_price AS games_u_q1_price, games.total_h2 AS games_total_h2, games.o_h2_price AS games_o_h2_price, games.u_h2_price AS games_u_h2_price
FROM box_scores JOIN games ON box_scores.espn_game_id = games.espn_game_id
WHERE games.total != :total_1 AND games.a_line != :a_line_1 ORDER BY box_scores.date DESC
This query takes over 3 seconds and returns 55,000 results.
box_scores = s.query(Box_Score).all()
I must be doing something wrong. I know people use databases with millions of entries regularly, so I don't get why selecting 50,000 rows should be a big deal. I also tried joining on Box_Score as opposed to Game, and taking out the order_by() part, and neither speeded up performance.
UPDATE; I am trying to learn what fragmentation is to answer the question below. Don't understand it yet but I did run the command PRAGMA page_count -> 64,785, and that doesn't seem like a big number. I also did sqlite nfl.db "VACUUM"; and then ran the query again and there was no performance improvement.