i have a table in sql server
id count
1 1
2 1
3 1
4 1
5 1
i have another table in pandas dataframe(df), with updated count
id count
1 1
2 1
3 2
4 3
5 4
i want to make changes in my database using Update query, and i am thinking to define a function, which would do this.
i am using pypyodbc for my connection.
conn = pypyodbc.connect("Driver={SQL Server};Server=<YourServer>;Database=<YourDatabase>;uid=<YourUserName>;pwd=<YourPassword>"
i tried using
for row in df.iterrows():
updateQuery = "update "+db_table+" set count="+str(row[1][1])+" where id= '"+str(row[1][0])+"'"
cursor.execute(updateQuery)
conn.commit()
But is there any better way of doing this?
What you are trying to do in your question is an iterative update, looping through each row one by one. SQL databases are very inefficient at this kind of operation. SQL databases are very efficient at set based updates though.
What this means in practice, is write a script that can apply to the whole table in one go and then run it once.
In your case you can demonstrate this as follows in some volatile table variable from a query within SQL Server Management Studio (SSMS):
First, create your test data:
declare #t table(id int, [count] int);
insert into #t values(1,1),(2,1),(3,1),(4,1),(5,1);
declare #p table(id int, [count] int);
insert into #p values(1,1),(2,1),(3,2),(4,3),(5,4);
The first two select statements show you what the data looks like before your update:
select *
from #t;
select *
from #p;
To update all your rows of data in your SQL Table, you can join it to the data that was in your Pandas table, and then select the data again so you can see the change that took place in the update:
update t
set [count] = p.[count]
from #t as t
join #p as p
on(t.id = p.id);
select *
from #t;
select *
from #p;
I am not too familiar with Pandas Dataframe so do not know exactly how you can access and query this data. If you are working on large datasets, I would recommend importing the Pandas data as is into a staging table in your SQL Server database and running the above type of query within SQL Server.
Related
With a SQL database (in my case Sqlite, using Python), what is a standard way to have a column which is a set of elements?
id name items_set
1 Foo apples,oranges,tomatoes,ananas
2 Bar tomatoes,bananas
...
A simple implementation is using
CREATE TABLE data(id int, name text, items_set text);
but there are a few drawbacks:
to query all rows that have ananas, we have to use items_set LIKE '%ananas%' and some tricks with separators to avoid querying "ananas" to also return rows with "bananas", etc.
when we insert a new item in one row, we have to load the whole items_set, and see if the item is already in the list or not, before concatenating ,newitem at the end.
etc.
There is surely better, what is a standard SQL solution for a column which is a list or set?
Note: I don't know in advance all the possible values for the set/list.
I can see a solution with a few additional tables, but in my tests, it multiplies the size on disk by a factor x2 or x3, which is a problem with many gigabytes of data.
Is there a better solution?
To have a well structured SQL database, you should extract the items to their own table and use a join table between the main table and the items table
I'm not familiar with the Sqlite syntax but you should be able to create the tables with
CREATE TABLE entities(id int, name text);
CREATE TABLE entity_items(entity_id int, item_id int);
CREATE TABLE items(id int, name text);
add data
INSERT INTO entities (name) VALUES ('Foo'), ('Bar');
INSERT INTO items (name) VALUES ('tomatoes'), ('ananas'), ('bananas');
INSERT INTO entity_items (entity_id, item_id) VALUES (
(SELECT id from entities WHERE name='Foo'),
(SELECT id from items WHERE name='bananas')
);
query data
SELECT * FROM entities
LEFT JOIN entity_items
ON entities.id = entity_items.entity_id
LEFT JOIN items
ON items.id = entity_items.item_id
WHERE items.name = 'bananas';
You have probably two options. One standard approach, which is more conventional, is many-to-many relationship. Like you have three tables, for example, Employees, Projects, and ProjectEmployees. The latter describes your many-to-many relationship (each employee can work on multiple projects, each project has a team).
Having a set in a single value denormalized the table and it will complicate the things either way. But if you just, use the JSON format and the JSON functionality provided by SQLite. If your SQLite version is not recent, it may not have the JSON extension built in. You would need either updated (best option) or load the JSON extension dynamically. Not sure if you can do it using the SQLite copy supplied with Python.
To elaborate on what #ussu said, ideally your table would have one row per thing & item pair, using IDs instead of names:
id thing_id item_id
1 1 1
2 1 2
3 1 3
4 1 4
5 2 3
5 2 4
Then look-up tables for the thing and item names:
id name
1 Foo
2 Bar
id name
1 apples
2 oranges
3 tomatoes
4 bananas
In Mysql, You have set Type
Creation:
CREATE TABLE myset (col SET('a', 'b', 'c', 'd'));
Select:
mysql> SELECT * FROM tbl_name WHERE FIND_IN_SET('value',set_col)>0;
mysql> SELECT * FROM tbl_name WHERE set_col LIKE '%value%';
Insertion:
INSERT INTO myset (col) VALUES ('a,d'), ('d,a'), ('a,d,a'), ('a,d,d'), ('d,a,d');
I am writing large amounts of data to a sqlite database. I am using a temporary dataframe to find unique values.
This sql code takes forever in conn.execute(sql)
if upload_to_db == True:
print(f'########################################WRITING TO TEMP TABLE: {symbol} #######################################################################')
master_df.to_sql(name='tempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO instrumentsHistory (datetime, instrumentSymbol, observation, observationColName)
SELECT t.datetime, t.instrumentSymbol, t.observation, t.observationColName
FROM tempTable t
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)"""
print(f'##############################################WRITING TO FINAL TABLE: {symbol} #################################################################')
cn.execute(sql)
running this takes forever to write to the database. Can someone help me understand how to speed it up?
Edit 1:
How many rows roughly? -About 15,000 at a time. Basically it is pulling data into a pandas dataframe and making some transformations and then writing it to a sqlite database. there are probably 600 different instruments and each having like 15,000 rows so 9M rows ultimately. Give or take a million....
Depending on your SQL database, you could try using something like INSERT INTO IGNORE (MySQL), or MERGE (e.g. on Oracle), which would do the insert only if it would not violate a primary key or unique constraint. This would assume that such a constraint would exist on the 4 columns which you are checking.
In the absence of merge, you could try adding the following index to the instrumentsHistory table:
CREATE INDEX idx ON instrumentsHistory (datetime, instrumentSymbol, observation,
observationColName);
This index would allow for rapid lookup of each incoming record, coming from the tempTable, and so might speed up the insert process.
This subquery
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)
has to check every row in the table - and match four columns - until a match is found. In the worst case, there is no match and a full table scan must be completed. Therefore, the performance of the query will deteriorate as the table grows in size.
The solution, as mentioned in Tim's answer, is to create an index over the four columns to that the db can quickly determine whether a match exists.
I want to check what values from a pandas dataframe are not in a SQL database. So basically a left join(left being pandas df) where the right(SQL DB) key is NULL.
The DB is quite big, about 5 million entries, but i'm only interested in the primary key.
Pandas dataframe(50k rows) is much smaller than the SQL DB(5M), so i'd
rather move data to the batabase that bring all of it
I've thought about creating a temporary table in SQL and making a LEFT JOIN, but it might be possible to do it with just a query
pandas dataframe:
index
0
1
2
3
4
sql database:
index(primary key)
1
2
3
result: 0, 4
I wonder what's "batabase"... Google didn't help me on that :-)
However, if I understand that correctly, I think you should create a one-column temporary table in SQL server out of the dataframe (as you suggested yourself) and then of course it would be easy to find it like that:
SELECT P.Index
FROM PandasTable as P
WHERE P.Index NOT IN
(SELECT B.Index FROM BatabaseTable)
Should be pretty fast with indexed primary keys.
I am currently using this code:
while True:
col_num = 0
for table in table_names:
cursor.execute("INSERT INTO public.{0} VALUES(CURRENT_TIMESTAMP, 999999)".format(table))
cursor.connection.commit()
col_num += 1
row_num += 1
And this is pretty slow. One of the problem I see is that its committing multiple times to account for each table. If I can commit for all tables in a single query, I think that would increase the performance. How should I go about this?
You can commit outside the loop:
for table in table_names:
cursor.execute("INSERT INTO public.{0} VALUES(CURRENT_TIMESTAMP, 999999)".format(table))
cursor.connection.commit()
However, there is a side effect. First columns (timestamps) will have different values when committed separately in contrast to the same value when committed together. This is because CURRENT_TIMESTAMP gives the time of the start of transaction.
I have 2 columns in a database(sent_time, accept_time) which each include time stamps, as well as a field that can have 2 different values (ref_col). I would like to find a way in my query to make a new column (result_col), which will check the value of ref_col, and copy sent_time if the value is 1, and copy accept_time if the value is 2.
I am using pandas to query the database in python, if that has any bearing on the answer.
Just use case expression statement :
SELECT sent_time,
accept_time,
ref_col,
CASE WHEN ref_col = 1 THEN sent_col
ELSE accept_col
END AS result_col
FROM Your_Table
When you say "I have 2 columns in a database", what you actually mean is that you have 2 columns in a table, right?
In sql for postgresql it would be something like:
select (case when ref_col = 1 then sent_time else accept_time end) as result_col
from mytable
don't know how close from SQL standard that is, but would assume it's not that far.