I have a new csv file every day with 400 million+ entries which I need to upsert into my database (3 tables with 2 foreign keys, indexed). The majority of the entries are already in the table, in which case I need to update a column. Some entries, which are not already in the table need to be inserted.
I tried to insert the CSV each day into a temptable then run:
INSERT INTO restaurants (name, food_id, street_id, datecreated, lastdayobservedopen) SELECT DISTINCT temptable.name, typesoffood.food_id, location.street_id, temptable.datecreated, temptable.lastdayobservedopen FROM temptable INNER JOIN typesoffood on typesoffood.food_type = temptable.food_type INNER JOIN location ON location.street_name = temptable.street_name ON CONFLICT ON CONSTRAINT restaurants_pk DO UPDATE SET lastdayobservedopen = EXCLUDED.lastdayobservedopen
But it takes over 6 hrs.
Is it possible to make this faster?
Edit:
Some more details: 3 tables- restaurants(name, food_id, street_id, datecreated, lastdayobservedopen) with pk (name, street_id) and fks (food_id and street_id); typesoffood(food_id, food_type) with pk (food_id) and index on food_type; location(street_id, street_name) with pk (street_id) and index on street_name; as for the csv file, I don’t know which are new or old entries, but I do know that the majority of the entries are already in the database which would require me to update the lastdayobserved date. The rest are to be inserted with the lastdayobserved date as today. This is supposed to help distinguish between restaurants that are no longer in operation (in which case their lastdayobserved column would not be updated) and currently operating restaurants whose date in that column should always match today’s date. Open to more efficient schema suggestions, as well. Thanks to all!
There is a function in sql called bulk insert can handle large volume of data:
bulk insert #temp
from "file location path"
If you can change you postgres settings you could take advantage of parallelism in Postgres. Otherwise you could at least speed up the csv upload using Postgres's bulk upload otherwise known as the COPY command.
Without more details it's hard to give better advice.
Related
hello Thanks for taking the time to go through my question. I work in the budget space for a small city and during these precarious time, I am learning some python for maybe in the future helping me with some financial data modelling. We use SAP currently but i also wanted to learn a new language.
I need some pointers on where to look for certain answers.
for ex, I made a database with a few million records, sorted by date and time. I was able to strip off the data I did not need and now have a clean database to work on
At a high level, I want to know if based on the first record in a day, is there another entry the same day that is double of the first record.
Date|time|dept|Value1
01/01/2019|11:00|BUD|51.00
01/01/2019|11:30|CSD|101.00
01/01/2019|11:50|BUD|102.00
01/02/2019|10:00|BUD|200.00
01/02/2019|10:31|BUD|201.00
01/02/2019|11:51|POL|400.00
01/03/2019|11:00|BUD|100.00
01/03/2019|11:30|PWD|101.00
01/03/2019|11:50|BUD|110.00
based on the data above and the requirement, I want to get an output of
Date|time|dept|Value| Start Value
01/01/2019|11:50|BUD|102.00|51.00
01/02/2019|11:51|POL|400.00|200.00
01/03/2019|NONE|NONE|NONE|100.00
On Day 3, There were no values that was at least double so, we have none or null.
What I have done so far
I have been able to connect to database [python]
2. I was able to strip off the unnecessary information and depts from the database [sqlite]
3. I have been able to create new tables for result [Python]
Questions / best Practices
How to get the first line per day. Do I start off with a variable before the loop that is assigned to Jan 1, 2019 and then pick the row number and store it in another table or what other options do we have here.
Once the first row per day is stored/captured in another table or a array, How do I get the first occurrence of a value at least twice of the first line.
ex? begin meta code***********
Start from Line 1 to end
table2.date[] Should be equal to 01/01/2019
table2.value[] Should be equal to 51.00
look through each line if date = table2.date and value >= 2* (table2.value[])
*if successful, get record line number and department and value and store in new table
else
goto next line
Then increase table2.date and table2.value by 1 and do the loop again.
end meta code*****************
is this the right approach, I feel like going through millions of records for each date change is not very optimized.
I can probably add a of condition to exit if date is not equal to table2.date[1] but am still not sure if this is the right way to approach this problem. This will be run only once or twice a year so system performance is not that important but still am thinking of approaching it the right way.
Should I export the final data to excel for analysis or are thee good analysis modelling tools in Python. What would the professionals recommend?
You could use exists to check if another record exists on the same day and with a value that is twice greater, and window functions to filter on the top record per day:
select *
from (
select
t.*,
row_number() over(partition by date order by time) rn
from mytable t
where exists (
select 1 from mytable t1 where t1.date = t.date and t1.value = 2 * t.value
)
) t
where rn = 1
In versions of SQLite where row_number() is not available, another option is to filter with a correlated subquery:
select t.*
from mytable t
where
exists(select 1 from mytable t1 where t1.date = t.date and t1.value = 2 * t.value)
and t.time = (select min(t1.time) from mytable t1 where t1.date = t.date)
You could do it that way, but you're correct, it would take a long time. I don't know if SQLite has the capabilities to do what you want effectively, but I know Python does. It sounds like you might want to use the Python Data Analysis Library, Pandas. You can find out how to get your SQLite into Pandas here:
How to open and convert sqlite database to pandas dataframe
Once you have it in a Pandas Dataframe, there are tons of functions to get the first occurrence of something, find duplicates, find unique values, and even generate other dataframes with only unique values.
I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).
Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()
First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).
I have a database, which I store as a .db file on my disk. I implemented all the function neccessary for managing this database using sqlite3. However, I noticed that updating the rows in the table takes a large amount of time. My database has currently 608042 rows. The database has one table - let's call it Table1. This table consists of the following columns:
id | name | age | address | job | phone | income
(id value is generated automaticaly while a row is inserted to the database).
After reading-in all the rows I perform some operations (ML algorithms for predicting the income) on the values from the rows, and next I have to update (for each row) the value of income (thus, for each one from 608042 rows I perform the SQL update operation).
In order to update, I'm using the following function (copied from my class):
def update_row(self, new_value, idkey):
update_query = "UPDATE Table1 SET income = ? WHERE name = ?" %
self.cursor.execute(update_query, (new_value, idkey))
self.db.commit()
And I call this function for each person registered in the database.
for each i out of 608042 rows:
update_row(new_income_i, i.name)
(values of new_income_i are different for each i).
This takes a huge amount of time, even though the dataset is not giant. Is there any way to speed up the updating of the database? Should I use something else than sqlite3? Or should I instead of storing the database as a .db file store it in memory (using sqlite3.connect(":memory:"))?
Each UPDATE statement must scan the entire table to find any row(s) that match the name.
An index on the name column would prevent this and make the search much faster. (See Query Planning and How does database indexing work?)
However, if the name column is not unique, then that value is not even suitable to find individual rows: each update with a duplicate name would modify all rows with the same name. So you should use the id column to identify the row to be updated; and as the primary key, this column already has an implicit index.
I have got a table with auto increment primary key. This table is meant to store millions of records and I don't need to delete anything for now. The problem is, when new rows are getting inserted, because of some error, the auto increment key is leaving some gaps in the auto increment ids.. For example, after 5, the next id is 8, leaving the gap of 6 and 7. Result of this is when I count the rows, it results 28000, but the max id is 58000. What can be the reason? I am not deleting anything. And how can I fix this issue.
P.S. I am using insert ignore while inserting records so that it doesn't give error when I try to insert duplicate entry in unique column.
This is by design and will always happen.
Why?
Let's take 2 overlapping transaction that are doing INSERTs
Transaction 1 does an INSERT, gets the value (let's say 42), does more work
Transaction 2 does an INSERT, gets the value 43, does more work
Then
Transaction 1 fails. Rolls back. 42 stays unused
Transaction 2 completes with 43
If consecutive values were guaranteed, every transaction would have to happen one after the other. Not very scalable.
Also see Do Inserted Records Always Receive Contiguous Identity Values (SQL Server but same principle applies)
You can create a trigger to handle the auto increment as:
CREATE DEFINER=`root`#`localhost` TRIGGER `mytable_before_insert` BEFORE INSERT ON `mytable` FOR EACH ROW
BEGIN
SET NEW.id = (SELECT IFNULL(MAX(id), 0) + 1 FROM mytable);;
END
This is a problem in the InnoDB, the storage engine of MySQL.
It really isn't a problem as when you check the docs on “AUTO_INCREMENT Handling in InnoDB” it basically says InnoDB uses a special table to do the auto increments at startup
And the query it uses is something like
SELECT MAX(ai_col) FROM t FOR UPDATE;
This improves concurrency without really having an affect on your data.
To not have this use MyISAM instead of InnoDB as storage engine
Perhaps (I haven't tested this) a solution is to set innodb_autoinc_lock_mode to 0.
According to http://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html this might make things a bit slower (if you perform inserts of multiple rows in a single query) but should remove gaps.
You can try insert like :
insert ignore into table select (select max(id)+1 from table), "value1", "value2" ;
This will try
insert new data with last unused id (not autoincrement)
if in unique fields duplicate entry found ignore it
else insert new data normally
( but this method not support to update fields if duplicate entry found )
cursor.execute('UPDATE emp SET name = %(name)s',{"name": name} where ?)
I don't understand how to get primary key of a particular record.
I have some N number of records present in DB. I want to access those record &
manipulate.
Through SELECT query i got all records but i want to update all those records accordingly
Can someone lend a helping hand?
Thanks in Advance!
Table structure:
ID CustomerName ContactName
1 Alfreds Futterkiste
2 Ana Trujillo
Here ID is auto genearted by system in postgres.
I am accessing CustomerName of two record & updating. So here when i am updating
those record the last updated is overwrtited in first record also.
Here i want to set some condition so that When executing update query according to my record.
After Table structure:
ID CustomerName ContactName
1 xyz Futterkiste
2 xyz Trujillo
Here I want to set first record as 'abc' 2nd record as 'xyz'
Note: It ll done using PK. But i dont know how to get that PK
You mean you want to use UPDATE SQL command with WHERE statement:
cursor.execute("UPDATE emp SET CustomerName='abc' WHERE ID=1")
cursor.execute("UPDATE emp SET CustomerName='xyz' WHERE ID=2")
This way you will UPDATE rows with specific IDs.
Maybe you won't like this, but you should not use autogenerated keys in general. The only exception is when you want to insert some rows and do not do anything else with them. The proper solution is this:
Create a sequencefor your table. http://www.postgresql.org/docs/9.4/static/sql-createsequence.html
Whenever you need to insert a new row, get the next value from the generator (using select nextval('generator_name')). This way you will know the ID before you create the row.
Then insert your row by specifying the id value explicitly.
For the updates:
You can create unique constraints (or unique indexes) on sets of coulmns that are known to be unique
But you should identify the rows with the identifiers internally.
When referring records in other tables, use the identifiers, and create foreign key constraints. (Not always, but usually this is good practice.)
Now, when you need to updatea row (for example: a customer) then you should already know which customer needs to be modified. Because all records are identified by the primary key id, you should already know the id for that row. If you don't know it, but you have an unique index on a set of fields, then you can try to get the id. For example:
select id from emp where CustomerName='abc' -- but only if you have a unique constraing on CustomerName!
In general, if you want to update a single row, then you should NEVER update this way:
update emp set CustomerName='newname' where CustomerName='abc'
even if you have an unique constraint on CustomerName. The explanation is not easy, and won't fit here. But think about this: you may be sending changes in a transaction block, and there can be many opened transactions at the same time...
Of course, it is fine to update rows, if you intention is to update all rows that satisfy your condition.