I am trying to automate a Python script with a batch file, and the code works fine on my own computer but runs into an operational error "too many SQL variables" when I run it with the batch file on a remote desktop.
This is apparently because the limit on a sql query is 999 parameters, and mine has more than the limit. How do I actually increase this limit or break the data into chunks of 999 cols? I came across many posts saying to increase this limit at compilation but I don't know how to do so, and the to_sql has a field called chunk but that's for rows and not columns.I'm using SQLite
My python code to insert data is:
df.tail(1).to_sql("table", sqlcon, if_exists="append", index=True)
Thanks !
A schema with more than 999 columns should probably be rethought. That said, here's how to work around it.
You can upgrade to SQLite after 3.32.0 when SQLITE_MAX_VARIABLE_NUMBER defaults to 32766. And if you need more than that, you are not allowed to design databases.
Otherwise, if for some reason upgrading is not an option, the hard coded limits can only be lowered at runtime. If you want to raise them you will have to recompile SQLite with a higher SQLITE_MAX_VARIABLE_NUMBER. This will make your program difficult to deploy using standard dependency managers.
Yes I've thought about that, but for this purpose, because the rows are basically dates and the column names are securities that we need to store stuff for, I don't think I can really change it.
This is a job for a join table.
create table securities (
id integer primary key,
symbol text not null unique,
name text not null
);
create table security_prices (
security_id integer not null references securities(id),
retrieved_at datetime not null,
price integer not null
);
select symbol, price
from security_prices sp
join securities s on s.id = sp.security_id
where retrieved_at = ?
Related
I have a new csv file every day with 400 million+ entries which I need to upsert into my database (3 tables with 2 foreign keys, indexed). The majority of the entries are already in the table, in which case I need to update a column. Some entries, which are not already in the table need to be inserted.
I tried to insert the CSV each day into a temptable then run:
INSERT INTO restaurants (name, food_id, street_id, datecreated, lastdayobservedopen) SELECT DISTINCT temptable.name, typesoffood.food_id, location.street_id, temptable.datecreated, temptable.lastdayobservedopen FROM temptable INNER JOIN typesoffood on typesoffood.food_type = temptable.food_type INNER JOIN location ON location.street_name = temptable.street_name ON CONFLICT ON CONSTRAINT restaurants_pk DO UPDATE SET lastdayobservedopen = EXCLUDED.lastdayobservedopen
But it takes over 6 hrs.
Is it possible to make this faster?
Edit:
Some more details: 3 tables- restaurants(name, food_id, street_id, datecreated, lastdayobservedopen) with pk (name, street_id) and fks (food_id and street_id); typesoffood(food_id, food_type) with pk (food_id) and index on food_type; location(street_id, street_name) with pk (street_id) and index on street_name; as for the csv file, I don’t know which are new or old entries, but I do know that the majority of the entries are already in the database which would require me to update the lastdayobserved date. The rest are to be inserted with the lastdayobserved date as today. This is supposed to help distinguish between restaurants that are no longer in operation (in which case their lastdayobserved column would not be updated) and currently operating restaurants whose date in that column should always match today’s date. Open to more efficient schema suggestions, as well. Thanks to all!
There is a function in sql called bulk insert can handle large volume of data:
bulk insert #temp
from "file location path"
If you can change you postgres settings you could take advantage of parallelism in Postgres. Otherwise you could at least speed up the csv upload using Postgres's bulk upload otherwise known as the COPY command.
Without more details it's hard to give better advice.
I am trying something like
select customer_id, order_id from order_table where purchase_id = 10 OR
purchase_id = 25 OR
...
purchase_id = 25432;
Since the query is too big, I am running to variety of problems... if I run the entire query in a single line, I am running into the error:
SP2-0027: Input is too long (> 2499 characters) - line ignored
If split the query to multiple lines, the query gets corrupted, due to the interference with line numbers printed for each line of the entered query. If I disable line numbers, SQL> prompt at each line is troubling me.
Same error if run the query from a text file SQL> #query.sql
(I did not face such issues with mysql in the past but with sqlplus now).
I am not an expert in shell-script nor in python. It would be of great help if I can get pointers on how I can put all the purchase_ids in a text file, one purchase_id per line and supply it to sqlplus query at script-runtime.
I did sufficient research, but I still appreciate pointers as well.
1) Syntax change:
Try to use 'in (10,25,2542, ...)' instead of a series of 'OR'. It can reduce the size of the sql statement
2) Logic change:
Syntax may delay the inevitable, but the exception will still occur if there are a lot of id to exclude.
2a)
A straight-forward fix is to break the query down into batches. You can issue a select query per 50 purchase IDs until all IDs are covered.
2b)
Or you can look into a more generalised way to retrieve the same query result. Let's assume what you actually want to see is a list of 'unconfirmed order'. Then instead of a using a set of purchase IDs in the where clause, you can add a boolean field 'confirmed' to the order_table and select based on this criteria.
another idea:
Create a table "query_ids" (one column) and input all your order_id from the WHERE clause.
New query would be:
select customer_id, order_id from order_table where purchase_id = ( select * from query_ids);
I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.
I have a MySQL table which I want to populate with some dummy data for testing (50+).
This table has a foreign key to another table so the dummy data must cross reference from that table but again be random i.e. can't all be the same foreign key.
It also has a date added field which I want to populate with a random date within a year span e.g. any date in the year 2010.
my table structure is:
id, customer_id, date_added, title, total_cost
where id is the primary key, customer_id is the foreign key and date_added is the date field.
What is the best way of doing this? I'd prefer to do it directly in MySQL but if not my site is running on Python so a way of doing this in that would do.
I would not do this in MySQL without outside help from an application written in Python.
There are several requirements built into your statement that are best expressed in a procedural style. SQL is a set-based language; I don't think it lends itself as nicely to the task at hand.
You'll want an application to take in data from a source, do whatever randomization and PII removal that you need, and then construct the test data according to your requirements.
If it's database intended just for test, you might consider an in-memory database that you can populate, modify all you like, and then blow away for your next test. I'm thinking about something like Hypersonic or Derby or TimesTen.
quick and dirty solution:
drop table if exists orders;
drop table if exists customers;
create table customers
(
cust_id int unsigned not null auto_increment primary key,
name varchar(255) not null
)
engine=innodb;
create table orders
(
order_id int unsigned not null auto_increment primary key,
cust_id int unsigned not null,
order_date datetime not null,
foreign key (cust_id) references customers(cust_id) on delete cascade
)
engine=innodb;
drop procedure if exists load_test_data;
delimiter #
create procedure load_test_data()
begin
declare v_max_customers int unsigned default 0;
declare v_max_orders int unsigned default 0 ;
declare v_counter int unsigned default 0 ;
declare v_rnd_cust_id int unsigned default 0;
declare v_base_date datetime;
set foreign_key_checks = 0;
truncate table orders;
truncate table customers;
set foreign_key_checks = 1;
set v_base_date = "2010-01-01 00:00:00";
set v_max_customers = 1000;
set v_max_orders = 10000;
start transaction;
set v_counter = 0;
while v_counter < v_max_customers do
insert into customers (name) values (concat('Customer ', v_counter+1));
set v_counter=v_counter+1;
end while;
commit;
start transaction;
set v_counter = 0;
while v_counter < v_max_orders do
set v_rnd_cust_id = floor(1 + (rand() * v_max_customers));
insert into orders (cust_id, order_date) values (v_rnd_cust_id, v_base_date + interval v_counter hour);
set v_counter=v_counter+1;
end while;
commit;
end #
delimiter ;
call load_test_data();
select * from customers order by cust_id desc limit 10;
select * from orders order by order_id desc limit 10;
For testing business rules, I actually prefer carefully thought out data over random data. Either from excel->csv->db or manually created insert statements.
One row for each boundary condition, say:
Customer without orders
One Customer with zero total cost
One customer with foreign characters in the name (because I always forget to deal with it)
One customer with max length name
One Customer with shit loads of orders (to make sure that the GUI still looks nice)
It makes it really easy to run regression tests because you "know" what the data should look like.
For performance testing, you can do pretty good with random data as long as the data distribution is realistic (which affects the usefulness of indexes). If you have very advanced requirements, your best bet is to use some software built for this purpose.
But often you can generate all the data you need from one single table of integers and clever use of built-in functions:
rand() -> Generate random number.
mod() -> Used to create repeating sequences (1,2,3,1,2,3)
lpad() and rpad() -> For padding strings to specified lengths
If you really want to get down with some setting up of testing data, you should go the fixture route. This will help set yourself up a pretty nice development environment and may integrate very nicely into your website's framework if you're using one.
You can find a link to the documentation of the fixture module here
If you think that's a little too much work to get all working, look into the MySQLdb module which will help you insert data into your table.
It may be in poor taste to link back to a stackoverflow, but someone has already answered the date question you are asking. You can find that here.
As such this question is old and answered but I assume you still need to know this one stored procedure to load dummy data to MySQL which runs from MySQL and auto-populates dummy data according to datatypes.
All you need to specify database-name, table-name and number of records to be populate.
call populate('sakila','film',1000,'N');
(You might want to follow on the Git-Repo for updates as well.)
I am currently analyzing a wikipedia dump file; I am extracting a bunch of data from it using python and persisting it into a PostgreSQL db. I am always trying to make things go faster for this file is huge (18GB). In order to interface with PostgreSQL, I am using psycopg2, but this module seems to mimic many other such DBAPIs.
Anyway, I have a question concerning cursor.executemany(command, values); it seems to me like executing an executemany once every 1000 values or so is better than calling cursor.execute(command % value) for each of these 5 million values (please confirm or correct me!).
But, you see, I am using an executemany to INSERT 1000 rows into a table which has a UNIQUE integrity constraint; this constraint is not verified in python beforehand, for this would either require me to SELECT all the time (this seems counter productive) or require me to get more than 3 GB of RAM. All this to say that I count on Postgres to warn me when my script tried to INSERT an already existing row via catching the psycopg2.DatabaseError.
When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one.
Since psycopg2 is so poorly documented (as are so many great modules...), I cannot find an efficient and effective workaround. I have reduced the number of values INSERTed per executemany from 1000 to 100 in order to reduce the likeliness of a non-UNIQUE INSERT per executemany, but I am pretty certain their is a way to just tell psycopg2 to ignore these execeptions or to tell the cursor to continue the executemany.
Basically, this seems like the kind of problem which has a solution so easy and popular, that all I can do is ask in order to learn about it.
Thanks again!
just copy all the data into a scratch table with the psql \copy command, or use the psycopg cursor.copy_in() method. Then:
insert into mytable
select * from (
select distinct *
from scratch
) uniq
where not exists (
select 1
from mytable
where mytable.mykey = uniq.mykey
);
This will dedup and runs much faster than any combination of inserts.
-dg
I had the same problem and searched here for many days to collect a lot of hints to form a complete solution. Even if the question outdated, I hope this will be useful to others.
1) Forget things about removing indexes/constraints & recreating them later, benefits are marginal or worse.
2) executemany is better than execute as it makes for you the prepare statement. You can get the same results yourself with a command like the following to gain 300% speed:
# To run only once:
sqlCmd = """PREPARE myInsert (int, timestamp, real, text) AS
INSERT INTO myBigTable (idNumber, date_obs, result, user)
SELECT $1, $2, $3, $4 WHERE NOT EXISTS
(SELECT 1 FROM myBigTable WHERE (idNumber, date_obs, user)=($1, $2, $4));"""
curPG.execute(sqlCmd)
cptInsert = 0 # To let you commit from time to time
#... inside the big loop:
curPG.execute("EXECUTE myInsert(%s,%s,%s,%s);", myNewRecord)
allreadyExists = (curPG.rowcount < 1)
if not allreadyExists:
cptInsert += 1
if cptInsert % 10000 == 0:
conPG.commit()
This dummy table example has an unique constraint on (idNumber, date_obs, user).
3) The best solution is to use COPY_FROM and a TRIGGER to manage the unique key BEFORE INSERT. This gave me 36x more speed. I started with normal inserts at 500 records/sec. and with "copy", I got over 18,000 records/sec. Sample code in Python with Psycopg2:
ioResult = StringIO.StringIO() #To use a virtual file as a buffer
cptInsert = 0 # To let you commit from time to time - Memory has limitations
#... inside the big loop:
print >> ioResult, "\t".join(map(str, myNewRecord))
cptInsert += 1
if cptInsert % 10000 == 0:
ioResult = flushCopyBuffer(ioResult, curPG)
#... after the loop:
ioResult = flushCopyBuffer(ioResult, curPG)
def flushCopyBuffer(bufferFile, cursorObj):
bufferFile.seek(0) # Little detail where lures the deamon...
cursorObj.copy_from(bufferFile, 'myBigTable',
columns=('idNumber', 'date_obs', 'value', 'user'))
cursorObj.connection.commit()
bufferFile.close()
bufferFile = StringIO.StringIO()
return bufferFile
That's it for the Python part. Now the Postgresql trigger to not have exception psycopg2.IntegrityError and then all the COPY command's records rejected:
CREATE OR REPLACE FUNCTION chk_exists()
RETURNS trigger AS $BODY$
DECLARE
curRec RECORD;
BEGIN
-- Check if record's key already exists or is empty (file's last line is)
IF NEW.idNumber IS NULL THEN
RETURN NULL;
END IF;
SELECT INTO curRec * FROM myBigTable
WHERE (idNumber, date_obs, user) = (NEW.idNumber, NEW.date_obs, NEW.user);
IF NOT FOUND THEN -- OK keep it
RETURN NEW;
ELSE
RETURN NULL; -- Oups throw it or update the current record
END IF;
END;
$BODY$ LANGUAGE plpgsql;
Now link this function to the trigger of your table:
CREATE TRIGGER chk_exists_before_insert
BEFORE INSERT ON myBigTable FOR EACH ROW EXECUTE PROCEDURE chk_exists();
This seems like a lot of work but Postgresql is a very fast beast when it doesn't have to interpret SQL over and over. Have fun.
"When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one."
The question doesn't really make a lot of sense.
Does EVERY block of 1,000 rows fail due to non-unique rows?
Does 1 block of 1,000 rows fail (out 5,000 such blocks)? If so, then the execute many helps for 4,999 out of 5,000 and is far from "worthless".
Are you worried about this non-Unique insert? Or do you have actual statistics on the number of times this happens?
If you've switched from 1,000 row blocks to 100 row blocks, you can -- obviously -- determine if there's a performance advantage for 1,000 row blocks, 100 row blocks and 1 row blocks.
Please actually run the actual program with actual database and different size blocks and post the numbers.
using a MERGE statement instead of an INSERT one would solve your problem.