automatically populate table with dummy data in mysql - python

I have a MySQL table which I want to populate with some dummy data for testing (50+).
This table has a foreign key to another table so the dummy data must cross reference from that table but again be random i.e. can't all be the same foreign key.
It also has a date added field which I want to populate with a random date within a year span e.g. any date in the year 2010.
my table structure is:
id, customer_id, date_added, title, total_cost
where id is the primary key, customer_id is the foreign key and date_added is the date field.
What is the best way of doing this? I'd prefer to do it directly in MySQL but if not my site is running on Python so a way of doing this in that would do.

I would not do this in MySQL without outside help from an application written in Python.
There are several requirements built into your statement that are best expressed in a procedural style. SQL is a set-based language; I don't think it lends itself as nicely to the task at hand.
You'll want an application to take in data from a source, do whatever randomization and PII removal that you need, and then construct the test data according to your requirements.
If it's database intended just for test, you might consider an in-memory database that you can populate, modify all you like, and then blow away for your next test. I'm thinking about something like Hypersonic or Derby or TimesTen.

quick and dirty solution:
drop table if exists orders;
drop table if exists customers;
create table customers
(
cust_id int unsigned not null auto_increment primary key,
name varchar(255) not null
)
engine=innodb;
create table orders
(
order_id int unsigned not null auto_increment primary key,
cust_id int unsigned not null,
order_date datetime not null,
foreign key (cust_id) references customers(cust_id) on delete cascade
)
engine=innodb;
drop procedure if exists load_test_data;
delimiter #
create procedure load_test_data()
begin
declare v_max_customers int unsigned default 0;
declare v_max_orders int unsigned default 0 ;
declare v_counter int unsigned default 0 ;
declare v_rnd_cust_id int unsigned default 0;
declare v_base_date datetime;
set foreign_key_checks = 0;
truncate table orders;
truncate table customers;
set foreign_key_checks = 1;
set v_base_date = "2010-01-01 00:00:00";
set v_max_customers = 1000;
set v_max_orders = 10000;
start transaction;
set v_counter = 0;
while v_counter < v_max_customers do
insert into customers (name) values (concat('Customer ', v_counter+1));
set v_counter=v_counter+1;
end while;
commit;
start transaction;
set v_counter = 0;
while v_counter < v_max_orders do
set v_rnd_cust_id = floor(1 + (rand() * v_max_customers));
insert into orders (cust_id, order_date) values (v_rnd_cust_id, v_base_date + interval v_counter hour);
set v_counter=v_counter+1;
end while;
commit;
end #
delimiter ;
call load_test_data();
select * from customers order by cust_id desc limit 10;
select * from orders order by order_id desc limit 10;

For testing business rules, I actually prefer carefully thought out data over random data. Either from excel->csv->db or manually created insert statements.
One row for each boundary condition, say:
Customer without orders
One Customer with zero total cost
One customer with foreign characters in the name (because I always forget to deal with it)
One customer with max length name
One Customer with shit loads of orders (to make sure that the GUI still looks nice)
It makes it really easy to run regression tests because you "know" what the data should look like.
For performance testing, you can do pretty good with random data as long as the data distribution is realistic (which affects the usefulness of indexes). If you have very advanced requirements, your best bet is to use some software built for this purpose.
But often you can generate all the data you need from one single table of integers and clever use of built-in functions:
rand() -> Generate random number.
mod() -> Used to create repeating sequences (1,2,3,1,2,3)
lpad() and rpad() -> For padding strings to specified lengths

If you really want to get down with some setting up of testing data, you should go the fixture route. This will help set yourself up a pretty nice development environment and may integrate very nicely into your website's framework if you're using one.
You can find a link to the documentation of the fixture module here
If you think that's a little too much work to get all working, look into the MySQLdb module which will help you insert data into your table.
It may be in poor taste to link back to a stackoverflow, but someone has already answered the date question you are asking. You can find that here.

As such this question is old and answered but I assume you still need to know this one stored procedure to load dummy data to MySQL which runs from MySQL and auto-populates dummy data according to datatypes.
All you need to specify database-name, table-name and number of records to be populate.
call populate('sakila','film',1000,'N');
(You might want to follow on the Git-Repo for updates as well.)

Related

Overwrite row if exists [duplicate]

Several months ago I learned from an answer on Stack Overflow how to perform multiple updates at once in MySQL using the following syntax:
INSERT INTO table (id, field, field2) VALUES (1, A, X), (2, B, Y), (3, C, Z)
ON DUPLICATE KEY UPDATE field=VALUES(Col1), field2=VALUES(Col2);
I've now switched over to PostgreSQL and apparently this is not correct. It's referring to all the correct tables so I assume it's a matter of different keywords being used but I'm not sure where in the PostgreSQL documentation this is covered.
To clarify, I want to insert several things and if they already exist to update them.
PostgreSQL since version 9.5 has UPSERT syntax, with ON CONFLICT clause. with the following syntax (similar to MySQL)
INSERT INTO the_table (id, column_1, column_2)
VALUES (1, 'A', 'X'), (2, 'B', 'Y'), (3, 'C', 'Z')
ON CONFLICT (id) DO UPDATE
SET column_1 = excluded.column_1,
column_2 = excluded.column_2;
Searching postgresql's email group archives for "upsert" leads to finding an example of doing what you possibly want to do, in the manual:
Example 38-2. Exceptions with UPDATE/INSERT
This example uses exception handling to perform either UPDATE or INSERT, as appropriate:
CREATE TABLE db (a INT PRIMARY KEY, b TEXT);
CREATE FUNCTION merge_db(key INT, data TEXT) RETURNS VOID AS
$$
BEGIN
LOOP
-- first try to update the key
-- note that "a" must be unique
UPDATE db SET b = data WHERE a = key;
IF found THEN
RETURN;
END IF;
-- not there, so try to insert the key
-- if someone else inserts the same key concurrently,
-- we could get a unique-key failure
BEGIN
INSERT INTO db(a,b) VALUES (key, data);
RETURN;
EXCEPTION WHEN unique_violation THEN
-- do nothing, and loop to try the UPDATE again
END;
END LOOP;
END;
$$
LANGUAGE plpgsql;
SELECT merge_db(1, 'david');
SELECT merge_db(1, 'dennis');
There's possibly an example of how to do this in bulk, using CTEs in 9.1 and above, in the hackers mailing list:
WITH foos AS (SELECT (UNNEST(%foo[])).*)
updated as (UPDATE foo SET foo.a = foos.a ... RETURNING foo.id)
INSERT INTO foo SELECT foos.* FROM foos LEFT JOIN updated USING(id)
WHERE updated.id IS NULL;
See a_horse_with_no_name's answer for a clearer example.
Warning: this is not safe if executed from multiple sessions at the same time (see caveats below).
Another clever way to do an "UPSERT" in postgresql is to do two sequential UPDATE/INSERT statements that are each designed to succeed or have no effect.
UPDATE table SET field='C', field2='Z' WHERE id=3;
INSERT INTO table (id, field, field2)
SELECT 3, 'C', 'Z'
WHERE NOT EXISTS (SELECT 1 FROM table WHERE id=3);
The UPDATE will succeed if a row with "id=3" already exists, otherwise it has no effect.
The INSERT will succeed only if row with "id=3" does not already exist.
You can combine these two into a single string and run them both with a single SQL statement execute from your application. Running them together in a single transaction is highly recommended.
This works very well when run in isolation or on a locked table, but is subject to race conditions that mean it might still fail with duplicate key error if a row is inserted concurrently, or might terminate with no row inserted when a row is deleted concurrently. A SERIALIZABLE transaction on PostgreSQL 9.1 or higher will handle it reliably at the cost of a very high serialization failure rate, meaning you'll have to retry a lot. See why is upsert so complicated, which discusses this case in more detail.
This approach is also subject to lost updates in read committed isolation unless the application checks the affected row counts and verifies that either the insert or the update affected a row.
With PostgreSQL 9.1 this can be achieved using a writeable CTE (common table expression):
WITH new_values (id, field1, field2) as (
values
(1, 'A', 'X'),
(2, 'B', 'Y'),
(3, 'C', 'Z')
),
upsert as
(
update mytable m
set field1 = nv.field1,
field2 = nv.field2
FROM new_values nv
WHERE m.id = nv.id
RETURNING m.*
)
INSERT INTO mytable (id, field1, field2)
SELECT id, field1, field2
FROM new_values
WHERE NOT EXISTS (SELECT 1
FROM upsert up
WHERE up.id = new_values.id)
See these blog entries:
Upserting via Writeable CTE
WAITING FOR 9.1 – WRITABLE CTE
WHY IS UPSERT SO COMPLICATED?
Note that this solution does not prevent a unique key violation but it is not vulnerable to lost updates.
See the follow up by Craig Ringer on dba.stackexchange.com
In PostgreSQL 9.5 and newer you can use INSERT ... ON CONFLICT UPDATE.
See the documentation.
A MySQL INSERT ... ON DUPLICATE KEY UPDATE can be directly rephrased to a ON CONFLICT UPDATE. Neither is SQL-standard syntax, they're both database-specific extensions. There are good reasons MERGE wasn't used for this, a new syntax wasn't created just for fun. (MySQL's syntax also has issues that mean it wasn't adopted directly).
e.g. given setup:
CREATE TABLE tablename (a integer primary key, b integer, c integer);
INSERT INTO tablename (a, b, c) values (1, 2, 3);
the MySQL query:
INSERT INTO tablename (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
becomes:
INSERT INTO tablename (a, b, c) values (1, 2, 10)
ON CONFLICT (a) DO UPDATE SET c = tablename.c + 1;
Differences:
You must specify the column name (or unique constraint name) to use for the uniqueness check. That's the ON CONFLICT (columnname) DO
The keyword SET must be used, as if this was a normal UPDATE statement
It has some nice features too:
You can have a WHERE clause on your UPDATE (letting you effectively turn ON CONFLICT UPDATE into ON CONFLICT IGNORE for certain values)
The proposed-for-insertion values are available as the row-variable EXCLUDED, which has the same structure as the target table. You can get the original values in the table by using the table name. So in this case EXCLUDED.c will be 10 (because that's what we tried to insert) and "table".c will be 3 because that's the current value in the table. You can use either or both in the SET expressions and WHERE clause.
For background on upsert see How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
I was looking for the same thing when I came here, but the lack of a generic "upsert" function botherd me a bit so I thought you could just pass the update and insert sql as arguments on that function form the manual
that would look like this:
CREATE FUNCTION upsert (sql_update TEXT, sql_insert TEXT)
RETURNS VOID
LANGUAGE plpgsql
AS $$
BEGIN
LOOP
-- first try to update
EXECUTE sql_update;
-- check if the row is found
IF FOUND THEN
RETURN;
END IF;
-- not found so insert the row
BEGIN
EXECUTE sql_insert;
RETURN;
EXCEPTION WHEN unique_violation THEN
-- do nothing and loop
END;
END LOOP;
END;
$$;
and perhaps to do what you initially wanted to do, batch "upsert", you could use Tcl to split the sql_update and loop the individual updates, the preformance hit will be very small see http://archives.postgresql.org/pgsql-performance/2006-04/msg00557.php
the highest cost is executing the query from your code, on the database side the execution cost is much smaller
There is no simple command to do it.
The most correct approach is to use function, like the one from docs.
Another solution (although not that safe) is to do update with returning, check which rows were updates, and insert the rest of them
Something along the lines of:
update table
set column = x.column
from (values (1,'aa'),(2,'bb'),(3,'cc')) as x (id, column)
where table.id = x.id
returning id;
assuming id:2 was returned:
insert into table (id, column) values (1, 'aa'), (3, 'cc');
Of course it will bail out sooner or later (in concurrent environment), as there is clear race condition in here, but usually it will work.
Here's a longer and more comprehensive article on the topic.
I use this function merge
CREATE OR REPLACE FUNCTION merge_tabla(key INT, data TEXT)
RETURNS void AS
$BODY$
BEGIN
IF EXISTS(SELECT a FROM tabla WHERE a = key)
THEN
UPDATE tabla SET b = data WHERE a = key;
RETURN;
ELSE
INSERT INTO tabla(a,b) VALUES (key, data);
RETURN;
END IF;
END;
$BODY$
LANGUAGE plpgsql
Personally, I've set up a "rule" attached to the insert statement. Say you had a "dns" table that recorded dns hits per customer on a per-time basis:
CREATE TABLE dns (
"time" timestamp without time zone NOT NULL,
customer_id integer NOT NULL,
hits integer
);
You wanted to be able to re-insert rows with updated values, or create them if they didn't exist already. Keyed on the customer_id and the time. Something like this:
CREATE RULE replace_dns AS
ON INSERT TO dns
WHERE (EXISTS (SELECT 1 FROM dns WHERE ((dns."time" = new."time")
AND (dns.customer_id = new.customer_id))))
DO INSTEAD UPDATE dns
SET hits = new.hits
WHERE ((dns."time" = new."time") AND (dns.customer_id = new.customer_id));
Update: This has the potential to fail if simultaneous inserts are happening, as it will generate unique_violation exceptions. However, the non-terminated transaction will continue and succeed, and you just need to repeat the terminated transaction.
However, if there are tons of inserts happening all the time, you will want to put a table lock around the insert statements: SHARE ROW EXCLUSIVE locking will prevent any operations that could insert, delete or update rows in your target table. However, updates that do not update the unique key are safe, so if you no operation will do this, use advisory locks instead.
Also, the COPY command does not use RULES, so if you're inserting with COPY, you'll need to use triggers instead.
Similar to most-liked answer, but works slightly faster:
WITH upsert AS (UPDATE spider_count SET tally=1 WHERE date='today' RETURNING *)
INSERT INTO spider_count (spider, tally) SELECT 'Googlebot', 1 WHERE NOT EXISTS (SELECT * FROM upsert)
(source: http://www.the-art-of-web.com/sql/upsert/)
I custom "upsert" function above, if you want to INSERT AND REPLACE :
`
CREATE OR REPLACE FUNCTION upsert(sql_insert text, sql_update text)
RETURNS void AS
$BODY$
BEGIN
-- first try to insert and after to update. Note : insert has pk and update not...
EXECUTE sql_insert;
RETURN;
EXCEPTION WHEN unique_violation THEN
EXECUTE sql_update;
IF FOUND THEN
RETURN;
END IF;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION upsert(text, text)
OWNER TO postgres;`
And after to execute, do something like this :
SELECT upsert($$INSERT INTO ...$$,$$UPDATE... $$)
Is important to put double dollar-comma to avoid compiler errors
check the speed...
According the PostgreSQL documentation of the INSERT statement, handling the ON DUPLICATE KEY case is not supported. That part of the syntax is a proprietary MySQL extension.
I have the same issue for managing account settings as name value pairs.
The design criteria is that different clients could have different settings sets.
My solution, similar to JWP is to bulk erase and replace, generating the merge record within your application.
This is pretty bulletproof, platform independent and since there are never more than about 20 settings per client, this is only 3 fairly low load db calls - probably the fastest method.
The alternative of updating individual rows - checking for exceptions then inserting - or some combination of is hideous code, slow and often breaks because (as mentioned above) non standard SQL exception handling changing from db to db - or even release to release.
#This is pseudo-code - within the application:
BEGIN TRANSACTION - get transaction lock
SELECT all current name value pairs where id = $id into a hash record
create a merge record from the current and update record
(set intersection where shared keys in new win, and empty values in new are deleted).
DELETE all name value pairs where id = $id
COPY/INSERT merged records
END TRANSACTION
CREATE OR REPLACE FUNCTION save_user(_id integer, _name character varying)
RETURNS boolean AS
$BODY$
BEGIN
UPDATE users SET name = _name WHERE id = _id;
IF FOUND THEN
RETURN true;
END IF;
BEGIN
INSERT INTO users (id, name) VALUES (_id, _name);
EXCEPTION WHEN OTHERS THEN
UPDATE users SET name = _name WHERE id = _id;
END;
RETURN TRUE;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
For merging small sets, using the above function is fine. However, if you are merging large amounts of data, I'd suggest looking into http://mbk.projects.postgresql.org
The current best practice that I'm aware of is:
COPY new/updated data into temp table (sure, or you can do INSERT if the cost is ok)
Acquire Lock [optional] (advisory is preferable to table locks, IMO)
Merge. (the fun part)
UPDATE will return the number of modified rows. If you use JDBC (Java), you can then check this value against 0 and, if no rows have been affected, fire INSERT instead. If you use some other programming language, maybe the number of the modified rows still can be obtained, check documentation.
This may not be as elegant but you have much simpler SQL that is more trivial to use from the calling code. Differently, if you write the ten line script in PL/PSQL, you probably should have a unit test of one or another kind just for it alone.
Edit: This does not work as expected. Unlike the accepted answer, this produces unique key violations when two processes repeatedly call upsert_foo concurrently.
Eureka! I figured out a way to do it in one query: use UPDATE ... RETURNING to test if any rows were affected:
CREATE TABLE foo (k INT PRIMARY KEY, v TEXT);
CREATE FUNCTION update_foo(k INT, v TEXT)
RETURNS SETOF INT AS $$
UPDATE foo SET v = $2 WHERE k = $1 RETURNING $1
$$ LANGUAGE sql;
CREATE FUNCTION upsert_foo(k INT, v TEXT)
RETURNS VOID AS $$
INSERT INTO foo
SELECT $1, $2
WHERE NOT EXISTS (SELECT update_foo($1, $2))
$$ LANGUAGE sql;
The UPDATE has to be done in a separate procedure because, unfortunately, this is a syntax error:
... WHERE NOT EXISTS (UPDATE ...)
Now it works as desired:
SELECT upsert_foo(1, 'hi');
SELECT upsert_foo(1, 'bye');
SELECT upsert_foo(3, 'hi');
SELECT upsert_foo(3, 'bye');
PostgreSQL >= v15
Big news on this topic as in PostgreSQL v15, it is possible to use MERGE command. In fact, this long awaited feature was listed the first of the improvements of the v15 release.
This is similar to INSERT ... ON CONFLICT but more batch-oriented. It has a powerful WHEN MATCHED vs WHEN NOT MATCHED structure that gives the ability to INSERT, UPDATE or DELETE on such conditions.
It not only eases bulk changes, but it even adds more control that tradition UPSERT and INSERT ... ON CONFLICT
Take a look at this very complete sample from official page:
MERGE INTO wines w
USING wine_stock_changes s
ON s.winename = w.winename
WHEN NOT MATCHED AND s.stock_delta > 0 THEN
INSERT VALUES(s.winename, s.stock_delta)
WHEN MATCHED AND w.stock + s.stock_delta > 0 THEN
UPDATE SET stock = w.stock + s.stock_delta
WHEN MATCHED THEN
DELETE;
PostgreSQL v9, v10, v11, v12, v13, v14
If version is under v15 and over v9.5 , probably best choice is to use UPSERT syntax, with ON CONFLICT clause
Here is the example how to do upsert with params and without special sql constructions
if you have special condition (sometimes you can't use 'on conflict' because you can't create constraint)
WITH upd AS
(
update view_layer set metadata=:metadata where layer_id = :layer_id and view_id = :view_id returning id
)
insert into view_layer (layer_id, view_id, metadata)
(select :layer_id layer_id, :view_id view_id, :metadata metadata FROM view_layer l
where NOT EXISTS(select id FROM upd WHERE id IS NOT NULL) limit 1)
returning id
maybe it will be helpful

SQLITE_MAX_VARIABLE_NUMBER increase or break sql query into chunks

I am trying to automate a Python script with a batch file, and the code works fine on my own computer but runs into an operational error "too many SQL variables" when I run it with the batch file on a remote desktop.
This is apparently because the limit on a sql query is 999 parameters, and mine has more than the limit. How do I actually increase this limit or break the data into chunks of 999 cols? I came across many posts saying to increase this limit at compilation but I don't know how to do so, and the to_sql has a field called chunk but that's for rows and not columns.I'm using SQLite
My python code to insert data is:
df.tail(1).to_sql("table", sqlcon, if_exists="append", index=True)
Thanks !
A schema with more than 999 columns should probably be rethought. That said, here's how to work around it.
You can upgrade to SQLite after 3.32.0 when SQLITE_MAX_VARIABLE_NUMBER defaults to 32766. And if you need more than that, you are not allowed to design databases.
Otherwise, if for some reason upgrading is not an option, the hard coded limits can only be lowered at runtime. If you want to raise them you will have to recompile SQLite with a higher SQLITE_MAX_VARIABLE_NUMBER. This will make your program difficult to deploy using standard dependency managers.
Yes I've thought about that, but for this purpose, because the rows are basically dates and the column names are securities that we need to store stuff for, I don't think I can really change it.
This is a job for a join table.
create table securities (
id integer primary key,
symbol text not null unique,
name text not null
);
create table security_prices (
security_id integer not null references securities(id),
retrieved_at datetime not null,
price integer not null
);
select symbol, price
from security_prices sp
join securities s on s.id = sp.security_id
where retrieved_at = ?

How to insert data into another table with same id by mysql or python

I have the following table:
I want like this:
Thanks!
I don't recommend this process because it is extremely messy, but if it helps, more power to you, I guess. Also, make sure that when you post a question, you include the code that the question is referring to. It helps other users a lot more and makes them much more willing to help!
Because MySQL will not let you put in repeat CHAR or VARCHAR columns, I suggest you either uniquely name the columns based on why they're separated this way, or make unique tables.
Unique Column Names:
CREATE TABLE 'tableName' (
id INT NOT NULL,
time1 TIME NOT NULL, //Just continue naming your columns in a different way
action1 CHAR(6) NOT NULL,
station1 VARCHAR NOT NULL,
PRIMARY KEY(primaryKey) //Use this to uniquely identify you table by a distinct
); //variable (probably id)
To make unique tables, just repeat the table above over and over again with a different Table Name. To connect all your tables together, use a foreign key. This key's variable needs to have entries that exist in all tables to have all table data connected. Read more about it here: Foreign Keys and How to Use Them
Good Luck!

How to normalize data efficently while INSERTing into SQL table (Postgres)

I want to import a large log file into (Postgres-)SQL
Certain string columns are very repetitive for example column 'event_type' has 1 of 10 different string values.
I have a rough understanding of normalizing data.
Firstly, is it correct to assume that : It's beneficial (for storage size and indexing and query speed) to store event_type in a separate table (possibly with a foreign key relation)?
In order to normalize I would have to check for the distinct values of event_type in the raw log and insert them into the event_types table.
There are many field types like event_types.
So Secondly: Is there a way to tell the database to create and maintain this kind of table when inserting the data?
Are there other strategies to accomplish this? I'm working with pandas.
This is a typical situation when starting to build a database from data hitherto stored otherwise, such as in a log file. There is a solution - as usual - but it is not a very fast one. Perhaps you can write a log message handler to process messages as they come in; provided the flux (messages/second) is not too large you won't notice the overhead, especially if you can forget about writing the message to a flat text file.
Firstly, on the issue of normalization. Yes, you should always normalize and to the so-called 3rd Normal Form (3NF). This basically implies that any kind of real-world data (such as your event_type) is stored once and once only. (There are cases where you could relax this a little and go to 2NF - usually only when the real-world data requires very little storage, such as an ISO country code, a M/F(male/female) choice, etc - but in most other cases 3NF will be better.)
In your specific case, let's say that your event_type is a char(20) type. Ten such events with their corresponding int codes easily fit on a single database page, typically 4kB of disk space. If you have 1,000 log messages with event_type as a char(20) then you need 20kB just to store that information, or five database pages. If you have other such items in your log message then the storage reduction becomes correspondingly larger. Other items such as date or timestamp can be stored in their native format (4 and 8 bytes, respectively) for smaller storage, better performance and increased functionality (such as comparing dates or looking at ranges).
Secondly, you cannot tell the database to create such tables, you have to do that yourself. But once created, a stored procedure can parse your log messages and put the data in the right tables.
In the case of log messages, you can do something like this (assuming you want to do the parsing in the database and thus not in python):
CREATE FUNCTION ingest_log_message(mess text) RETURNS int AS $$
DECLARE
parts text[];
et_id int;
log_id int;
BEGIN
parts := regexp_split_to_array(mess, ','); -- Whatever your delimiter is
-- Assuming:
-- parts[1] is a timestamp
-- parts[2] is your event_type
-- parts[3] is the actual message
-- Get the event_type identifier. If event_type is new, INSERT it, else just get the id.
-- Do likewise with other log message parts whose unique text is located in a separate table.
SELECT id INTO et_id
FROM event_type
WHERE type_text = quote_literal(parts[2]);
IF NOT FOUND THEN
INSERT INTO event_type (type_text)
VALUES (quote_literal(parts[2]))
RETURNING id INTO et_id;
END IF;
-- Now insert the log message
INSERT INTO log_message (dt, et, msg)
VALUES (parts[1]::timestamp, et_id, quote_literal(parts[3]))
RETURNING id INTO log_id;
RETURN log_id;
END; $$ LANGUAGE plpgsql STRICT;
The tables you need for this are:
CREATE TABLE event_type (
id serial PRIMARY KEY,
type_text char(20)
);
and
CREATE TABLE log_message (
id serial PRIMARY KEY,
dt timestamp,
et integer REFERENCES event_type
msg text
);
You can then invoke this function as a simple SELECT statement, which will return the id of the newly insert log message:
SELECT * FROM ingest_log_message(the_message);
Note the use of the quote_literal() function in the function body. This has two important functions: (1) Quotes inside the string are properly escaped (so that words like "isn't" don't mess up the command); and (2) It guards against SQL-injection by malicious generators of log messages.
All of the above obviously needs to be tailored to your specific situation.

Filtering sqlite - performing actions one by one

I'm working on a Python program that interacts with a simple sqlite database. I'm trying to build a search tool that will be able to, depending on user input, interactively "filter" the database and then return rows (items) that match the search. For example...
My Python program (through if statements, cgi.FieldStorage(), and whatnot) should be able to accept user input and then hunt through the database. Here's the general code for the program:
import cgitb; cgitb.enable()
import cgi
import sys
import sqlite3 as lite
import sys
con = lite.connect('bikes.db')
form = cgi.FieldStorage()
terrain_get = form.getlist("terrain")
terrains = ",".join(terrain_get)
handlebar_get = form.getlist("handlebar")
handlebars = ",".join(handlebar_get)
kickstand = form['kickstand'].value
As you can see, that part is what receives the user's input; works fine (I think). Next, where I need help:
if 'dirtrocky' not in terrains:
FILTER the database to not return items that have "dirtrocky' in their terrain field
And then later in the program, I want to be able to extend on my filter:
if 'drop' not in handlebars:
FILTER the database to, much like in previous one, not return items that have 'drop' in their 'handlebar' field
My question is, HOW can I filter the database? My end result should ideally be a tuple of IDs for rows that are left after I 'filter away' the above.
Thanks!
First, you should define your database schema. Most common approach is to create fully normalized database, something like:
CREATE TABLE bikes (
bike_id INTEGER AUTOINCREMENT PRIMARY KEY,
manufacturer VARCHAR(20),
price FLOAT,
...
);
CREATE TABLE terrains (
terrain_id INTEGER AUTOINCREMENT PRIMARY KEY,
terrain VARCHAR(20),
...
);
CREATE TABLE handlebars (
handlebar_id INTEGER AUTOINCREMENT PRIMARY KEY,
handlebar VARCHAR(20),
...
);
CREATE TABLE bike_terrain (
bike_id INTEGER,
terrain_id INTEGER
);
CREATE TABLE bike_handlebar (
bike_id INTEGER,
handlebar_id INTEGER
);
Note that bikes table does not contain anything about terrain types or handlebars: this info will be stored in connecting tables like bike_terrain.
This fully normalized database makes it little bit cumbersome to populate, but on the other hand, it makes it much easier to query.
How do you query it for multi-valued fields?
You will need to construct your SQL statement dynamically, something like this:
SELECT
b.manufacturer,
b.price
FROM bikes b,
terrains t,
bike_terrain bt
WHERE b.bike_id = bt.bike_id
AND t.terrain_id = bt.terrain_id
AND t.terrain IN ('mountain', 'dirt', ...) -- this will be built dynamically
... -- add more for handlebars, etc...
Almost whole WHERE clause will have to be built and added dynamically, by constructing your SQL statement on the fly.
I highly recommend getting some good SQLite GUI to work on this. On Windows, SQLite Expert Personal is superb, and on Linux sqliteman is great.
Once you get your database populated and it has something beyond few 100s of rows, you should add proper indexes so it works fast. Good luck!

Categories