How to normalize data efficently while INSERTing into SQL table (Postgres) - python

I want to import a large log file into (Postgres-)SQL
Certain string columns are very repetitive for example column 'event_type' has 1 of 10 different string values.
I have a rough understanding of normalizing data.
Firstly, is it correct to assume that : It's beneficial (for storage size and indexing and query speed) to store event_type in a separate table (possibly with a foreign key relation)?
In order to normalize I would have to check for the distinct values of event_type in the raw log and insert them into the event_types table.
There are many field types like event_types.
So Secondly: Is there a way to tell the database to create and maintain this kind of table when inserting the data?
Are there other strategies to accomplish this? I'm working with pandas.

This is a typical situation when starting to build a database from data hitherto stored otherwise, such as in a log file. There is a solution - as usual - but it is not a very fast one. Perhaps you can write a log message handler to process messages as they come in; provided the flux (messages/second) is not too large you won't notice the overhead, especially if you can forget about writing the message to a flat text file.
Firstly, on the issue of normalization. Yes, you should always normalize and to the so-called 3rd Normal Form (3NF). This basically implies that any kind of real-world data (such as your event_type) is stored once and once only. (There are cases where you could relax this a little and go to 2NF - usually only when the real-world data requires very little storage, such as an ISO country code, a M/F(male/female) choice, etc - but in most other cases 3NF will be better.)
In your specific case, let's say that your event_type is a char(20) type. Ten such events with their corresponding int codes easily fit on a single database page, typically 4kB of disk space. If you have 1,000 log messages with event_type as a char(20) then you need 20kB just to store that information, or five database pages. If you have other such items in your log message then the storage reduction becomes correspondingly larger. Other items such as date or timestamp can be stored in their native format (4 and 8 bytes, respectively) for smaller storage, better performance and increased functionality (such as comparing dates or looking at ranges).
Secondly, you cannot tell the database to create such tables, you have to do that yourself. But once created, a stored procedure can parse your log messages and put the data in the right tables.
In the case of log messages, you can do something like this (assuming you want to do the parsing in the database and thus not in python):
CREATE FUNCTION ingest_log_message(mess text) RETURNS int AS $$
DECLARE
parts text[];
et_id int;
log_id int;
BEGIN
parts := regexp_split_to_array(mess, ','); -- Whatever your delimiter is
-- Assuming:
-- parts[1] is a timestamp
-- parts[2] is your event_type
-- parts[3] is the actual message
-- Get the event_type identifier. If event_type is new, INSERT it, else just get the id.
-- Do likewise with other log message parts whose unique text is located in a separate table.
SELECT id INTO et_id
FROM event_type
WHERE type_text = quote_literal(parts[2]);
IF NOT FOUND THEN
INSERT INTO event_type (type_text)
VALUES (quote_literal(parts[2]))
RETURNING id INTO et_id;
END IF;
-- Now insert the log message
INSERT INTO log_message (dt, et, msg)
VALUES (parts[1]::timestamp, et_id, quote_literal(parts[3]))
RETURNING id INTO log_id;
RETURN log_id;
END; $$ LANGUAGE plpgsql STRICT;
The tables you need for this are:
CREATE TABLE event_type (
id serial PRIMARY KEY,
type_text char(20)
);
and
CREATE TABLE log_message (
id serial PRIMARY KEY,
dt timestamp,
et integer REFERENCES event_type
msg text
);
You can then invoke this function as a simple SELECT statement, which will return the id of the newly insert log message:
SELECT * FROM ingest_log_message(the_message);
Note the use of the quote_literal() function in the function body. This has two important functions: (1) Quotes inside the string are properly escaped (so that words like "isn't" don't mess up the command); and (2) It guards against SQL-injection by malicious generators of log messages.
All of the above obviously needs to be tailored to your specific situation.

Related

SQLITE_MAX_VARIABLE_NUMBER increase or break sql query into chunks

I am trying to automate a Python script with a batch file, and the code works fine on my own computer but runs into an operational error "too many SQL variables" when I run it with the batch file on a remote desktop.
This is apparently because the limit on a sql query is 999 parameters, and mine has more than the limit. How do I actually increase this limit or break the data into chunks of 999 cols? I came across many posts saying to increase this limit at compilation but I don't know how to do so, and the to_sql has a field called chunk but that's for rows and not columns.I'm using SQLite
My python code to insert data is:
df.tail(1).to_sql("table", sqlcon, if_exists="append", index=True)
Thanks !
A schema with more than 999 columns should probably be rethought. That said, here's how to work around it.
You can upgrade to SQLite after 3.32.0 when SQLITE_MAX_VARIABLE_NUMBER defaults to 32766. And if you need more than that, you are not allowed to design databases.
Otherwise, if for some reason upgrading is not an option, the hard coded limits can only be lowered at runtime. If you want to raise them you will have to recompile SQLite with a higher SQLITE_MAX_VARIABLE_NUMBER. This will make your program difficult to deploy using standard dependency managers.
Yes I've thought about that, but for this purpose, because the rows are basically dates and the column names are securities that we need to store stuff for, I don't think I can really change it.
This is a job for a join table.
create table securities (
id integer primary key,
symbol text not null unique,
name text not null
);
create table security_prices (
security_id integer not null references securities(id),
retrieved_at datetime not null,
price integer not null
);
select symbol, price
from security_prices sp
join securities s on s.id = sp.security_id
where retrieved_at = ?

Creating an id based on schema data that is both unique and prevents duplicates

So i am inserting data that looks like this.. into my mongo db colletion, it is some polling data
Link to sample insertions
What i plan on doing is combining the "Poll_Name", "Date", "Sample_Size" and "MoE" values into a unique string and then using some function to convert it into a unique id value.
What i wish to get out of this function is the ability to both create an id for each poll and create a duplicate id if a duplicate string is given.
So for example lets say i wish to add this poll to my database..
{'Poll_Name': 'NBC News/Marist', 'Date': '2020-03-10', 'Sample_Size': '2523 RV', 'MoE': '2.7', 'Biden (D)': '47', 'Trump(R)': '46', 'Spread': 'Biden +1'}
and i create an id from this poll using its "Poll_Name", "Date", "Sample_Size" and "MoE" values
so the string would come out something like this...
poll_String = "NBC News/Marist2020-03-102523RV2.7"
And then i put it through the function that creates an id and lets say it spits out the value "12345" (for simplicity sake)
Then lets say later on in the insertions, i am adding an exact duplicate of this poll, so the "poll_String" comes out the exact same for this poll duplicate.
I need the id creation function to return the exact same value ie 12345, so that i know then that the poll being added here is a duplicate. And obviously in the process keep the id created completely unique to other polls that differ, so as not to create incorrect duplicate ids.
Is this possible? or am i asking for something too advanced...
You can use a hashing function to create a hash of the data.
But you will need to consider that hashing data will not guarantee that another piece of data will not have the same hash, it would just be very unlikely.
So consider the following code
import hashlib
some_string = "Some test string I want to generate an ID from"
new_id = hashlib.md5(some_string.encode()).hexdigest()
print(new_id)
This snippet will print 051ba4078ab8419b76388ee9173dac1a.
Please note that md5 hashes should not be used to store passwords.
Also, if you want the id to be shorter than this you can simply take the first x characters of the hash. But remember, the shorter the id, the higher the chance of you getting two pieces of data with the same auto-generated id.
The odds of two different pieces of data to have the same auto-generated id with this method is roughly 1/16^x. Consider how much data you have and how unlikely you'd want id collisions to happen. over 99% in a lifetime of an application is reasonable in my opinion.
So if you have say 100M items, taking the first 10 hexadecimal characters from the md5 hash will give you a likelihood of about 0.01% to have a collision (assuming of course no who items are the same).
Also, it isn't random, so for the same string you will always get the same hash value.

Profile Millions of Text Files In Parallel Using An Sqlite Counter?

A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.
I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:
In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:
I've got this system down! But it's slow # 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.
placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)
I learned that's a bad idea because
sqlite string search is much slower than indexed search
several hundreds of properties (some 160 characters long) make for really long sql queries
using %s instead of ? is bad security practice... (not a concern ATM)
A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:
Read file for new_properties
Read tally_board for rowid, property
Generate script side client_hash from 2's read
Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
Lookup rowid for every row in new_properties using the client_hash
Write increment to every rowid (now a proxy for property) to tally_board
Step 6. looks like
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute
The problem with this is
It's still slow!
It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.
A solution to the race condition is to give steps 2-6 an exclusive lock on the db though it doesn't look like reads can get those Lock A Read.
Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.
There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.
How about a change of table schema? Instead of a column per type, have a type column. Then you have unique rows identified by property and type, like this:
|rowid|prop |type |count|
============================
|1 |prop_foo|typeA|215 |
|2 |prop_foo|typeB|456 |
This means you can enter a transaction for each and every property of each and every file separately and let sqlite worry about races. So for each property you encounter, immediately issue a complete transaction that computes the next total and upserts the record identified by the property name and file type.
The following sped things up immensely:
Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).

SQLite get id and insert when using executemany

I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.
You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.

automatically populate table with dummy data in mysql

I have a MySQL table which I want to populate with some dummy data for testing (50+).
This table has a foreign key to another table so the dummy data must cross reference from that table but again be random i.e. can't all be the same foreign key.
It also has a date added field which I want to populate with a random date within a year span e.g. any date in the year 2010.
my table structure is:
id, customer_id, date_added, title, total_cost
where id is the primary key, customer_id is the foreign key and date_added is the date field.
What is the best way of doing this? I'd prefer to do it directly in MySQL but if not my site is running on Python so a way of doing this in that would do.
I would not do this in MySQL without outside help from an application written in Python.
There are several requirements built into your statement that are best expressed in a procedural style. SQL is a set-based language; I don't think it lends itself as nicely to the task at hand.
You'll want an application to take in data from a source, do whatever randomization and PII removal that you need, and then construct the test data according to your requirements.
If it's database intended just for test, you might consider an in-memory database that you can populate, modify all you like, and then blow away for your next test. I'm thinking about something like Hypersonic or Derby or TimesTen.
quick and dirty solution:
drop table if exists orders;
drop table if exists customers;
create table customers
(
cust_id int unsigned not null auto_increment primary key,
name varchar(255) not null
)
engine=innodb;
create table orders
(
order_id int unsigned not null auto_increment primary key,
cust_id int unsigned not null,
order_date datetime not null,
foreign key (cust_id) references customers(cust_id) on delete cascade
)
engine=innodb;
drop procedure if exists load_test_data;
delimiter #
create procedure load_test_data()
begin
declare v_max_customers int unsigned default 0;
declare v_max_orders int unsigned default 0 ;
declare v_counter int unsigned default 0 ;
declare v_rnd_cust_id int unsigned default 0;
declare v_base_date datetime;
set foreign_key_checks = 0;
truncate table orders;
truncate table customers;
set foreign_key_checks = 1;
set v_base_date = "2010-01-01 00:00:00";
set v_max_customers = 1000;
set v_max_orders = 10000;
start transaction;
set v_counter = 0;
while v_counter < v_max_customers do
insert into customers (name) values (concat('Customer ', v_counter+1));
set v_counter=v_counter+1;
end while;
commit;
start transaction;
set v_counter = 0;
while v_counter < v_max_orders do
set v_rnd_cust_id = floor(1 + (rand() * v_max_customers));
insert into orders (cust_id, order_date) values (v_rnd_cust_id, v_base_date + interval v_counter hour);
set v_counter=v_counter+1;
end while;
commit;
end #
delimiter ;
call load_test_data();
select * from customers order by cust_id desc limit 10;
select * from orders order by order_id desc limit 10;
For testing business rules, I actually prefer carefully thought out data over random data. Either from excel->csv->db or manually created insert statements.
One row for each boundary condition, say:
Customer without orders
One Customer with zero total cost
One customer with foreign characters in the name (because I always forget to deal with it)
One customer with max length name
One Customer with shit loads of orders (to make sure that the GUI still looks nice)
It makes it really easy to run regression tests because you "know" what the data should look like.
For performance testing, you can do pretty good with random data as long as the data distribution is realistic (which affects the usefulness of indexes). If you have very advanced requirements, your best bet is to use some software built for this purpose.
But often you can generate all the data you need from one single table of integers and clever use of built-in functions:
rand() -> Generate random number.
mod() -> Used to create repeating sequences (1,2,3,1,2,3)
lpad() and rpad() -> For padding strings to specified lengths
If you really want to get down with some setting up of testing data, you should go the fixture route. This will help set yourself up a pretty nice development environment and may integrate very nicely into your website's framework if you're using one.
You can find a link to the documentation of the fixture module here
If you think that's a little too much work to get all working, look into the MySQLdb module which will help you insert data into your table.
It may be in poor taste to link back to a stackoverflow, but someone has already answered the date question you are asking. You can find that here.
As such this question is old and answered but I assume you still need to know this one stored procedure to load dummy data to MySQL which runs from MySQL and auto-populates dummy data according to datatypes.
All you need to specify database-name, table-name and number of records to be populate.
call populate('sakila','film',1000,'N');
(You might want to follow on the Git-Repo for updates as well.)

Categories