Have a question about what sort of approach to take on a process I am trying to structure. Working with PostgreSQL and Python.
Scenario:
I have two databases A and B.
B is a processed version of A.
Data continually streams into A, which needs to be processed in a certain
way (using multi-processing) and is then stored in B.
Each new row in A needs to be processed only once.
So:
streamofdata ===> [database A] ----> process ----> [database B]
Database A is fairly large (40 GB) and growing. My question is regarding the determination on what is the new data not yet processed and put into B. What is the best way to determine what rows have to be processed still.
Matching primary keys each time on what has not yet been processed is not the way to go I am guessing
So let's say new rows 120 to 130 come into database A over some time period. So my last row processed row was 119. Is a correct approach to look at the last row id (the primary key) 119 processed and say that anything beyond that should now be processed?
Also wondering whether anyone has any further resources on this sort of 'realtime' processing of data. Not exactly sure what I am looking for technically speaking.
Well, there are a few ways you could handle this problem. As a reminder, the process you are describing is basically re-implementing a form of database replication, so you may want to familiarize yourself with the various popular replication options out there for Postgres and how they work, particularly Slony might be of interest to you. You didn't specify what sort of database "database B" is, so I'll assume it's a separate PostgreSQL instance, though that assumption won't change a whole lot about the decisions below other than ruling out some canned solutions like Slony.
Set up a FOR EACH ROW trigger on the important table(s) you have in database A which need to be replicated. Your trigger would take each new row INSERTed (and/or UPDATEd, DELETEd, if you need to catch those) in those tables and send them off to database B appropriately. You mentioned using Python, so just a reminder you can certainly write these trigger functions in PL/python if that makes life easy for you, i.e. you should hopefully be able to more-or-less easily tweak your existing code so that it runs inside the database as a PL/Python trigger function.
If you read up on Slony, you might have noticed that proposal #1 is very similar to how Slony works -- consider whether it would be easy or helpful for you to have Slony take over the replication of the necessary tables from database A to database B, then if you need to further move/transform the data into other tables inside database B, you might do that with triggers on those tables in database B.
Set up a trigger or RULE which will send out a NOTIFY with a payload indicating the row which has changed. Your code will will LISTEN for these notifications and know immediately which rows have changed. The psycopg2 adapter has good support for LISTEN and NOTIFY. N.B. you will need to exercise some care to handle the case that your listener code has crashed or gets disconnected from the database or otherwise misses some notifications.
In case you have control over the code streaming data into database A, you could have that code take over the job of replicating its new data into database B.
Related
Short version: Need a faster/better way to update many column comments at once in spark/databricks. I have a pyspark notebook that can do this sequentially across many tables, but if I call it from multiple tasks they take so long waiting on a hive connection that I get timeout failures.
Command used: ALTER TABLE my_db_name.my_table_name CHANGE my_column COMMENT "new comment" (docs)
Long version: I have a data dictionary notebook where I maintain column descriptions that are reused across multiple tables. If I run the notebook directly it successfully populates all my database table and column comments by issuing the above command sequentially for every column across all tables (and the corresponding table description command once).
I'm trying to move this to a by-table call. In the databricks tasks that populate the tables I have a check to see if the output table exist. If not it's created, and at the end I call the dictionary notebook (using dbutils.notebook.run("Data Dictionary Creation", 600, {"db": output_db, "update_table": output_table}) to populate the comments for that particular table. If this happens simultaneously for multiple tables however the notebook calls now timeout, as most of the tasks spend a lot of time waiting for client connection with hive. This is true even though there's only one call of the notebook per table.
Solution Attempts:
I tried many variations of the above command to update all column comments in one call per table, but it's either impossible or my syntax is wrong.
It's unclear to me how to avoid the timeout issues (I've doubled timeout to 10 minutes and it still fails, while the original notebook takes much less time than that to run across all tables!). I need to wait for completion before continuing to the next task (or I'd spawn it as a process).
Update: I think what's happening here is that the above Alter command is being called in a loop, and when I schedule a job this loop is being distributed and called in parallel. What I may actually need is a way to call it, or a function in it, without letting the loop be distributed. Is there a way to force sequential execution for a single function?
In the end I found a solution for this issue.
First, the problem seems to have been that the loop with the ALTER command was getting parallelized by spark, and thus firing multiple (conflicting) commands simultaneously on the same table.
The answer to this was two-fold:
Add a .coalesce(1) to the end of the function I was calling with the ALTER line. This limits the function to sequential execution.
Return a newly-created empty dataframe from the function to avoid coalesce-based errors.
Part 2 seems to have been necessary because this command is I think meant to get a result back for aggregation. I couldn't find a way to make it work without that (.repartition(1) had the same issue), so in the end I returned spark.createDataFrame([ (1, "foo")],["id", "label"]) from the function and things then worked.
This gets me to my desired end goal of working through all the alter commands without conflict errors.
It's clunky as hell though; still love improvements or alternative approaches if anyone has one.
If you want to change multiple columns at once, why not recreate the table? (This trick will work only if table 'B' is an external table. Here table 'B' is the 'B'ad table with outdated comments. Table 'A' is the good table with good comments.)
drop table ('B')
create table with required comments ( 'A' )
If this table is NOT external, then you might want to create a view, and start using that. This would enable you to add updated comments without altering the original tables data.
Have you considered using table properties instead of comments?
I'm using postgres and I have multiple schemas with identical tables where they are dynamically added the application code.
foo, bar, baz, abc, xyz, ...,
I want to be able to query all the schemas as if they are a single table
!!! I don't want to query all the schemas one by one and combine the results
I want to "combine"(not sure if this would be considered a huge join) the tables across schemas and then run the query.
For example, an order by query shouldn't be like
1. schema_A.result_1
2. schema_A.result_3
3. schema_B.result_2
4. schema_B.result 4
but instead it should be
1. schema_A.result_1
2. schema_B.result_2
3. schema_A.result_3
4. schema_B.result 4
If possible I don't want to generate a query that goes like
SELECT schema_A.table_X.field_1, schema_B.table_X.field_1 FROM schema_A.table_X, schema_B.table_X
But I want that to be taken care of in postgresql, in the database.
Generating a query with all the schemas(namespaces) appended can make my queries HUGE with ~50 field and ~50 schemas.
Since these tables are generated I also cannot inherit them from some global table and query that instead.
I'd also like to know if this is not really possible in a reasonable speed.
EXTRA:
I'm using django and django-tenants so I'd also accept any answer that actually helps me generate the entire query and run it to get a global queryset EVEN THOUGH it would be really slow.
Your question isn't as much of a question as it is an admission that you've got a really terrible database and applicaiton design. It's as if you parittioned something that iddn't need to be parittioned, or partitioned it in the wrong way.
Since you're doing something awkward, the database itself won't provide you with any elegant solution. Instead, you'll have to get more and more awkward until the regret becomes too much to bear and you redesign your database and/or your application.
I urge you to repent now, the sooner the better.
After that giant caveat based on a haughty moral position, I acknolwedge that the only reason we answer questions here is to get imaginary internet points. And so, my answer is this: use a view that unions all of the values together and presents them as if they came from one table. I can't make any sense of the "order by query", so I just ignore it for now. Maybe you mean that you want the results in a certain order; if so, you can add constants to each SELECT operand of each UNION ALL and ORDER BY that constant column coming out of the union. But if the order of the rows matters, I'd assert that you are showing yet another symptom of a poor database design.
You can programatically update the view whenever it is you update or create the new schemas and their catalogs.
A working example is here: http://sqlfiddle.com/#!17/c09265/1
with this schema creation and population code:
CREATE Schema Fooey;
CREATE SCHEMA Junk;
CREATE TABLE Fooey.Baz (SomeINteger INT);
CREATE TABLE Junk.Baz (SomeINteger INT);
INSERT INTO Fooey.Baz (SomeInteger) VALUES (17), (34), (51);
INSERT INTO Junk.Baz (SomeInteger) VALUES (13), (26), (39);
CREATE VIEW AllOfThem AS
SELECT 'FromFooey' AS SourceSchema, SomeINteger FROM Fooey.Baz
UNION ALL
SELECT 'FromJunk' AS SourceSchema, SomeInteger FROM Junk.Baz;
and this query:
SELECT *
FROM AllOfThem
ORDER BY SourceSchema;
Why are per-tenant schemas a bad design?
This design favors laziness over scalability. If you don't want to make changes to your application, you can simply slam connections to a particular shcema and keep working without any code changes. Adding more tennants means adding more schemas, which it sounds like you've automated. Adding many schemas will eventually make database management cumbersome (what if you have thousands or millions of tenants?) and even if you have only a few, the dynamic nature of the list and the problems in writing system-wide queries is an issue that you've already discovered.
Consider instead combining everything and adding the tenant ID as part of a key on each table. In that case, adding more tenants means adding more rows. Any summary queries trivially come from single tables, and all of the features and power of the database implementation and its query language are at your fingertips without any fuss whatsoever.
It's simply false that a database design can't be changed, even in an existing and busy system. It takes a lot of effort to do it, but it can be done and people do it all the time. That's why getting the database design right as early as possible is important.
The README of the django-tenants package you're using describes thier decision to trade-off towards laziness, and cites a whitpaper that outlines many of the shortcomings and alternatives of that method.
I'm looking for ideas on how to improve a report that takes up to 30 minutes to process on the server, I'm currently working with Django and MySQL but if there is a solution that requires changing the language or SQL database I'm open to it.
The report I'm talking about reads multiple excel files and insert all the rows from those files into a table (the report table) with a range from 12K to 15K records, the table has around 50 columns. This part doesn't take that much time.
Once I have all the records on the report table I start applying multiple phases of business logic so I end having something like this:
def create_report():
business_logic_1()
business_logic_2()
business_logic_3()
business_logic_4()
Each function of the business_logic_X does something very similar, it starts by doing a ReportModel.objects.all() and then it applies multiple calculations like checking dates, quantities, etc and updates the record. Since it's a 12K record table it quickly starts adding time to the complete report.
The reason I'm going multiple functions separately and no all processing in one pass it's because the logic from the first function needs to be completed so the logic on the next functions works (ex. the first function finds all related records and applies the same status for all of them.
The first thing that I know could be optimized is somehow caching the objects.all() instead of calling it in each function but I'm not sure how to pass it to the next function without saving the records first.
I already optimized the report a bit by using update_fields on the save method of the functions and that saved a bit of time.
My question is, is there a better approach to this kind of problem? Is Django/MySQL the right stack for this?
What takes time is the business logic that you're doing in Django. So it does several round trips between the database and the application.
It sounds like there are several tables involved, so I would suggest that you write your query in raw sql and once you have the results you get that into the application, if you need it.
The orm has a method "raw" that you can use. Or you could drop down to even lower level and interface with Your database directly.
Unless I see more what you do, I can't give any more specific advice
I am implementing an import tool (Django 1.6) that takes a potentially very large CSV file, validates it and depending on user confirmation imports it or not. Given the potential large filesize, the processing of the file is done via flowy (a python wrapper over Amazon's SWF). Each import job is saved in a table in the DB and the workflow, which is quite simple and consists of only one activity, basically calls a method that runs the import and saves all necessary information about the processing of the file in the job's record in the database.
The tricky thing is: We now have to make this import atomic. Either all records are saved or none. But one of the things saved in the import table is the progress of the import, which is calculated based on the position of the file reader:
progress = (raw_data.tell() * 100.0) / filesize
And this progress is used by an AJAX progressbar widget in the client side. So simply adding #transaction.atomic to the method that loops through the file and imports the rows is not a solution, because the progress will only be saved on commit.
The CSV files only contain one type of record and affect a single table. If I could somehow do a transaction only on this table, leaving the job table free for me to update the progress column, it would be ideal. But from what I've found so far it seems impossible. The only solution I could think of so far is opening a new thread and a new database connection inside it every time I need to update the progress. But I keep wondering… will this even work? Isn't there a simpler solution?
One simple approach would be to use the READ UNCOMMITTED transaction isolation level. That could allow dirty reads, which would allow your other processes to see the progress even though the transaction hasn't been committed. However, whether this works or not will be database-dependent. (I'm not familiar with MySQL, but this wouldn't work in PostgreSQL because READ UNCOMMITTED works the same way as READ COMMITTED.)
Regarding your proposed solution, you don't necessarily need a new thread, you really just need a fresh connection to the database. One way to do that in Django might be to take advantage of the multiple database support. I'm imagining something like this:
As described in the documentation, add a new entry to DATABASES with a different name, but the same setup as default. From Django's perspective we are using multiple databases, even though we in fact just want to get multiple connections to the same database.
When it's time to update the progress, do something like:
JobData.objects.using('second_db').filter(id=5).update(progress=0.5)
That should take place in its own autocommitted transaction, allowing the progress to be seen by your web server.
Now, does this work? I honestly don't know, I've never tried anything like it!
I am in the middle of a project involving trying to grab numerous pieces of information out of 70GB worth of xml documents and loading it into a relational database (in this case postgres) I am currently using python scripts and psycopg2 to do this inserts and whatnot. I have found that as the number of rows in the some of the tables increase. (The largest of which is at around 5 million rows) The speed of the script (inserts) has slowed to a crawl. What was once taking a couple of minutes now takes about an hour.
What can I do to speed this up? Was I wrong in using python and psycopg2 for this task? Is there anything I can do to the database that may speed up this process. I get the feeling I am going about this in entirely the wrong way.
Considering the process was fairly efficient before and only now when the dataset grew up it slowed down my guess is it's the indexes. You may try dropping indexes on the table before the import and recreating them after it's done. That should speed things up.
What are the settings for wal_buffers and checkpoint_segments? For large transactions, you have to tweak some settings. Check the manual.
Consider the book PostgreSQL 9.0 High Performance as well, there is much more to tweak than just the database configuration to get high performance.
I'd try to use COPY instead of inserts. This is what backup tools use for fast loading.
Check if all foreign keys from this table do have corresponding index on target table. Or better - drop them temporarily before copying and recreate after.
Increase checkpoint_segments from default 3 (which means3*16MB=48MB) to a much higher number - try for example 32 (512MB). make sure you have enough space for this much additional data.
If you can afford to recreate or restore your database cluster from scratch in case of system crash or power failure then you can start Postgres with "-F" option, which will enable OS write cache.
Take a look at http://pgbulkload.projects.postgresql.org/
There is a list of hints on this topic in the Populating a Database section of the documentation. You might speed up general performance using the hints in Tuning Your PostgreSQL Server as well.
The overhead of checking foreign keys might be growing as the table size increases, which is made worse because you're loading a single record at a time. If you're loading 70GB worth of data, it will be far faster to drop foreign keys during the load, then rebuild them when it's imported. This is particularly true if you're using single INSERT statements. Switching to COPY instead is not a guaranteed improvement either, due to how the pending trigger queue is managed--the issues there are discussed in that first documentation link.
From the psql prompt, you can find the name of the constraint enforcing your foreign key and then drop it using that name like this:
\d tablename
ALTER TABLE tablename DROP CONSTRAINT constraint_name;
When you're done with loading, you can put it back using something like:
ALTER TABLE tablename ADD CONSTRAINT constraint_name FOREIGN KEY (other_table) REFERENCES other_table (join_column);
One useful trick to find out the exact syntax to use for the restore is to do pg_dump --schema-only on your database. The dump from that will show you how to recreate the structure you have right now.
I'd look at the rollback logs. They've got to be getting pretty big if you're doing this in one transaction.
If that's the case, perhaps you can try committing a smaller transaction batch size. Chunk it into smaller blocks of records (1K, 10K, 100K, etc.) and see if that helps.
First 5 mil rows is nothing, difference in inserts should not change is it 100k or 1 mil;
1-2 indexes wont slow it down that much(if fill factor is set 70-90, considering each major import is 1/10 of table ).
python with PSYCOPG2 is quite fast.
a small tip, you cud use database extension XML2 to read/work with data
small example from
https://dba.stackexchange.com/questions/8172/sql-to-read-xml-from-file-into-postgresql-database
duffymo is right, try to commit in chunks of 10000 inserts (committing only at the end or after each insert is quite expensive)
autovacuum might be bloating if you do a lot of deletes and updates, you can turn it off temporary at the start for certain tables. set work_mem and maintenance_work_mem according to your servers available resources ...
for inserts, increase wal_buffers, (9.0 and higher its set auto by default -1) if u use version 8 postgresql, you should increase it manually
cud also turn fsync off and test wal_sync_method(be cautious changing this may make your database crash unsafe if sudden power-failures or hardware crash occurs)
try to drop foreign keys, disable triggers or set conditions for trigger not to run/skip execution;
use prepared statements for inserts, cast variables
you cud try to insert data into an unlogged table to temporary hold data
are inserts having where conditions or values from a sub-query, functions or such alike?