Multi-Threaded data insertion in MySQL using python - python

I am working on a project involving insertion a lot of data in to the database. I am wondering if anybody knows how to fill 2 or 3 tables in the database at the same time.An example or psueodecode would be helpful.
Thanks

If you have a lot of data to insert into the database all at once, then you probably are interested in bulk loading data. The ideal tool for that is the bulk loader that likely comes with your database -- Oracle, Microsoft SQL Server, Sybase SQL Server, and MySQL (to name the ones that come to mind) all have bulk loaders. For example, Microsoft has the bulk insert statement and the bcp program to perform this task. I recommend you look into that rather than rigging up some tool in python, with or without threads.

Related

Python GUI connection to SQL Server database

I'm developing a GUI in Python in Visual Studio Code, which is connected to SQL Server 2019 database. I created a number of tables in the database, taking normalization into consideration.
I'm not quite sure which approach I should take, I need to perform CRUD operation on multiple tables tables simultaneously from the GUI. Would using stored procedures be the best option, and will i have to create the stored procedure in SQL Server 2019 and then call the procedure name in VSC? Is there a more effective way to achieve this?
Thanks

Remote Postgres to Postgres data

I am working on a project now where I need to load daily data from one psql database into another one (both databases are on separate remote machines).
The Postgres version I'm using is 9.5, and due to our infrastructure, I am currently doing this using python scripts, which works fine for now, although I was wondering:
Is it possible to do this using psql commands that I can easily schedule? or is python a flexible enough appproach for future developments?
EDIT:
The main database contains a backend connected directly to a website and the other contains an analytics system which basically only needs to read the main db's data and store future transformations of it.
The latency is not very important, what is important is the reliability and simplicity.
sure, you can use psql and an ssh connection if you want.
this approach (or using pg_dump) can be useful as way to reduce the effexcts of latency.
however note that the SQL insert...values command can insert several rows in a single command. When I use python scripts to migrate data I build insert commands that insert up-to 1000 rows, thus reducing latency by a factor of 1000,
Another approach worth considering is dblink which allows postgres to query a remote postgres directly, so you could do a select from the remote database and insert the result into a local table.
Postgres-FDW may be worth a look too.

Output SQL as string from pandas.DataFrame.to_sql

Is there a way of making pandas (or sqlalchemy) output the SQL that would be executed by a call to to_sql() instead of actually executing it? This would be handy in many cases where I actually need to update multiple databases with the same data where python and pandas only exists in one of my machines.
According to the doc, use the echo parameter as:
engine = create_engine("mysql://scott:tiger#hostname/dbname", echo=True)
This is more a process question than a programming one. First, is the use of multiple databases. Relational databases management systems (RDMBS) are designed as multiple-user systems for many simultaneous users/apps/clients/machines. Designed to run as ONE system, the database serves as the central repository for related applications. Some argue databases should be agnostic to apps and be data-centric (Postgre folks) and others believe databases should be app-centric (MySQL folks). Overall, understand they are more involved than a flatfile spreadsheet or data frame.
Usually, RDMS's come in two structural types:
file level systems like SQLite and MS Access (where databases reside in a file saved to CPU directory); these systems though still powerful and multi-user mostly serve for smaller business applications with relatively handful of users or team sizes
server-level systems like SQL Server, MySQL, PostgreSQL, DB2, Oracle (where databases run over a network without any localized file); these systems serve as enterprise level systems to run full-scale business operations run over LAN intranets or web networks.
Meanwhile, Pandas is not a database but a data analysis toolkit (much like MS Excel) though it can import/export queried resultsets from RDMS's. Therefore, it maintains no native SQL dialect for DDL/DML procedures. Moreover, pandas runs in memory on the OS calling the Python script and cannot be shared by other clients/machines. Pandas does not track changes like you intend in order to know the different states of a data frame during runtime of script unless you design it that way with a before and after and identify column/row changes.
With that mouthful said, why not use ONE database and have your Python script serve as just another of the many clients that connect to the database to import/export data into data frame. Hence, after every data frame change actually run the to_sql(). Recall pandas' to_sql uses the if_exists argument:
# DROPS TABLE, RECREATES IT, AND UPDATES IT
df.to_sql(name='tablename', con=conn, if_exists='replace')
# APPENDS DF DATA TO EXISTING TABLE
df.to_sql(name='tablename', con=conn, if_exists='append')
In turn, every app/machine that connects to the centralized database will only need to refresh their instance and current data would be available in real-time for their end use needs. Though of course, table-locking states can be an issue in multi-user environments if another user had a table record in edit mode while your script tried updating it. But transactions here may help.

Are SQLite reads always hitting disk?

I have a Pylons application using SQLAlchemy with SQLite as backend. I would like to know if every read operation going to SQLite will always lead to a hard disk read (which is very slow compared to RAM) or some caching mechanisms are already involved.
does SQLite maintain a subset of the database in RAM for faster access ?
Can the OS (Linux) do that automatically ?
How much speedup could I expect by using a production database (MySQL or PostgreSQL) instead of SQLite?
Yes, SQLite has its own memory cache. Check PRAGMA cache_size for instance. Also, if you're looking for speedups, check PRAGMA temp_store. There is also API for implementing your own cache.
The SQLite database is just a file to the OS. Nothing is 'automatically' done for it. To ensure caching does happen, there are sqlite.h defines and runtime pragma settings.
It depends, there are a lot of cases when you'll get a slowdown instead.
How much speedup could I expect by using a production database (Mysql or postgres) instead of sqlite?
Are you using sqlite in a production server environment? You probably shouldn't be:
From Appropriate Uses for Sqlite:
SQLite will normally work fine as the database backend to a website.
But if you website is so busy that you are thinking of splitting the
database component off onto a separate machine, then you should
definitely consider using an enterprise-class client/server database
engine instead of SQLite.
SQLite is not designed well for, and was never intended to scale well; SQLite trades convenience for performance; if performance is a concern, you should consider another DBMS

Real-time SQLite and PostgreSQL bi-directional synchronization using python

Is there any python library that can keep a client-side SQLite database in sync with a server-side PostgreSQL database?
There are solutions for Java, such as Daffodil or SymmetricDS. Is there something similar for python?
SymmetricDS is a server-side solution for synchronization that gets triggered regardless of which language is being used to access the database. You should still be able to use that to synchronize the databases, while using Python libraries to actually query them. I would recommend sqlalchemy as a good database-independent query layer for Python.

Categories