Output SQL as string from pandas.DataFrame.to_sql - python

Is there a way of making pandas (or sqlalchemy) output the SQL that would be executed by a call to to_sql() instead of actually executing it? This would be handy in many cases where I actually need to update multiple databases with the same data where python and pandas only exists in one of my machines.

According to the doc, use the echo parameter as:
engine = create_engine("mysql://scott:tiger#hostname/dbname", echo=True)

This is more a process question than a programming one. First, is the use of multiple databases. Relational databases management systems (RDMBS) are designed as multiple-user systems for many simultaneous users/apps/clients/machines. Designed to run as ONE system, the database serves as the central repository for related applications. Some argue databases should be agnostic to apps and be data-centric (Postgre folks) and others believe databases should be app-centric (MySQL folks). Overall, understand they are more involved than a flatfile spreadsheet or data frame.
Usually, RDMS's come in two structural types:
file level systems like SQLite and MS Access (where databases reside in a file saved to CPU directory); these systems though still powerful and multi-user mostly serve for smaller business applications with relatively handful of users or team sizes
server-level systems like SQL Server, MySQL, PostgreSQL, DB2, Oracle (where databases run over a network without any localized file); these systems serve as enterprise level systems to run full-scale business operations run over LAN intranets or web networks.
Meanwhile, Pandas is not a database but a data analysis toolkit (much like MS Excel) though it can import/export queried resultsets from RDMS's. Therefore, it maintains no native SQL dialect for DDL/DML procedures. Moreover, pandas runs in memory on the OS calling the Python script and cannot be shared by other clients/machines. Pandas does not track changes like you intend in order to know the different states of a data frame during runtime of script unless you design it that way with a before and after and identify column/row changes.
With that mouthful said, why not use ONE database and have your Python script serve as just another of the many clients that connect to the database to import/export data into data frame. Hence, after every data frame change actually run the to_sql(). Recall pandas' to_sql uses the if_exists argument:
# DROPS TABLE, RECREATES IT, AND UPDATES IT
df.to_sql(name='tablename', con=conn, if_exists='replace')
# APPENDS DF DATA TO EXISTING TABLE
df.to_sql(name='tablename', con=conn, if_exists='append')
In turn, every app/machine that connects to the centralized database will only need to refresh their instance and current data would be available in real-time for their end use needs. Though of course, table-locking states can be an issue in multi-user environments if another user had a table record in edit mode while your script tried updating it. But transactions here may help.

Related

Remote Postgres to Postgres data

I am working on a project now where I need to load daily data from one psql database into another one (both databases are on separate remote machines).
The Postgres version I'm using is 9.5, and due to our infrastructure, I am currently doing this using python scripts, which works fine for now, although I was wondering:
Is it possible to do this using psql commands that I can easily schedule? or is python a flexible enough appproach for future developments?
EDIT:
The main database contains a backend connected directly to a website and the other contains an analytics system which basically only needs to read the main db's data and store future transformations of it.
The latency is not very important, what is important is the reliability and simplicity.
sure, you can use psql and an ssh connection if you want.
this approach (or using pg_dump) can be useful as way to reduce the effexcts of latency.
however note that the SQL insert...values command can insert several rows in a single command. When I use python scripts to migrate data I build insert commands that insert up-to 1000 rows, thus reducing latency by a factor of 1000,
Another approach worth considering is dblink which allows postgres to query a remote postgres directly, so you could do a select from the remote database and insert the result into a local table.
Postgres-FDW may be worth a look too.

Why django and python MySQLdb have one cursor per database?

Example scenario:
MySQL running a single server -> HOSTNAME
Two MySQL databases on that server -> USERS , GAMES .
Task -> Fetch 10 newest games from GAMES.my_games_table , and fetch users playing those games from USERS.my_users_table ( assume no joins )
In Django as well as Python MySQLdb , why is having one cursor for each database more preferable ?
What is the disadvantage of an extended cursor which is single per MySQL server and can switch databases ( eg by querying "use USERS;" ), and then work on corresponding database
MySQL connections are cheap, but isn't single connection better than many , if there is a linear flow and no complex tranasactions which might need two cursors ?
A shorter answer would be, "MySQL doesn't support that type of cursor", so neither does Python-MySQL, so the reason one connection command is preferred is because that's the way MySQL works. Which is sort of a tautology.
However, the longer answer is:
A 'cursor', by your definition, would be some type of object accessing tables and indexes within an RDMS, capable of maintaining its state.
A 'connection', by your definition, would accept commands, and either allocate or reuse a cursor to perform the action of the command, returning its results to the connection.
By your definition, a 'connection' would/could manage multiple cursors.
You believe this would be the preferred/performant way to access a database as 'connections' are expensive, and 'cursors' are cheap.
However:
A cursor in MySQL (and other RDMS) is not a the user-accessible mechanism for performing operations. MySQL (and other's) perform operations in as "set", or rather, they compile your SQL command into an internal list of commands, and do numerous, complex bits depending on the nature of your SQL command and your table structure.
A cursor is a specific mechanism, utilized within stored procedures (and there only), giving the developer a way to work with data in a procedural way.
A 'connection' in MySQL is what you think of as a 'cursor', sort of. MySQL does not expose it's internals for you as an iterator, or pointer, that is merely moving over tables. It exposes it's internals as a 'connection' which accepts SQL and other commands, translates those commands into an internal action, performs that action, and returns it's result to you.
This is the difference between a 'set' and a 'procedural' execution style (which is really about the granularity of control you, the user, is given access to, or at least, the granularity inherent in how the RDMS abstracts away its internals when it exposes them via an API).
As you say, MySQL connections are cheap, so for your case, I'm not sure there is a technical advantage either way, outside of code organization and flow. It might be easier to manage two cursors than to keep track of which database a single cursor is currently talking to by painstakingly tracking SQL 'USE' statements. Mileage with other databases may vary -- remember that Django strives to be database-agnostic.
Also, consider the case where two different databases, even on the same server, require different access credentials. In such a case, two connections will be necessary, so that each connection can successfully authenticate.
One cursor per database is not necessarily preferable, it's just the default behavior.
The rationale is that different databases are more often than not on different servers, use different engines, and/or need different initialization options. (Otherwise, why should you be using different "databases" in the first place?)
In your case, if your two databases are just namespaces of tables (what should be called "schemas" in SQL jargon) but reside on the same MySQL instance, then by all means use a single connection. (How to configure Django to do so is actually an altogether different question.)
You are also right that a single connection is better than two, if you only have a single thread and don't actually need two database workers at the same time.

DB Connectivity from multiple modules

I am working on a system where a bunch of modules connect to a MS SqlServer DB to read/write data. Each of these modules are written in different languages (C#, Java, C++) as each language serves the purpose of the module best.
My question however is about the DB connectivity. As of now, all these modules use the language-specific Sql Connectivity API to connect to the DB. Is this a good way of doing it ?
Or alternatively, is it better to have a Python (or some other scripting lang) script take over the responsibility of connecting to the DB? The modules would then send in input parameters and the name of a stored procedure to the Python Script and the script would run it on the database and send the output back to the respective module.
Are there any advantages of the second method over the first ?
Thanks for helping out!
If we assume that each language you use will have an optimized set of classes to interact with databases, then there shouldn't be a real need to pass all database calls through a centralized module.
Using a "middle-ware" for database manipulation does offer a very significant advantage. You can control, monitor and manipulate your database calls from a central and single location. So, for example, if one day you wake up and decide that you want to log certain elements of the database calls, you'll need to apply the logical/code change only in a single piece of code (the middle-ware). You can also implement different caching techniques using middle-ware, so if the different systems share certain pieces of data, you'd be able to keep that data in the middle-ware and serve it as needed to the different modules.
The above is a very advanced edge-case and it's not commonly used in small applications, so please evaluate the need for the above in your specific application and decide if that's the best approach.
Doing things the way you do them now is fine (if we follow the above assumption) :)

Are SQLite reads always hitting disk?

I have a Pylons application using SQLAlchemy with SQLite as backend. I would like to know if every read operation going to SQLite will always lead to a hard disk read (which is very slow compared to RAM) or some caching mechanisms are already involved.
does SQLite maintain a subset of the database in RAM for faster access ?
Can the OS (Linux) do that automatically ?
How much speedup could I expect by using a production database (MySQL or PostgreSQL) instead of SQLite?
Yes, SQLite has its own memory cache. Check PRAGMA cache_size for instance. Also, if you're looking for speedups, check PRAGMA temp_store. There is also API for implementing your own cache.
The SQLite database is just a file to the OS. Nothing is 'automatically' done for it. To ensure caching does happen, there are sqlite.h defines and runtime pragma settings.
It depends, there are a lot of cases when you'll get a slowdown instead.
How much speedup could I expect by using a production database (Mysql or postgres) instead of sqlite?
Are you using sqlite in a production server environment? You probably shouldn't be:
From Appropriate Uses for Sqlite:
SQLite will normally work fine as the database backend to a website.
But if you website is so busy that you are thinking of splitting the
database component off onto a separate machine, then you should
definitely consider using an enterprise-class client/server database
engine instead of SQLite.
SQLite is not designed well for, and was never intended to scale well; SQLite trades convenience for performance; if performance is a concern, you should consider another DBMS

How to interface with another database effectively using python

I have an application that needs to interface with another app's database. I have read access but not write.
Currently I'm using sql statements via pyodbc to grab the rows and using python manipulate the data. Since I don't cache anything this can be quite costly.
I'm thinking of using an ORM to solve my problem. The question is if I use an ORM like "sql alchemy" would it be smart enough to pick up changes in the other database?
E.g. sql alchemy accesses a table and retrieves a row. If that row got modified outside of sql alchemy would it be smart enough to pick it up?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit: To be more clear
I have one application that is simply a reporting tool lets call App A.
I have another application that handles various financial transactions called App B.
A has access to B's database to retrieve the transactions and generates various reports. There's hundreds of thousands of transactions. We're currently caching this info manually in python, if we need an updated report we refresh the cache. If we get rid of the cache, the sql queries combined with the calculations becomes unscalable.
I don't think an ORM is the solution to your problem of performance. By default ORMs tend to be less efficient than row SQL because they might fetch data that you're not going to use (eg. doing a SELECT * when you need only one field), although SQLAlchemy allows fine-grained control over the SQL generated.
Now to implement a caching mechanism, depending on your application, you could use a simple dictionary in memory or a specialized system such as memcached or Redis.
To keep your cached data relatively fresh, you can poll the source at regular intervals, which might be OK if your application can tolerate a little delay. Otherwise you'll need the application that has write access to the db to notify your application or your cache system when an update occurs.
Edit: since you seem to have control over app B, and you've already got a cache system in app A, the simplest way to solve your problem is probably to create a callback in app A that app B can call to expire cached items. Both apps need to agree on a convention to identify cached items.

Categories