Example scenario:
MySQL running a single server -> HOSTNAME
Two MySQL databases on that server -> USERS , GAMES .
Task -> Fetch 10 newest games from GAMES.my_games_table , and fetch users playing those games from USERS.my_users_table ( assume no joins )
In Django as well as Python MySQLdb , why is having one cursor for each database more preferable ?
What is the disadvantage of an extended cursor which is single per MySQL server and can switch databases ( eg by querying "use USERS;" ), and then work on corresponding database
MySQL connections are cheap, but isn't single connection better than many , if there is a linear flow and no complex tranasactions which might need two cursors ?
A shorter answer would be, "MySQL doesn't support that type of cursor", so neither does Python-MySQL, so the reason one connection command is preferred is because that's the way MySQL works. Which is sort of a tautology.
However, the longer answer is:
A 'cursor', by your definition, would be some type of object accessing tables and indexes within an RDMS, capable of maintaining its state.
A 'connection', by your definition, would accept commands, and either allocate or reuse a cursor to perform the action of the command, returning its results to the connection.
By your definition, a 'connection' would/could manage multiple cursors.
You believe this would be the preferred/performant way to access a database as 'connections' are expensive, and 'cursors' are cheap.
However:
A cursor in MySQL (and other RDMS) is not a the user-accessible mechanism for performing operations. MySQL (and other's) perform operations in as "set", or rather, they compile your SQL command into an internal list of commands, and do numerous, complex bits depending on the nature of your SQL command and your table structure.
A cursor is a specific mechanism, utilized within stored procedures (and there only), giving the developer a way to work with data in a procedural way.
A 'connection' in MySQL is what you think of as a 'cursor', sort of. MySQL does not expose it's internals for you as an iterator, or pointer, that is merely moving over tables. It exposes it's internals as a 'connection' which accepts SQL and other commands, translates those commands into an internal action, performs that action, and returns it's result to you.
This is the difference between a 'set' and a 'procedural' execution style (which is really about the granularity of control you, the user, is given access to, or at least, the granularity inherent in how the RDMS abstracts away its internals when it exposes them via an API).
As you say, MySQL connections are cheap, so for your case, I'm not sure there is a technical advantage either way, outside of code organization and flow. It might be easier to manage two cursors than to keep track of which database a single cursor is currently talking to by painstakingly tracking SQL 'USE' statements. Mileage with other databases may vary -- remember that Django strives to be database-agnostic.
Also, consider the case where two different databases, even on the same server, require different access credentials. In such a case, two connections will be necessary, so that each connection can successfully authenticate.
One cursor per database is not necessarily preferable, it's just the default behavior.
The rationale is that different databases are more often than not on different servers, use different engines, and/or need different initialization options. (Otherwise, why should you be using different "databases" in the first place?)
In your case, if your two databases are just namespaces of tables (what should be called "schemas" in SQL jargon) but reside on the same MySQL instance, then by all means use a single connection. (How to configure Django to do so is actually an altogether different question.)
You are also right that a single connection is better than two, if you only have a single thread and don't actually need two database workers at the same time.
Related
I had a question regarding the python IBM_DB package (but I think it could be applied to any of the packages that employ the connection/cursor logic i.e. pyodbc).
When the cursor.execute() method is called, it executes an sql query on the database. However, to access this data, you would need to use the fetchall()/other fetch methods. I want to time the hit on the database.
Does the query completely finish running at the execute level, and it is in memory just for python to fetch? Or does the fetch method continue calling the database? I have scoured the documentation and am unable to find anything definitive on this subject.
Most or all of the Db2 open source drivers are based on the Call Level Interface (CLI). The CLI functions and details are part of the overall Db2 documentation. The Fetch() from a ResultSet retrieves one more row.
AFAIK the result set can be cached or go back to the engine. It makes sense to bring in few (dozen) rows, but not for some million rows.
You would need insights and understanding of how drivers and database query processing work in order to measure something useful and interpret it correctly.
BTW: There is some form of CLI tracing available.
Is there a way of making pandas (or sqlalchemy) output the SQL that would be executed by a call to to_sql() instead of actually executing it? This would be handy in many cases where I actually need to update multiple databases with the same data where python and pandas only exists in one of my machines.
According to the doc, use the echo parameter as:
engine = create_engine("mysql://scott:tiger#hostname/dbname", echo=True)
This is more a process question than a programming one. First, is the use of multiple databases. Relational databases management systems (RDMBS) are designed as multiple-user systems for many simultaneous users/apps/clients/machines. Designed to run as ONE system, the database serves as the central repository for related applications. Some argue databases should be agnostic to apps and be data-centric (Postgre folks) and others believe databases should be app-centric (MySQL folks). Overall, understand they are more involved than a flatfile spreadsheet or data frame.
Usually, RDMS's come in two structural types:
file level systems like SQLite and MS Access (where databases reside in a file saved to CPU directory); these systems though still powerful and multi-user mostly serve for smaller business applications with relatively handful of users or team sizes
server-level systems like SQL Server, MySQL, PostgreSQL, DB2, Oracle (where databases run over a network without any localized file); these systems serve as enterprise level systems to run full-scale business operations run over LAN intranets or web networks.
Meanwhile, Pandas is not a database but a data analysis toolkit (much like MS Excel) though it can import/export queried resultsets from RDMS's. Therefore, it maintains no native SQL dialect for DDL/DML procedures. Moreover, pandas runs in memory on the OS calling the Python script and cannot be shared by other clients/machines. Pandas does not track changes like you intend in order to know the different states of a data frame during runtime of script unless you design it that way with a before and after and identify column/row changes.
With that mouthful said, why not use ONE database and have your Python script serve as just another of the many clients that connect to the database to import/export data into data frame. Hence, after every data frame change actually run the to_sql(). Recall pandas' to_sql uses the if_exists argument:
# DROPS TABLE, RECREATES IT, AND UPDATES IT
df.to_sql(name='tablename', con=conn, if_exists='replace')
# APPENDS DF DATA TO EXISTING TABLE
df.to_sql(name='tablename', con=conn, if_exists='append')
In turn, every app/machine that connects to the centralized database will only need to refresh their instance and current data would be available in real-time for their end use needs. Though of course, table-locking states can be an issue in multi-user environments if another user had a table record in edit mode while your script tried updating it. But transactions here may help.
I'm coming from a very heavy Python->Oracle development environment and have been playing around with Clojure quite a bit. I love the ease of access that cx_Oracle gives me to the database on the Python end and was wondering if Clojure has something similar.
Specifically what I'm looking for is something to give me easy access to a database connection, ala cx_Oracle's "username/password#tns_name" format.
The best I've come up with so far is:
(defn get-datasource [user password server service]
{:datasource (clj-dbcp.core/make-datasource {:adapter :oracle
:style :service-name
:host server
:service-name service
:user user
:password password})})
This requires the server however and 95% of my users don't have the knowledge of what server they're hitting, just the tns name from tnsnames.ora.
In addition, I don't understand when I have a database connection and when it disconnects. With cx_Oracle I either had to do a with cx_Oracle.connect()... or a connection.close() to close the connection.
Can someone give me guidance as to how datasources work as far as connections go and the easiest way to connect to a database given a username, password, and tns alias?
Thanks!!
Best use Clojure's most idiomatic database library clojure.java.jdbc.
First, because the Oracle driver isn't available from a maven repository, we need to download the latest one and install it in our local repository, using the lein-localrepo plugin:
lein localrepo install -r D:\Path\To\Repo\
D:\Path\To\ojdbc6.jar
oracle.jdbc/oracledriver "12.1.0.1"
Now we can reference it in our project.clj, together with clojure.java.jdbc.
(defproject oracle-connect "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/java.jdbc "0.3.3"]
[oracle.jdbc/oracledriver "12.1.0.1"]])
After starting a REPL we can connect to the database through a default host/port/SID connection
(ns oracle-connect
(:require [clojure.java.jdbc :as jdbc]))
(def db
{:classname "oracle.jdbc.OracleDriver"
:subprotocol "oracle:thin"
:subname "//#hostname:port:sid"
:user "username"
:password "password"}))
(jdbc/query db ["select ? as one from dual" 1])
db is just a basic map, referred to as the db-spec. It is not a real connection, but has all the information needed to make one. Clojure.java.jdbc makes one when needed, for instance in (query db ..).
We need to enter the classname manually because clojure.java.jdbc doesn't have a default mapping between the subprotocol and the classname for Oracle. This is probably because the Oracle JDBC driver has both thin and OCI JDBC connection options.
To make a connection with a TNS named database, the driver needs the location of the tnsnames.ora file. This is done by setting a system property called oracle.net.tns_admin.
(System/setProperty "oracle.net.tns_admin"
"D:/oracle/product/12.1.0.1/db_1/NETWORK/ADMIN")
Once this is set all we need for subname is the tnsname of the database.
(def db
{:classname "oracle.jdbc.OracleDriver"
:subprotocol "oracle:thin"
:subname "#tnsname"
:user "username"
:password "password"}))
(jdbc/query db ["select ? as one from dual" 1])
Now on to the 'how do connections work' part. As stated earlier, clojure.java.jdbc creates connections when needed, for instance within the query function.
If all you want to do is transform the results of a query, you can give in two extra optional named parameters: :row-fn and :result-set-fn. Every row is transformed with the row-fn, after which the whole resultset is transformed with the result-set-fn.
Both of these are executed within the context of the connection, so the connection is guaranteed to be open until all these actions have been performed, unless these functions return lazy sequences.
By default the :result-set-fn is defined as a doall guaranteeing all results are realized, but if you redefine it be sure to realize all lazy results. Usually whenever you get a connection or resultset closed exception while using results outside of the scope the problem is you didn't.
The connection only exists within the scope of the query function. At the end it is closed. This means that every query results in a connection. If you want multiple queries done within one connection, you can wrap them in a with-db-connection:
(jdbc/with-db-connection [c db]
(doall (map #(jdbc/query c ["select * from EMP where DEPTNO = ?" %])
(jdbc/query c ["select * from DEPT"] :row-fn :DEPTNO))))
In the with-db-connection binding you bind the db-spec to a var, and use that var instead of the db-spec in statements inside the binding scope. It creates a connection and adds that to the var. The other statements will use that connection. This is especially handy when creating dynamic queries based on the result of other queries.
The same thing goes for with-db-transaction. It has the same semantics as with-db-connection, however here the scope not only guarantees the same connection is used, but also that either all statements or none succeed by wrapping them in a transaction block. Both with-db-connection and with-db-transaction are nestable.
There are also more advanced options like creating connection pools and instead of having query et al. create or reuse single connections, have them draw a connection from the pool. See the clojure-doc.org documentation for those.
It seems that both MongoClient and MongoReplicaSetClient can connect to mongo replica sets. In fact, their documentation pages are nearly identical - same options, same methods, etc - except that the latter's constructor requires me to specify a replicaSet.
In both cases, we may specify a read preference. In both cases, we must handle the AutoReconnect exception if a stepdown occurs.
So my questions are:
Why would one use one versus the other, since one can perform the exact same operations with both?
Both can perform secondary reads, correct? The documentation says that the advantage of a ReplicaSetClient is that we can do secondary reads, but clearly they are supported in both.
The documentation says that the ReplicaSetClient features "replica set health monitoring." What exactly does that mean? Are there new methods I can invoke which tell me about a replset's health that I cannot otherwise do with MongoClient?
In theory a MongoReplicaSetClient will connect to all members of the replset, rather than just one. This is false: you may munge or omit any of the servers in the connection string, and both MongoClient and MongoReplicaSetClient are still able to connect. Am I missing something?
This is was a confusing API choice that we regret in PyMongo 2.x. We will merge all the client classes into MongoClient in PyMongo 3, in April 2015:
http://emptysqua.re/blog/good-idea-at-the-time-pymongo-mongoreplicasetclient/
Meanwhile:
Use MongoReplicaSetClient when you plan to connect to a whole replica set. MongoClient only connects to one member.
A single MongoReplicaSetClient can be used to perform primary or secondary reads, as well as more sophisticated decision-making with read preferences, see my blog post on the subject. A MongoClient will connect to one member of the replica set (the primary) and always read from it, unless you make a direct connection to a secondary using MongoClient, in which case it will always read from that secondary.
MongoReplicaSetClient monitors the set's health with a background thread that periodically checks on all the members. The client tracks whether members are up, it tracks their ping times, and it notices when a member is added. This will reduce the number of exceptions you see on a flaky network or when the replica set's configuration changes, and it allows the client to correctly implement read preferences.
A MongoReplicaSetClient does in fact connect to all members, whereas a MongoClient only connects to one member. MongoReplicaSetClient tries to connect to each member listed in the connection string; as soon as it connects to one it asks that member for a list of all other members. From this point forward it ignores your connection string and uses the list it got from the member it connected to.
I have an application that needs to interface with another app's database. I have read access but not write.
Currently I'm using sql statements via pyodbc to grab the rows and using python manipulate the data. Since I don't cache anything this can be quite costly.
I'm thinking of using an ORM to solve my problem. The question is if I use an ORM like "sql alchemy" would it be smart enough to pick up changes in the other database?
E.g. sql alchemy accesses a table and retrieves a row. If that row got modified outside of sql alchemy would it be smart enough to pick it up?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit: To be more clear
I have one application that is simply a reporting tool lets call App A.
I have another application that handles various financial transactions called App B.
A has access to B's database to retrieve the transactions and generates various reports. There's hundreds of thousands of transactions. We're currently caching this info manually in python, if we need an updated report we refresh the cache. If we get rid of the cache, the sql queries combined with the calculations becomes unscalable.
I don't think an ORM is the solution to your problem of performance. By default ORMs tend to be less efficient than row SQL because they might fetch data that you're not going to use (eg. doing a SELECT * when you need only one field), although SQLAlchemy allows fine-grained control over the SQL generated.
Now to implement a caching mechanism, depending on your application, you could use a simple dictionary in memory or a specialized system such as memcached or Redis.
To keep your cached data relatively fresh, you can poll the source at regular intervals, which might be OK if your application can tolerate a little delay. Otherwise you'll need the application that has write access to the db to notify your application or your cache system when an update occurs.
Edit: since you seem to have control over app B, and you've already got a cache system in app A, the simplest way to solve your problem is probably to create a callback in app A that app B can call to expire cached items. Both apps need to agree on a convention to identify cached items.