Using Python to Query multiple SQL databases on different servers - python

I have been doing a fair amount of manual data analysis, reporting and dash boarding recently via SQL and wonder if perhaps python would be able to automate a lot of this. I am not familiar with Python at all so I hope my question makes sense. For security/performance issues, we store databases on a number of servers (more than 5) which contain data that would be pertinent to a query. Unfortunately, these servers are set up so they cannot talk to each other so I cant pull data from the two servers in the same query. I believe this is a limitation due to using windows credentials/security.
For my data analysis and reporting needs, I need to be able to grab pertinent data from two or more of these so the way I currently do this is by running a query, grabbing the results, running another query with the results, doing some formula work in excel, and then running another query and so on and so forth until I get what I need.
Unfortunately this both time consuming, and also makes me pull massive datasets (in the multiple millions of rows), which I then have to continually narrow down based on criteria that are in said databases.
I know Python has the ability to query SQL Server, however I figured I would ask the experts:
Can I manipulate the data in the background with Python similar to how I can do with excel (lookups, statistical functions, etc, perhaps even XML/webAPI?
Can Python handle connections to multiple different database servers at the same time?
Does Python handle windows credentials well?
If Python is not the tool for this, can you name one that would work better?
Please let me know if I can provide additional pertinent details.
Ideally, I would like to end up creating our own separate database and creating automated processes to pull everything from other databases but currently that is not possible due to project constraints.
Thanks!

I didn't use windows credential. But i have used Python to work with multiple MS-SQL databases at the same time. It worked very well. You can use the library pymssql or better with SQLAlchemy
But i think you should start with a basic tutorial about Python first. Because you want to work with millions of rows, it's very important to understand list, set, tuple, dict in Python. For good performance, you should use the right type.
A basic example with pymssql
import pymssql
conn1 = pymssql.connect("Host1", "user1", "password1", "db1")
conn2 = pymssql.connect("Host2", "user2", "password2", "db2")
cursor1 = conn1.cursor()
cursor2 = conn2.cursor()
cursor1.execute('SELECT * FROM TABLE1 LIMIT 10')
cursor2.execute('SELECT * FROM TABLE2 LIMIT 10')
result1 = cursor1.fetchall()
result2 = cursor2.fetchall()
# print each row
for row in result1:
print(row)
# print each row
for row in result2:
print(row)

You can do all of what you asked. Python allows to create multiple connection objects via a library, so for example, let's say you use MySQL python you would create two different objects like this:
NOT ACTUAL CODE, JUST EXAMPLE
conn1 = mysqlConnect(server1, user, pass)
conn2 = mysqlConnect(server2, user, pass)
Like this, conn1 connects to one database and conn2 connects to a different one, usually you would do:
conn1.execute(query_to_server_1)
conn2.execute(query_to_server_2)
This helps maintain two different connections in the same script. If you are looking for multi threading, python offers an incredible library that will help you execute multiple task from one master script.

Related

How to use multiple dbs in one query?

How would one write this piece of code using influxdb-python client?
SELECT column1 INTO 'db2.retention_policy2.measurement2.' FROM 'db1.retention_policy1.measurement1.' WHERE time > '2019-01-01';"
I get the fact that you can create two connections, or even just use one connection to query a db.
One approach could be this:
Get the data required from db1
Switch db using connection.switch_database("db2")
Then what are some way(s) to push the data into db2?
Thank you.
I just had to give read/write permissions to both databases.

Flask website backend structure guidance assistance?

I have a basic personal project website that I am looking to learn some web dev fundamentals with and database (SQL) fundamentals as well (If SQL is even the right technology to use??).
I have the basic skeleton up and running but as I am new to this, I want to make sure I am doing it in the most efficient and "correct" way possible.
Currently the site has a main index (landing) page and from there the user can select one of a few subpages. For the sake of understanding, each of these sub pages represents a different surf break and they each display relevant info about that particular break i.e. wave height, wind, tide.
As I have already been able to successfully scrape this data, my main questions revolve around how would I go about inserting this data into a database for future use (historical graphs, trends)? How would I ensure data is added to this database in a continuous manner (once/day)? How would I use data that was scraped from an earlier time, say at noon, to be displayed/used at 12:05 PM rather than scraping it again?
Any other tips, guidance, or resources you can point me to are much appreciated.
This kind of data is called time series. There are specialized database engines for time series, but with a not-extreme volume of observations - (timestamp, wave heigh, wind, tide, which break it is) tuples - a SQL database will be perfectly fine.
Try to model your data as a table in Postgres or MySQL. Start by making a table and manually inserting some fake data in a GUI client for your database. When it looks right, you have your schema. The corresponding CREATE TABLE statement is your DDL. You should be able to write SELECT queries against your table that yield the data you want to show on your webapp. If these queries are awkward, it's a sign that your schema needs revision. Save your DDL. It's (sort of) part of your source code. I imagine two tables: a listing of surf breaks, and a listing of observations. Each row in the listing of observations would reference the listing of surf breaks. If you're on a Mac, Sequel Pro is a decent tool for playing around with a MySQL database, and playing around is probably the best way to learn to use one.
Next, try to insert data to the table from a Python script. Starting with fake data is fine, but mold your Python script to read from your upstream source (the result of scraping) and insert into the table. What does your scraping code output? Is it a function you can call? A CSV you can read? That'll dictate how this script works.
It'll help if this import script is idempotent: you can run it multiple times and it won't make a mess by inserting duplicate rows. It'll also help if this is incremental: once your dataset grows large, it will be very expensive to recompute the whole thing. Try to deal with importing a specific interval at a time. A command-line tool is fine. You can specify the interval as a command-line argument, or figure out out from the current time.
The general problem here, loading data from one system into another on a regular schedule, is called ETL. You have a very simple case of it, and can use very simple tools, but if you want to read about it, that's what it's called. If instead you could get a continuous stream of observations - say, straight from the sensors - you would have a streaming ingestion problem.
You can use the Linux subsystem cron to make this script run on a schedule. You'll want to know whether it ran successfully - this opens a whole other can of worms about monitoring and alerting. There are various open-source systems that will let you emit metrics from your programs, basically a "hey, this happened" tick, see these metrics plotted on graphs, and ask to be emailed/texted/paged if something is happening too frequently or too infrequently. (These systems are, incidentally, one of the main applications of time-series databases). Don't get bogged down with this upfront, but keep it in mind. Statsd, Grafana, and Prometheus are some names to get you started Googling in this direction. You could also simply have your script send an email on success or failure, but people tend to start ignoring such emails.
You'll have written some functions to interact with your database engine. Extract these in a Python module. This forms the basis of your Data Access Layer. Reuse it in your Flask application. This will be easiest if you keep all this stuff in the same Git repository. You can use your chosen database engine's Python client directly, or you can use an abstraction layer like SQLAlchemy. This decision is controversial and people will have opinions, but just pick one. Whatever database API you pick, please learn what a SQL injection attack is and how to use user-supplied data in queries without opening yourself up to SQL injection. Your database API's documentation should cover the latter.
The / page of your Flask application will be based on a SQL query like SELECT * FROM surf_breaks. Render a link to the break-specific page for each one.
You'll have another page like /breaks/n where n identifies a surf break (an integer that increments as you insert surf break rows is customary). This page will be based on a query like SELECT * FROM observations WHERE surf_break_id = n. In each case, you'll call functions in your Data Access Layer for a list of rows, and then in a template, iterate through those rows and render some HTML. There are various Javascript and Python graphing libraries you can feed this list of rows into and get graphs out of (client side or server side). If you're interested in something like a week-over-week change, you should be able to express that in one SQL query and get that dataset directly from the database engine.
For performance, try not to get in a situation where more than one SQL query happens during a page load. By default, you'll be doing some unnecessary work by going back to the database and recomputing the page every time someone requests it. If this becomes a problem, you can add a reverse proxy cache in front of your Flask app. In your case this is easy, since nothing users do to the app cause its content to change. Simply invalidate the cache when you import new data.

Python and SQLAlchemy: How to detect external changes on database

Some devices are asynchronously storing values on a common remote MySQL database server.
I would like to write a supervisor app in Python (and possibly SQLAlchemy) to recognize the external INSERT events on the database and act upon the last rows' data. This is to avoid a long manual test to see if every table is being updated regularly or a logger crashed.
Can somebody just tell me where to search online this kind of info and, even better, an example?
EDIT
I already read all tables periodically using a datetime primary key ({date_time}), loading the last row of each table, and comparing to the previous values:
SELECT * FROM table ORDER BY date_time DESC LIMIT 1
but it looks very cumbersome and doesn't guarantee that I don't lose some rows between successive database checks.
The engine is an old version of INNODB that I cannot upgrade: I cannot use the UPDATE field in schema because it simply doesn't work.
To reword my question:
How to listen any database event with a daemon-like Python application (sleeping thread) and wake up only when something happens?
I want also to avoid SQL triggers because this would be just too heavy to manage: tables are in hundreds and they are added/removed very often according to the active loggers.
I gave a look to SQLAlchemy but all reference I could find, if I don't misunderstood it, are decorators to act on INSERTs made by SQLAlchemy's itself. I didn't find anything about external changes to the database.
About the example request: I am not interested in a copy-and-paste, because first I want to understand how stuff works. I prefer (even incomplete) examples because SQLAlchemy documentation is far too deep for my knowledge and I simply cannot put the pieces together.

Python ORM - save or read sql data from/to files

I'm completely new to managing data using databases so I hope my question is not too stupid but I did not find anything related using the title keywords...
I want to setup a SQL database to store computation results; these are performed using a python library. My idea was to use a python ORM like SQLAlchemy or peewee to store the results to a database.
However, the computations are done by several people on many different machines, including some that are not directly connected to internet: it is therefore impossible to simply use one common database.
What would be useful to me would be a way of saving the data in the ORM's format to be able to read it again directly once I transfer the data to a machine where the main database can be accessed.
To summarize, I want to do:
On the 1st machine: Python data -> ORM object -> ORM.fileformat
After transfer on a connected machine: ORM.fileformat -> ORM object -> SQL database
Would anyone know if existing ORMs offer that kind of feature?
Is there a reason why some of the machine cannot be connected to the internet?
If you really can't, what I would do is setup a database and the Python app on each machine where data is collected/generated. Have each machine use the app to store into its own local database and then later you can create a dump of each database from each machine and import those results into one database.
Not the ideal solution but it will work.
Ok,
thanks to MAhsan's and Padraic's answers I was able to find the how this can be done: the CSV format is indeed easy to use for import/export from a database.
Here are examples for SQLAlchemy (import 1, import 2, and export) and peewee

Multiple pandas users connecting to SQL DB

New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.

Categories