Insert pandas dataframes to SQL - python

I have 10,000 dataframes (which can all be transformed into JSONs). Each dataframe has 5,000 rows. So, eventually it's quite a lot of data that I would like to insert to my AWS RDS databases.
I want to insert them into my databases but I find the process using PyMySQL a bit too slow as I iterate through every single row and insert them.
First question, is there a way to insert the whole dataframe into a table straight away. I've tried using the "to_sql" function in the dataframe library but it doesn't seem to work as I am using Python 3.6
Second question, should I use NoSQL instead of RDS? What would be the best way to structure my (big) data?
Many thanks
from sqlalchemy import create_engine
engine = create_engine("mysql://......rds.amazonaws.com")
con = engine.connect()
my_df.to_sql(name='Scores', con=con, if_exists='append')
The table "Scores" is already existing and I would like to put all of my databases into this specific table. Or is there a better way to organise my data?

It seems like you're either missing the package or the package is installed in a different directory. Use a file manager to look for the missing library libmysqlclient.21.dylib and then copy it to the correct folder /Users/anaconda3/lib/python3.6/site-packages/MySQLdb/_mysql.cpython-36m-darwin.so.
My best guess is it's in either your lib or MySQLdb directory. You may also be able to find it in a virtual environment that you have set up.

Related

Creating an empty database in SQLite from a template SQLite file

I have a question about a problem that I'm sure is pretty often, but I still couldn't find any satisfying answers.
Suppose I have a huge-size original SQLite database. Now, I want to create an empty database that is initialized with all the tables, columns, and primary/foreign keys but has no rows in the tables. Basically, I want to create an empty table using the original one as a template.
The "DELETE from {table_name}" query for every table in the initial database won't do it because the resulting empty database ends up having a very large size, just like the original one. As I understand it, this is because all the logs about the deleted rows are still being stored. Also, I don't know exactly what else is being stored in the original SQLite file, I believe it may store some other logs/garbage things that I don't need, so I would prefer just creating an empty one from scratch if it is possible. FWIW, I eventually need to do it inside a Python script with sqlite3 library.
Would anyone please provide me with an example of how this can be done? A set of SQLite queries or a python script that would do the job?
You can use Command Line Shell For SQLite
sqlite3 old.db .schema | sqlite3 new.db

How to create a SQL table from several SQL files?

All explained above is in the context of an ETL process. I have a git repository full of sql files. I need to put all those sql files (once pulled) into a sql table with 2 columns: name and query, so that I can access each file later on using a SQL query instead of loading them from the file path. How can I make this? I am free to use the tool I want to, but I just know python and Pentaho.
Maybe the assumption that this method would require less computation time than simply accessing to the pull file located in the hard drive is wrong. In that case let me know.
You can first define the table you're interested in using something along the lines of (you did not mention the database you are using):
CREATE TABLE queries (
name TEXT PRIMARY KEY,
query TEXT
);
After creating the table, you can use perhaps os.walk to iterate through the files in your repository, and insert both the contents (e.g. file.read()) and the name of the file into the table you created previously.
It sounds like you're trying to solve a different problem though. It seems like you're interested in speeding up some process, because you asked about whether accessing queries using a table would be faster than opening a file on disk. To investigate that (separate!) question further, see this.
I would recommend that you profile the existing process you are trying to speed up using profiling tools. After that, you can see whether IO is your bottleneck. Otherwise, you may do all of this work without any benefit.
As a side note, if you are looking up queries in this way, it may indicate that you need to rearchitect your application. Please consider that possibility as well.

Python pandas large database using excel

I am comfortable using python / excel / pandas for my dataFrames . I do not know sql or database languages .
I am about to start on a new project that will include around 4,000 different excel files I have. I will call to have the file opened saved as a dataframe for all 4000 files and then do my math on them. This will include many computations such a as sum , linear regression , and other normal stats.
My question is I know how to do this with 5-10 files no problem. Am I going to run into a problem with memory or the programming taking hours to run? The files are around 300-600kB . I don't use any functions in excel only holding data. Would I be better off have 4,000 separate files or 4,000 tabs. Or is this something a computer can handle without a problem? Thanks for looking into have not worked with a lot of data before and would like to know if I am really screwing up before I begin.
You definitely want to use a database. At nearly 2GB of raw data, you won't be able to do too much to it without choking your computer, even reading it in would take a while.
If you feel comfortable with python and pandas, I guarantee you can learn SQL very quickly. The basic syntax can be learned in an hour and you won't regret learning it for future jobs, its a very useful skill.
I'd recommend you install PostgreSQL locally and then use SQLAlchemy to connect to create a database connection (or engine) to it. Then you'll be happy to hear that Pandas actually has df.to_sql and pd.read_sql making it really easy to push and pull data to and from it as you need it. Also SQL can do any basic math you want like summing, counting etc.
Connecting and writing to a SQL database is as easy as:
from sqlalchemy import create_engine
my_db = create_engine('postgresql+psycopg2://username:password#localhost:5432/database_name')
df.to_sql('table_name', my_db, if_exists='append')
I add the last if_exists='append' because you'll want to add all 4000 to one table most likely.

Updating mysql database using your findings in python

I am new to python so this might have wrong syntax as well. I want to bulk update my sql database. I have already created a column in the database with null values. My aim is to find the moving average on python then use those value to update the database. But the problem is that the moving averages that I have found is in a data table format where columns are time blocks and rows are dates. But in the database both dates and time blocks are different column.
MA30 = pd.rolling_mean(df, 30)
cur.execute(""" UPDATE feeder_ndmc_copy
SET time_block = CASE date, (MA30)""")
db.commit()
This is the error I am getting.
self.errorhandler(self, exc, value)
I have seen a lot of other question answers but there is no example of how to use the finding of python command to update the database. Any suggestions?
Ok, so it is very hard to give you a complete answer with what little information your question contains, but I'll try my best to explain how I would deal with this.
The easiest way is probably to automatically write a separate UPDATE query for each row you want to update. If I'm not mistaken, this will be relatively efficient on the database side of things but it will produce some overhead in your python program. I'm not a database guy, but since you didn't mention performance optimality in your question, i will assume that any solution that works will do for now.
I will be using sqlalchemy to handle interactions with the database. Take care that if you want to copy my code, you will need to install sqlalchemy and import the module in your code.
First, we will need to create a sqlalchemy engine. I will assume that you use a local database, if not you will need to edit this part.
engine = sqlalchemy.create_engine('mysql://localhost/yourdatabase')
Now lets create a string containing all our queries (i don't know the name of the columns you want to update, so I'll use place holders, I also do no know the format of your time index, so I'll have to guess):
queries = ''
for index, value in MA30.iterrows():
queries += 'UPDATE feeder_ndmc_copy SET column_name = {} WHERE index_column_name = {};\n'.format(value, index.strftime(%Y-%m-%d %H:%M:%S))
You will need to adapt this part heavily to conform with your requirements. I simply can't do any better without you supplying the proper schema of your database.
With the list of queries complete, we proceed to put the data into SQL:
with engine.connect() as connection:
with engine.begin():
connection.execute(queries)
Edit:
Obviously my solution does not deal in any way with things like if your pandas operations create datapoints for timestamps that are not in mysql, etc. You need to keep that in mind. If that is a problem, you will need to use queries of the form
INSERT INTO table (id,Col1,Col2) VALUES (1,1,1),(2,2,3),(3,9,3),(4,10,12)
ON DUPLICATE KEY UPDATE Col1=VALUES(Col1),Col2=VALUES(Col2);

Use Python to load data into Mysql

is it possible to set up tables for Mysql in Python?
Here's my problem, I have bunch of .txt files which I want to load into Mysql database. Instead of creating tables in phpmyadmin manually, is it possible to do the following things all in Python?
Create table, including data type definition.
Load many files one by one. I only know this LOAD DATA LOCAL INFILE command to load one file.
Many thanks
Yes, it is possible, you'll need to read the data from the CSV files using CSV module.
http://docs.python.org/library/csv.html
And the inject the data using Python MySQL binding. Here is a good starter tutorial:
http://zetcode.com/databases/mysqlpythontutorial/
If you already know python it will be easy
It is. Typically what you want to do is use an Object-Retlational Mapping library.
Probably the most widely used in the python ecosystem is SQLAlchemy, but there is a lot of magic going on in it, so if you want to keep a tighter control on your DB schema, or if you are learning about relational DB's and want to follow along what the code does, you might be better off with something lighter like Canonical's storm.
EDIT: Just thought to add. The reason to use ORM's is that they provide a very handy way to manipulate data / interface to the DB. But if all you will ever want to do is to do a script to convert textual data to MySQL tables, than you might get along with something even easier. Check the tutorial linked from the official MySQL website, for example.
HTH!

Categories