I am new to SQL, I am working on a research project, we have years worth of data from different sources summing up to hundreds of terabytes of data. I currently have them parsed as python data frames, I need help to literally set up SQL from scratch, I also need help to compile all our data into a SQL database. Please tell me everythign I need to know about SQL as a beginner?
Probably the easiest to get started with one of the free RDMS options, MySQL (https://www.mysql.com/) or PostgreSQL (https://www.postgresql.org/).
Once you've got that installed and configured, and have created the tables you wish to load, you can go with one of two routes to get your data in.
Either you can install the appropriate python libraries to connect to the server you've installed and then INSERT the data in.
Or, if there is a lot of data, look at dumping the data out into a flat file (.csv) and then use the bulk loader to push it into your tables (this is more hassle, but for larger data sets it will be faster).
I am comfortable using python / excel / pandas for my dataFrames . I do not know sql or database languages .
I am about to start on a new project that will include around 4,000 different excel files I have. I will call to have the file opened saved as a dataframe for all 4000 files and then do my math on them. This will include many computations such a as sum , linear regression , and other normal stats.
My question is I know how to do this with 5-10 files no problem. Am I going to run into a problem with memory or the programming taking hours to run? The files are around 300-600kB . I don't use any functions in excel only holding data. Would I be better off have 4,000 separate files or 4,000 tabs. Or is this something a computer can handle without a problem? Thanks for looking into have not worked with a lot of data before and would like to know if I am really screwing up before I begin.
You definitely want to use a database. At nearly 2GB of raw data, you won't be able to do too much to it without choking your computer, even reading it in would take a while.
If you feel comfortable with python and pandas, I guarantee you can learn SQL very quickly. The basic syntax can be learned in an hour and you won't regret learning it for future jobs, its a very useful skill.
I'd recommend you install PostgreSQL locally and then use SQLAlchemy to connect to create a database connection (or engine) to it. Then you'll be happy to hear that Pandas actually has df.to_sql and pd.read_sql making it really easy to push and pull data to and from it as you need it. Also SQL can do any basic math you want like summing, counting etc.
Connecting and writing to a SQL database is as easy as:
from sqlalchemy import create_engine
my_db = create_engine('postgresql+psycopg2://username:password#localhost:5432/database_name')
df.to_sql('table_name', my_db, if_exists='append')
I add the last if_exists='append' because you'll want to add all 4000 to one table most likely.
I'm doing analysis on data from a MySql database in python. I query the database for about 200,000 rows of data, then analyze in python using Pandas. I will often do many iterations over the same data, changing different variables, parameters, and such. Each time I run the program, I query the remote database (about 10 second query), then discard the query results when the program finishes. I'd like to save the results of the last query in a local file, then check each time I run the program to see if the query is the same, then just use the saved results. I guess I could just write the Pandas dataframe to a csv, but is there a better/easier/faster way to do this?
If for any reason MySQL Query Cache doesn't help, then I'd recommend to save the latest result set either in HDF5 format or in Feather format. Both formats are pretty fast. You may find some demos and tests here:
Just use pickle to write the dataframe to a file, and to read it back out ("unpickle").
This would be the "easy way".
More specifically , can i use some bridge for this like first i should copy data to excel fro mongodb and then that excel sheets data could easily be imported into mysql by some scripts like as in Python.
MongoDB does not offer any direct tool to do this, but you have many options to achieve this.
You can:
Write your own tool using your favorite language, that connect to MongoDB & MySQL and copy the data
Use mongoexport to create files and mysqlimport to reimport them into MySQL
Use an ETL (Extract, Transform, Load) that connect to MongoDB, allow you to transform the data and push them into MySQL. You can for example use Talend that has connector for MongoDB, bu you have many other solutions.
Note: Keep in mind that a simple document could contains complex structures such as Array/List, sub-documents, and even an Array of sub-documents. These structures can not be imported directly into a single table record, this is why most of the time you need a small transformation/mapping layer.
is it possible to set up tables for Mysql in Python?
Here's my problem, I have bunch of .txt files which I want to load into Mysql database. Instead of creating tables in phpmyadmin manually, is it possible to do the following things all in Python?
Create table, including data type definition.
Load many files one by one. I only know this LOAD DATA LOCAL INFILE command to load one file.
Many thanks
Yes, it is possible, you'll need to read the data from the CSV files using CSV module.
And the inject the data using Python MySQL binding. Here is a good starter tutorial:
If you already know python it will be easy
It is. Typically what you want to do is use an Object-Retlational Mapping library.
Probably the most widely used in the python ecosystem is SQLAlchemy, but there is a lot of magic going on in it, so if you want to keep a tighter control on your DB schema, or if you are learning about relational DB's and want to follow along what the code does, you might be better off with something lighter like Canonical's storm.
EDIT: Just thought to add. The reason to use ORM's is that they provide a very handy way to manipulate data / interface to the DB. But if all you will ever want to do is to do a script to convert textual data to MySQL tables, than you might get along with something even easier. Check the tutorial linked from the official MySQL website, for example.
I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.
Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).
Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.
I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.
The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.