Python pandas large database using excel - python

I am comfortable using python / excel / pandas for my dataFrames . I do not know sql or database languages .
I am about to start on a new project that will include around 4,000 different excel files I have. I will call to have the file opened saved as a dataframe for all 4000 files and then do my math on them. This will include many computations such a as sum , linear regression , and other normal stats.
My question is I know how to do this with 5-10 files no problem. Am I going to run into a problem with memory or the programming taking hours to run? The files are around 300-600kB . I don't use any functions in excel only holding data. Would I be better off have 4,000 separate files or 4,000 tabs. Or is this something a computer can handle without a problem? Thanks for looking into have not worked with a lot of data before and would like to know if I am really screwing up before I begin.

You definitely want to use a database. At nearly 2GB of raw data, you won't be able to do too much to it without choking your computer, even reading it in would take a while.
If you feel comfortable with python and pandas, I guarantee you can learn SQL very quickly. The basic syntax can be learned in an hour and you won't regret learning it for future jobs, its a very useful skill.
I'd recommend you install PostgreSQL locally and then use SQLAlchemy to connect to create a database connection (or engine) to it. Then you'll be happy to hear that Pandas actually has df.to_sql and pd.read_sql making it really easy to push and pull data to and from it as you need it. Also SQL can do any basic math you want like summing, counting etc.
Connecting and writing to a SQL database is as easy as:
from sqlalchemy import create_engine
my_db = create_engine('postgresql+psycopg2://username:password#localhost:5432/database_name')
df.to_sql('table_name', my_db, if_exists='append')
I add the last if_exists='append' because you'll want to add all 4000 to one table most likely.

Related

How to up SQL and compile data frames to a SQL database?

I am new to SQL, I am working on a research project, we have years worth of data from different sources summing up to hundreds of terabytes of data. I currently have them parsed as python data frames, I need help to literally set up SQL from scratch, I also need help to compile all our data into a SQL database. Please tell me everythign I need to know about SQL as a beginner?
Probably the easiest to get started with one of the free RDMS options, MySQL (https://www.mysql.com/) or PostgreSQL (https://www.postgresql.org/).
Once you've got that installed and configured, and have created the tables you wish to load, you can go with one of two routes to get your data in.
Either you can install the appropriate python libraries to connect to the server you've installed and then INSERT the data in.
Or, if there is a lot of data, look at dumping the data out into a flat file (.csv) and then use the bulk loader to push it into your tables (this is more hassle, but for larger data sets it will be faster).

Store MySql query results for faster reuse

I'm doing analysis on data from a MySql database in python. I query the database for about 200,000 rows of data, then analyze in python using Pandas. I will often do many iterations over the same data, changing different variables, parameters, and such. Each time I run the program, I query the remote database (about 10 second query), then discard the query results when the program finishes. I'd like to save the results of the last query in a local file, then check each time I run the program to see if the query is the same, then just use the saved results. I guess I could just write the Pandas dataframe to a csv, but is there a better/easier/faster way to do this?
If for any reason MySQL Query Cache doesn't help, then I'd recommend to save the latest result set either in HDF5 format or in Feather format. Both formats are pretty fast. You may find some demos and tests here:
https://stackoverflow.com/a/37929007/5741205
https://stackoverflow.com/a/42750132/5741205
https://stackoverflow.com/a/42022053/5741205
Just use pickle to write the dataframe to a file, and to read it back out ("unpickle").
https://docs.python.org/3/library/pickle.html
This would be the "easy way".

Somthing wrong with using CSV as database for a webapp?

I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.
Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.
Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.
I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.
I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..

sqlite3 or CSV files

First of all, I'm a total noob at this. I've been working on setting up a small GUI app to play with database, mostly microsoft Excel files with large numbers of rows. I want to be able to display a portion of it, I wanna be able to choose the columns I'm working with trought the menu so I can perform different task very efficiently.
I've been looking into the .CSV files. I can create some sort of list or dictionnarie with it (Not sure) or I could just import the excel table into a database then do w/e I need to with my GUI. Now my question is, for this type of task i just described, wich of the 2 methods would be best suited ? (Feel free to tell me if there is a better one)
It will depend upon the requirements of you application and how you plan to extend or maintain it in the future.
A few points in favour of sqlite:
standardized interface, SQL - with CSV you would create some custom logic to select columns or filter rows
performance on bigger data sets - it might be difficult to load 10M rows of CSV into memory, whereas handling 10M rows in sqlite won't be a problem
sqlite3 is in the python standard library (but then, CSV is too)
That said, also take a look at pandas, which makes working with tabular data that fits in memory a breeze. Plus pandas will happily import data directly from Excel and other sources: http://pandas.pydata.org/pandas-docs/stable/io.html

Use Python to load data into Mysql

is it possible to set up tables for Mysql in Python?
Here's my problem, I have bunch of .txt files which I want to load into Mysql database. Instead of creating tables in phpmyadmin manually, is it possible to do the following things all in Python?
Create table, including data type definition.
Load many files one by one. I only know this LOAD DATA LOCAL INFILE command to load one file.
Many thanks
Yes, it is possible, you'll need to read the data from the CSV files using CSV module.
http://docs.python.org/library/csv.html
And the inject the data using Python MySQL binding. Here is a good starter tutorial:
http://zetcode.com/databases/mysqlpythontutorial/
If you already know python it will be easy
It is. Typically what you want to do is use an Object-Retlational Mapping library.
Probably the most widely used in the python ecosystem is SQLAlchemy, but there is a lot of magic going on in it, so if you want to keep a tighter control on your DB schema, or if you are learning about relational DB's and want to follow along what the code does, you might be better off with something lighter like Canonical's storm.
EDIT: Just thought to add. The reason to use ORM's is that they provide a very handy way to manipulate data / interface to the DB. But if all you will ever want to do is to do a script to convert textual data to MySQL tables, than you might get along with something even easier. Check the tutorial linked from the official MySQL website, for example.
HTH!

Categories