I'm building a flask application that allows users to upload CSV files (with varying columns), preview uploaded files, generate summary statistics, perform complex transformations/aggregations (sometimes via Celery jobs), and then export the modified data. The uploaded file is being read into a pandas DataFrame, which allows me to elegantly handle most of the complicated data work.
I'd like these DataFrames along with associated metadata (time uploaded, ID of user uploading the file, etc.) to persist and be available for multiple users to pass around to various views. However, I'm not sure how best to incorporate the data into my SQLAlchemy models (I'm using PostgreSQL on the backend).
Three approaches I've considered:
Cramming the DataFrame into a PickleType and storing it directly in the DB. This seems to be the most straightforward solution, but means I'll be sticking large binary objects into the database.
Pickling the DataFrame, writing it to the filesystem, and storing the path as a string in the model. This keeps the database small but adds some complexity when backing up the database and allowing users to do things like delete previously uploaded files.
Converting the DataFrame to JSON (DataFrame.to_json()) and storing it as a json type (maps to PostgreSQL's json type). This adds the overhead of parsing JSON each time the DataFrame is accessed, but it also allows the data to be manipulated directly via PostgreSQL JSON operators.
Given the advantages and drawbacks of each (including those I'm unaware of), is there a preferred way to incorporate pandas DataFrames into SQLAlchemy models?
Go towards the JSON and PostgreSQL solution. I am on a Pandas project that started with the Pickle on file system, and loaded the data into to an class object for the data processing with pandas. However, as the data became large, we played with SQLAlchemy / SQLite3. Now, we are finding that working with SQLAlchemy / PostgreSQL is even better. I think our next step will be JSON.
Have fun! Pandas rocks!
Related
EDIT 28/04/2021
Have no chances with my question ; )
No one deals with this problem?
I want to develop an app for automatic data quality monitoring (DQM) based on different queries:
missing data
outliers
inconsitent data in the same models
incosistent data between models
missing form
User should be able to 'customize' its own queries in a structured settings files (in a excel file loaded in a Django model or directly in a Django model with UI interface).
DQM app should read this setting file and run queries with results stored in a model.
User can edit list of query result and make correction in database in order to resolve and improve data quality.
I look for Django package that could already do that but could not find any. So I would appreciate some help in conception.
I have designed the settings files as below :
I've read about data quality with Pandas bt nothing that cover all data quality queries mentionned above.
Nevertheless, Pandas could be used to read excel settings file using dataframe.
Then, I need to 'construct' query based on the setting file. I thought about 2 options:
use raw: concatenate SQL orders with data read from dataframe and passed SQL order in raw()
use Django queryset with unpack dictionnary to provide keyword arguments : qs = mymodel.objects.filter(**kwargs)
Is there a more clean way to achieve data quality?
I try to build a REST API in Python which relies on large data to be loaded dynamically in memory and processed. The data is loaded in Pandas DataFrames, but my question is not specific to Pandas and I might need other data structures.
After a request to the API, I would like to load the useful data (e.g., read from disk or from a DB) and keep it in memory because other requests relying on the same data should follow. After some time, I would need to drop the data in order to save memory.
In practice, I would like to keep a list of Pandas DataFrames in memory. The DataFrames in the list would be the DataFrames needed to fulfill the latest requests. Some DataFrames can be very large (e.g, several GBs), so I think that I cannot afford to retrieve them every time from a DB without a big overhead. This is why I want to keep them in memory for the next requests.
I started with Flask when the API relied on a single, fixed DataFrame. But now I cannot find a way to load dynamically new DataFrames and make them persistent across multiple requests. The loading of a new DataFrame should be triggered inside a request when necessary, and the new DataFrame should be available for the following requests. I do not know how to achieve that with Flask or with any other framework.
I have a task to import multiple Excel files in their respective sql server tables. The Excel files are of different schema and I need a mechanism to create a table dynamically; so that I don't have to write a Create Table query. I use SSIS, and I have seen some SSIS articles on the same. However, it looks I have to define the table anyhow. OpenRowSet doesn't work well in case of large excel files.
You can try using BiML, which dynamically creates packages based on meta data.
The only other possible solution is to write a script task.
New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.
I have a huge database, and I'd like to create a dump only from the latest data (let's say from the last 3 weeks, or from each table the last million lines or less).
I've collected all the models I need (with the help of django.apps.get_app_configs()), and all the relationships (let's say from the User model I need User.objects.filter(id__gt = 1000000)). What would be the best way to create a dump (or an export) which can be loaded easily into an empty DB.
What would be the best way? Creating a real MySQL dump, or using some shutil, pickle or cpickle stuff?
The best would be to have an export and an import management command for this purpose, but I'm interested at first in the techniques that can be used for this.
The goal is to have a consistent database with minimal set of datas.
Django: 1.7.4
DB: MySQL