I have a huge database, and I'd like to create a dump only from the latest data (let's say from the last 3 weeks, or from each table the last million lines or less).
I've collected all the models I need (with the help of django.apps.get_app_configs()), and all the relationships (let's say from the User model I need User.objects.filter(id__gt = 1000000)). What would be the best way to create a dump (or an export) which can be loaded easily into an empty DB.
What would be the best way? Creating a real MySQL dump, or using some shutil, pickle or cpickle stuff?
The best would be to have an export and an import management command for this purpose, but I'm interested at first in the techniques that can be used for this.
The goal is to have a consistent database with minimal set of datas.
Django: 1.7.4
DB: MySQL
Related
EDIT 28/04/2021
Have no chances with my question ; )
No one deals with this problem?
I want to develop an app for automatic data quality monitoring (DQM) based on different queries:
missing data
outliers
inconsitent data in the same models
incosistent data between models
missing form
User should be able to 'customize' its own queries in a structured settings files (in a excel file loaded in a Django model or directly in a Django model with UI interface).
DQM app should read this setting file and run queries with results stored in a model.
User can edit list of query result and make correction in database in order to resolve and improve data quality.
I look for Django package that could already do that but could not find any. So I would appreciate some help in conception.
I have designed the settings files as below :
I've read about data quality with Pandas bt nothing that cover all data quality queries mentionned above.
Nevertheless, Pandas could be used to read excel settings file using dataframe.
Then, I need to 'construct' query based on the setting file. I thought about 2 options:
use raw: concatenate SQL orders with data read from dataframe and passed SQL order in raw()
use Django queryset with unpack dictionnary to provide keyword arguments : qs = mymodel.objects.filter(**kwargs)
Is there a more clean way to achieve data quality?
I'm updating from an ancient language to Django. I want to keep the data from the old project into the new.
But old project is mySQL. And I'm currently using SQLite3 in dev mode. But read that postgreSQL is most capable. So first question is: Is it better to set up postgreSQL while in development. Or is it an easy transition to postgreSQL from SQLite3?
And for the data in the old project. I am bumping up the table structure from the old mySQL structure. Since it got many relation db's. And this is handled internally with foreignkey and manytomany in SQLite3 (same in postgreSQL I guess).
So I'm thinking about how to transfer the data. It's not really much data. Maybe 3-5.000 rows.
Problem is that I don't want to have same table structure. So a import would be a terrible idea. I want to have the sweet functionality provided by SQLite3/postgreSQL.
One idea I had was to join all the data and create a nested json for each post. And then define into what table so the relations are kept.
But this is just my guessing. So I'm asking you if there is a proper way to do this?
Thanks!
better create the postgres database. write down the python script which take the data from the mysql database and import in postgres database.
I'm completely new to managing data using databases so I hope my question is not too stupid but I did not find anything related using the title keywords...
I want to setup a SQL database to store computation results; these are performed using a python library. My idea was to use a python ORM like SQLAlchemy or peewee to store the results to a database.
However, the computations are done by several people on many different machines, including some that are not directly connected to internet: it is therefore impossible to simply use one common database.
What would be useful to me would be a way of saving the data in the ORM's format to be able to read it again directly once I transfer the data to a machine where the main database can be accessed.
To summarize, I want to do:
On the 1st machine: Python data -> ORM object -> ORM.fileformat
After transfer on a connected machine: ORM.fileformat -> ORM object -> SQL database
Would anyone know if existing ORMs offer that kind of feature?
Is there a reason why some of the machine cannot be connected to the internet?
If you really can't, what I would do is setup a database and the Python app on each machine where data is collected/generated. Have each machine use the app to store into its own local database and then later you can create a dump of each database from each machine and import those results into one database.
Not the ideal solution but it will work.
Ok,
thanks to MAhsan's and Padraic's answers I was able to find the how this can be done: the CSV format is indeed easy to use for import/export from a database.
Here are examples for SQLAlchemy (import 1, import 2, and export) and peewee
I'm building a flask application that allows users to upload CSV files (with varying columns), preview uploaded files, generate summary statistics, perform complex transformations/aggregations (sometimes via Celery jobs), and then export the modified data. The uploaded file is being read into a pandas DataFrame, which allows me to elegantly handle most of the complicated data work.
I'd like these DataFrames along with associated metadata (time uploaded, ID of user uploading the file, etc.) to persist and be available for multiple users to pass around to various views. However, I'm not sure how best to incorporate the data into my SQLAlchemy models (I'm using PostgreSQL on the backend).
Three approaches I've considered:
Cramming the DataFrame into a PickleType and storing it directly in the DB. This seems to be the most straightforward solution, but means I'll be sticking large binary objects into the database.
Pickling the DataFrame, writing it to the filesystem, and storing the path as a string in the model. This keeps the database small but adds some complexity when backing up the database and allowing users to do things like delete previously uploaded files.
Converting the DataFrame to JSON (DataFrame.to_json()) and storing it as a json type (maps to PostgreSQL's json type). This adds the overhead of parsing JSON each time the DataFrame is accessed, but it also allows the data to be manipulated directly via PostgreSQL JSON operators.
Given the advantages and drawbacks of each (including those I'm unaware of), is there a preferred way to incorporate pandas DataFrames into SQLAlchemy models?
Go towards the JSON and PostgreSQL solution. I am on a Pandas project that started with the Pickle on file system, and loaded the data into to an class object for the data processing with pandas. However, as the data became large, we played with SQLAlchemy / SQLite3. Now, we are finding that working with SQLAlchemy / PostgreSQL is even better. I think our next step will be JSON.
Have fun! Pandas rocks!
I know how to import txt files to mysql. But since I'm using Django now, I created a model called dataset. How can I import the data.txt to the dataset model in Django? Each row of the data.txt file consists of several columns, each of which corresponds to one field of the model.
I'm using Django with MySQL.
Any code or hint would be appreciated.
I have done something similar.
You should create a python script that read you input file and create the Model object(which translate in a record in your database). It would be the same thing as using the django-shell
This link is a bit old but explains the concept. You want to adapt it to your version of Django ie: using setup_environ
http://www.cotellese.net/2007/09/27/running-external-scripts-against-django-models/