New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.
Related
Disclaimer: I am still pretty new to Django and am no veteran.
I am in the midst of building the "next generation" of a software package I built 10 years ago. The original software was built using CodeIgniter and the LAMP stack. The current software still works great, but it's just time to move on. The tech is now old. I have been looking at Django to write the new software in, but I have concerns using the ORM and the models file getting out of control.
So here's my situation, each client must have their own database. No exceptions due to data confidentiality and contracts. Each database mainly stores weather forecast data. There is a skeleton database that is currently used to setup a client. This skeleton does have tables that are common across all clients. What I am concerned about are the forecast data tables I have to dynamically create. Each forecast table is unique and different with the exception of the first four columns in the table being used for referencing/indexing and letting you know when the data was added. The rest of the columns are forecast values in a real/float datatype. There could be anything from 12 forecast data columns to over 365. Between all clients, there are hundreds of different/unique forecast tables.
I am trying to wrap my head around how I can use the ORM without having hundreds of methods in model.py. Even if I made a subdirectory and then a "model.py" for each client, I'd still have tons of model methods to deal with.
I have been reading up on how the ORM works for Django, but I haven't found anything (yet) out there that helps with my kind of situation. It's not the norm.
Without getting any more long winded about this, should I skip the ORM because of all these complexities or is there some stable way to deal with this besides going with SQL queries and stored procedures to get some performance gains?
Things to note: I did thorough benchmarking between MySQL and Postgres and will be using Postgres for the new project. I did test the option of using an array column vs having a column for each forecast value in Postgres hoping this would help with the potential modeling bloat issue. To my surprise, having a column for each forecast value provided faster querying than storing everything in an array column. So array storage is not a viable option for my data.
I was wondering if there is a way to allow a user to export a SQLite database as a .csv file, make some changes to it in a program like Excel, then upload that .csv file back to the table it came from using a record UPDATE method.
Currently I have a client that needed an inventory and pricing management system for their e-commerce store. I designed a database system and logic in Python 3 and SQLite. The system from a programming standpoint works flawlessly.
The problem I have is that there are some less then technical office staff that need to edit things like product markup within the database. Currently, I have them setup with SQLite DB Browser, from there they can edit products one at a time and write the changes to the database. They can also export tables to a .csv file for data manipulation in Excel.
The main issue is getting that .csv file back into the table it was exported from using an UPDATE method. When importing a .csv file to a table in SQLite DB Browser there is no way to perform an update import. It can only insert new rows by default and do to my table constraints that is a problem.
I like SQLite DB Browser because it is clean and simple and does exactly what I need. However, as soon as you have to edit more then one thing at a time and filter information in more complicated ways it starts to lack the functionality needed.
Is there a solution out there for SQLite DB Browser to tackle this problem? Is there a better software option all together to interact with a SQLite database that would give me that last bit of functionality?
Have you tried SQLiteForExcel? however, some coding is required.
So after researching some off the shelf options I found that the Devart Excel Add Ins did exactly what I needed. They are paid add ins, however, they seem to support almost all modern databases including SQlite. Once the add in is installed you can connect to a database and manipulate the data returned just like normal in Excel including bulk edits and advanced filtering, all changes are highlighted and can easily be written to the database with one click.
Overall I thought it was a pretty solid solution and everyone seems to be very happy with it as it made interacting with a database intuitive and non threatening to the more technically challenged.
I'm building a flask application that allows users to upload CSV files (with varying columns), preview uploaded files, generate summary statistics, perform complex transformations/aggregations (sometimes via Celery jobs), and then export the modified data. The uploaded file is being read into a pandas DataFrame, which allows me to elegantly handle most of the complicated data work.
I'd like these DataFrames along with associated metadata (time uploaded, ID of user uploading the file, etc.) to persist and be available for multiple users to pass around to various views. However, I'm not sure how best to incorporate the data into my SQLAlchemy models (I'm using PostgreSQL on the backend).
Three approaches I've considered:
Cramming the DataFrame into a PickleType and storing it directly in the DB. This seems to be the most straightforward solution, but means I'll be sticking large binary objects into the database.
Pickling the DataFrame, writing it to the filesystem, and storing the path as a string in the model. This keeps the database small but adds some complexity when backing up the database and allowing users to do things like delete previously uploaded files.
Converting the DataFrame to JSON (DataFrame.to_json()) and storing it as a json type (maps to PostgreSQL's json type). This adds the overhead of parsing JSON each time the DataFrame is accessed, but it also allows the data to be manipulated directly via PostgreSQL JSON operators.
Given the advantages and drawbacks of each (including those I'm unaware of), is there a preferred way to incorporate pandas DataFrames into SQLAlchemy models?
Go towards the JSON and PostgreSQL solution. I am on a Pandas project that started with the Pickle on file system, and loaded the data into to an class object for the data processing with pandas. However, as the data became large, we played with SQLAlchemy / SQLite3. Now, we are finding that working with SQLAlchemy / PostgreSQL is even better. I think our next step will be JSON.
Have fun! Pandas rocks!
Have some programming background, but in the process of both learning Python and making a web app, and I'm a long-time lurker but first-time poster on Stack Overflow, so please bear with me.
I know that SQLite (or another database, seems like PostgreSQL is popular) is the way to store data between sessions. But what's the most efficient way to store large amounts of data during a session?
I'm building a script to identify the strongest groups of employees to work on various projects in a company. I have received one SQLite database per department containing employee data including skill sets, achievements, performance, and pay.
My script currently runs one SQL query on each database in response to an initial query by the user, pulling all the potentially-relevant employees and their data. It stores all of that data in a list of Python dicts so the end-user can mix-and-match relevant people.
I see two other options: I could still run the comprehensive initial queries but instead of storing it in Python dicts, dump it all into SQLite temporary tables; my guess is that this would save some space and computing because I wouldn't have to store all the joins with each record. Or I could just load employee name and column/row references, which would save a lot of joins on the first pass, then pull the data on the fly from the original databases as the user requests additional data, storing little if any data in Python data structures.
What's going to be the most efficient? Or, at least, what is the most common/proper way of handling large amounts of data during a session?
Thanks in advance!
Aren't you over-optimizing? You don't need the best solution, you need a solution which is good enough.
Implement the simplest one, using dicts; it has a fair chance to be adequate. If you test it and then find it inadequate, try SQLite or Mongo (both have downsides) and see if it suits you better. But I suspect that buying more RAM instead would be the most cost-effective solution in your case.
(Not-a-real-answer disclaimer applies.)
My python project involves an externally provided database: A text file of approximately 100K lines.
This file will be updated daily.
Should I load it into an SQL database, and deal with the diff daily? Or is there an effective way to "query" this text file?
ADDITIONAL INFO:
Each "entry", or line, contains three fields - any one of which can be used as an index.
The update is is the form of the entire database - I would have to manually generate a diff
The queries are just looking up records and displaying the text.
Querying the database will be a fundamental task of the application.
How often will the data be queried? On the one extreme, if once per day, you might use a sequential search more efficiently than maintaining a database or index.
For more queries and a daily update, you could build and maintain your own index for more efficient queries. Most likely, it would be worth a negligible (if any) sacrifice in speed to use an SQL database (or other database, depending on your needs) in return for simpler and more maintainable code.
What I've done before is create SQLite databases from txt files which were created from database extracts, one SQLite db for each day.
One can query across SQLite db to check the values etc and create additional tables of data.
I added an additional column of data that was the SHA1 of the text line so that I could easily identify lines that were different.
It worked in my situation and hopefully may form the barest sniff of an acorn of an idea for you.