Using metadata to format columns of a database in Python - python

I am working with a database that originally is in ‘.sav’ format of SPSS. This database has columns of information that do not optimize memory usage (i.e, strings variables have left blank spaces, also in the numeric there are decimals where there are supposed to be an integer).
The process I do is to clean the database (correct the formats) and then I exported them to a flat text file to be processed in a text editor. The text editor is not that powerful, so the correct assignment of columns and formats is a very important process, as well as the reduction of memory usage.
I have created a program to reduce memory usage using Python (pyreadstat library). The program is working with one limitation, it is only considering the values of the columns found in the database and no the metadata that is also imported from the SPSS file. My clients usually send two databases, one of the preliminary data and after a few weeks, I receive the final data that is going to be presented.
I use the preliminary data to continue programming in a text editor and save time until I get the final data. I found a potential problem in this process, which is that in the final data value may come that does not correspond to the format of the preliminary data.
For example, with the variable age in the database. It can happen that in the preliminary data the oldest person is 99 years old, so the column format is a two-digit integer, however in the final data a 100-year-old subject could appear then the format of the column would change it would be an integer of three digits. This could change everything that I format in the text editor because the columns of the database are pulled or extracted in a wrong way.
One way to overcome this problem would be to use the database metadata. In the metadata, there are all the values and labels of the variables of my database, and this information was double-checked by the client. Then if I use this the memory usage reduction would be optimal.
My question is if there is any way to use the metadata that I am importing from the SPSS file to format the columns of my database in a Python program?
Also, Is this the most efficient way to solve this problem or is there any other more appropriate procedure to accomplish this task?
Thank you very much for your help, having this procedure working or having some ideas from more experienced people on the subject would save me a lot of time in my work.

Related

How can I pass through dataset metadata, like a hash or timestamp, in a Python Transform in Palantir Foundry?

I want to be able to do two things:
Store a hash of a datasets contents (so I can decide whether it has updated). To date, I have done this via a second output dataset with a single row that stores the hash and row count. In my Transform I can read that output and compare it to the current build's hash and row count to decide if data has updated. This works fine, but I'd like to avoid having a second dataset if possible.
Pass through timestamps from upstream dependencies so that in downstream workflows I can answer "when did dependency X last update?"
It seems like both of these could be solved by some sort of key-value metadata store on the dataset.
You're correct that one of the most straightforward ways to do this is to decorate the rows with a timestamp value, and in fact with Foundry's Parquet storage system, this will be encoded using Dictionary Encoding, a highly efficient mechanism to store repeated values.
The problem with this approach is you'll have to stack a new column for each phase of updating you want to keep track of. This might prove annoying to maintain in practice.
However, if you don't want to add this data to your rows and instead simply want to store your metadata, you have two options, one of which you've already found:
Store metadata in a separate dataset
Write an 'unused' file (probably .csv or .txt) to your output keeping track of this information
Foundry won't consider your .csv or .txt extra file on the output if you're writing a standard DataFrame to it since your schema by default will only read Parquet files. This means you can store this little snippet of information without affecting your output. If you check platform documentation, you can confirm that it's possible to write both a DataFrame to an output and a file of your own.
It may be simpler to interact with a second output however since the mechanisms of Incremental Transforms and schema handling will be taken care of for you, so I'd recommend proceeding with 1. as you are right now.

How to lively save pandas dataframe to file?

I have a Python program that is controlling some machines and stores some data. The data is produced at a rate of about 20 rows per second (and about 10 columns or so). The whole run of this program can be as long as one week, as a result there is a large dataframe.
What are safe and correct ways to store this data? With safe I mean that if something fails in the day 6, I will still have all the data from days 1→6. With correct I mean not re-writing the whole dataframe to a file in each loop.
My current solution is a CSV file, I just print each row manually. This solution is both safe and correct, but the problem is that CSV does not preserve data types and also occupies more memory. So I would like to know if there is a binary solution. I like the feather format as it is really fast, but it does not allow to append rows.
I can think of two easy options:
store chunks of data (e.g. every 30 seconds or whatever suits your use case) into separate files; you can then postprocess them back into a single dataframe.
store each row into an SQL database as it comes in. Sqlite will likely be a good start, but I'd maybe really go for PostgreSQL. That's what databases are meant for, after all.

Make hash on excel data to detect data changed with openpyxl

I have an excel file with a lot of sheets (100+). Each sheet is independant. I would like to know if the data in a specific sheet has been altered since it last was opened. At the moment, I have a solution based on a for loop on all the relevant cells and calculate a checksum from there. If it is different, then the sheet has been changed. The problem is that I need to access a lot of cells and python is notoriously slow at that kind of task.
My question is: would you people have a better solution than my very naive one that would be more efficient?
I am using pyopenxl, but I could use another library for this specific task but it must be a python library.
The data is not of a single kind: there is a mix of numbers and strings in each sheet. But every sheet is formatted with the same pattern. (i.e. always the same data type at a given coordinate)

Using Pandas to create, read, and update hdf5 file structure

We would like to be able to allow the HDF5 files themselves to define their columns, indexes, and column types instead of maintaining a separate file that defines structure of the HDF5 data.
How can I create an empty HDF5 file from Pandas with a specific table structure like:
Columns
id (Int)
name (Str)
update_date (datetime)
some_float (float)
Indexes
id
name
Once the HDF5 is created and saved to disk, how do I retrieve the column and index information without having to open the file completely each time since it will likely contain several GB of data.
Many thanks in advance...
-- UPDATE --
Thanks for the comments. To clarify a bit more:
We do have some experience with Pandas but by no means are really proficient. The part that is tripping us up is creating an empty data structure and reading that structure from a file that you will not want to fully open. In all of the Pandas examples there is data. The Pandas examples also only show two ways to retrieve data/structure which are to read the entire frame into memory or issue a where clause. In this case, we would like to be able to see the table structure without query operations if possible.
I know this is an odd case. Why the heck would you want an empty dataframe?? Well, we want to have a great deal of flexility in moving data around and want to be able to define a target dataframe structure prior to data writing, which could take place much later (e.g. hours or days). Since the HDF5 specification maintains all that information it seems directionally incorrect to store the table structure information separately. Thus our desire to crack the code on this subject.
-- UPDATE 2 --
To add more detail as #jeff requested.
We would like to abstract some of the common Pandas functions like summing data or merging two frames. Thus we would like to be able to ask each frame what their columns are so we can present a view for the user to select the result frame columns.
For example, if we imported a CSV with columns A, B, C, D, and V and saved the frame to HDF5 as my_csv.hdf then we would be able to determine the columns by opening the file.
However, in our use case it is likely that the import frame for the CSV could be cleared periodically and no longer contain the data. The reason knowing that the my_csv frame has certain columns and types is important because we want to enable a user to then select those columns for summing in a downstream operation. Lets say a user wants to sum column V by the values in columns A and B only and save the frame as my_sum. Since we can't ensure my_csv will always have data we would like to ensure it at least contains the structure.
Open to other suggestions obviously. It is also possible to store the table structure info in the user_block. This, again, is not ideal because the structure is now being kept in two different areas but I guess it would be possible to always update the user_block on save using the latest column and index information for the frame, although I believe the to_* operations in Pandas will blow away the user_block so...blah. I feel like I'm talking myself into maintaining a peer structure definition but I REALLY would love some suggestions to not have to do that.

Generate table schema inspecting Excel(CSV) and import data

How would I go around creating a MYSQL table schema inspecting an Excel(or CSV) file.
Are there any ready Python libraries for the task?
Column headers would be sanitized to column names. Datatype would be estimated based on the contents of the spreadsheet column. When done, data would be loaded to the table.
I have an Excel file of ~200 columns that I want to start normalizing.
Use the xlrd module; start here. [Disclaimer: I'm the author]. xlrd classifies cells into text, number, date, boolean, error, blank, and empty. It distinguishes dates from numbers by inspecting the format associated with the cell (e.g. "dd/mm/yyyy" versus "0.00").
The job of programming some code to wade through user-entered data to decide on what DB datatype to use for each column is not something that can be easily automated. You should be able to eyeball the data and assign types like integer, money, text, date, datetime, time, etc and write code to check your guesses. Note that you need to able to cope with things like numeric or date data entered in text fields (can look OK in the GUI). You need a strategy to handle cells that don't fit the "estimated" datatype. You need to validate and clean your data. Make sure you normalize text strings (strip leading/trailing whitespace, replace multiple whitespaces by a single space. Excel text is (BMP-only) Unicode; don't bash it into ASCII or "ANSI" -- work in Unicode and encode in UTF-8 to put it in your database.
Quick and dirty workaround with phpmyadmin:
Create a table with the right amount of columns. Make sure the data fits the columns.
Import the CSV into the table.
Use the propose table structure.
As far as I know, there is no tool that can automate this process (I would love for someone to prove me wrong as I've had this exact problem before).
When I did this, I came up with two options:
(1) Manually create the columns in the db with the appropriate types and then import, or
(2) Write some kind of filter that could "figure out" what data types the columns should be.
I went with the first option mainly because I didn't think I could actually write a program to do the type inference.
If you do decide to write a type inference tool/conversion, here are a couple of issues you may have to deal with:
(1) Excel dates are actually stored as the number of days since December 31st, 1899; how does one infer then that a column is dates as opposed to some piece of numerical data (population for example)?
(2) For text fields, do you just make the columns of type varchar(n) where n is the longest entry in that column, or do you make it an unbounded char field if one of the entries is longer than some upper limit? If so, what's a good upper limit?
(3) How do you automatically convert a float to a decimal with the correct precision and without loosing any places?
Obviously, this doesn't mean that you won't be able to (I'm a pretty bad programmer). I hope you do, because it'd be a really useful tool to have.
Just for (my) reference, I documented below what I did:
XLRD is practical, however I've just saved the Excel data as CSV, so I can use LOAD DATA INFILE
I've copied the header row and started writing the import and normalization script
Script does: CREATE TABLE with all columns as TEXT, except for Primary key
query mysql: LOAD DATA LOCAL INFILE loading all CSV data into TEXT fields.
based on the output of PROCEDURE ANALYSE, I was able to ALTER TABLE to give columns the right types and lengths. PROCEDURE ANALYSE returns ENUM for any column with few distinct values, which is not what I needed, but I found that useful later for normalization. Eye-balling 200 columns was a breeze with PROCEDURE ANALYSE. Output from PhpMyAdmin propose table structure was junk.
I wrote some normalization mostly using SELECT DISTINCT on columns and INSERTing results to separate tables. I have added to the old table a column for FK first. Just after the INSERT, I've got its ID and UPDATEed the FK column. When loop finished I've dropped old column leaving only FK column. Similarly with multiple dependent columns. It was much faster than I expected.
I ran (django) python manage.py inspctdb, copied output to models.py and added all those ForeignkeyFields as FKs do not exist on MyISAM. Wrote a little python views.py, urls.py, few templates...TADA
Pandas can return a schema:
pandas.read_csv('data.csv').dtypes
References:
pandas.read_csv
pandas.DataFrame

Categories