I have a script in python that loads .RData and reads it and then writes it out to an excel file. Unfortunately, one table that contains 11 variables and 144 objects with mixed types (IntVector, FactorVector, Float Vector, Float Vector,...etc.)
When the table writes to Excel, the column names and data are preserved, except for the column that is a four-level FactorVector. Instead of returning the metadata (a,a,a,a,b,b,b,b,c,c,c,c,d,d,d,d...etc.) associated with the four levels, it returns integer values associated with each level (1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4...etc.)
I found this on the rpy2 sourceforge website, which pretty much explains my problem.
Since a FactorVector is an IntVector with attached metadata (the levels), getting items Python-style was not changed from what happens when gettings items from a IntVector. A consequence to that is that information about the levels is then lost.
It goes on below this to explain using levels, at which point I get lost as to what exactly I should do or use to keep the metadata levels intact for the FactorVector variable in question.
I presume there some sort of rpy2.robjects "switch" that will preserve this metadata when it gets translated into python? What would be the most efficient way to to apply this? Thanks!
The conversion layer customers customized for pandas DataFrame in rpy2-2.6.0 should take care of converting R factors to Pandas factors.
Related
I want to be able to do two things:
Store a hash of a datasets contents (so I can decide whether it has updated). To date, I have done this via a second output dataset with a single row that stores the hash and row count. In my Transform I can read that output and compare it to the current build's hash and row count to decide if data has updated. This works fine, but I'd like to avoid having a second dataset if possible.
Pass through timestamps from upstream dependencies so that in downstream workflows I can answer "when did dependency X last update?"
It seems like both of these could be solved by some sort of key-value metadata store on the dataset.
You're correct that one of the most straightforward ways to do this is to decorate the rows with a timestamp value, and in fact with Foundry's Parquet storage system, this will be encoded using Dictionary Encoding, a highly efficient mechanism to store repeated values.
The problem with this approach is you'll have to stack a new column for each phase of updating you want to keep track of. This might prove annoying to maintain in practice.
However, if you don't want to add this data to your rows and instead simply want to store your metadata, you have two options, one of which you've already found:
Store metadata in a separate dataset
Write an 'unused' file (probably .csv or .txt) to your output keeping track of this information
Foundry won't consider your .csv or .txt extra file on the output if you're writing a standard DataFrame to it since your schema by default will only read Parquet files. This means you can store this little snippet of information without affecting your output. If you check platform documentation, you can confirm that it's possible to write both a DataFrame to an output and a file of your own.
It may be simpler to interact with a second output however since the mechanisms of Incremental Transforms and schema handling will be taken care of for you, so I'd recommend proceeding with 1. as you are right now.
I've used TERR to make calculated columns and other types of data functions in Spotfire and am happy to hear you can now use Python. I did a simple test to ensure things are working (x3 = x2*2) - that's literally script i wrote in the python data function window and then set up the input paramters (x2 as a column) and the output paramters (x3) to be a new column....the values come out fine but the newly calculated column comes out as named x2(2)....i looked into the input/output parameters and all the names are correct, yet the column still comes out named that way. My concern is that this is uber simple, yet why is the column not being named what is in the script even though everything is set up correct. There is even a Youtube video by a Spotfire employee, the same thing happens to them and the don't mention it at all.
Has anybody else run across this?
It does seem to differ from how the equivalent TERR data function works. I consulted with the Spotfire engineering team, and here is what they suggest. It has to do with how a Column input is handled internally in Python vs TERR. In both Python and TERR, inputs (and outputs) are passed over as a table. In TERR's case a data.frame, and in Python's case a pandas.DataFrame. In TERR's case though, if the Data Function says the input is a Column, this is actually converted from a 1-column data.frame to a vector of the equivalent type; similarly, for a Value it is converted from its 1x1 data.frame to a scalar type. In Python, Value inputs are treated the same, but Column inputs are left as a pandas.Series, which retains the column name from the original input column.
Maybe you can try something different. You wouldn't want to convert it to a standard Python list, because in that case, x2*2 would actually make the column twice as long, rather than a vectorised arithmetic operation. But you could make it a straight numpy array instead. You can try adding "x2 = x2.to_numpy()" at the top of your example, and see if the result matches what you expected.
We would like to be able to allow the HDF5 files themselves to define their columns, indexes, and column types instead of maintaining a separate file that defines structure of the HDF5 data.
How can I create an empty HDF5 file from Pandas with a specific table structure like:
Columns
id (Int)
name (Str)
update_date (datetime)
some_float (float)
Indexes
id
name
Once the HDF5 is created and saved to disk, how do I retrieve the column and index information without having to open the file completely each time since it will likely contain several GB of data.
Many thanks in advance...
-- UPDATE --
Thanks for the comments. To clarify a bit more:
We do have some experience with Pandas but by no means are really proficient. The part that is tripping us up is creating an empty data structure and reading that structure from a file that you will not want to fully open. In all of the Pandas examples there is data. The Pandas examples also only show two ways to retrieve data/structure which are to read the entire frame into memory or issue a where clause. In this case, we would like to be able to see the table structure without query operations if possible.
I know this is an odd case. Why the heck would you want an empty dataframe?? Well, we want to have a great deal of flexility in moving data around and want to be able to define a target dataframe structure prior to data writing, which could take place much later (e.g. hours or days). Since the HDF5 specification maintains all that information it seems directionally incorrect to store the table structure information separately. Thus our desire to crack the code on this subject.
-- UPDATE 2 --
To add more detail as #jeff requested.
We would like to abstract some of the common Pandas functions like summing data or merging two frames. Thus we would like to be able to ask each frame what their columns are so we can present a view for the user to select the result frame columns.
For example, if we imported a CSV with columns A, B, C, D, and V and saved the frame to HDF5 as my_csv.hdf then we would be able to determine the columns by opening the file.
However, in our use case it is likely that the import frame for the CSV could be cleared periodically and no longer contain the data. The reason knowing that the my_csv frame has certain columns and types is important because we want to enable a user to then select those columns for summing in a downstream operation. Lets say a user wants to sum column V by the values in columns A and B only and save the frame as my_sum. Since we can't ensure my_csv will always have data we would like to ensure it at least contains the structure.
Open to other suggestions obviously. It is also possible to store the table structure info in the user_block. This, again, is not ideal because the structure is now being kept in two different areas but I guess it would be possible to always update the user_block on save using the latest column and index information for the frame, although I believe the to_* operations in Pandas will blow away the user_block so...blah. I feel like I'm talking myself into maintaining a peer structure definition but I REALLY would love some suggestions to not have to do that.
I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.
Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)
Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.
I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples
How would I go around creating a MYSQL table schema inspecting an Excel(or CSV) file.
Are there any ready Python libraries for the task?
Column headers would be sanitized to column names. Datatype would be estimated based on the contents of the spreadsheet column. When done, data would be loaded to the table.
I have an Excel file of ~200 columns that I want to start normalizing.
Use the xlrd module; start here. [Disclaimer: I'm the author]. xlrd classifies cells into text, number, date, boolean, error, blank, and empty. It distinguishes dates from numbers by inspecting the format associated with the cell (e.g. "dd/mm/yyyy" versus "0.00").
The job of programming some code to wade through user-entered data to decide on what DB datatype to use for each column is not something that can be easily automated. You should be able to eyeball the data and assign types like integer, money, text, date, datetime, time, etc and write code to check your guesses. Note that you need to able to cope with things like numeric or date data entered in text fields (can look OK in the GUI). You need a strategy to handle cells that don't fit the "estimated" datatype. You need to validate and clean your data. Make sure you normalize text strings (strip leading/trailing whitespace, replace multiple whitespaces by a single space. Excel text is (BMP-only) Unicode; don't bash it into ASCII or "ANSI" -- work in Unicode and encode in UTF-8 to put it in your database.
Quick and dirty workaround with phpmyadmin:
Create a table with the right amount of columns. Make sure the data fits the columns.
Import the CSV into the table.
Use the propose table structure.
As far as I know, there is no tool that can automate this process (I would love for someone to prove me wrong as I've had this exact problem before).
When I did this, I came up with two options:
(1) Manually create the columns in the db with the appropriate types and then import, or
(2) Write some kind of filter that could "figure out" what data types the columns should be.
I went with the first option mainly because I didn't think I could actually write a program to do the type inference.
If you do decide to write a type inference tool/conversion, here are a couple of issues you may have to deal with:
(1) Excel dates are actually stored as the number of days since December 31st, 1899; how does one infer then that a column is dates as opposed to some piece of numerical data (population for example)?
(2) For text fields, do you just make the columns of type varchar(n) where n is the longest entry in that column, or do you make it an unbounded char field if one of the entries is longer than some upper limit? If so, what's a good upper limit?
(3) How do you automatically convert a float to a decimal with the correct precision and without loosing any places?
Obviously, this doesn't mean that you won't be able to (I'm a pretty bad programmer). I hope you do, because it'd be a really useful tool to have.
Just for (my) reference, I documented below what I did:
XLRD is practical, however I've just saved the Excel data as CSV, so I can use LOAD DATA INFILE
I've copied the header row and started writing the import and normalization script
Script does: CREATE TABLE with all columns as TEXT, except for Primary key
query mysql: LOAD DATA LOCAL INFILE loading all CSV data into TEXT fields.
based on the output of PROCEDURE ANALYSE, I was able to ALTER TABLE to give columns the right types and lengths. PROCEDURE ANALYSE returns ENUM for any column with few distinct values, which is not what I needed, but I found that useful later for normalization. Eye-balling 200 columns was a breeze with PROCEDURE ANALYSE. Output from PhpMyAdmin propose table structure was junk.
I wrote some normalization mostly using SELECT DISTINCT on columns and INSERTing results to separate tables. I have added to the old table a column for FK first. Just after the INSERT, I've got its ID and UPDATEed the FK column. When loop finished I've dropped old column leaving only FK column. Similarly with multiple dependent columns. It was much faster than I expected.
I ran (django) python manage.py inspctdb, copied output to models.py and added all those ForeignkeyFields as FKs do not exist on MyISAM. Wrote a little python views.py, urls.py, few templates...TADA
Pandas can return a schema:
pandas.read_csv('data.csv').dtypes
References:
pandas.read_csv
pandas.DataFrame