Pandas to_hdf and import to Matlab - python

I encountered a weird problem, and can't find what I'm doing wrong:
In Python, I have a simple matrix as pandas dataframe (6000 x 1500 matrix). As I want to read this into Matlab I'm saving the dataframe as HDF5 as follows:
df.to_hdf("output.hdf","mytable", format="table")
Saving works fine, and reading back to Python with pd.read_hdf, also works fine. But when I try to import the same file into Matlab as follows:
data = h5read('output.hdf','/mytable')
I just get an error:
H5Dopen2 not a dataset
Somewhere I read to leave a space in the dataset name ('/ mytable') but that just returns an "object doesn't exist" error.
Any hints on what might go wrong here is highly appreciated.

Playing around with h5info in Matlab, I figured out that in Matlab I need to explicitly specify "table" in the dataset:
data = h5read('output.hdf','/mytable/table')
At least this imports the HDF5. Strange though that I have not seen this mentioned anywhere.
However, now it seems that some rows are not imported correctly, which I need to further investigate.

Related

Large overhead for reading pickled panda to R

I have a 10GB .pkl file, which contains four pandas. Following advice in this blog post, I try to load it in R. R takes up to 25GB of memory for this process and I wonder whether the computational overhead arises. How to make this more feasible? I could save as csv but, but there should be a better solution, no?
For reference, the post suggests to save a python function read_pickle_file.py:
import pandas as pd
def read_pickle_file(file):
pickle_data = pd.read_pickle(file)
return pickle_data
and then in R:
require("reticulate")
source_python("pickle_reader.py")
pickle_data <- read_pickle_file("C:/tsa/dataset.pickle")
The only workable solution for me was to export from Python using feather, which allows to also import from R. Resulted in (almost) no overhead (but is not a proper answer to my question, if you have better one I'd be happy to accept).
A feather tutorial can be found here

Python Panda.read_csv rounds to get import errors?

I have a 10000 x 250 dataset in a csv file. When I use the command
data = pd.read_csv('pool.csv', delimiter=',',header=None)
while I am in the correct path I actually import the values.
First I get the Dataframe. Since I want to work with the numpy package I need to convert this to its values using
data = data.values
And this is when i gets weird. I have at position [9999,0] in the file a -0.3839 as value. However after importing and calculating with it I noticed, that Python (or numpy) does something strange while importing.
Calling the value of data[9999,0] SHOULD give the expected -0.3839, but gives something like -0.383899892....
I already imported the file in other languages like Matlab and there was no issue of rounding those values. I aswell tried to use the .to_csv command from the pandas package instead of .values. However there is the exact same problem.
The last 10 elements of the first column are
-0.2716
0.3711
0.0487
-1.518
0.5068
0.4456
-1.753
-0.4615
-0.5872
-0.3839
Is there any import routine, which does not have those rounding errors?
Passing float_precision='round_trip' should solve this issue:
data = pd.read_csv('pool.csv',delimiter=',',header=None,float_precision='round_trip')
That's a floating point error. This is because of how computers work. (You can look it up if you really want to know how it works.) Don't be bothered by it, it is very small.
If you really want to use exact precision (because you are testing for exact values) you can look at the decimal module of Python, but your program will be a lot slower (probably like 100 times slower).
You can read more here: https://docs.python.org/3/tutorial/floatingpoint.html
You should know that all languages have this problem, only some are better in hiding it. (Also note that in Python3 this "hiding" of the floating point error has been improved.)
Since this problem cannot be solved by an ideal solution, you are given the task to solve it yourself and choose the most appropriate solution for your situtation
I don't know about 'round_trip' and its limitations, but it probably can help you. Other solutions would be to use float_format from the to_csv method. (https://docs.python.org/3/library/string.html#format-specification-mini-language)

Large File crashing on Jupyter Notebook

I have a very simple task: I need to take a sum of 1 column in a file that has many columns and thousand of rows. However, every time I open the file on jupyter, it crashes since I cannot go over 100 MB per file.
Is there any work around for such a task? I feel I shouldnt have to open the entire file since I need just 1 column.
Thanks!
I'm not sure if this will work since the information you have provided is somewhat limited, but if you're using python 3 I had a similar issue. Try typing this at the top and see if this helps. It might fix your issue.
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
The above solution is sort of a band-aid and isn't supported and may cause undefined behavior. If your data is too big for your memory try reading in the data with dask.
import dask.dataframe as dd
dd.read_csv(path, params)
You have to open the file even if you want just one row, .. opening it load it into some other memory and here is your problem .
You can either open the file outside Ipython and split it to smaller size OR
Use a library like pandas and read it in chunks , as in the answer
You should slice through rows and put it in different other data frames and then works on respective data frames.
Hanging issues are because of RAM insufficiency in your system.
Use new_dataframe = dataframe.iloc[: , :]- or new_dataframe = dataframe.loc[: , :]-methods for slicing in pandas.
Rows slicing before colon and column slicing after colon.

How to modify the default mapper of the StringConverter class?

I am trying to read a csv file in Python3 using the numpy genfromtxt function. In my csv file I have a field which is a string that looks like the following: "0x30375107333f3333".
I need to use the "dtype=None" option because I need this section of code to work with many different csv files, only some of them having such a field. Unfortunately numpy interprets this as a float128 which is a pain because 1) it is not a float and 2) I cannot find way to convert it to an int after it has been read as a float128 (without losing precision).
What I would like to do is instead interpret this as a string because it is enough for me. I found on the Numpy documentation that there is a way of getting around this, but they give cryptic instructions:
This behavior may be changed by modifying the default mapper of the StringConverter class.
Unfortunately whenever I Google something related to this I fall back to this documentation page.
I would greatly appreciate either an explanation of what they mean in the above quoted text or a solution to my above stated problem.

Importing a lisp data structure into python

Quick Aside So, I'm a bit of a rookie with Python; therefore forgive my incorrect ways of describing things AND ask me questions if I don't provide enough information.
Ask my title indicates, I'm attempting to bring in a data set that is Lisp data structure. I'm trying to start small and work with a smaller data set (as I'm going to be dealing with much larger eventually) however, I'm unclear as to how I should set up my separators for my pandas
So, I'm bringing in a .dat file from a lisp data structure, and reading it with pandas (or attempting to).
My goal, is to try and have it be a normal data set, where I can separate a given, say function, with its' respected outputs.
My Lisp Data set looks like the following:
(setf nameoffile?'
((function-1 output1) (function-2 output2 output3 output4) (function-3 output5 output 6 output7...)
(function-4 output)
...
(function-N outputN outputM ... )) )
Hopefully this is not too cryptic. Please, let me know if I'm not providing enough information.
Lastly, my goal is to have all of the functions, lets say in a row and have the outputs read across the row in a pandas dataframe (since I'm used to that); for example:
function-1: output1
function-2: output2 and so on and so forth...
Again, please let me know if I'm a bit confusing, or did not provide enough information.
Thank you so much in advance!
EDIT:
My specific question is how can I insert this somewhat ambiguous lisp data structure into a pandas dataframe? Additionally, I dont know how to modify what I want into their desired rows and on how to separate them (delimiter/sep = ?). When I insert this via pandas, I get a very mumble jumbled dataframe. I think a key issue is how do I separate them appropriately?
As noted by #molbdnilo and #sds, it's probably easier to export data from lisp in a common format and then import them in Python using an existing parser.
For example you can save them to CSV file from Lisp, using the cl-csv library that is also available on quicklisp.
As you can see from cl-csv tests, you can get a csv string from you data using the write-csv function:
(write-csv *your-data-rows* :always-quote t)
Or, if you want to proceed line-by-line, you can use write-csv-row function.
Then will be easy to save the resulting string into a file and read this CSV from Python.
If your Lisp program isn't already too large, consider rewriting it in Hy. Hy is a Lisp dialect, so you can continue writing in Lisp. But also,
Hy maintains, over everything else, 100% compatibility in both directions with Python itself.
This means you can use Python libraries when writing Hy, and you can write a module in Hy to use in Python.
I don't know how your project is setup (and I don't know Pandas), but perhaps you can use this to communicate directly with Pandas?

Categories