I am trying to read a csv file in Python3 using the numpy genfromtxt function. In my csv file I have a field which is a string that looks like the following: "0x30375107333f3333".
I need to use the "dtype=None" option because I need this section of code to work with many different csv files, only some of them having such a field. Unfortunately numpy interprets this as a float128 which is a pain because 1) it is not a float and 2) I cannot find way to convert it to an int after it has been read as a float128 (without losing precision).
What I would like to do is instead interpret this as a string because it is enough for me. I found on the Numpy documentation that there is a way of getting around this, but they give cryptic instructions:
This behavior may be changed by modifying the default mapper of the StringConverter class.
Unfortunately whenever I Google something related to this I fall back to this documentation page.
I would greatly appreciate either an explanation of what they mean in the above quoted text or a solution to my above stated problem.
Related
I am trying to format a pandas DataFrame value representation.
Basically, all I want is to get the "Thousand" separator on my values.
I managed to do it using the pd.style.format function. It does the job, but also "breaks" all my table original design.
here is an example of what is going on:
Is there anything I can do to avoid doing it? I want to keep the original table format, only changing the format of the value.
PS: Don't know if it makes any difference, but I am using Google Colab.
In case anyone is having the same problem as I was using Colab, I have found a solution:
.set_table_attributes('class="dataframe"') seems to solve the problem
More infos can be found here: https://github.com/googlecolab/colabtools/issues/1687
For this case you could do:
pdf.assign(a=pdf['a'].map("{:,.0f}".format))
Today, and on several other occasions, I received an error like this:
{TypeError}ufunc subtract cannot use operands with types dtype('<M8[us]') and dtype('O').
On other days, I'd want to do some printf type command and be at a loss for which character stood for some obtuse data type (e.g. signed octal value).
I always had a hard time finding the definitions of what I now found to be called "type codes" or "Array-protocol type strings" in the first example and not to be confused with "printf-style String Formatting conversion characters" as in the later case, as they are single characters with string literal quotes, and thus Googling them is just a mess or trying to find synonyms for a word I didn't know. Maybe I'm just bad at RegEx and can't navigate man pages well enough, but I just wanted to throw up a possibly self answered question, in order to tag a bunch of synonyms for things I was trying to find and in the end landed on type code. I knew I was looking for python or numpy data types, and was scouring the internet for a dtype('<M8[us]') for the longest time so thought I'd help those who end up in a similar situation by providing a would-be online bookmark.
I had already read about various data types and this syntax in the past from various sources, knowing about the little-endian symbol '<', that '8' had something to do with the size, but would change depending on the dtype, but I had no idea what 'M' or '[us]' was defining. In my late night stupidity I looked over the numpy and python docs, but both for an earlier version than I had in my current env, and it looks like this 'M' did not appear until recently so I was left thinking all the tables in the docs were non-exhaustive and there was some other Unix or C based definition of all these type codes (which I still have not ruled out, but assume this is not the case now that I've found 'M' in my current Numpy version doc).
I will put the various resources that I've located regarding these various type codes in python and associated libraries here, but I'm sure there are plenty more, so would welcome others' additions/edits. I'll add all my links as an answer, and who knows, if others also found themselves in this situation, maybe I'll make a type code cheat sheet or something as a general resource online somewhere. Anyways, I think they'd be helpful to gather in a place tagged by a bunch of keywords that I was using trying to find them, to no avail like: python numpy data type shorthand definitions, python numpy dtype abbreviations, python array dtype codes, etc. If you have any other words that came to your mind when labeling these un-googleable terms, feel free to edit and add.
General notes:
Make sure you are reading the doc for the right version of python, numpy, etc.
The codes used depend on the use case (i.e. numpy array-protocol type strings are different than those used to define the types in general python arrays)
Even worse, some of the same characters are used to mean different things depending on the use case ('b' and 'B' for example if you compare numpy and python arrays, or 'd' if comparing python printf and array codes).
Numpy 1.17: Array-protocol type strings and the 'M' type
Python 3.8.0: printf conversion types
Python 3.8.0 Array type codes. Edit: This class is not used often, but just wanted here for comparative and exhaustive reference.
Python 3.8.0 string formatter "mini language" syntax, aka "presentation types"
I won't go to the trouble of reiterating the docs despite my answer being primarily links since I don't expect the docs to go down anytime soon, but for the main point of how I got here, 'M' stands for a datetime type in numpy and '[us]' was for microsecond resolution
I have a 10000 x 250 dataset in a csv file. When I use the command
data = pd.read_csv('pool.csv', delimiter=',',header=None)
while I am in the correct path I actually import the values.
First I get the Dataframe. Since I want to work with the numpy package I need to convert this to its values using
data = data.values
And this is when i gets weird. I have at position [9999,0] in the file a -0.3839 as value. However after importing and calculating with it I noticed, that Python (or numpy) does something strange while importing.
Calling the value of data[9999,0] SHOULD give the expected -0.3839, but gives something like -0.383899892....
I already imported the file in other languages like Matlab and there was no issue of rounding those values. I aswell tried to use the .to_csv command from the pandas package instead of .values. However there is the exact same problem.
The last 10 elements of the first column are
-0.2716
0.3711
0.0487
-1.518
0.5068
0.4456
-1.753
-0.4615
-0.5872
-0.3839
Is there any import routine, which does not have those rounding errors?
Passing float_precision='round_trip' should solve this issue:
data = pd.read_csv('pool.csv',delimiter=',',header=None,float_precision='round_trip')
That's a floating point error. This is because of how computers work. (You can look it up if you really want to know how it works.) Don't be bothered by it, it is very small.
If you really want to use exact precision (because you are testing for exact values) you can look at the decimal module of Python, but your program will be a lot slower (probably like 100 times slower).
You can read more here: https://docs.python.org/3/tutorial/floatingpoint.html
You should know that all languages have this problem, only some are better in hiding it. (Also note that in Python3 this "hiding" of the floating point error has been improved.)
Since this problem cannot be solved by an ideal solution, you are given the task to solve it yourself and choose the most appropriate solution for your situtation
I don't know about 'round_trip' and its limitations, but it probably can help you. Other solutions would be to use float_format from the to_csv method. (https://docs.python.org/3/library/string.html#format-specification-mini-language)
I encountered a weird problem, and can't find what I'm doing wrong:
In Python, I have a simple matrix as pandas dataframe (6000 x 1500 matrix). As I want to read this into Matlab I'm saving the dataframe as HDF5 as follows:
df.to_hdf("output.hdf","mytable", format="table")
Saving works fine, and reading back to Python with pd.read_hdf, also works fine. But when I try to import the same file into Matlab as follows:
data = h5read('output.hdf','/mytable')
I just get an error:
H5Dopen2 not a dataset
Somewhere I read to leave a space in the dataset name ('/ mytable') but that just returns an "object doesn't exist" error.
Any hints on what might go wrong here is highly appreciated.
Playing around with h5info in Matlab, I figured out that in Matlab I need to explicitly specify "table" in the dataset:
data = h5read('output.hdf','/mytable/table')
At least this imports the HDF5. Strange though that I have not seen this mentioned anywhere.
However, now it seems that some rows are not imported correctly, which I need to further investigate.
Quick Aside So, I'm a bit of a rookie with Python; therefore forgive my incorrect ways of describing things AND ask me questions if I don't provide enough information.
Ask my title indicates, I'm attempting to bring in a data set that is Lisp data structure. I'm trying to start small and work with a smaller data set (as I'm going to be dealing with much larger eventually) however, I'm unclear as to how I should set up my separators for my pandas
So, I'm bringing in a .dat file from a lisp data structure, and reading it with pandas (or attempting to).
My goal, is to try and have it be a normal data set, where I can separate a given, say function, with its' respected outputs.
My Lisp Data set looks like the following:
(setf nameoffile?'
((function-1 output1) (function-2 output2 output3 output4) (function-3 output5 output 6 output7...)
(function-4 output)
...
(function-N outputN outputM ... )) )
Hopefully this is not too cryptic. Please, let me know if I'm not providing enough information.
Lastly, my goal is to have all of the functions, lets say in a row and have the outputs read across the row in a pandas dataframe (since I'm used to that); for example:
function-1: output1
function-2: output2 and so on and so forth...
Again, please let me know if I'm a bit confusing, or did not provide enough information.
Thank you so much in advance!
EDIT:
My specific question is how can I insert this somewhat ambiguous lisp data structure into a pandas dataframe? Additionally, I dont know how to modify what I want into their desired rows and on how to separate them (delimiter/sep = ?). When I insert this via pandas, I get a very mumble jumbled dataframe. I think a key issue is how do I separate them appropriately?
As noted by #molbdnilo and #sds, it's probably easier to export data from lisp in a common format and then import them in Python using an existing parser.
For example you can save them to CSV file from Lisp, using the cl-csv library that is also available on quicklisp.
As you can see from cl-csv tests, you can get a csv string from you data using the write-csv function:
(write-csv *your-data-rows* :always-quote t)
Or, if you want to proceed line-by-line, you can use write-csv-row function.
Then will be easy to save the resulting string into a file and read this CSV from Python.
If your Lisp program isn't already too large, consider rewriting it in Hy. Hy is a Lisp dialect, so you can continue writing in Lisp. But also,
Hy maintains, over everything else, 100% compatibility in both directions with Python itself.
This means you can use Python libraries when writing Hy, and you can write a module in Hy to use in Python.
I don't know how your project is setup (and I don't know Pandas), but perhaps you can use this to communicate directly with Pandas?