I'm wondering why HDFStore gives warnings on string columns in pandas. I thought it may be NaNs in my real database, but trying it here gives me the warning for both columns even though one is not mixed and is simply strings.
Using .13.1 pandas and 3.1.1 tables
In [75]: d1 = {1:{'Mix': 'Hello', 'Good': 'Hello'}}
In [76]: d2 = {2:{'Good':'Goodbye'}}
In [77]: d2_df = pd.DataFrame.from_dict(d2,orient='index')
In [78]: d_df = pd.DataFrame.from_dict(d1,orient='index')
In [80]: d = pd.concat([d_df,d2_df])
In [81]: d
Out[81]:
Good Mix
1 Hello Hello
2 Goodbye NaN
[2 rows x 2 columns]
In [84]: d.to_hdf('test_.h5','d')
/home/cschwalbach/venv/lib/python2.7/site-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py:2446: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['Good', 'Mix']]
warnings.warn(ws, PerformanceWarning)
When storing using the fixed format (which if you don't specify format, defaults to fixed), you are storing object dtypes (strings are stored as object dtypes in pandas). These are variable length formats which are not supported by PyTables in the Array types (CArray, EArray), see the warning here
You can however store in a format='table'; see here for the docs on storing fixed-length strings.
The NaN value is the issue here. If you manage to replace with an empty string, the warning will go away.
Related
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.
On the documentation, it says
Numpy representation of NDFrame -- Source
What does "Numpy representation of NDFrame" mean? Will modifying this numpy representation affect my original dataframe? In other words, will .values return a copy or a view?
There are answers to questions in StackOverflow implicitly suggesting (relying on) that a view be returned. For example, in the accepted answer of Set values on the diagonal of pandas.DataFrame,np.fill_diagonal(df.values, 0) is used to set all values on the diagonal part of df to 0. That is a view is returned in this case. However, as shown in #coldspeed's answer, sometimes a copy is returned.
This feels very basic. It is just a bit weird to me because I do not have a more detailed source of .values.
Another experiment that returns a view in addition to the current experiments in #coldspeed's answer:
df = pd.DataFrame([["A", "B"],["C", "D"]])
df.values[0][0] = 0
We get
df
0 1
0 0 B
1 C D
Even though it is mixed type now, we can still modify original df by setting df.values
df.values[0][1] = 5
df
0 1
0 0 5
1 C D
TL;DR:
It's an implementation detail if a copy is returned (then changing the values would not change the DataFrame) or if values returns a view (then changing the values would change the DataFrame). Don't rely on any of these cases. It could change if the pandas developers think it would be beneficial (for example if they changed the internal structure of DataFrame).
I guess the documentation has changed since the question was asked, currently it reads:
pandas.DataFrame.values
Return a Numpy representation of the DataFrame.
Only the values in the DataFrame will be returned, the axes labels will be removed.
It doesn't mention NDFrame anymore - but simply mentions a "NumPy representation of the DataFrame". A NumPy representation could be either a view or a copy!
The documentation also contains a Note about mixed dtypes:
Notes
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By numpy.find_common_type() convention, mixing int64 and uint64 will result in a float64 dtype.
From these Notes it's obvious that accessing the values of a DataFrame that contains different dtypes can (almost) never return a view. Simply because it needs to put the values into an array of the "lowest-common-denominator" dtype and that involves a copy.
However it doesn't say anything about the view / copy behavior and that's by design. jreback mentioned on the pandas issue tracker 1 that this really is just an implementation detail:
this is an implementation detail. since you are getting a single dtyped numpy array, it is upcast to a compatible dtype. if you have mixed dtypes, then you almost always will have a copy (the exception is mixed float dtypes will not copy I think), but this is a numpy detail.
I agree this is not great, but it has been there from the beginning and will not change in current pandas. If exporting to numpy you need to take care.
Even the documentation of Series mentions nothing about a view:
pandas.Series.values
Return Series as ndarray or ndarray-like depending on the dtype
It even mentions that it might not even return a plain array depending on the dtype. And that certainly includes the possibility (even if it's only hypothetical) that it returns a copy. It does not guarantee that you get a view.
When does .values return a view and when does it return a copy?
The answer is simply: It's an implementation detail and as long as it's an implementation detail there won't be any guarantees. The reason it's an implementation detail is because the pandas developers want to make certain that they can change the internal storage if they want to.
However in some cases it's impossible to create a view. For example with a DataFrame containing columns of different dtypes.
There might be advantages if you analyze the behavior to date. But as long as that's an implementation detail you shouldn't really rely on it anyways.
However if you're interested: Pandas currently stores columns with the same dtype internally as multi-dimensional array. That has the advantage that you can operate on rows and columns very efficiently (at least as long as they have the same dtype). But if the DataFrame contains mixed types it will have several internal multi-dimensional arrays. One for each dtype. It's not possible to create a view that points into two distinct arrays (at least for NumPy) so when you have mixed dtypes you'll get a copy when you want the values.
A side-note, your example:
df = pd.DataFrame([["A", "B"],["C", "D"]])
df.values[0][0] = 0
Isn't mixed-dtype. It has a specific dtype: object. However object arrays can contain any Python object, so I can see why you would say/assume that it's of mixed types.
Personal note:
Personally I would have preferred that the values property only ever returns views or errors when it cannot return a view and an additional method (e.g. as_array) that only ever returns copies even if it would be possible to get a view. That would certainly make the behavior more predictable and avoid some surprises like having a property doing an expensive copy is certainly unexpected.
1 This question has been mentioned in the issue post, so maybe the docs changed because of this question.
Let's test it out.
First, with pd.Series objects.
In [750]: s = pd.Series([1, 2, 3])
In [751]: v = s.values
In [752]: v[0] = 10000
In [753]: s
Out[753]:
0 10000
1 2
2 3
dtype: int64
Now, for DataFrame objects. First, consider non-mixed dtypes -
In [780]: df = pd.DataFrame(1 - np.eye(3, dtype=int))
In [781]: df
Out[781]:
0 1 2
0 0 1 1
1 1 0 1
2 1 1 0
In [782]: v = df.values
In [783]: v[0] = 12345
In [784]: df
Out[784]:
0 1 2
0 12345 12345 12345
1 1 0 1
2 1 1 0
Modifications are made, so that means .values returned a view.
Now, consider a scenario with mixed dtypes -
In [755]: df = pd.DataFrame({'A' :[1, 2], 'B' : ['ccc', 'ddd']})
In [756]: df
Out[756]:
A B
0 1 ccc
1 2 ddd
In [757]: v = df.values
In [758]: v[0] = 123
In [759]: v[0, 1] = 'zzxxx'
In [760]: df
Out[760]:
A B
0 1 ccc
1 2 ddd
Here, .values returns a copy.
Observation
.values for Series returns a view regardless of dtypes of each row, whereas for DataFrames this depends. For homogenous dtypes, a view is returned. Otherwise, a copy.
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.
When reading the pandas doc on pandas.Series, it states
A Series is like a fixed-size dict in that you can get and set values by index label:
http://pandas.pydata.org/pandas-docs/version/0.13.1/dsintro.html#series-is-dict-like
Question: What does it mean by pandas.Series being a fixed-size dict, and if its not like a fixed size dict, what can it do?
Like a fixed-sized dict, you can get and set (already existing) keys efficiently.
What you cannot do (efficiently) is add an element. Though you can do it:
In [11]: s
Out[11]:
a -1.344
b 0.845
c 1.076
d -0.109
e 12.000
dtype: float64
In [12]: s["f"] = 3.14 # works but slow (copies all the data)
In [13]: s
Out[13]:
a -1.344
b 0.845
c 1.076
d -0.109
e 12.000
f 3.140
dtype: float64
As this creates a new Series (i.e. via creating copy the old).
What Series (and general numpy and pandas objects allow efficient aggregations e.g. sums and groupby operations. Where using a python dict similar aggregations would be very slow.
A hand-wavy reason for this is that it's mostly due to the way that data can be stored in memory (contiguously and with known types) rather than with python objects whether they are pointers to a pointer for the type and pointer for the data (this misdirection means things are slower)...
Pandas also comes with a lot of efficiently written functions, and a neat API, so you don't have to rewrite all the functionality you need yourself...
I have a data file (csv) with Nilsimsa hash values. Some of them would have as long as 80 characters. I wish to read them in Python for data analysis tasks. Is there a way to import the data in python without information loss?
EDIT: I have tried the implementations proposed in the comments but that does not work for me.
Example data in csv file would be: 77241756221441762028881402092817125017724447303212139981668021711613168152184106
Start with a simple text file to read in, just one variable and one row.
%more foo.txt
x
77241756221441762028881402092817125017724447303212139981668021711613168152184106
In [268]: df=pd.read_csv('foo.txt')
Pandas will read it in as a string because it's too big to store as a core number type like int64 or float64. But the info is there, you didn't lose anything.
In [269]: df.x
Out[269]:
0 7724175622144176202888140209281712501772444730...
Name: x, dtype: object
In [270]: type(df.x[0])
Out[270]: str
And you can use plain python to treat it as a number. Recall the caveats from the links in the comments, this isn't going to be as fast as stuff in numpy and pandas where you have stored a whole column as int64. This is using the more flexible but slower object mode to handle things.
You can change a column to be stored as longs (long integers) like this. (But note that the dtype is still object because everything except the core numpy types (int32, int64, float64, etc.) are stored as objects.)
In [271]: df.x = df.x.map(int)
And then can more or less treat it like a number.
In [272]: df.x * 2
Out[272]:
0 1544835124428835240577628041856342500354488946...
Name: x, dtype: object
You'll have to do some formatting to see the whole number. Or go the numpy route which will default to showing the whole number.
In [273]: df.x.values * 2
Out[273]: array([ 154483512442883524057762804185634250035448894606424279963336043423226336304368212L], dtype=object)
As explained by #JohnE in his answer that we do not lose any information while reading big numbers using Pandas. They are stored as dtype=object, to make numerical computation on them we need to transform this data into numerical type.
For series:
We have to apply the map(func) to the series in the dataframe:
df['columnName'].map(int)
Whole dataframe:
If for some reason, our entire dataframe is composed of columns with dtype=object, we look at applymap(func)
from the documentation of Pandas:
DataFrame.applymap(func): Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame
so to transform all columns in dataframe:
df.applymap(int)