I have a data file (csv) with Nilsimsa hash values. Some of them would have as long as 80 characters. I wish to read them in Python for data analysis tasks. Is there a way to import the data in python without information loss?
EDIT: I have tried the implementations proposed in the comments but that does not work for me.
Example data in csv file would be: 77241756221441762028881402092817125017724447303212139981668021711613168152184106
Start with a simple text file to read in, just one variable and one row.
%more foo.txt
x
77241756221441762028881402092817125017724447303212139981668021711613168152184106
In [268]: df=pd.read_csv('foo.txt')
Pandas will read it in as a string because it's too big to store as a core number type like int64 or float64. But the info is there, you didn't lose anything.
In [269]: df.x
Out[269]:
0 7724175622144176202888140209281712501772444730...
Name: x, dtype: object
In [270]: type(df.x[0])
Out[270]: str
And you can use plain python to treat it as a number. Recall the caveats from the links in the comments, this isn't going to be as fast as stuff in numpy and pandas where you have stored a whole column as int64. This is using the more flexible but slower object mode to handle things.
You can change a column to be stored as longs (long integers) like this. (But note that the dtype is still object because everything except the core numpy types (int32, int64, float64, etc.) are stored as objects.)
In [271]: df.x = df.x.map(int)
And then can more or less treat it like a number.
In [272]: df.x * 2
Out[272]:
0 1544835124428835240577628041856342500354488946...
Name: x, dtype: object
You'll have to do some formatting to see the whole number. Or go the numpy route which will default to showing the whole number.
In [273]: df.x.values * 2
Out[273]: array([ 154483512442883524057762804185634250035448894606424279963336043423226336304368212L], dtype=object)
As explained by #JohnE in his answer that we do not lose any information while reading big numbers using Pandas. They are stored as dtype=object, to make numerical computation on them we need to transform this data into numerical type.
For series:
We have to apply the map(func) to the series in the dataframe:
df['columnName'].map(int)
Whole dataframe:
If for some reason, our entire dataframe is composed of columns with dtype=object, we look at applymap(func)
from the documentation of Pandas:
DataFrame.applymap(func): Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame
so to transform all columns in dataframe:
df.applymap(int)
Related
I have a pandas series that is the return value of a groupby operation. The data looks like this:
year month
2010 8 -0.062911
9 0.057910
10 0.033854
11 -0.000337
12 0.044635
2011 1 0.012829
2 0.018433
...
How do I get the third column (the values, NOT the year/month) of this series?
If you want the values of the series you can type:
my_series.values
This will return a list of all values.
Depending on what's the intended usage and how real the values you want to get:
Although .values is commonly seen, it may have had been converted from the actual data into a different form.
From the official document of .values, you can see that there is a warning regarding the usage of .values:
Warning
We recommend using Series.array or Series.to_numpy(), depending on
whether you need a reference to the underlying data or a NumPy array.
1) If you need a refernce to the underlying Series data (the actual data), you should use:
Series.array (Zero-copy view to the array backing the Series)
The official doc mentions that:
[Series.array] returns ExtensionArray An ExtensionArray of the values stored within.
For [Pandas] extension types, this is the actual array. For NumPy native types,
this is a thin (no copy) wrapper around numpy.ndarray.
.array differs .values which may require converting the data to a
different form.
Note the highlighted text which reveals that .array presents a more original / real / actual data than .values.
Thus, if you want to get the original / real data values of the Series, Series.array is the choice.
From your sample data, calling Series.array gives:
<PandasArray>
[-0.062911, 0.05791, 0.033854, -0.000337, 0.044635, 0.012829, 0.018433]
Length: 7, dtype: float64
2) If you need to get a numpy array, as seen from the warning of the official document of .values shown above, Series.to_numpy() is the official recommended choice.
From your sample data, calling Series.to_numpy() gives:
array([-0.062911, 0.05791 , 0.033854, -0.000337, 0.044635, 0.012829,
0.018433])
The array([...]) notation indicates that it is a numpy array.
3) If you need to get a Python list (e.g. for compatibility purpose with other Python entities), you can use Series.to_list()
From your sample data, calling Series.to_list() gives:
[-0.062911, 0.05791, 0.033854, -0.000337, 0.044635, 0.012829, 0.018433]
I am working on a dataset with mixed sparse / dense columns. As the number of sparse columns greatly outnumber the number of dense I wanted to see if I could store these in an efficient manner using sparse data structures in pandas. However, while testing the functionality I found dataframes with sparse columns appear to take up more memory, consider the following example:
import numpy as np
import pandas as pd
a = np.zeros(10000000)
b = np.zeros(10000000)
a[3000:3100] = 2
b[300:310] = 1
df = pd.DataFrame({'a':pd.SparseArray(a), 'b':pd.SparseArray(b)})
print(df.info())
This prints memory usage: 228.9 MB.
Next:
df = pd.DataFrame({'a':a, 'b':b})
print(df.info())
This prints memory usage: 152.6 MB.
Does the non-sparse dataframe take up less space? Am I misunderstanding?
Installation info:
pandas 0.25.0
python 3.7.2
I've reproduced those exact numbers. From the docs:
Pandas provides data structures for efficiently storing sparse data.
These are not necessarily sparse in the typical “mostly 0”. Rather,
you can view these objects as being “compressed” where any data
matching a specific value (NaN / missing value, though any value can
be chosen, including 0) is omitted. The compressed values are not
actually stored in the array.
Which means you have to specify that it's the 0 elements that should be compressed. You can do that by using fill_value=0, like so:
df = pd.DataFrame({'a':pd.SparseArray(a, fill_value=0), 'b':pd.SparseArray(b, fill_value=0)})
The result of df.info() is 1.4kb of memory usage in this case, quite a dramatic difference.
As to why it's initially bigger in your example than a normal "uncompressed" array, my guess is that it has to do with the compression data added on top of all the normal data that is still there (including zeros in your case). Anyway, that's just a guess
Additional reading in the docs would tell you that 0 is the default fill_value only in arrays of data.dtype=int, which yours weren't
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.
I have a very large dataset so I have compressed data as best I can by converting dtypes to smallest possible type and using Series.to_sparse on some columns (achieving around 80% compression). However, when I try to concatenate 2 of these datasets (both with same compression and dtypes) I lose this compression. I.e.:
dfmerged = pd.concat((df1, df2), 0)
dfmerged.dtypes is now all float64. I tried doing the following:
for c in df1.columns:
if dfmerged[c].dtype != df1[c].dtype:
dfmerged[c] = dfmerged[c].astype(df1[c].dtype)
However this gives me an Error: 'Can only support floating point data for now'.
Original column can be uint8, uint16, etc. So I guess float64.astype(uint8) is not supported?
What is the best way I can achieve this? Will I need to create the DataFrame from scratch by creating individual Series with correct type information?
Pandas Version: 0.15.0-6-g403f38d
Python: 2.7