How to check if float pandas column contains only integer numbers? - python

I have a dataframe
df = pd.DataFrame(data=np.arange(10),columns=['v']).astype(float)
How to make sure that the numbers in v are whole numbers?
I am very concerned about rounding/truncation/floating point representation errors

Comparison with astype(int)
Tentatively convert your column to int and test with np.array_equal:
np.array_equal(df.v, df.v.astype(int))
True
float.is_integer
You can use this python function in conjunction with an apply:
df.v.apply(float.is_integer).all()
True
Or, using python's all in a generator comprehension, for space efficiency:
all(x.is_integer() for x in df.v)
True

Here's a simpler, and probably faster, approach:
(df[col] % 1 == 0).all()
To ignore nulls:
(df[col].fillna(-9999) % 1 == 0).all()

If you want to check multiple float columns in your dataframe, you can do the following:
col_should_be_int = df.select_dtypes(include=['float']).applymap(float.is_integer).all()
float_to_int_cols = col_should_be_int[col_should_be_int].index
df.loc[:, float_to_int_cols] = df.loc[:, float_to_int_cols].astype(int)
Keep in mind that a float column, containing all integers will not get selected if it has np.NaN values. To cast float columns with missing values to integer, you need to fill/remove missing values, for example, with median imputation:
float_cols = df.select_dtypes(include=['float'])
float_cols = float_cols.fillna(float_cols.median().round()) # median imputation
col_should_be_int = float_cols.applymap(float.is_integer).all()
float_to_int_cols = col_should_be_int[col_should_be_int].index
df.loc[:, float_to_int_cols] = float_cols[float_to_int_cols].astype(int)

For completeness, Pandas v1.0+ offer the convert_dtypes() utility, that (among 3 other conversions) performs the requested operation for all dataframe-columns (or series) containing only integer numbers.
If you wanted to limit the conversion to a single column only, you could do the following:
>>> df.dtypes # inspect previous dtypes
v float64
>>> df["v"] = df["v"].convert_dtype()
>>> df.dtypes # inspect converted dtypes
v Int64

On 27 331 625 rows it works well. Time : 1.3sec
df['is_float'] = df[field_fact_qty]!=df[field_fact_qty].astype(int)
This way took Time : 4.9s
df[field_fact_qty].apply(lambda x : (x.is_integer()))

Related

Python TypeError: cannot convert the series to <class 'int'> when using math.floor() for iloc index lookup value

I'm having an issue where I need a function to find a corresponding value from within a dataframe with multiple rows within them, looking similar to:
Value
0 1.2332165631653
1 6.5651324661235
2 2.3651432415454
3 1.6566584651432
4 9.5168743514354
5 ...
My function looks like this:
import math
import dataframe as df
df1 = df.read_csv('Data1.csv')
df2 = df.read_csv('Data2.csv')
def dfFunction (A, B):
Step = 10
AB = A * B
ABInt = math.floor(AB / Step)
dfValue = df1.iloc[ABInt]
return AB / dfValue
When I input A and B values as int or float, the function works, but when I try to apply the function to df2 (similar to df1 in terms of layout, just additional columns of floats), I'm returning this error.
I've tried df2.apply(dfFunction(df2.ColumnA, df2.ColumnB), axis = 1) and simply dfFunction(df2.ColumnA, df2.ColumnB).
I fundamentally understand the error, since it's highlighting the math.floor() line, but I can't use a float to look up the row index of df1 with a float. Is there another way I can have the function or looking up the data value? I'd just use iloc() if the floats weren't massive decimal places, but the values are means from another portion of the code.
Please let me know if further clarification is needed; I'm only a beginning with Python and Stack :)

Pandas ValueError: cannot convert float NaN to integer [duplicate]

I get ValueError: cannot convert float NaN to integer for following:
df = pandas.read_csv('zoom11.csv')
df[['x']] = df[['x']].astype(int)
The "x" is a column in the csv file, I cannot spot any float NaN in the file, and I don't understand the error or why I am getting it.
When I read the column as String, then it has values like -1,0,1,...2000, all look very nice int numbers to me.
When I read the column as float, then this can be loaded. Then it shows values as -1.0,0.0 etc, still there are no any NaN-s
I tried with error_bad_lines = False and dtype parameter in read_csv to no avail. It just cancels loading with same exception.
The file is not small (10+ M rows), so cannot inspect it manually, when I extract a small header part, then there is no error, but it happens with full file. So it is something in the file, but cannot detect what.
Logically the csv should not have missing values, but even if there is some garbage then I would be ok to skip the rows. Or at least identify them, but I do not see way to scan through file and report conversion errors.
Update: Using the hints in comments/answers I got my data clean with this:
# x contained NaN
df = df[~df['x'].isnull()]
# Y contained some other garbage, so null check was not enough
df = df[df['y'].str.isnumeric()]
# final conversion now worked
df[['x']] = df[['x']].astype(int)
df[['y']] = df[['y']].astype(int)
For identifying NaN values use boolean indexing:
print(df[df['x'].isnull()])
Then for removing all non-numeric values use to_numeric with parameter errors='coerce' - to replace non-numeric values to NaNs:
df['x'] = pd.to_numeric(df['x'], errors='coerce')
And for remove all rows with NaNs in column x use dropna:
df = df.dropna(subset=['x'])
Last convert values to ints:
df['x'] = df['x'].astype(int)
ValueError: cannot convert float NaN to integer
From v0.24, you actually can. Pandas introduces Nullable Integer Data Types which allows integers to coexist with NaNs.
Given a series of whole float numbers with missing data,
s = pd.Series([1.0, 2.0, np.nan, 4.0])
s
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
s.dtype
# dtype('float64')
You can convert it to a nullable int type (choose from one of Int16, Int32, or Int64) with,
s2 = s.astype('Int32') # note the 'I' is uppercase
s2
0 1
1 2
2 NaN
3 4
dtype: Int32
s2.dtype
# Int32Dtype()
Your column needs to have whole numbers for the cast to happen. Anything else will raise a TypeError:
s = pd.Series([1.1, 2.0, np.nan, 4.0])
s.astype('Int32')
# TypeError: cannot safely cast non-equivalent float64 to int32
Also, even at the lastest versions of pandas if the column is object type you would have to convert into float first, something like:
df['column_name'].astype(np.float).astype("Int32")
NB: You have to go through numpy float first and then to nullable Int32, for some reason.
The size of the int if it's 32 or 64 depends on your variable, be aware you may loose some precision if your numbers are to big for the format.
I know this has been answered but wanted to provide alternate solution for anyone in the future:
You can use .loc to subset the dataframe by only values that are notnull(), and then subset out the 'x' column only. Take that same vector, and apply(int) to it.
If column x is float:
df.loc[df['x'].notnull(), 'x'] = df.loc[df['x'].notnull(), 'x'].apply(int)
if you have null value then in doing mathematical operation you will get this error to resolve it use df[~df['x'].isnull()]df[['x']].astype(int) if you want your dataset to be unchangeable.

How to convert Decimal128 to decimal in pandas dataframe

I have a dataframe with many (But not all) Decimal128 columns (taken from a mongodb collection). I can't perform any math or comparisons on them (e.g. '<' not supported between instances of 'Decimal128' and 'float').
What is the quickest/easiest way to convert all these to float or some simpler built-in type that i can work with?
There is the Decimal128 to_decimal() method, and pandas astype(), but how can I do it for all (the decimal128) columns in one step/helper method?
Edit, I've tried:
testdf = my_df.apply(lambda x: x.astype(str).astype(float) if isinstance(x, Decimal128) else x)
testdf[testdf["MyCol"] > 80].head()
but I get:
TypeError: '>' not supported between instances of 'Decimal128' and 'int'
Converting a single column using .astype(str).astype(float) works.
Casting full DataFrame.
df = df.astype(str).astype(float)
For single column. IDs is the name of the column.
df["IDs"] = df.IDs.astype(str).astype(float)
Test implementation
from pprint import pprint
import bson
df = pd.DataFrame()
y = []
for i in range(1,6):
i = i *2/3.5
y.append(bson.decimal128.Decimal128(str(i)))
pprint(y)
df["D128"] = y
df["D128"] = df.D128.astype(str).astype(float)
print("\n", df)
Output:
[Decimal128('0.5714285714285714'),
Decimal128('1.1428571428571428'),
Decimal128('1.7142857142857142'),
Decimal128('2.2857142857142856'),
Decimal128('2.857142857142857')]
D128
0 0.571429
1 1.142857
2 1.714286
3 2.285714
4 2.857143
Just use:
df = df.astype(float)
You can also use apply or applymap(applying element wise operations), although these are inefficient compared to previous method.
df = df.applymap(float)
I can't reproduce a Decimal128 number in my system. Can you please check if the next line works for you?
df = df.apply(lambda x: x.astype(float) if isinstance(x, bson.decimal.Decimal128) else x)
It will check if a column is of type Decimal128 and then convert it to float.

Optimize a for loop applied on all elements of a df

EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.
How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).

Numbers with hyphens or strings of numbers with hyphens

I need to make a pandas DataFrame that has a column filled with hyphenated numbers. The only way I could think of to do this was to use strings. This all worked fine, until I needed to sort them to get them back into order after a regrouping. The problem is that strings sort like this:
['100-200','1000-1100','1100-1200','200-300']
This is clearly not how I want it sorted. I want it sorted numberically. How would I get this to work? I am willing to change anything. Keeping the hyphenated string as an integer or float would be the best, but I am unsure how to do that.
You could try something like this:
>>> t = ['100-200','1000-1100','1100-1200','200-300']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100-200', '200-300', '1000-1100', '1100-1200']
This would allow you to sort on integers, and if a hyphen exists, it will sort first by the first integer in the key list and then by the second. If no hyphen exists, you will sort just on the integer equivalent of the string:
>>> t = ['100-200','1000-1100','1100-1200','200-300', '100']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100', '100-200', '200-300', '1000-1100', '1100-1200']
If you have any float equivalents in any strings, simply change int to float like this:
>>> t = ['100-200.3','1000.5-1100','1100.76-1200','200-300.75', '100.35']
>>> t.sort(key=lambda x: [float(y) for y in x.split('-')])
>>> t
['100-200.3', '100.35', '200-300.75', '1000.5-1100', '1100.76-1200']
You could use sorted to construct a new ordering for the index, and then perform the sort (reordering) using df.take:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
order = sorted(range(len(df)),
key=lambda idx: map(int, df.ix[idx, 'foo'].split('-')))
df = df.take(order)
print(df)
yields
foo
0 100-200
3 200-300
1 1000-1100
2 1100-1200
This is similar to #275365's solution, but note that the sorting is done on range(len(df)), not on the strings. The strings are only used in the key parameter to determine the order in which range(len(df)) should be rearranged.
Using sorted works fine if the DataFrame is small. You can get better performance when the DataFrame is of moderate size (for example, a few hundred rows on my machine), by using numpy.argsort instead:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']*100})
arr = df['foo'].map(lambda item: map(int, item.split('-'))).values
order = np.argsort(arr)
df = df.take(order)
Alternatively, you could split your string column into two integer-valued columns, and then use df.sort:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
df[['start', 'end']] = df['foo'].apply(lambda val: pd.Series(map(int, val.split('-'))))
df.sort(['start', 'end'], inplace=True)
print(df)
yields
foo start end
0 100-200 100 200
3 200-300 200 300
1 1000-1100 1000 1100
2 1100-1200 1100 1200

Categories