Convert float to int and leave nulls - python

I have the following dataframe, I want to convert values in column 'b' to integer
a b c
0 1 NaN 3
1 5 7200.0 20
2 5 580.0 20
The following code is throwing exception
"ValueError: Cannot convert NA to integer"
df['b'] = df['b'].astype(int)
How do i convert only floats to int and leave the nulls as is?

When your series contains floats and nan's and you want to convert to integers, you will get an error when you do try to convert your float to a numpy integer, because there are na values.
DON'T DO:
df['b'] = df['b'].astype(int)
From pandas >= 0.24 there is now a built-in pandas integer. This does allow integer nan's. Notice the capital in 'Int64'. This is the pandas integer, instead of the numpy integer.
SO, DO THIS:
df['b'] = df['b'].astype('Int64')
More info on pandas integer na values:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions

np.NaN is a floating point only kind of thing, so it has to be removed in order to create an integer pd.Series. Jeon's suggestion work's great If 0 isn't a valid value in df['b']. For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 5, 5], 'b': [np.NaN, 7200.0, 580.0], 'c': [3, 20, 20]})
print(df, '\n\n')
df['b'] = np.nan_to_num(df['b']).astype(int)
print(df)
if there are valid 0's, then you could first replace them all with some unique value (e.g., -999999999), the the conversion above, and then replace these unique values with 0's.
Either way, you have to remember that you have 0's where there were once NaNs. You will need to be careful to filter these out when doing various numerical analyses (e.g., mean, etc.)

Similar answer as TSeymour, but now using Panda's fillna:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 5, 5], 'b': [np.NaN, 7200.0, 580.0], 'c': [3, 20, 20]})
print(df, '\n\n')
df['b'] = df['b'].fillna(0).astype(int)
print(df)
Which gives:
a b c
0 1 NaN 3
1 5 7200.0 20
2 5 580.0 20
a b c
0 1 0 3
1 5 7200 20
2 5 580 20

Select the variable values that are not NaN using the pandas notnull function. Then, assign those variables to type int using astype function:
df[df[0].notnull()] = df[df[0].notnull()].astype(int)
I used the index number to make this solution more general. Of course, you can always specify using the column name like this: df['name_of_column']

Related

Why is DataFrame int column value sometimes returned as float?

I add a calculated column c to a DataFrame that only contains integers.
df = pd.DataFrame(data=list(zip(*[np.random.randint(1,3,5), np.random.random(5)])), columns=['a', 'b'])
df['c'] = np.ceil(df.a/df.b).astype(int)
df.dtypes
The DataFrame reports that the column type of c is indeed int:
a int64
b float64
c int32
dtype: object
If I access a value from c like this then I get an int:
df.c.values[0] # Returns "3"
type(df.c.values[0]) # Returns "numpy.int32"
But if I access the same value using loc I get a float:
df.iloc[0].c # Returns "3.0"
type(df.iloc[0].c) # Returns "numpy.float64"
Why is this?
I would like to be able to access the value using indexes without having to cast it (again) to an int.
Looks like what's happening is when you are accessing df.iloc[0].c, you have to first access df.iloc[0] which includes all three columns. df.iloc[0] then casts to the type that represents all three columns, which is numpy.float64.
Interestingly enough, I can avoid this by adding a string column.
df = pd.DataFrame(data=list(zip(*[np.random.randint(1,3,5), np.random.random(5)])), columns=['a', 'b'])
df['c'] = np.ceil(df.a/df.b).astype(int)
df['d'] = ['hi', 'bye', 'hello', 'cya', 'sup']
print(df.iloc[0].c)
print(type(df.iloc[0].c))
print(df.dtypes)
To your end question, you can avoid this whole mess by using df.loc[0, 'c'] instead of iloc.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=list(zip(*[np.random.randint(1,3,5), np.random.random(5)])), columns=['a', 'b'])
df['c'] = np.ceil(df.a/df.b).astype(int)
print(df.loc[0, 'c'])
print(df.loc[0, 'c'].dtype)
15
int32
When I execute your code, result is this dataframe:
df
a b c
0 1 0.315388 4
1 1 0.111275 9
2 1 0.251253 4
3 2 0.043162 47
4 1 0.047985 21
When I type in the interpreter df['c'].values I get this:
array([ 4, 9, 4, 47, 21]). It's to say all the c-column values.
When I type in the interpreter df.iloc[0] I get the dataframe's first row values:
a 1.000000
b 0.315388
c 4.000000
Name: 0, dtype: float64
What we could notice
All c-column values are integers while all first row values are not of the same types because we have two integers and a float value.
This fact is very important.
Indeed by definition an array is a collection of elements of the same type.
So to represent a float in a collection of values that are integers, conversion must be to float for all elements to respect this rule, because floats can contain integers but the reverse is not true.
Conclusion
Type of a collection of integers is int...
Type of a collection of floats is float...
Type of a collection of integers containing at least one float is converted to float...
Quote
"An array is a concept that stores different items of the same type together as one and makes calculating the stance of each element easier by adding an offset to the base number." (codeinstitute.net)
To check this and go further
# case A : value 2 is an integer
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},]
df = pd.DataFrame(mydict)
df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: int64
# case B : value '2' is a string
mydict = [{'a': 1, 'b': '2', 'c': 3, 'd': 4},]
df = pd.DataFrame(mydict)
df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: object
In the case A all elements are integers so dtype remains int....
Like in the case B collection contains a string that can't be a float..., all elements are converted to object type.

Can a pd.Series be assigned to a column in an out-of-order pd.DataFrame without mapping to index (i.e. without reordering the values)?

I discovered some unexpected behavior when creating or assigning a new column in Pandas. When I filter or sort the pd.DataFrame (thus mixing up the indexes) and then create a new column from a pd.Series, Pandas reorders the series to map to the DataFrame index. For example:
df = pd.DataFrame({'a': ['alpha', 'beta', 'gamma']},
index=[2, 0, 1])
df['b'] = pd.Series(['alpha', 'beta', 'gamma'])
index
a
b
2
alpha
gamma
0
beta
alpha
1
gamma
beta
I think this is happening because the pd.Series has an index [0, 1, 2] which is getting mapped to the pd.DataFrame index. But I wanted to create the new column with values in the correct "order" ignoring index:
index
a
b
2
alpha
alpha
0
beta
beta
1
gamma
gamma
Here's a convoluted example showing how unexpected this behavior is:
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: pd.Series(list(x['num']*2)))
index
num
num_times_two
2
1
6
0
2
2
1
3
4
If I use any function that strips the index off the original pd.Series and then returns a new pd.Series, the values get out of order.
Is this a bug in Pandas or intentional behavior? Is there any way to force Pandas to ignore the index when I create a new column from a pd.Series?
If you don't want the conversions of dtypes between pandas and numpy (for example, with datetimes), you can set the index of the Series same as the index of the DataFrame before assigning to a column:
either with .set_axis()
The original Series will have its index preserved - by default this operation is not in place:
ser = pd.Series(['alpha', 'beta', 'gamma'])
df['b'] = ser.set_axis(df.index)
or you can change the index of the original Series:
ser.index = df.index # ser.set_axis(df.index, inplace=True) # alternative
df['b'] = ser
OR:
Use a numpy array instead of a Series. It doesn't have indices, so there is nothing to be aligned by.
Any Series can be converted to a numpy array with .to_numpy():
df['b'] = ser.to_numpy()
Any other array-like also can be used, for example, a list.
I don't know if it is on purpose, but the new column assignment is based on index, do you need to maintain the old indexes?
If the answer is no you can simply reset the index before adding a new column
df.reset_index(drop=True)
In your example, I don't see any reason to make it a new Series? (Even if something strips the index, like converting to a list)
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: list(x['num']*2))
print(df)
Output:
num num_times_two
2 1 2
0 2 4
1 3 6

Why does pandas .sum(axis=1) return 0 when one row has numpy datetime64 values?

I have a pandas dataframe with floating numbers in some rows and numpy datetime64 in another.
df2 = pd.DataFrame(
[[np.datetime64('2021-01-01'), np.datetime64('2021-01-01')], [2, 3]],
columns=['A', 'B'])
when I sum each row (axis=1), I get 0 for all rows:
df2.sum(axis=1)
0 0.0
1 0.0
dtype: float64
Why does this happen? I have tried the numeric_only=True option with same result.
I would expect each row to be handled individually, and get 5 as result for the second row, as happens if I replace the datetime64 objects with strings:
df = pd.DataFrame(
{'A': ['2021-01-01', 2],
'B': ['2021-01-01', 3]})
print(df.sum(axis=1))
0 2021-01-012021-01-01
1 5
dtype: object
Thanks!
You get get something like you're after if you make your rows columns
df2.transpose().sum(axis=0)
Your rows won't get coerced to a numerical dtype, i.e.
df2.loc[1]
results in
A 2
B 3
Name: 1, dtype: object
Rows and columns are not treated equally in pandas, rightly or wrongly.

Pandas astype throwing invalid literal for int() with base 10 error

I have a pandas dataframe df whose column name and dtypes are specified in another file (read as data_dict). So to get the data properly I am using the below code:
col_list = data_dict['name'].tolist()
dtype_list = data_dict['type'].tolist()
dtype_dict = {col_list[i]: dtype_list[i] for i in range(len(col_list))}
df.columns = col_list
df = df.fillna(0)
df = df.astype(dtype_dict)
But it is throwing this error:
invalid literal for int() with base 10: '2.230'
Most of the answers I searched online recommended using pd.to_numeric() or something like df[col1].astype(float).astype(int). The issue here is that df contains 50+ columns out of which around 30 should be converted to integer type. Therefore I don't want to convert the data types one column at a time.
So how can I easily fix this error?
Try via boolean masking:
mask=df.apply(lambda x:x.str.isalpha(),1).fillna(False)
Finally:
df[~mask]=df[~mask].astype(float).astype(int)
Or
cols=df[~mask].dropna(axis=1).columns
df[cols]=df[cols].astype(float).astype(int)
df[col_list] = pd.to_numeric(df[col_list])
You can set the data type of the whole dataframe like this:
import pandas as pd
df = pd.DataFrame({'A': map(str, np.random.rand(10)), 'B': np.random.rand(10)})
df.apply(pd.to_numeric)
A B
0 0.493771 0.389934
1 0.991265 0.387819
2 0.398947 0.128031
3 0.869156 0.007609
4 0.129748 0.532235
5 0.993632 0.882933
6 0.244311 0.213737
7 0.773192 0.229257
8 0.392530 0.339418
9 0.732609 0.685258
and for just some columns like this:
df[['A', 'B']] = df[['A', 'B']].apply(pd.to_numeric)
In case you want to have a way to convert types to float for whole dataframe where you do not know which column has numbers, you can use this:
import pandas as pd
df = pd.DataFrame({'A': map(str, np.random.rand(10)), 'B': np.random.rand(10), 'C': [x for x in 'ABCDEFGHIJ']})
def to_num(df):
for col in df:
try:
df[col] = pd.to_numeric(df[col])
except:
continue
return df
df.pipe(to_num)
A B C
0 0.762027 0.095877 A
1 0.647066 0.931435 B
2 0.016939 0.806675 C
3 0.260255 0.346676 D
4 0.561694 0.551960 E
5 0.561363 0.675580 F
6 0.312432 0.498806 G
7 0.353007 0.203697 H
8 0.418549 0.128924 I
9 0.728632 0.600307 J

Converting Pandas Dataframe types

I have a pandas dataFrame created through a mysql call which returns the data as object type.
The data is mostly numeric, with some 'na' values.
How can I cast the type of the dataFrame so the numeric values are appropriately typed (floats) and the 'na' values are represented as numpy NaN values?
Use the replace method on dataframes:
import numpy as np
df = DataFrame({
'k1': ['na'] * 3 + ['two'] * 4,
'k2': [1, 'na', 2, 'na', 3, 4, 4]})
print df
df = df.replace('na', np.nan)
print df
I think it's helpful to point out that df.replace('na', np.nan) by itself won't work. You must assign it back to the existing dataframe.
df = df.convert_objects(convert_numeric=True) will work in most cases.
I should note that this copies the data. It would be preferable to get it to a numeric type on the initial read. If you post your code and a small example, someone might be able to help you with that.
This is what Tom suggested and is correct
In [134]: s = pd.Series(['1','2.','na'])
In [135]: s.convert_objects(convert_numeric=True)
Out[135]:
0 1
1 2
2 NaN
dtype: float64
As Andy points out, this doesn't work directly (I think that's a bug), so convert to all string elements first, then convert
In [136]: s2 = pd.Series(['1','2.','na',5])
In [138]: s2.astype(str).convert_objects(convert_numeric=True)
Out[138]:
0 1
1 2
2 NaN
3 5
dtype: float64

Categories