How to convert string columns to numeric values without getting NaN values

How to convert string columns to numeric values without getting NaN values - python

enter image description here
I have columns of strings and I have to convert it into values. I used this code
and unfortunately the fillna method don't work at this example.
How can I fix the problem?
Here's the head()
Head()
data['country_txt'] = data['country_txt'].astype('float64')
data['city'] = data['city'].astype('float64')
I expected a normal result but the actual output is all fulled with NaN values:
country_txt 0 non-null float64
city 0 non-null float64

Apparently, you need to map your strings to integer representations.
There are many different ways to do that.
1 pd.factorize
df['country_as_int'] = pd.factorize(df['country_txt'])[0]
2 LabelEncoder
from sklearn.preprocessing import LabelEncoder
f = LabelEncoder()
df['country_as_int'] = f.fit_transform(df['country_txt'])
3 np.unique
df['country_as_int'] = np.unique(df['country_txt'], return_inverse=True)[-1]

Related

Pandas ValueError: cannot convert float NaN to integer [duplicate]

I get ValueError: cannot convert float NaN to integer for following:
df = pandas.read_csv('zoom11.csv')
df[['x']] = df[['x']].astype(int)
The "x" is a column in the csv file, I cannot spot any float NaN in the file, and I don't understand the error or why I am getting it.
When I read the column as String, then it has values like -1,0,1,...2000, all look very nice int numbers to me.
When I read the column as float, then this can be loaded. Then it shows values as -1.0,0.0 etc, still there are no any NaN-s
I tried with error_bad_lines = False and dtype parameter in read_csv to no avail. It just cancels loading with same exception.
The file is not small (10+ M rows), so cannot inspect it manually, when I extract a small header part, then there is no error, but it happens with full file. So it is something in the file, but cannot detect what.
Logically the csv should not have missing values, but even if there is some garbage then I would be ok to skip the rows. Or at least identify them, but I do not see way to scan through file and report conversion errors.
Update: Using the hints in comments/answers I got my data clean with this:
# x contained NaN
df = df[~df['x'].isnull()]
# Y contained some other garbage, so null check was not enough
df = df[df['y'].str.isnumeric()]
# final conversion now worked
df[['x']] = df[['x']].astype(int)
df[['y']] = df[['y']].astype(int)

For identifying NaN values use boolean indexing:
print(df[df['x'].isnull()])
Then for removing all non-numeric values use to_numeric with parameter errors='coerce' - to replace non-numeric values to NaNs:
df['x'] = pd.to_numeric(df['x'], errors='coerce')
And for remove all rows with NaNs in column x use dropna:
df = df.dropna(subset=['x'])
Last convert values to ints:
df['x'] = df['x'].astype(int)

ValueError: cannot convert float NaN to integer
From v0.24, you actually can. Pandas introduces Nullable Integer Data Types which allows integers to coexist with NaNs.
Given a series of whole float numbers with missing data,
s = pd.Series([1.0, 2.0, np.nan, 4.0])
s
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
s.dtype
# dtype('float64')
You can convert it to a nullable int type (choose from one of Int16, Int32, or Int64) with,
s2 = s.astype('Int32') # note the 'I' is uppercase
s2
0 1
1 2
2 NaN
3 4
dtype: Int32
s2.dtype
# Int32Dtype()
Your column needs to have whole numbers for the cast to happen. Anything else will raise a TypeError:
s = pd.Series([1.1, 2.0, np.nan, 4.0])
s.astype('Int32')
# TypeError: cannot safely cast non-equivalent float64 to int32

Also, even at the lastest versions of pandas if the column is object type you would have to convert into float first, something like:
df['column_name'].astype(np.float).astype("Int32")
NB: You have to go through numpy float first and then to nullable Int32, for some reason.
The size of the int if it's 32 or 64 depends on your variable, be aware you may loose some precision if your numbers are to big for the format.

I know this has been answered but wanted to provide alternate solution for anyone in the future:
You can use .loc to subset the dataframe by only values that are notnull(), and then subset out the 'x' column only. Take that same vector, and apply(int) to it.
If column x is float:
df.loc[df['x'].notnull(), 'x'] = df.loc[df['x'].notnull(), 'x'].apply(int)

if you have null value then in doing mathematical operation you will get this error to resolve it use df[~df['x'].isnull()]df[['x']].astype(int) if you want your dataset to be unchangeable.

Pandas: ValueError: cannot convert float NaN to integer

I get ValueError: cannot convert float NaN to integer for following:
df = pandas.read_csv('zoom11.csv')
df[['x']] = df[['x']].astype(int)
The "x" is a column in the csv file, I cannot spot any float NaN in the file, and I don't understand the error or why I am getting it.
When I read the column as String, then it has values like -1,0,1,...2000, all look very nice int numbers to me.
When I read the column as float, then this can be loaded. Then it shows values as -1.0,0.0 etc, still there are no any NaN-s
I tried with error_bad_lines = False and dtype parameter in read_csv to no avail. It just cancels loading with same exception.
The file is not small (10+ M rows), so cannot inspect it manually, when I extract a small header part, then there is no error, but it happens with full file. So it is something in the file, but cannot detect what.
Logically the csv should not have missing values, but even if there is some garbage then I would be ok to skip the rows. Or at least identify them, but I do not see way to scan through file and report conversion errors.
Update: Using the hints in comments/answers I got my data clean with this:
# x contained NaN
df = df[~df['x'].isnull()]
# Y contained some other garbage, so null check was not enough
df = df[df['y'].str.isnumeric()]
# final conversion now worked
df[['x']] = df[['x']].astype(int)
df[['y']] = df[['y']].astype(int)

For identifying NaN values use boolean indexing:
print(df[df['x'].isnull()])
Then for removing all non-numeric values use to_numeric with parameter errors='coerce' - to replace non-numeric values to NaNs:
df['x'] = pd.to_numeric(df['x'], errors='coerce')
And for remove all rows with NaNs in column x use dropna:
df = df.dropna(subset=['x'])
Last convert values to ints:
df['x'] = df['x'].astype(int)

ValueError: cannot convert float NaN to integer
From v0.24, you actually can. Pandas introduces Nullable Integer Data Types which allows integers to coexist with NaNs.
Given a series of whole float numbers with missing data,
s = pd.Series([1.0, 2.0, np.nan, 4.0])
s
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
s.dtype
# dtype('float64')
You can convert it to a nullable int type (choose from one of Int16, Int32, or Int64) with,
s2 = s.astype('Int32') # note the 'I' is uppercase
s2
0 1
1 2
2 NaN
3 4
dtype: Int32
s2.dtype
# Int32Dtype()
Your column needs to have whole numbers for the cast to happen. Anything else will raise a TypeError:
s = pd.Series([1.1, 2.0, np.nan, 4.0])
s.astype('Int32')
# TypeError: cannot safely cast non-equivalent float64 to int32

Also, even at the lastest versions of pandas if the column is object type you would have to convert into float first, something like:
df['column_name'].astype(np.float).astype("Int32")
NB: You have to go through numpy float first and then to nullable Int32, for some reason.
The size of the int if it's 32 or 64 depends on your variable, be aware you may loose some precision if your numbers are to big for the format.

I know this has been answered but wanted to provide alternate solution for anyone in the future:
You can use .loc to subset the dataframe by only values that are notnull(), and then subset out the 'x' column only. Take that same vector, and apply(int) to it.
If column x is float:
df.loc[df['x'].notnull(), 'x'] = df.loc[df['x'].notnull(), 'x'].apply(int)

if you have null value then in doing mathematical operation you will get this error to resolve it use df[~df['x'].isnull()]df[['x']].astype(int) if you want your dataset to be unchangeable.

pandas cut(): how to convert nans? Or to convert the output to non-categorical?

I am using pandas.cut() on dataframe columns with nans. I need to run groupby on the output of pandas.cut(), so I need to convert nans to something else (in the output, not in the input data), otherwise groupby will stupidly and infuriatingly ignore them.
I understand that cut() now outputs categorical data, but I cannot find a way to add a category to the output. I have tried add_categories(), which runs with no warning nor errors, but doesn't work because the categories are not added and, indeed, fillna fails for this very reason. A minimalist example is below.
Any ideas?
Or is there maybe an easy way to convert this categorical object to a non-categorical one? I have tried np.asarray() but with no luck - it becomes an array containing an Interval object
import pandas as pd
import numpy as np
x=[np.nan,4,6]
intervals =[-np.inf,4,np.inf]
out_nolabels=pd.cut(x,intervals)
out_labels=pd.cut(x,intervals, labels=['<=4','>4'])
out_nolabels.add_categories(['missing'])
out_labels.add_categories(['missing'])
print(out_labels)
print(out_nolabels)
out_labels=out_labels.fillna('missing')
out_nolabels=out_nolabels.fillna('missing')

As the documentation say out of the bounds data will be consider as Na categorical object, so you cant use fillna's with some constant in categorical data since the new value you are filling is not in that categories
Any NA values will be NA in the result. Out of bounds values will be
NA in the resulting Categorical object
You cant use x.fillna('missing') because missing is not in the category of x but you can do x.fillna('>4') because >4 is in the category.
We can use np.where here to overcome that
x = pd.cut(df['id'],intervals, labels=['<=4','>4'])
np.where(x.isnull(),'missing',x)
array(['<=4', '<=4', '<=4', '<=4', 'missing', 'missing'], dtype=object)
Or add_categories to the values i.e
x = pd.cut(df['id'],intervals, labels=['<=4','>4']).values.add_categories('missing')
x.fillna('missing')
[<=4, <=4, <=4, <=4, missing, missing]
Categories (3, object): [<=4 < >4 < missing]
If you want to group nan's and keep the dtype one way of doing it is by casting it to str i.e If you have a dataframe
df = pd.DataFrame({'id':[1,1,1,4,np.nan,np.nan],'value':[4,5,6,7,8,1]})
df.groupby(df.id.astype(str)).mean()
Output :
id value
id
1.0 1.0 5.0
4.0 4.0 7.0
nan NaN 4.5

Convert pandas series float dtype to int dtype while preserving NA's

Is it possible to change a column in a data frame that is float64 and holds some null values to an integer dtype? I get the following error
raise ValueError('Cannot convert NA to integer')

It is not possible, even if you try do some work around. Generally, NaN are more efficient when it comes to show missing values. So people try to do this,
Let's check what will happen if we try same.
Convert all NaN values to 0 (if your data does not have this
value), if 0 is not possible in your case use a very large number in
negative or positive, say 9999999999
df['x'].dtype output: dtype('float64')
df.loc[df['x'].notnull(),'x'] = 9999999999 or
df.loc[df['x'].notnull(),'x'] = 0
Convert all non NaN values to int only.
df['x'] = df['x'].astype('int64') converting to int64, now dtype is int64.
Put back your NaN values:
df.loc[df['x']==0,'x'] = np.nan
df['x'].dtype
output: dtype('float64')
Above technique can also be used to convert float column to integer column if it contains NaN and raising errors. But you will have to lose NaN anyway.

How to force pandas DataFrame use the desired dtypes when it is constructed?

For example:
raw = {'x':[1,2,3,4], 'y':[None,]*4, 'z':[datetime.now()] *4, 'e':[1,2,3,4]}
a = pd.DataFrame(raw, dtype={'x':float, 'y':float, 'z':object, 'e':int})
This doesn't work.
Currently I have to do:
a = pd.DataFrame(raw, dtype=object)
a['x'] = a['x'].astype(float)
a['y'] = a['y'].astype(float)
a['z'] = pd.to_date_time(a['z'], utc=True)
a['e'] = a['e'].astype(int)
Since I have a number of raw objects I would like to cast into dataframe, is there an easy way to force the right dtypes at constructing time, instead of transforming them later, which takes 2x time needed.
#Jeff has a good way to deal with raw if it is in dict format.
But what if raw is in records format, like:
raw = [(1,None,datetime.now(),1),
(2,None,datetime.now(),2),
(3,None,datetime.now(),3),
(4,None,datetime.now(),4)]
Do I have to zip it? Perhaps the time taken for zip would cost more than cast again afterwards?
DataFrame.from_records doesn't seem to accept a dtype parameter at all.

The constructor will infer non-ambiguous types correctly. You cannot specify a compound dtype mapping ATM, issue is here, pull-requests are welcome to implement this.
Don't use None, instead use np.nan (otherwise it will infer to object dtype)
Specify floats with a decimal point (or wrap as a Series, e.g. Series([1,2,3,4],dtype='float')
datetimes will automatically infer to datetime64[ns] which is almost always what you want unless you need to specify a timezone
Here's your example
In [20]: DataFrame({
'x':Series([1,2,3,4],dtype='float'),
'y':Series([None,]*4,dtype='float'),
'z':[datetime.datetime.now()] *4,
'e':[1,2,3,4]})
Out[20]:
e x y z
0 1 1 NaN 2014-06-17 07:40:42.188422
1 2 2 NaN 2014-06-17 07:40:42.188422
2 3 3 NaN 2014-06-17 07:40:42.188422
3 4 4 NaN 2014-06-17 07:40:42.188422
In [21]: DataFrame({
'x':Series([1,2,3,4],dtype='float'),
'y':Series([None,]*4,dtype='float'),
'z':[datetime.datetime.now()] *4,
'e':[1,2,3,4]}).dtypes
Out[21]:
e int64
x float64
y float64
z datetime64[ns]
dtype: object

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert string columns to numeric values without getting NaN values - python

Related

Pandas ValueError: cannot convert float NaN to integer [duplicate]

Pandas: ValueError: cannot convert float NaN to integer

pandas cut(): how to convert nans? Or to convert the output to non-categorical?

Convert pandas series float dtype to int dtype while preserving NA's

How to force pandas DataFrame use the desired dtypes when it is constructed?

Categories

Resources