I have a below Dataframe with 3 columns:
df = DataFrame(query, columns=["Processid", "Processdate", "ISofficial"])
In Below code, I get Processdate based on Processid==204 (without Column Names):
result = df[df.Processid == 204].Processdate.to_string(index=False)
But I wan the same result for Two columns at once without column names, Something like below code:
result = df[df.Processid == 204].df["Processdate","ISofficial"].to_string(index=False)
I know how to get above result but I dont want Column names, Index and data type.
Can someone help?
I think you are looking for header argument in to_string parameters. Set it to False.
df[df.Processid==204][['Processdate', 'ISofficial']].to_string(index=False, header=False)
I'm finding several answers to this question, but none that seem to address or solve the error that pops up when I apply them. Per e.g. this answer I have a dataframe df and a function my_func(string_1,string_2) and I'm attempting to create a new column with the following:
df.['new_column'] = df.apply(lambda x: my_func(x['old_col_1'],x['old_col_2']),axis=1)
I'm getting an error originating inside my_func telling me that old_col_1 is type float and not a string as expected. In particular, the first line of my_func is old_col_1 = old_col_1.lower(), and the error is
AttributeError: 'float' object has no attribute 'lower'
By including debug statements using dataframe printouts I've verified old_col_1 and old_col_2 are indeed both strings. If I explicitly cast them to strings when passing as arguments, then my_func behaves as you would expect if it were being fed numeric data cast as strings, though the column values are decidedly not numeric.
Per this answer I've even explicitly ensured these columns are not being "intelligently" cast incorrectly when creating the dataframe:
df = pd.read_excel(file_name, sheetname,header=0,converters={'old_col_1':str,'old_col_2':str})
The function my_func works very well when it's called on its own. All this is making me suspect that the indices or some other numeric data from the dataframe is being passed, and not (exclusively) the column values.
Other implementations seem to give the same problem. For instance,
df['new_column'] = np.vectorize(my_func)(df['old_col_1'],df['old_col_2'])
produces the same error. Variations (e.g. using df['old_col_1'].to_numpy() or df['old_col_1'].values in place of df['old_col_1']) don't change this.
Is it possible that you have a np.nan/None/null data in your columns? If so you might be getting an error similar to the one that is caused with this data
data = {
'Column1' : ['1', '2', np.nan, '3']
df = pd.DataFrame(data)
df['Column1'] = df['Column1'].apply(lambda x : x.lower())
I load data to dataframe:
dfzips = pd.read_excel(filename_zips, dtype='object')
Dataframe has column with value: 00590
After load dataframe I got this as 590.
I have tried dtype='object'. Does not help me.
Have you tried using str instead of object?
if you use str (string) it maintains the zeros at the beginning.
It could be good to specify the column name you would like to change to str or object (without quotations).
dfzips = pd.read_excel(filename_zips,dtype=str)
It even supports this a dict mapping where the keys constitute the column names and values the data types to be set when you want to change it. You didnt specify the column name so i just put it as "column_1"
dfzips = pd.read_excel(filename_zips,dtype={"column_1":str})
This is the well known issue in pandas library.
try below code :
dfzips = pd.read_excel(filename_zips, dtype={'name of the column':})
try writing converters while reading excel.
example : df = pd.read_excel(file, dtype='string', header=headers,
converters={'name of the column': decimal_converter})
function decimal_converter:
def decimal_converter(value):
return str(float(value))
except ValueError:
return value
converters={'column name': function}
You can modify converter function according to your requirement.
Try above solutions. I hope it should work. Good Day
I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.
When I try to cast the id column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
Nullable Integer Data Type.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
df['myCol'] = df['myCol'].astype('Int64')
The lack of NaN rep in integer columns is a pandas "gotcha".
The usual workaround is to simply use floats.
My use case is munging data prior to loading into a DB table:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
It's not pretty but it gets the job done!
It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values
Whether your pandas series is object datatype or simply float datatype the below method will work
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float).astype('Int64')
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
for col in discrete:
df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
df['col'] = (
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
You could use .dropna() if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
use .fillna() and .astype() to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type.
Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.
When reading in your data all you have to do is:
df= pd.read_csv("data.csv", dtype={'id': 'Int64'})
Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.
As a side note, this will also work with .astype()
df['id'] = df['id'].astype('Int64')
Documentation here
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
if row['id']:
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
import pandas as pd
df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
If you want to use it when you chain methods, you can use assign:
df = (
df.assign(col = lambda x: x['col'].astype('Int64'))
The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.
def to_int(x):
return int(x)
return np.nan
df[column] = df[column].apply(to_int)
Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)
df['id'] = df['id'].fillna(0).astype(int)
For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:
df = df.where(pd.notnull(df), None)
This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.
First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)
df = df.astype('Int8')
But you may want to only target specific columns which have integer data mixed with NaN/nulls:
df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')
At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see
TypeError: <U1 cannot be converted to an IntegerDtype
You can do this by
df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR
df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.
This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
Try this:
df[['id']] = df[['id']].astype(pd.Int64Dtype())
If you print it's dtypes, you will get id Int64 instead of normal one int64
First remove the rows which contain NaN. Then do Integer conversion on remaining rows.
At Last insert the removed rows again.
Hope it will work
Had a similar problem. That was my solution:
def toint(zahl = 1.1):
zahl = int(zahl)
zahl = np.nan
return zahl
print(toint(4.776655), toint(np.nan), toint('test'))
4 nan nan
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
Since I didn't see the answer here, I might as well add it:
One-liner to convert NANs to empty string if you for some reason you still can't handle or pd.NA like me when relying on a library with an older version of pandas:
df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')
I think the approach of #Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:
df = df.astype({
'col_1': 'Int64',
'col_2': 'Int64',
'col_3': 'Int64',
'col_4': 'Int64', })
Similar to #hibernado's answer, but keeping it as integers (instead of strings)
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
use pd.to_numeric()
df["DateColumn"] = pd.to_numeric(df["DateColumn"])
simple and clean
I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.
When I try to cast the id column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
Nullable Integer Data Type.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
df['myCol'] = df['myCol'].astype('Int64')
The lack of NaN rep in integer columns is a pandas "gotcha".
The usual workaround is to simply use floats.
My use case is munging data prior to loading into a DB table:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
It's not pretty but it gets the job done!
It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values
Whether your pandas series is object datatype or simply float datatype the below method will work
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float).astype('Int64')
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
for col in discrete:
df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
df['col'] = (
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
You could use .dropna() if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
use .fillna() and .astype() to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type.
Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.
When reading in your data all you have to do is:
df= pd.read_csv("data.csv", dtype={'id': 'Int64'})
Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.
As a side note, this will also work with .astype()
df['id'] = df['id'].astype('Int64')
Documentation here
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
if row['id']:
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
import pandas as pd
df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
If you want to use it when you chain methods, you can use assign:
df = (
df.assign(col = lambda x: x['col'].astype('Int64'))
The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.
def to_int(x):
return int(x)
return np.nan
df[column] = df[column].apply(to_int)
Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)
df['id'] = df['id'].fillna(0).astype(int)
For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:
df = df.where(pd.notnull(df), None)
This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.
First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)
df = df.astype('Int8')
But you may want to only target specific columns which have integer data mixed with NaN/nulls:
df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')
At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see
TypeError: <U1 cannot be converted to an IntegerDtype
You can do this by
df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR
df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.
This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
Try this:
df[['id']] = df[['id']].astype(pd.Int64Dtype())
If you print it's dtypes, you will get id Int64 instead of normal one int64
First remove the rows which contain NaN. Then do Integer conversion on remaining rows.
At Last insert the removed rows again.
Hope it will work
Had a similar problem. That was my solution:
def toint(zahl = 1.1):
zahl = int(zahl)
zahl = np.nan
return zahl
print(toint(4.776655), toint(np.nan), toint('test'))
4 nan nan
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
Since I didn't see the answer here, I might as well add it:
One-liner to convert NANs to empty string if you for some reason you still can't handle or pd.NA like me when relying on a library with an older version of pandas:
df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')
I think the approach of #Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:
df = df.astype({
'col_1': 'Int64',
'col_2': 'Int64',
'col_3': 'Int64',
'col_4': 'Int64', })
Similar to #hibernado's answer, but keeping it as integers (instead of strings)
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
use pd.to_numeric()
df["DateColumn"] = pd.to_numeric(df["DateColumn"])
simple and clean