how to deal with strings on a numeric column in pandas? - python

I have a big dataset and I cannot convert the dtype from object to int because of the error "invalid literal for int() with base 10:" I did some research and it is because there are some strings within the column.
How can I find those strings and replace them with numeric values?

You might be looking for .str.isnumeric(), which will only allow you to filter the data for these numbers-in-strings and act on them independently .. but you'll need to decide what those values should be
converted (maybe they're money and you want to truncate €, or another date format that's not a UNIX epoch, or any number of possibilities..)
dropped (just throw them away)
something else
>>> df = pd.DataFrame({"a":["1", "2", "x"]})
>>> df
a
0 1
1 2
2 x
>>> df[df["a"].str.isnumeric()]
a
0 1
1 2
>>> df[~df["a"].str.isnumeric()]
a
2 x

Assuming 'col' the column name.
Just force convert to numeric, or NaN upon error:
df['col_num'] = pd.to_numeric(df['col'], errors='coerce')
If needed you can check which original values gave NaNs using:
df.loc[df['col'].notna()&df['col_num'].isna(), 'col']

Base 10 means it is a float. so In python you would do
int(float(____))
Since you used int(), I'm guessing you needed an integer value.

Related

invalid literal for int() with base 10: 'EC180'

All that i tried was:
df['buyer_zip']=df['buyer_zip'].replace('-', 0)
df['buyer_zip']=df['buyer_zip'].replace('', 0)
df['buyer_zip']=df['buyer_zip'].str[:5]
df["buyer_zip"].fillna( method ='ffill', inplace = True)
df["buyer_zip"].apply(int)
I have two columns in a pandas dataframe called Buyer_zip and Item_zip which are the zip codes of Buyer and items respectively. These zip codes have 4 formats. One is 5 digit zip code( ex: 12345), one is 5+4 digit zip code( 12345-1234), one is 9 digit zipcode (123456789) and the last one is 'EC180'. So, the last format is alphanumeric. There are 15 Million records in total. I am struck at a point where i have to convert all those alphanumeric values to numeric. When trying to do the same, i encountered the error: invalid literal for int() with base 10: 'EC180'. Could someone help me how to find all the words in my data column and replace it with 00000. Appreciate any help.But none of it gave an answer to how to find the words in that column and replace it with numbers
Sample data:
buyer_zip
97219
11415-3528
EC180
907031234
Expected output
buyer_zip
0 97219
1 114153528
2 0
3 907031234
Pandas has several different "replace" methods. On a DataFrame or a Series, replace is meant to match and replace entire values. For instance, df['buyer_zip'].replace('-', 0) looks for a column value that is literally the single character "-" and replaces it with the integer 0. That's not what you want. The series also has a .str attribute which holds functions for strings, and its replace is closer to what you want.
But, that is what you want when you have a string that starts with a non-digit letter. You want that one to be completely replaced with "00000".
Finally, astype is a faster way to convert the column to an int.
import pandas as pd
df = pd.DataFrame({"buyer_zip":["12345", "123451234", "123456789", "EC180"]})
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "")
df["buyer_zip"] = df["buyer_zip"].replace(r"[^\d].*$", "00000", regex=True)
df["buyer_zip"] = df["buyer_zip"].astype(int)
The operations can be chained. Apply the second operation to the result of the first, etc, and you can condense the conversion
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "").replace(r"[^\d].*$", "00000", regex=True).astype(int)

Remove scientific notation floats in a dataframe

I am receiving different series from a source. Some of those series have the values in big numbers (X billions). I then combine all the series to a dataframe with individual columns for each series.
Now, when I print the dataframe, the big numbers in the series are showed in scientific notation. Even printing the series individually shows the numbers in scientific notation.
Dataframe df (multiindex) output is:
Values
Item Sub
A 1 1.396567e+12
B 1 2.868929e+12
I have tried this:
pd.set_option('display.float_format', lambda x: '%,.2f' % x)
This doesn't work as:
it converts everywhere. I only need the conversion in that specific dataframe.
it tries to convert all kinds of floats, and not just those in scientific. So, even if the float is 89.142, it will try to convert the format and as there's no digit to put ',' it shows an error.
Then I tried these:
df.round(2)
This only converted numeric floats to 2 decimals from existing 3 decimals. Didn't do anything to scientific values.
Then I tried:
df.astypes(floats)
Doesn't do anything visible. Output stayed the same.
How else can we change the scientific notation to normal float digits inside the dataframe. I do not want to create a new list with the converted values. The dataframe itself should show the values in normal terms.
Can you guys please help me find a solution for this?
Thank you.
try df['column'] = df['column'].astype(str) . if does not work you should change type of numbers to string before create pandas dataframe from your data
I would suggest keeping everything in a float type and adjust the display setting.
For example, I have generated a df with some random numbers.
df = pd.DataFrame({"Item": ["A", "B"], "Sub": [1,1],
"Value": [float(31132314122123.1), float(324231235232315.1)]})
# Item Sub Value
#0 A 1 3.113231e+13
#1 B 1 3.242312e+14
If we print(df), we can see that the Sub values are ints and the Value values are floats.
Item object
Sub int64
Value float64
dtype: object
You can then call pd.options.display.float_format = '{:.1f}'.format to suppress the scientific notation of the floats, while retaining the float format.
# Item Sub Value
#0 A 1 31132314122123.1
#1 B 1 324231235232315.1
Item object
Sub int64
Value float64
dtype: object
If you want the scientific notation back, you can call pd.reset_option('display.float_format')
Okay. I found something called option_context for pandas that allows to change the display options just for the particular case / action using a with statement.
with pd.option_context('display.float_format',{:.2f}.format):
print(df)
So, we do not have to reset the options again as well as the options stay default for all other data in the file.
Sadly though, I could find no way to store different columns in different float format (for example one column with currency - comma separated and 2 decimals, while next column in percentage - non-comma and 2 decimals.)

Pandas ValueError: cannot convert float NaN to integer [duplicate]

I get ValueError: cannot convert float NaN to integer for following:
df = pandas.read_csv('zoom11.csv')
df[['x']] = df[['x']].astype(int)
The "x" is a column in the csv file, I cannot spot any float NaN in the file, and I don't understand the error or why I am getting it.
When I read the column as String, then it has values like -1,0,1,...2000, all look very nice int numbers to me.
When I read the column as float, then this can be loaded. Then it shows values as -1.0,0.0 etc, still there are no any NaN-s
I tried with error_bad_lines = False and dtype parameter in read_csv to no avail. It just cancels loading with same exception.
The file is not small (10+ M rows), so cannot inspect it manually, when I extract a small header part, then there is no error, but it happens with full file. So it is something in the file, but cannot detect what.
Logically the csv should not have missing values, but even if there is some garbage then I would be ok to skip the rows. Or at least identify them, but I do not see way to scan through file and report conversion errors.
Update: Using the hints in comments/answers I got my data clean with this:
# x contained NaN
df = df[~df['x'].isnull()]
# Y contained some other garbage, so null check was not enough
df = df[df['y'].str.isnumeric()]
# final conversion now worked
df[['x']] = df[['x']].astype(int)
df[['y']] = df[['y']].astype(int)
For identifying NaN values use boolean indexing:
print(df[df['x'].isnull()])
Then for removing all non-numeric values use to_numeric with parameter errors='coerce' - to replace non-numeric values to NaNs:
df['x'] = pd.to_numeric(df['x'], errors='coerce')
And for remove all rows with NaNs in column x use dropna:
df = df.dropna(subset=['x'])
Last convert values to ints:
df['x'] = df['x'].astype(int)
ValueError: cannot convert float NaN to integer
From v0.24, you actually can. Pandas introduces Nullable Integer Data Types which allows integers to coexist with NaNs.
Given a series of whole float numbers with missing data,
s = pd.Series([1.0, 2.0, np.nan, 4.0])
s
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
s.dtype
# dtype('float64')
You can convert it to a nullable int type (choose from one of Int16, Int32, or Int64) with,
s2 = s.astype('Int32') # note the 'I' is uppercase
s2
0 1
1 2
2 NaN
3 4
dtype: Int32
s2.dtype
# Int32Dtype()
Your column needs to have whole numbers for the cast to happen. Anything else will raise a TypeError:
s = pd.Series([1.1, 2.0, np.nan, 4.0])
s.astype('Int32')
# TypeError: cannot safely cast non-equivalent float64 to int32
Also, even at the lastest versions of pandas if the column is object type you would have to convert into float first, something like:
df['column_name'].astype(np.float).astype("Int32")
NB: You have to go through numpy float first and then to nullable Int32, for some reason.
The size of the int if it's 32 or 64 depends on your variable, be aware you may loose some precision if your numbers are to big for the format.
I know this has been answered but wanted to provide alternate solution for anyone in the future:
You can use .loc to subset the dataframe by only values that are notnull(), and then subset out the 'x' column only. Take that same vector, and apply(int) to it.
If column x is float:
df.loc[df['x'].notnull(), 'x'] = df.loc[df['x'].notnull(), 'x'].apply(int)
if you have null value then in doing mathematical operation you will get this error to resolve it use df[~df['x'].isnull()]df[['x']].astype(int) if you want your dataset to be unchangeable.

Error when converting dataframe object into int64

Good day. I am working on 2 dataframes that i will later be comparing, playersData & allStar. playersData['Year'] is type int64, while allStar is type object. i tried to convert the playersData['Year'] using the following code:
playersData['Year'] = playersData['Year'].astype(str).astype(int)
but it shows error saying:
ValueError: invalid literal for int() with base 10: 'nan'
the code I used is from the link: https://www.kite.com/python/answers/how-to-convert-a-pandas-dataframe-column-from-object-to-int-in-python
here is reference pics regarding types of my dataframes:
Try Dropping all the nan values from the dataset.
playersData.dropna(inplace=True)
You can either drop rows containing NaN values or replace them with a constant (In case there were few other columns containing valuable info, dropping rows might not be a good option).
If you want to drop
playersData.dropna(inplace=True) or playersData = playersData.dropna()
Replacing with a constant (Ex: 0)
playersData['Year'].fillna(0, inplace=True) or playersData['Year'] = playersData['Year'].fillna(0)

Pandas: ValueError: cannot convert float NaN to integer

I get ValueError: cannot convert float NaN to integer for following:
df = pandas.read_csv('zoom11.csv')
df[['x']] = df[['x']].astype(int)
The "x" is a column in the csv file, I cannot spot any float NaN in the file, and I don't understand the error or why I am getting it.
When I read the column as String, then it has values like -1,0,1,...2000, all look very nice int numbers to me.
When I read the column as float, then this can be loaded. Then it shows values as -1.0,0.0 etc, still there are no any NaN-s
I tried with error_bad_lines = False and dtype parameter in read_csv to no avail. It just cancels loading with same exception.
The file is not small (10+ M rows), so cannot inspect it manually, when I extract a small header part, then there is no error, but it happens with full file. So it is something in the file, but cannot detect what.
Logically the csv should not have missing values, but even if there is some garbage then I would be ok to skip the rows. Or at least identify them, but I do not see way to scan through file and report conversion errors.
Update: Using the hints in comments/answers I got my data clean with this:
# x contained NaN
df = df[~df['x'].isnull()]
# Y contained some other garbage, so null check was not enough
df = df[df['y'].str.isnumeric()]
# final conversion now worked
df[['x']] = df[['x']].astype(int)
df[['y']] = df[['y']].astype(int)
For identifying NaN values use boolean indexing:
print(df[df['x'].isnull()])
Then for removing all non-numeric values use to_numeric with parameter errors='coerce' - to replace non-numeric values to NaNs:
df['x'] = pd.to_numeric(df['x'], errors='coerce')
And for remove all rows with NaNs in column x use dropna:
df = df.dropna(subset=['x'])
Last convert values to ints:
df['x'] = df['x'].astype(int)
ValueError: cannot convert float NaN to integer
From v0.24, you actually can. Pandas introduces Nullable Integer Data Types which allows integers to coexist with NaNs.
Given a series of whole float numbers with missing data,
s = pd.Series([1.0, 2.0, np.nan, 4.0])
s
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
s.dtype
# dtype('float64')
You can convert it to a nullable int type (choose from one of Int16, Int32, or Int64) with,
s2 = s.astype('Int32') # note the 'I' is uppercase
s2
0 1
1 2
2 NaN
3 4
dtype: Int32
s2.dtype
# Int32Dtype()
Your column needs to have whole numbers for the cast to happen. Anything else will raise a TypeError:
s = pd.Series([1.1, 2.0, np.nan, 4.0])
s.astype('Int32')
# TypeError: cannot safely cast non-equivalent float64 to int32
Also, even at the lastest versions of pandas if the column is object type you would have to convert into float first, something like:
df['column_name'].astype(np.float).astype("Int32")
NB: You have to go through numpy float first and then to nullable Int32, for some reason.
The size of the int if it's 32 or 64 depends on your variable, be aware you may loose some precision if your numbers are to big for the format.
I know this has been answered but wanted to provide alternate solution for anyone in the future:
You can use .loc to subset the dataframe by only values that are notnull(), and then subset out the 'x' column only. Take that same vector, and apply(int) to it.
If column x is float:
df.loc[df['x'].notnull(), 'x'] = df.loc[df['x'].notnull(), 'x'].apply(int)
if you have null value then in doing mathematical operation you will get this error to resolve it use df[~df['x'].isnull()]df[['x']].astype(int) if you want your dataset to be unchangeable.

Categories