Good day. I am working on 2 dataframes that i will later be comparing, playersData & allStar. playersData['Year'] is type int64, while allStar is type object. i tried to convert the playersData['Year'] using the following code:
playersData['Year'] = playersData['Year'].astype(str).astype(int)
but it shows error saying:
ValueError: invalid literal for int() with base 10: 'nan'
the code I used is from the link: https://www.kite.com/python/answers/how-to-convert-a-pandas-dataframe-column-from-object-to-int-in-python
here is reference pics regarding types of my dataframes:
Try Dropping all the nan values from the dataset.
playersData.dropna(inplace=True)
You can either drop rows containing NaN values or replace them with a constant (In case there were few other columns containing valuable info, dropping rows might not be a good option).
If you want to drop
playersData.dropna(inplace=True) or playersData = playersData.dropna()
Replacing with a constant (Ex: 0)
playersData['Year'].fillna(0, inplace=True) or playersData['Year'] = playersData['Year'].fillna(0)
Related
I have a csv file that contains 6 columns in such format:
2021-04-13 11:03:13+02:00,3.0,3.0,3.0,12.0,12.0
Now I want to remove the decimal point of each column except for the first one.
I already tried using df.style.format aswell as the df.astype(int) and pd.set_option('precision', 0) but in the latter two it always gets stuck on the first column, since that doesn't quiet fit ( ValueError: invalid literal for int() with base 10: '2021-04-13 11:03:13+02:00' )
Any help would be much appreciated!
With your shown samples, please try following. Using iloc functionality of pandas.
df.iloc[:,1:] = df.iloc[:,1:].astype(int)
df will be as follows after running above code:
0,2021-04-13 11:03:13+02:00,3,3,3,12,12
I get ValueError: cannot convert float NaN to integer for following:
df = pandas.read_csv('zoom11.csv')
df[['x']] = df[['x']].astype(int)
The "x" is a column in the csv file, I cannot spot any float NaN in the file, and I don't understand the error or why I am getting it.
When I read the column as String, then it has values like -1,0,1,...2000, all look very nice int numbers to me.
When I read the column as float, then this can be loaded. Then it shows values as -1.0,0.0 etc, still there are no any NaN-s
I tried with error_bad_lines = False and dtype parameter in read_csv to no avail. It just cancels loading with same exception.
The file is not small (10+ M rows), so cannot inspect it manually, when I extract a small header part, then there is no error, but it happens with full file. So it is something in the file, but cannot detect what.
Logically the csv should not have missing values, but even if there is some garbage then I would be ok to skip the rows. Or at least identify them, but I do not see way to scan through file and report conversion errors.
Update: Using the hints in comments/answers I got my data clean with this:
# x contained NaN
df = df[~df['x'].isnull()]
# Y contained some other garbage, so null check was not enough
df = df[df['y'].str.isnumeric()]
# final conversion now worked
df[['x']] = df[['x']].astype(int)
df[['y']] = df[['y']].astype(int)
For identifying NaN values use boolean indexing:
print(df[df['x'].isnull()])
Then for removing all non-numeric values use to_numeric with parameter errors='coerce' - to replace non-numeric values to NaNs:
df['x'] = pd.to_numeric(df['x'], errors='coerce')
And for remove all rows with NaNs in column x use dropna:
df = df.dropna(subset=['x'])
Last convert values to ints:
df['x'] = df['x'].astype(int)
ValueError: cannot convert float NaN to integer
From v0.24, you actually can. Pandas introduces Nullable Integer Data Types which allows integers to coexist with NaNs.
Given a series of whole float numbers with missing data,
s = pd.Series([1.0, 2.0, np.nan, 4.0])
s
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
s.dtype
# dtype('float64')
You can convert it to a nullable int type (choose from one of Int16, Int32, or Int64) with,
s2 = s.astype('Int32') # note the 'I' is uppercase
s2
0 1
1 2
2 NaN
3 4
dtype: Int32
s2.dtype
# Int32Dtype()
Your column needs to have whole numbers for the cast to happen. Anything else will raise a TypeError:
s = pd.Series([1.1, 2.0, np.nan, 4.0])
s.astype('Int32')
# TypeError: cannot safely cast non-equivalent float64 to int32
Also, even at the lastest versions of pandas if the column is object type you would have to convert into float first, something like:
df['column_name'].astype(np.float).astype("Int32")
NB: You have to go through numpy float first and then to nullable Int32, for some reason.
The size of the int if it's 32 or 64 depends on your variable, be aware you may loose some precision if your numbers are to big for the format.
I know this has been answered but wanted to provide alternate solution for anyone in the future:
You can use .loc to subset the dataframe by only values that are notnull(), and then subset out the 'x' column only. Take that same vector, and apply(int) to it.
If column x is float:
df.loc[df['x'].notnull(), 'x'] = df.loc[df['x'].notnull(), 'x'].apply(int)
if you have null value then in doing mathematical operation you will get this error to resolve it use df[~df['x'].isnull()]df[['x']].astype(int) if you want your dataset to be unchangeable.
I'm cleaning a data file with some irregularities in it. I have a list of values like so:
import numpy as np
import pandas as pd
dataset = pd.DataFrame.from_dict({'data':['1','2','3','Third Street',np.nan]})
My goal is to filter out the "Third Street" column while retaining the NaN value.
dataset['data'].astype(int)
ValueError: invalid literal for int() with base 10: 'Third Street'
Which makes a lot of sense, since the last value isn't able to be converted to an integer.
Trying to filter the non-digit column filters out the NaN value, which I want to keep:
digitFilter = dataset['data'].str.isdigit()
dataset[digitFilter]
ValueError: cannot index with vector containing NA / NaN values
I've also tried stacking filters, but the NaN seems to get in the way there, too. Sure there's an easy way to do this that I'm overlooking. Appreciate any wisdom anyone can offer.
You can use | (OR operator) to check if a value is a number or NaN
digitFilter = (dataset['data'].str.isdigit()) | (dataset['data'] == 'NaN')
dataset[digitFilter]
Perhaps you could write a function which to try and except what you are doing above?
Then follow to apply this function on the Third Street column!
I'm importing data from a CSV file which has text, date and numeric columns. I'm using pandas.read_csv() to read it in, but I'm not specifying what each column's dtype should be. Here's a cut of that csv file (apologies for the shoddy formatting).
Now these two columns (total_imp_pma, char_value_aa503) are imported very differently. I import all the number fields and create a new dataframe called base_varlist4, which only contains the number columns.
When I run base_varlist4.dtypes, I get:
total_imp_pma object
char_value_aa503 float64
So as you can see, total_imp_pma was imported as an object. The problem then means that if I run this:
#calculate max, and group by obs_date
output_max_temp=base_varlist4.groupby('obs_date').max(skipna=True)
#reset obs_date to be treated as a column rather than an index
output_max_temp.reset_index()
#reshape temporary output to have 2 columns corresponding to variable and value
output_max=pd.melt(output_max_temp, id_vars='obs_date', value_vars=varlist4)
Where varlist4 is just my list of columns, I get the wrong max value for total_imp_pma but the correct max value for char_value_aa503.
Logically, this means I should change the object total_imp_pma to either a float or an integer. However, when I run:
base_varlist4[varlist4] = base_varlist4[varlist4].apply(pd.to_numeric, errors='coerce')
And then proceed to do the max value, I still get an incorrect result.
What's going on here? Why does pandas.read_csv() import some columns as an object dtype, and others as an int64 or float64 dtype? Why does conversion not work?
I have a theory but I'm not sure how to work around it. The only difference I see in the two columns in my source data are that total_imp_pma has mixed typed cells all the way down. For example, 66979 is a General cell, while there's a cell a little further down with a value of 1,760.60 as a number.
I think the mixed cell types in certain columns is causing pandas.read_csv() to be confused and just say "whelp, dunno what this is, import it as an object".
... how do I fix this?
EDIT: Here's an MCVE as per the request below.
Data in CSV is:
Char_Value_AA503 Total_IMP_PMA
1293 19.9
1831 0.9
1.2
243 2,666.50
Code is:
import pandas as pd
loc = r"xxxxxxxxxxxxxx"
source_data_name = 'import_problem_example.csv'
reporting_date = '01Feb2018'
source_data = pd.read_csv(loc + source_data_name)
source_data.columns = source_data.columns.str.lower()
varlist4 = ["char_value_aa503","total_imp_pma"]
base_varlist4 = source_data[varlist4]
base_varlist4['obs_date'] = reporting_date
base_varlist4[varlist4] = base_varlist4[varlist4].apply(pd.to_numeric, errors='coerce')
output_max_temp=base_varlist4.groupby('obs_date').max(skipna=True)
#reset obs_date to be treated as a column rather than an index
output_max_temp.reset_index()
#reshape temporary output to have 2 columns corresponding to variable and value
output_max=pd.melt(output_max_temp, id_vars='obs_date', value_vars=varlist4)
""" Test some stuff"""
source_data.dtypes
output_max
source_data.dtypes
As you can see, the max value of total_imp_pma comes out as 19.9, when it should be 2666.50.
I want to create a new data frame which has 2 columns, grouped by Striker_Id and other column which has sum of 'Batsman_Scored' corresponding to the grouped 'Striker_Id'
Eg:
Striker_ID Batsman_Scored
1 0
2 8
...
I tried this ball.groupby(['Striker_Id'])['Batsman_Scored'].sum() but this is what I get:
Striker_Id
1 0000040141000010111000001000020000004001010001...
2 0000000446404106064011111011100012106110621402...
3 0000121111114060001000101001011010010001041011...
4 0114110102100100011010000000006010011001111101...
5 0140016010010040000101111100101000111410011000...
6 1100100000104141011141001004001211200001110111...
It doesn't sum, only joins all the numbers. What's the alternative?
For some reason, your columns were loaded as strings. While loading them from a CSV, try applying a converter -
df = pd.read_csv('file.csv', converters={'Batsman_Scored' : int})
Or,
df = pd.read_csv('file.csv', converters={'Batsman_Scored' : pd.to_numeric})
If that doesn't work, then convert to integer after loading -
df['Batsman_Scored'] = df['Batsman_Scored'].astype(int)
Or,
df['Batsman_Scored'] = pd.to_numeric(df['Batsman_Scored'], errors='coerce')
Now, performing the groupby should work -
r = df.groupby('Striker_Id')['Batsman_Scored'].sum()
Without access to your data, I can only speculate. But it seems like, at some point, your data contains non-numeric data that prevents pandas from being able to perform conversions, resulting in those columns being retained as strings. It's a little difficult to pinpoint this problematic data until you actually load it in and do something like
df.col.str.isdigit().any()
That'll tell you if there are any non-numeric items. Note that it only works for integers, float columns cannot be debugged like this.
Also, another way of seeing what columns have corrupt data would be to query dtypes -
df.dtypes
Which will give you a listing of all columns and their datatypes. Use this to figure out what columns need parsing -
for c in df.columns[df.dtypes == object]:
print(c)
You can then apply the methods outlined above to fix them.