How to define the default missing value in pandas dataframe - python

I want to read a dataframe with given datatype and missing values, but the follwing code is wrong. I have no idea, why this happens!
myText = StringIO("""1,2
3,\N
5,6""")
myDf = pd.read_csv(myText, header=None, names=["a1","a2"], na_values=["\N"], dtype={"a1":"int", "a2":"int"})
I got the error message:
ValueError: Integer column has NA values in column 1
If I remove the dtype option dtype={"a1":"int", "a2":"int"}, then it works fine. Does the integer column don't allow missing values?

Integer doesn't allow missing values. Float allows missing values. If you need it to be integers, you'll need to use a sentinel for the missing ones, like 0 or 99999999 or something (not recommended). Otherwise, use a type like float64 that allows out-of-band values like NaN.

Related

How can I convert an object to float?

code
dataframe
I tried making a boxplot for 'horsepower', but it shows up as an object type, so I tried converting it to float but it displays that error.
The column horsepower seems to hold some non numeric values. I suggest you, in this case, to use pandas.to_numeric instead of pandas.Series.astype.
Replace this :
df_T['horsepower']= df_T['horsepower'].astype(float)
By this :
df_T['horsepower']= pd.to_numeric(df_T['horsepower'], errors= 'coerce')
If ‘coerce’, then invalid parsing will be set as NaN.

.tolist() converting pandas dataframe values to bytes

I read a csv file in as a pandas dataframe but now need to use the data to do some calculations.
import pandas as pd
### LOAD DAQ FILE
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(cal_file,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
### DEFINE SIGNAL CHANNELS
ch104 = df.ch104.dropna() #gets rid of None value at the end of the column
print(ch104)
When I print a ch104 I get the following.
But I cannot do math on it currently as it is a pandas.Series or a string. The datatype is not correct yet.
The error if I do calculations is this:
can't multiply sequence by non-int of type 'float'
So what I tried to do is use .tolist() or even list() on the data, but then ch104 looks like this.
I believe the values are now being written as bytes then stores as a list of strings.
Does anyone know how I can get around this or fix this issue? It may be because the original file is UTF-16 LE encoded, but I cannot change this and need to use the files as is.
I need the values for simple calculations later on but they would need to be a float or a double for that. What can I do?
You probably get this error because you're trying to make calculations on some columns considered by pandas as non numeric. The values (numbers in your sense) are for some reason interpreted as strings (in pandas sense).
To fix that, you can change the type of those columns by using pandas.to_numeric :
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) # to get rid of the extra whitespace
import re
cols = list(filter(re.compile('^(ch|alarm)').match, df.columns.to_list()))
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
ch104 = df.ch104.dropna()
#do here your calculations on ch104 serie
Note that the 'coerce' argument will put NaN instead of every bad value in your columns.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.

Do not convert numerical column names to float in pandas read_excel

I have an Excel file where column name might be a number, i.e. 2839238. I am reading it using pd.read_excel(bytes(filedata), engine='openpyxl') and, for some reason, this column name gets converted to a float 2839238.0. How to disable this conversion?
This is an issue for me because I then operate on column names using string-only methods like df = df.loc[:, ~df.columns.str.contains('^Unnamed')], and it gives me the following error:
TypeError: bad operand type for unary ~: 'float'
Column names are arbitrary.
try to change the type of the columns.
df['col'] = df['col'].astype(int)
the number you gave in the example shows that maybe you have a big numbers so python can't handle the big numbers as int but it can handle it like a float or double, check the ranges of the data types in python and compare it to your data and see which one you can use
Verify that you don't have any duplicate column names. Pandas will add .0 or .1 if there is another instance of 2839238 as a header name.
See description of mangle_dupe_colsbool
which says:
Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.

Pandas adding NA changes column from float to object

I have a dataframe with a column that is of type float64. Then when I run the following it converts to object, which ends up messing with a downstream calculation where this is the denominator and there are some 0s:
df.loc[some criteria,'my_column'] = pd.NA
After this statement, my column is now of type object. Can someone explain why that is, and how to correct this? I'd rather do it right the first time rather than adding this to the end:
df['my_column'] = df['my_column'].astype(float)
Thanks.
In pandas, you can use NA or np.nan to represent missing values. np.nan casts it to float which is what you want and what the documentation states it as. NA casts the column as string and is best used for dtype string columns.
df.loc[some criteria,'my_column'] = np.nan

Change formatting from float to int, in columns of a pandas pivot table

I have a pivot table, unfortunately I am unable to cast the column to an int value due to NaN values, and it represents a year in the data. Is there a way to use a function to manipulate the columns (lambda?) in the creation of the pivot table?
submissions_by_country = df_maa_lu.pivot_table(index=["COUNTRY_DISPLAY_LABEL"], columns=["APPROVAL_YEAR"], values='LU_NUMBER_NO_SUFFIX', aggfunc='nunique', fill_value=0)
#smackenzie,
Is it possible to replace the value and recast. For e.g: Assuming your dataframe is called df
df.replace(to_replace =np.nan, value =0.)
df.astype(float)
If retaining np.nan is important, you can replace with a unique value like -999. and then upon changing the datatype, replace it again..
If you just wanted to updated the value in a column i.e. pd.Series instead of the entire dataframe, you could try that like this, not sure if dictionaries would allow NaN to be the key though:
df['Afghanistan'] = w['Afghanistan'].map({np.NaN: 0.0})
Can you post a sample dataset to work with?

Categories