code
dataframe
I tried making a boxplot for 'horsepower', but it shows up as an object type, so I tried converting it to float but it displays that error.
The column horsepower seems to hold some non numeric values. I suggest you, in this case, to use pandas.to_numeric instead of pandas.Series.astype.
Replace this :
df_T['horsepower']= df_T['horsepower'].astype(float)
By this :
df_T['horsepower']= pd.to_numeric(df_T['horsepower'], errors= 'coerce')
If ‘coerce’, then invalid parsing will be set as NaN.
Related
I read a csv file in as a pandas dataframe but now need to use the data to do some calculations.
import pandas as pd
### LOAD DAQ FILE
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(cal_file,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
### DEFINE SIGNAL CHANNELS
ch104 = df.ch104.dropna() #gets rid of None value at the end of the column
print(ch104)
When I print a ch104 I get the following.
But I cannot do math on it currently as it is a pandas.Series or a string. The datatype is not correct yet.
The error if I do calculations is this:
can't multiply sequence by non-int of type 'float'
So what I tried to do is use .tolist() or even list() on the data, but then ch104 looks like this.
I believe the values are now being written as bytes then stores as a list of strings.
Does anyone know how I can get around this or fix this issue? It may be because the original file is UTF-16 LE encoded, but I cannot change this and need to use the files as is.
I need the values for simple calculations later on but they would need to be a float or a double for that. What can I do?
You probably get this error because you're trying to make calculations on some columns considered by pandas as non numeric. The values (numbers in your sense) are for some reason interpreted as strings (in pandas sense).
To fix that, you can change the type of those columns by using pandas.to_numeric :
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) # to get rid of the extra whitespace
import re
cols = list(filter(re.compile('^(ch|alarm)').match, df.columns.to_list()))
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
ch104 = df.ch104.dropna()
#do here your calculations on ch104 serie
Note that the 'coerce' argument will put NaN instead of every bad value in your columns.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.
I have an Excel file where column name might be a number, i.e. 2839238. I am reading it using pd.read_excel(bytes(filedata), engine='openpyxl') and, for some reason, this column name gets converted to a float 2839238.0. How to disable this conversion?
This is an issue for me because I then operate on column names using string-only methods like df = df.loc[:, ~df.columns.str.contains('^Unnamed')], and it gives me the following error:
TypeError: bad operand type for unary ~: 'float'
Column names are arbitrary.
try to change the type of the columns.
df['col'] = df['col'].astype(int)
the number you gave in the example shows that maybe you have a big numbers so python can't handle the big numbers as int but it can handle it like a float or double, check the ranges of the data types in python and compare it to your data and see which one you can use
Verify that you don't have any duplicate column names. Pandas will add .0 or .1 if there is another instance of 2839238 as a header name.
See description of mangle_dupe_colsbool
which says:
Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
I am working with dataset which contain numerical values stored as type object. Below you can see how is look like data in pandas.core.frame.DataFrame
So I want to convert this columns from object type into int64 . In order to do this I try with this line of code
X['duration']=X['duration'].astype('float64')
But this line not don't work well and give this error message:
could not convert string to float: '18`'
So can anybody help me how to solve this and convert into numerical values ?
Try pd.to_numeric -
X['duration'] = X['duration'].str.extract('(\d+)')
X['duration'] = pd.to_numeric(X['duration'], errors='coerce')
I am trying to convert a pands data frame (read in from a .csv file) from string to float. The columns 1 until 20 are recognized as "strings" by the system. However, they are float values in the format of "10,93847722". Therefore, I tried the following code:
new_df = df[df.columns[1:20]].transform(lambda col: col.str.replace(',','.').astype(float))
The last line causes the Error:
AttributeError: 'DataFrame' object has no attribute 'transform'
Maybe important to know, I can only use pands version 0.16.2.
Thank you very much for your help!
#all: Short extract from one of the columns
23,13854599
23,24945831
23,16853714
23,0876255
23,05908775
Use DataFrame.apply:
df[df.columns[1:20]].apply(lambda col: col.str.replace(',','.').astype(float))
EDIT: If some non numeric values is possible use to_numeric with errors='coerce' for replace these values to NaNs:
df[df.columns[1:20]].apply(lambda col: pd.to_numeric(col.str.replace(',','.'),errors='coerce'))
You should load them directly as numbers:
pd.read_csv(..., decimal=',')
This will recognize , as decimal point for every column.
I want to read a dataframe with given datatype and missing values, but the follwing code is wrong. I have no idea, why this happens!
myText = StringIO("""1,2
3,\N
5,6""")
myDf = pd.read_csv(myText, header=None, names=["a1","a2"], na_values=["\N"], dtype={"a1":"int", "a2":"int"})
I got the error message:
ValueError: Integer column has NA values in column 1
If I remove the dtype option dtype={"a1":"int", "a2":"int"}, then it works fine. Does the integer column don't allow missing values?
Integer doesn't allow missing values. Float allows missing values. If you need it to be integers, you'll need to use a sentinel for the missing ones, like 0 or 99999999 or something (not recommended). Otherwise, use a type like float64 that allows out-of-band values like NaN.