I am working with this dataset at below measurements.csv
https://www.kaggle.com/anderas/car-consume/data
It has values inside like this: 21,5 but floating definition must be like that 21.5 Therefore, Python says, "ValueError: could not convert string to float: '21,5'"
My codes are as these,
# get data ready
data = pd.read_csv('measurements.csv')
data.shape
# split out features and label
X = data.iloc[:, :-5].values
y = data.iloc[:, -4]
# map category to binary
y = np.where(y == 'E10', 1, 0)
enc = OneHotEncoder()
Second Question:
I also want to use its another columns which has string values or null (empty) how should I transform them to my input shape?
You can tell read_csv what the character for decimal point is:
data = pd.read_csv('measurements.csv', decimal=',')
From https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
In read_csv, you can speciify decimal values as
data = pd.read_csv('measurements.csv', decimal=",")
Related
I am using the Beers dataset in which I want to encode the data with datatype 'object'.
Following is my code.
from sklearn import preprocessing
df3 = BeerDF.select_dtypes(include=['object']).copy()
label_encoder = preprocessing.LabelEncoder()
df3 = df3.apply(label_encoder.fit_transform)
The following error is occurring.
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
Any insights are helpful!!!
Use:
df3 = df3.astype(str).apply(label_encoder.fit_transform)
From the TypeError, it seems that the column you want to transform into label has two different data types (dtypes), in your case string and float which raises an error. To avoid this, you have to modify your column to have a uniform dtype (string or float). For example in Iris classification dataset, class=['Setosa', 'Versicolour', 'Virginica'] and not ['Setosa', 3, 5, 'Versicolour']
I am going to train an SVM based on x dataframe and y series I have.
X dataframe is shown below:
x:
Timestamp Location of sensors Pressure or Flow values
0.00000 138.22, 1549.64 28.92
0.08333 138.22, 1549.64 28.94
0.16667 138.22, 1549.64 28.96
In X dataframe, location of sensors are represented as the form of node coordinate.
Y series is shown below:
y:
0
0
0
But when I fit svm to the training set, it returned an ValueError: could not convert string to float: '361.51,1100.77' and (361.51, 1100.77) are coordinates for a node.
Could you please give me some ideas to fix this problem?
Appreciate a lot if any advices.
'361.51,1100.77' are actually two numbers right? A latitude (361.51) and a longitude (1100.77). You would first need to split it into two strings. Here is one way to do it:
data = pd.DataFrame(data=[[0, "138.22,1549.64", 28.92]], columns=["Timestamp", "coordinate", "flow"])
data["latitude"] = data["coordinate"].apply(lambda x: float(x.split(",")[0]))
data["longitude"] = data["coordinate"].apply(lambda x: float(x.split(",")[1]))
This will give you two new columns in your dataframe each with a float of the values in the string.
I'm assuming that you are trying to convert the entire string "361.51,1100.77" into a float, and you can see why that's a problem because Python sees two decimal points and a comma inbetween, so it has no idea what to do.
Assuming you want the numbers to be separate, you could do something like this:
myStr = "361.51,1100.77"
x = float(myStr[0:myStr.index(",")])
y = float(myStr[myStr.index(",")+1:])
print(x)
print(y)
Which would get you an output of
361.51
1100.77
Assigning x to be myStr[0:myStr.index(",")] takes a substring of the original string from 1 to the first occurrence of a comma, getting you the first number.
Assigning y to be myStr[myStr.index(",")+1:] takes a substring of the original string starting after the first comma and to the end of the string, getting you the second number.
Both can easily be converted to floats from here using the float(myStr) method, getting you two separate floats.
Here is a helpful link to understand string slicing: https://www.geeksforgeeks.org/string-slicing-in-python/
m using tensorflow datasets api
and i have a data with a string column that can represents a binary option
(something like ("yes" or "no")
i'm wondering if i convert it into 1 and 0 (integer value) respectively, and leave the other columns unchanged
my skeleton functions is:
def mapper(features,target):
#features["str_col"] TODO "MAP this when yes to 1 when no to 0"
#return features with x transformed # TODO
can u assist?
You can convert a bool to int:
y = tf.equal(features["str_col"], 'YES')
y = tf.cast(y, tf.int32)
I have a dataset with 4 variables(Bearing 1 to Bearing 4) and 20152319 no of observations. It looks like this:
Now, I am trying to find the correlation matrix of the 4 variables. The code I use is this:
corr_mat = Data.corr(method = 'pearson')
print(corr_mat)
However in the result, I get the correlation information for only Bearing 2 to Bearing 4. Bearing 1 is nowhere to be seen. I am providing a snapshot of the result down below:
I have tried removing NULL values from each of the variables and also tried looking for missing values but nothing works. What is interesting is that, if I isolate the first two variables (Bearing 1 and Bearing 2) and then try to find the correlation matrix between them, Bearing 1 does not come up and the matrix is a 1x1 matrix with only Bearing 2
Any explanation on why this occurs and how to solve it would be appreciated.
Try to see if the first column 'Bearing 1' is numeric.
Data.dtypes # This will show the type of each column
cols = Data.columns # Saving column names to a variable
Data[cols].apply(pd.to_numeric, errors='coerce') # Converting the columns to numeric
Now apply your Calculations,
corr_mat = Data.corr(method = 'pearson')
print(corr_mat)
Dtype of first column is object, so pandas by default omit it. Solution is convert it to numeric:
Data['Bearing 1'] = Data['Bearing 1'].astype(float)
Or if some non numeric values use to_numeric with errors='coerce' for parse these values to NaNs:
Data['Bearing 1'] = pd.to_numeric(Data['Bearing 1'], errors='coerce')
If want convert all columns to numeric:
Data = Data.astype(float)
Or:
Data = Data.apply(pd.to_numeric, errors='coerce')
I'm trying to convert a list of elements in the dataframe called "GDP" from floating to integers. The cells that I want to convert are specified in GDP.iloc[4,-10]. I have tried the following methods:
for x in GDP.iloc[4,-10:]:
pd.to_numeric(x, downcast='signed')
GDP.iloc[4,-10:]=GDP.iloc[4,-10:].astype(int)
GDP.iloc[4,-10:]=int(GDP.iloc[4,-10:])
However, none of them seem to be working in converting the float to integers. No errors appear for methods 1 and 2 but for option 3, the following error appears:
TypeError: cannot convert the series to
The data can be found here: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
GDP = pd.read_csv('world_bank.csv',header=None)
Method 1
for x in GDP.iloc[4,-10:]:
pd.to_numeric(x, downcast='signed')
Method 2:
GDP.iloc[4,-10:]=GDP.iloc[4,-10:].astype(int)
Method 3:
GDP.iloc[4,-10:]=int(GDP.iloc[4,-10:])
Can someone help me out? Much appreciated.
enter image description here
You can use astype(np.int64) to convert to int
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# df.head()
df = df.fillna('custom_none_values')
# df.head()
df = df[df['1960'] != 'custom_none_values']
df['1960'] = df['1960'].astype(np.int64)
df.head()