CreateML won't accept float64? - python

I am trying out Create ML for the first time and I am trying to use a CSV file. Problem is that the CSV column for the target is a float type. It only accepts int or string. So, I used pandas to convert the column to a string with:
df.pris_yrs = df.pris_yrs.astype(str)
# I have also tried
df.pris_yrs = df.pris_yrs.apply(str)
Checking the dtype of the dataframe returns an object which is also a string in Pandas, but Create ML still is having the same error.
Question: How do I get an a dataframe object to work as the target in Create ML?

To transform one column of a dataframe to int I recommend you:
df["pris_yrs "]=df["pris_yrs "].astype(int)
For every ML model you should use numerical targets (even if you have a categorical feature, you can transform it easily labelling it).
You obtain an error probably because your ML model don't support a string as a target.

Related

Cant't show spark dataframe after scoring data - incompatible input types for column x

I have created an xgboost model in databricks. I am trying to score the model on production data. The same dataprep code is used before training the model and and for scoring.
import mlflow
from pyspark.sql.functions import struct
model_uri = f"models:/{model_name}/1"
predict = mlflow.pyfunc.spark_udf(spark, model_uri, result_type="double")
spark_df = spark.createDataFrame(table)
output_df = spark_df.withColumn("prediction", predict(struct(*spark_df.columns)))
The code runs without giving me any errors, but if I try
output_df.show(20)
I get an error:
mlflow.exceptions.MlflowException: Incompatible input types for column x. Can not safely convert int64 to int32.
This says that the model expects an int, but was passed a long. For the offending column, what type is it in spark_df? what type is expected according to the model signature logged with the model? You can cast the column to int if you are sure that's safe before applying the model.
It seems like I had the order of operations in the dataprep mixed up, so that I converted floats to int and replaced missing numerical values before doing a one hot encoding of a categorical varaible. That last variable did not get the proper treatment since it was created after cleaning up the rest of the data

What is a pandas.core.Frame.DataFrame, and how to convert it to pd.DataFrame?

Currently I was trying to do a machine learning classification of 6 time series datasets (in .csv format) using MiniRocket, an sktime machine learning package. However, when I imported the .csv files using pd.read_csv and run them through MiniRocket, the error "TypeError: X must be in an sktime compatible format" pops up, and it says that the following data types are sktime compatible:
['pd.Series', 'pd.DataFrame', 'np.ndarray', 'nested_univ', 'numpy3D', 'pd-multiindex', 'df-list', 'pd_multiindex_hier']
Then I checked the data type of my imported .csv files and got "pandas.core.Frame.DataFrame", which is a data type that I never saw before and is obviously different from the sktime compatible pd.DataFrame. What is the difference between pandas.core.Frame.DataFrame and pd.DataFrame, and how to convert pandas.core.Frame.DataFrame to the sktime compatible pd.DataFrame?
I tried to convert pandas.core.Frame.DataFrame to pd.DataFrame using df.join and df.pop functions, but neither of them was able to convert my data from pandas.core.Frame.DataFrame to pd.DataFrame (after conversion I checked the type again and it is still the same).
If you just take the values from your old DataFrame with .values, you can create a new DataFrame the standard way. If you want to keep the same columns and index values, just set those when you declare your new DataFrame.
df_new = pd.DataFrame(df_old.values, columns=df_old.columns, index=df_old.index)
Most of the pandas classes are defined under pandas.core folder: https://github.com/pandas-dev/pandas/tree/main/pandas/core.
For example, class DataFrame is defined in pandas.core.frame.py:
class DataFrame(NDFrame, OpsMixin):
...
def __init__(...)
...
Pandas is not yet a py.typed library PEP 561, hence the public API documentation uses pandas.DataFrame but internally all error messages still refer to the source file structure such as pandas.core.frame.DataFrame.

How could I put a large number in a datatype in Pandas?

I have numbers in range from 0 to 3.4е+23.
The "maximum" data format in this library is 'int64' (that is Хе+18).
Help me, please. How could I 'read' that data, because I want to train them with Sklearn. Also I can't apply StandartScaler/Normalizer for data, because numbers are large!
I change datatype like this:
df['df'] = df['df'].astype('int64')
Do you know some ways to change datatype?
Or you know the way to do something for the whole DataFrame?
Have you tried declaring first a custom type based in the allowed in python large int?
I haven't tested it, but here there is a way to use custom data types in Pandas

Converting oracle.sql.STRUCT# using python (into geojson or dataframe)

I would like to convert an Oracle DB frame including a 'SHAPE' column that hold the 'oracle.sql.STRUCT#' information into something more accessible, either a geojson/shapefile/dataframe either using Python/R or SQL.
Any ideas?
Create your frame with a query using one of the SDO_UTIL functions to convert the shape (sdo_geometry type) to a type easily consumed by Python/R, i.e. wkb, wkt, geojson. For example SDO_UTIL.TO_WKTGEOMETRY(shape). See info on the conversion functions here; https://docs.oracle.com/en/database/oracle/oracle-database/19/spatl/SDO_UTIL-reference.html

Reading Date times from Excel to Python using Pandas

I'm trying to read from an Excel file that gets converted to python and then gets split into numbers (Integers and floats) and everything else. There are numerous columns of different types.
I currently bring in the data with
pd.read_excel
and then split the data up with
DataFrame.select_dtypes("number")
When users upload a time (so 12:30:00) they expect for it to be recognized as a time. However python (currently) treats it as dtype object.
If I specify the column with parse_dates then it works, however since I don't know what the data is in advance I ideally want this to be done automatically. I`ve tried setting parse_dates = True however it doesn't seem to make a difference.
I'm not sure if there is a way to recognize the datatime after the file is uploaded. Again however I would want this to be done without having to specify the column (so anything that can be converted is)
Many Thanks
If your data contains only one column with dtype object (I assume it is a string) you can do the following:
1) filter the column with dtype object
import pandas as pd
datatime_col = df.select_dtypes(object)
2) convert it to seconds
datetime_col_in_seconds = pd.to_timedelta(datatime_col.loc[0]).dt.total_seconds()
Then you can re-append the converted column to your original data and/or do whatever processing you want.
Eventually, you can convert it back to datetime.
datetime_col = pd.to_datetime(datetime_col_in_seconds, unit='s')
if you have more than one column with dtype object you might have to do some more pre-processing but I guess this is a good way to start tackling your particular case.
This does what I need
for column_name in df.columns:
try:
df.loc[:, column_name] = pd.to_timedelta(df.loc[:, column_name].astype(str))
except ValueError:
pass
This tries to convert every column into a timedelta format. If it isn't capable of transforming it, it returns a value error and moves onto the next column.
After being run any columns that could be recognized as a timedelta format are transformed.

Categories