Create LabeledPoint from rdd data which has both strings and numbers - PySpark - python

I have lines like this in my data:
0,tcp,http,SF,181,5450,0,0,0.5,normal.
I want to use decision tree algorithm for training. I couldn't create LabeledPoints, so I want to try HashingTF for strings but I couldn't handle it. "normal" is my target label. How can I create a LabeledPoint RDD data to use in pyspark? Also, Label for LabeledPoint requires double, should I just create some double values for labels or should it be hashed?

I come up with the solution.
First of all, Spark's Decision tree classifier has already a parameter for this: categoricalFeaturesInfo. In the pyspark api documentation:
categoricalFeaturesInfo - Map from categorical feature index to number of categories. Any feature not in this map is treated as continuous.
However, before doing this, we first should simply replace the strings to numbers for pypsark to understand them.
Then we create for the above example data categoricalFeaturesInfo as in the definition like this:
categoricalFeaturesInfo = {1:len(feature1), 2:len(feature2), 3:len(feature3), 9:len(labels)}
Simply, first ones are the indexes of the categorical features and the second ones are the number of categories in that feature.
Note that converting strings to numbers is enough for the trainer algorithm but if you declare the categorical features like this, it would train faster.

Related

is there anyway to extract numerical values (or equivalently eliminate the text) of CSV so that the numerical values can be used as a regular feature

how to clean all numerical features and the target variable price so that they can be used in training algorithms. For instance, host_response_rate feature is in object format containing both numerical values and text.
You should elaborate, But from what i understand you want to convert
Object to integer
You could try:
df['column'].astype(str).astype(int)

BatchDataset not subscriptable when trying to format Python dictionary as table

I'm working through the TensorFlow Load pandas.DataFrame tutorial, and I'm trying to modify the output from a code snippet that creates the dictionary slices:
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
I find the following output sloppy, and I want to put it into a more readable table format.
I tried to format the for loop, based on this recommendation
Which gave me the error that the BatchDataset was not subscriptable
Then I tried to use the range and leng function on the dict_slices, so that i would be an integer index and not a slice
Which gave me the following error (as I understand, because the dict_slices is still an array, and each iteration is one vector of the array, not one index of the vector):
Refer here for solution. To summarize we need to use as_numpy_iterator
example = list(dict_slices.as_numpy_iterator())
example[0]['age']
BatchDataset is a tf.data.Dataset instance that has been batches by calling it's .batch(..) method. You cannot "index" a tensorflow Dataset, or call the len function on it. I suggest iterating through it like you did in the first code snippet.
However in your dataset you are using .to_dict('list'), which means that a key in your dictionary is mapped to a list as value. Basically you have "columns" for every key and not rows, is this what you want? This would make printing line-by-line (shown in the table printing example you linked) a lot more difficult, since you do not have different features in a row. Also it is different from the example in the official Tensorflow code, where a datapoint consists of multiple features, and not one feature with multiple values.
Combining the Tensorflow code and pretty printing:
columns = list(df.columns.values)+['target']
dict_slices = tf.data.Dataset.from_tensor_slices((df.values, target.values)).batch(1) # batch = 1 because otherwise you will get multiple dict_slice - target pairs in one iteration below!
print(*columns, sep='\t')
for dict_slice, target in dict_slices.take(1):
print(*dict_slice.numpy(), target.numpy(), sep='\t')
This needs a bit of formatting, because column widths are not equal.

How to handle string data in ML classification

Hello I am a beginner in Machine Learning, I have previously worked with some binary ml tasks where the data was numerical. Now I am facing an issue where I have to find the probability of a particular combination. I can not disclose the dataset or the code at this point. My data is a dataframe of 10 columns. I have to train my model on 8 columns and predict the possibility of the last 2 columns. That is my labels are a combination of the last 2 columns. What I am facing a problem with is, these column values are not numerical. I have tried everything I came across but can't find any suitable means of converting this to numerical values. I have tried LabelEncoder from sklearn,which works with the labels, but throws memory error if I use it again. I have tried to_numeric from pandas, which reads all the values as Nan. The values are in the form '2be74fad-4d4'. Any suggestions would be highly appreciated about how to handle this issue.
To convert categorical data to numerical, you can try these approaches in sklearn:
Label Encoding
Label Binarizer
OneHot Encoding
Now, for your problem, you can use LabelEncoder. But there is a catch. In other sklearn models, you can declare it once and then use it to fit and then transform on a number of columns.
In LabelEncoding, you have to fit_transform the model on one column in train data and then transform the same column in test data. Then the same process for the next categorial column.
You can iterate over a list of categorical columns to make it simple. Consider the snippet below:
cat_cols = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined']
enc = LabelEncoder()
for col in cat_cols:
train[col] = train[col].astype('str')
test[col] = test[col].astype('str')
train[col] = enc.fit_transform(train[col])
test[col] = enc.transform(test[col])
You can create a dictionary with the mapping form a string to integer. An example can be found here: enter link description here. Then you use onehot encoding or just feed the integer to the neural network. If the characters have some meaning you could also do it on a per character base instead of wordbased. But that depends on the task. If this String is a unique identifier of the column or so, just leave it away and don't feed it to your model.

Patsy: New levels in categorical fields in test data

I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.
My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:
df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')
The last line throws the following error:
patsy.PatsyError: Error converting data to categorical: observation
with value 'Kolkata' does not match any of the expected levels
I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.
Is there any way I can make this work with Patsy?
The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.
One way is by using the levels= argument to C(...), like:
# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))
dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)
Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.
I ran into a similar problem and I built the design matrices prior to splitting the data.
df_Y, df_X = dmatrices(formula, data=df, return_type='dataframe')
df_train_X, df_test_X, df_train_Y, df_test_Y = \
train_test_split(df_X, df_Y, test_size=test_size)
Then as an example of applying a fit:
model = smf.OLS(df_train_Y, df_train_X)
model2 = model.fit()
predicted = model2.predict(df_test_X)
Technically I haven't built a test case, but I haven't run into the Error converting data to categorical error again since implementing the above.

geoJSON: How to create FeatureCollection with dynamic number of Features?

I have a list of of Features (all Points) in a list in Python. The Features are dynamic stemming from database data which is updated on a 30 minutes interval.
Hence I never have a static number of features.
I need to generate a Feature Collection with all Features in my list.
However (as far as I know) the syntax for creating a FeatureCollection wants you to pass it all the features.
ie:
FeatureClct = FeatureCollection(feature1, feature2, feature3)
How does one generate a FeatureCollection without knowing how many features there will be beforehand? Is there a way to append Features to an existing FeatureCollection?
According to the documentation of python-geojson (which i guess you are using, you didn't mention it) you can also pass a list to FeatureCollection, just put all the results into a list and you're good to go:
feature1 = Point((45, 45));
feature2 = Point((-45, -45));
features = [feature1, feature2];
collection = FeatureCollection(features);
https://github.com/frewsxcv/python-geojson#featurecollection

Categories