PySpark replace() function does not replace integer with NULL value - python

Notice: this is for Spark version 2.1.1.2.6.1.0-129
I have a spark dataframe (Python). I would like to replace all instances of 0 across the entirety of the dataframe (without specifying particular column names), with NULL values.
The following is the code that I have written:
my_df = my_df.na.replace(0, None)
The following is the error that I receive:
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1634, in replace
return self.df.replace(to_replace, value, subset)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1323, in replace
raise ValueError("value should be a float, int, long, string, list, or tuple")
ValueError: value should be a float, int, long, string, list, or tuple

Apparently in Spark 2.1.1, df.na.replace does not support None. None option is only available since 2.3.0, which is not applicable in your case.
To replace values dynamically (i.e without typing columns name manually), you can use either df.columns or df.dtypes. The latter will give you the option to compare datatype as well.
from pyspark.sql import functions as F
for c in df.dtypes:
if c[1] == 'bigint':
df = df.withColumn(c[0], F.when(F.col(c[0]) == 0, F.lit(None)).otherwise(F.col(c[0])))
# Input
# +---+---+
# | id|val|
# +---+---+
# | 0| a|
# | 1| b|
# | 2| c|
# +---+---+
# Output
# +----+---+
# | id|val|
# +----+---+
# |null| a|
# | 1| b|
# | 2| c|
# +----+---+

Related

Translating a SAS Ranking with Tie set to HIGH into PySpark

I'm trying to replicate the following SAS code in PySpark:
PROC RANK DATA = aud_baskets OUT = aud_baskets_ranks GROUPS=10 TIES=HIGH;
BY customer_id;
VAR expenditure;
RANKS basket_rank;
RUN;
The idea is to rank all expenditures under each customer_id block. The data would look like this:
+-----------+--------------+-----------+
|customer_id|transaction_id|expenditure|
+-----------+--------------+-----------+
| A| 1| 34|
| A| 2| 90|
| B| 1| 89|
| A| 3| 6|
| B| 2| 8|
| B| 3| 7|
| C| 1| 96|
| C| 2| 9|
+-----------+--------------+-----------+
In PySpark, I tried this:
spendWindow = Window.partitionBy('customer_id').orderBy(col('expenditure').asc())
aud_baskets = (aud_baskets_ranks.withColumn('basket_rank', ntile(10).over(spendWindow)))
The problem is that PySpark doesn't let the user change the way it will handle Ties, like SAS does (that I know of). I need to set this behavior in PySpark so that values are moved up to the next tier each time one of those edge cases occur, as oppose to dropping them to the rank below.
Or is there a way to custom write this approach?
Use dense_rank it will give same rank in case of ties and next rank will not be skipped
ntile function split the group of records in each partition into n parts. In your case which is 10
from pyspark.sql.functions import dense_rank
spendWindow = Window.partitionBy('customer_id').orderBy(col('expenditure').asc())
aud_baskets = aud_baskets_ranks.withColumn('basket_rank',dense_rank.over(spendWindow))
Try The following code. It is generated by an automated tool called SPROCKET. It should take care of ties.
df = (aud_baskets)
for (colToRank,rankedName) in zip(['expenditure'],['basket_rank']):
wA = Window.orderBy(asc(colToRank))
df_w_rank = (df.withColumn('raw_rank', rank().over(wA)))
ties = df_w_rank.groupBy('raw_rank').count().filter("""count > 1""")
df_w_rank = (df_w_rank.join(ties,['raw_rank'],'left').withColumn(rankedName,expr("""case when count is not null
then (raw_rank + count - 1) else
raw_rank end""")))
rankedNameGroup = rankedName
n = df_w_rank.count()
df_with_rank_groups = (df_w_rank.withColumn(rankedNameGroup,expr("""FLOOR({rankedName}
*{k}/({n}+1))""".format(k=10, n=n,
rankedName=rankedName))))
df = df_with_rank_groups
aud_baskets_ranks = df_with_rank_groups.drop('raw_rank', 'count')

how to identify people's relationship based on name, address and then assign a same ID through linux comman or Pyspark

I have one csv file.
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address
2,66M,J,Rock,F,1995,201211.0,J
3,David,HM,Lee,M,1991,201211.0,J
6,66M,,Rock,F,1990,201211.0,J
0,David,H M,Lee,M,1990,201211.0,B
3,Marc,H,Robert,M,2000,201211.0,C
6,Marc,M,Robert,M,1988,201211.0,C
6,Marc,MS,Robert,M,2000,201211.0,D
I want to assign persons with same last name living in the same address a same ID or index. It's better that ID is made up of only numbers.
If persons have different last name in the same place, then ID should be different.
Such ID should be unique. Namely, people who are different in either address or last name, ID must be different.
My expected output is
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address,ID
2,66M,J,Rock,F,1995,201211.0,J,11
3,David,HM,Lee,M,1991,201211.0,J,12
6,66M,,Rock,F,1990,201211.0,J,11
0,David,H M,Lee,M,1990,201211.0,B,13
3,Marc,H,Robert,M,2000,201211.0,C,14
6,Marc,M,Robert,M,1988,201211.0,C,14
6,Marc,MS,Robert,M,2000,201211.0,D,15
My datafile size is around 30 GB. I am thinking of using groupBy function in spark based on the key consisting of LNAME and address to group those observations together. Then assign it a ID by key. But I don't know how to do this. After that, maybe I can use flatMap to split the line and return those observations with a ID. But I am not sure about it. In addition, can I also make it in Linux environment? Thank you.
As discussed in the comments, the basic idea is to partition the data properly so that records with the same LNAME+Address stay in the same partition, run Python code to generate separate idx on each partition and then merge them into the final id.
Note: I added some new rows in your sample records, see the result of df_new.show() shown below.
from pyspark.sql import Window, Row
from pyspark.sql.functions import coalesce, sum as fsum, col, max as fmax, lit, broadcast
# ...skip code to initialize the dataframe
# tweak the number of repartitioning N based on actual data size
N = 5
# Python function to iterate through the sorted list of elements in the same
# partition and assign an in-partition idx based on Address and LNAME.
def func(partition_id, it):
idx, lname, address = (1, None, None)
for row in sorted(it, key=lambda x: (x.LNAME, x.Address)):
if lname and (row.LNAME != lname or row.Address != address): idx += 1
yield Row(partition_id=partition_id, idx=idx, **row.asDict())
lname = row.LNAME
address = row.Address
# Repartition based on 'LNAME' and 'Address' and then run mapPartitionsWithIndex()
# function to create in-partition idx. Adjust N so that records in each partition
# should be small enough to be loaded into the executor memory:
df1 = df.repartition(N, 'LNAME', 'Address') \
.rdd.mapPartitionsWithIndex(func) \
.toDF()
Get number of unique rows cnt (based on Address+LNAME) which is max_idx and then grab the running SUM of this rcnt.
# idx: calculated in-partition id
# cnt: number of unique ids in the same partition: fmax('idx')
# rcnt: starting_id for a partition(something like a running count): coalesce(fsum('cnt').over(w1),lit(0))
# w1: WindowSpec to calculate the above rcnt
w1 = Window.partitionBy().orderBy('partition_id').rowsBetween(Window.unboundedPreceding,-1)
df2 = df1.groupby('partition_id') \
.agg(fmax('idx').alias('cnt')) \
.withColumn('rcnt', coalesce(fsum('cnt').over(w1),lit(0)))
df2.show()
+------------+---+----+
|partition_id|cnt|rcnt|
+------------+---+----+
| 0| 3| 0|
| 1| 1| 3|
| 2| 1| 4|
| 4| 1| 5|
+------------+---+----+
Join df1 with df2 and create the final id which is idx + rcnt
df_new = df1.join(broadcast(df2), on=['partition_id']).withColumn('id', col('idx')+col('rcnt'))
df_new.show()
#+------------+-------+---+----+-----+------+------+-----+---+--------+---+----+---+
#|partition_id|Address| D| DOB|FNAME|GENDER| LNAME|MNAME|idx|snapshot|cnt|rcnt| id|
#+------------+-------+---+----+-----+------+------+-----+---+--------+---+----+---+
#| 0| B| 0|1990|David| M| Lee| H M| 1|201211.0| 3| 0| 1|
#| 0| J| 3|1991|David| M| Lee| HM| 2|201211.0| 3| 0| 2|
#| 0| D| 6|2000| Marc| M|Robert| MS| 3|201211.0| 3| 0| 3|
#| 1| C| 3|2000| Marc| M|Robert| H| 1|201211.0| 1| 3| 4|
#| 1| C| 6|1988| Marc| M|Robert| M| 1|201211.0| 1| 3| 4|
#| 2| J| 6|1991| 66M| F| Rek| null| 1|201211.0| 1| 4| 5|
#| 2| J| 6|1992| 66M| F| Rek| null| 1|201211.0| 1| 4| 5|
#| 4| J| 2|1995| 66M| F| Rock| J| 1|201211.0| 1| 5| 6|
#| 4| J| 6|1990| 66M| F| Rock| null| 1|201211.0| 1| 5| 6|
#| 4| J| 6|1990| 66M| F| Rock| null| 1|201211.0| 1| 5| 6|
#+------------+-------+---+----+-----+------+------+-----+---+--------+---+----+---+
df_new = df_new.drop('partition_id', 'idx', 'rcnt', 'cnt')
Some notes:
Practically, you will need to clean-out/normalize the column LNAME and Address before using them as uniqueness check. For example, use a separate column uniq_key which combine LNAME and Address as the unique key of the dataframe. see below for an example with some basic data cleansing procedures:
from pyspark.sql.functions import coalesce, lit, concat_ws, upper, regexp_replace, trim
#(1) convert NULL to '': coalesce(col, '')
#(2) concatenate LNAME and Address using NULL char '\x00' or '\0'
#(3) convert to uppercase: upper(text)
#(4) remove all non-[word/whitespace/NULL_char]: regexp_replace(text, r'[^\x00\w\s]', '')
#(5) convert consecutive whitespaces to a SPACE: regexp_replace(text, r'\s+', ' ')
#(6) trim leading/trailing spaces: trim(text)
df = (df.withColumn('uniq_key',
trim(
regexp_replace(
regexp_replace(
upper(
concat_ws('\0', coalesce('LNAME', lit('')), coalesce('Address', lit('')))
),
r'[^\x00\s\w]+',
''
),
r'\s+',
' '
)
)
))
Then in the code, replace 'LNAME' and 'Address' with uniq_key to find the idx
As mentioned by cronoik in the comment, you can also try one of the Window rank functions to calculate the in-partition idx. for example:
from pyspark.sql.functions import spark_partition_id, dense_rank
# use dense_rank to calculate the in-partition idx
w2 = Window.partitionBy('partition_id').orderBy('LNAME', 'Address')
df1 = df.repartition(N, 'LNAME', 'Address') \
.withColumn('partition_id', spark_partition_id()) \
.withColumn('idx', dense_rank().over(w2))
After you have df1, use the same methods as above to calculate df2 and df_new. This should be faster than using mapPartitionsWithIndex() which is basically an RDD-based method.
For your real data, adjust N to fit your actual data size. this N only influences the initial partitions, after dataframe join, the partition will be reset to default(200). you can adjust this using spark.sql.shuffle.partitions for example when you initialize the spark session:
spark = SparkSession.builder \
....
.config("spark.sql.shuffle.partitions", 500) \
.getOrCreate()
Since you have 30GB of input data, you probably don't want something that'll attempt to hold it all in in-memory data structures. Let's use disk space instead.
Here's one approach that loads all your data into a sqlite database, and generates an id for each unique last name and address pair, and then joins everything back up together:
#!/bin/sh
csv="$1"
# Use an on-disk database instead of in-memory because source data is 30gb.
# This will take a while to run.
db=$(mktemp -p .)
sqlite3 -batch -csv -header "${db}" <<EOF
.import "${csv}" people
CREATE TABLE ids(id INTEGER PRIMARY KEY, lname, address, UNIQUE(lname, address));
INSERT OR IGNORE INTO ids(lname, address) SELECT lname, address FROM people;
SELECT p.*, i.id AS ID
FROM people AS p
JOIN ids AS i ON (p.lname, p.address) = (i.lname, i.address)
ORDER BY p.rowid;
EOF
rm -f "${db}"
Example:
$./makeids.sh data.csv
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address,ID
2,66M,J,Rock,F,1995,201211.0,J,1
3,David,HM,Lee,M,1991,201211.0,J,2
6,66M,"",Rock,F,1990,201211.0,J,1
0,David,"H M",Lee,M,1990,201211.0,B,3
3,Marc,H,Robert,M,2000,201211.0,C,4
6,Marc,M,Robert,M,1988,201211.0,C,4
6,Marc,MS,Robert,M,2000,201211.0,D,5
It's better that ID is made up of only numbers.
If that restriction can be relaxed, you can do it in a single pass by using a cryptographic hash of the last name and address as the ID:
$ perl -MDigest::SHA=sha1_hex -F, -lane '
BEGIN { $" = $, = "," }
if ($. == 1) { print #F, "ID" }
else { print #F, sha1_hex("#F[3,7]") }' data.csv
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address,ID
2,66M,J,Rock,F,1995,201211.0,J,5c99211a841bd2b4c9cdcf72d7e95e46b2ae08b5
3,David,HM,Lee,M,1991,201211.0,J,c263f9d1feb4dc789de17a8aab8f2808aea2876a
6,66M,,Rock,F,1990,201211.0,J,5c99211a841bd2b4c9cdcf72d7e95e46b2ae08b5
0,David,H M,Lee,M,1990,201211.0,B,e86e81ab2715a8202e41b92ad979ca3a67743421
3,Marc,H,Robert,M,2000,201211.0,C,363ed8175fdf441ed59ac19cea3c37b6ce9df152
6,Marc,M,Robert,M,1988,201211.0,C,363ed8175fdf441ed59ac19cea3c37b6ce9df152
6,Marc,MS,Robert,M,2000,201211.0,D,cf5135dc402efe16cd170191b03b690d58ea5189
Or if the number of unique lname, address pairs is small enough that they can reasonably be stored in a hash table on your system:
#!/usr/bin/gawk -f
BEGIN {
FS = OFS = ","
}
NR == 1 {
print $0, "ID"
next
}
! ($4, $8) in ids {
ids[$4, $8] = ++counter
}
{
print $0, ids[$4, $8]
}
$ sort -t, -k8,8 -k4,4 <<EOD | awk -F, ' $8","$4 != last { ++id; last = $8","$4 }
{ NR!=1 && $9=id; print }' id=9 OFS=,
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address
2,66M,J,Rock,F,1995,201211.0,J
3,David,HM,Lee,M,1991,201211.0,J
6,66M,,Rock,F,1990,201211.0,J
0,David,H M,Lee,M,1990,201211.0,B
3,Marc,H,Robert,M,2000,201211.0,C
6,Marc,M,Robert,M,1988,201211.0,C
6,Marc,MS,Robert,M,2000,201211.0,D
> EOD
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address
0,David,H M,Lee,M,1990,201211.0,B,11
3,Marc,H,Robert,M,2000,201211.0,C,12
6,Marc,M,Robert,M,1988,201211.0,C,12
6,Marc,MS,Robert,M,2000,201211.0,D,13
3,David,HM,Lee,M,1991,201211.0,J,14
2,66M,J,Rock,F,1995,201211.0,J,15
6,66M,,Rock,F,1990,201211.0,J,15
$

pyspark `substr' without length

Is there a way, in pyspark, to perform the substr function on a DataFrame column, without specifying the length? Namely, something like df["my-col"].substr(begin).
I am not sure why this function is not exposed as api in pysaprk.sql.functions module.
SparkSQL supports the substring function without defining len argument substring(str, pos, len)
You can use it with expr api of functions module like below to achieve same:
df.withColumn('substr_name', f.expr("substring(name, 2)")).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
How spark is doing it internally :
Now if you see physical plan of above statement then will notice that if we don't pass len then spark will automatically add 2147483647.
As #pault said in comment, 2147483647 is the maximum positive value for a 32-bit signed binary integer (2^31 -1).
df.withColumn('substr_name', f.expr("substring(name, 2)")).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 2147483647) AS substr_name#169]
+- Scan ExistingRDD[name#140,id#141L] --> 2147483647 is automatically added
In substring api implementation of functions module, it expect us to explicitly pass length. If you want then you can give any higher number in len which can cover max lengths of your column.
df.withColumn('substr_name', f.substring('name', 2, 100)).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
>>> df.withColumn('substr_name', f.substring('name', 2, 100)).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 100) AS substr_name#189]
+- Scan ExistingRDD[name#140,id#141L] --> 100 is what we passed
If the objective is to make a substring from a position given by a parameter begin to the end of the string, then you can do it as follows:
import pyspark.sql.functions as f
l = [(1, 'Prague'), (2, 'New York')]
df = spark.createDataFrame(l, ['id', 'city'])
begin = 2
l = (f.length('city') - f.lit(begin) + 1)
(
df
.withColumn('substr', f.col('city').substr(f.lit(begin), l))
).show()
+---+--------+-------+
| id| city| substr|
+---+--------+-------+
| 1| Prague| rague|
| 2|New York|ew York|
+---+--------+-------+
I'd create udf.
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import StringType
>>> df = spark.createDataFrame([('Alice', 23), ('Brian', 25)], schema=["name", "age"])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 23|
|Brian| 25|
+-----+---+
>>> #F.udf(returnType=StringType())
... def substr_udf(col):
... return str(col)[2:]
>>> df = df.withColumn('substr', substr_udf('name'))
>>> df.show()
+-----+---+------+
| name|age|substr|
+-----+---+------+
|Alice| 23| ice|
|Brian| 25| ian|
+-----+---+------+
No we need to specify the both parameters pos and len
BUt do make sure that both should be of same type other wise it will give error.
Error: Column not iterable.
You can do in this way:
df = df.withColumn("new", F.col("previous").substr(F.lit(5), F.length("previous")-5))

Adding column to dataframe and updating in pyspark

I have a dataframe in pyspark:
ratings = spark.createDataFrame(
sc.textFile("transactions.json").map(lambda l: json.loads(l)),
)
ratings.show()
+--------+-------------------+------------+----------+-------------+-------+
|click_id| created_at| ip|product_id|product_price|user_id|
+--------+-------------------+------------+----------+-------------+-------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3|
+--------+-------------------+------------+----------+-------------+-------+
ratings.registerTempTable("transactions")
final_df = sqlContext.sql("select * from transactions");
I want to add a new column to this data frame called status and then update the status column based on created_at and user_id.
The created_at and user_id are read from the given table transations and passed to a function get_status(user_id,created_at) which returns the status. This status needs to be put into the transaction table as a new column for the corresponding user_id and created_at
Can I run alter and update command in pyspark?
How can this be done using pyspark ?
It's not clear what you want to do exactly. You should check out window functions they allow you to compare, sum... rows in a frame.
For instance
import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("user_id").orderBy(psf.desc("created_at"))
ratings.withColumn(
"status",
psf.when(psf.row_number().over(w) == 1, "active").otherwise("inactive")).sort("click_id").show()
+--------+-------------------+------------+----------+-------------+-------+--------+
|click_id| created_at| ip|product_id|product_price|user_id| status|
+--------+-------------------+------------+----------+-------------+-------+--------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|inactive|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|inactive|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1| active|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|inactive|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|inactive|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2| active|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|inactive|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|inactive|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3| active|
+--------+-------------------+------------+----------+-------------+-------+--------+
It gives you each user's last click
If you want to pass a UDF to create a new column from two existing ones.
Say you have a function that takes the user_id and created_at as arguments
from pyspark.sql.types import *
def get_status(user_id,created_at):
...
get_status_udf = psf.udf(get_status, StringType())
StringType() or whichever datatype your function outputs
ratings.withColumn("status", get_status_udf("user_id", "created_at"))

pyspark: how do you convert a column from a string to a categorical variable? [duplicate]

How do I handle categorical data with spark-ml and not spark-mllib ?
Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.
Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.
However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.
How should I proceed?
I just wanted to complete Holden's answer.
Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.
In Scala:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}
val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")
val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(indexer.getOutputCol, "category2"))
.setOutputCols(Array("category1Vec", "category2Vec"))
val pipeline = new Pipeline().setStages(Array(indexer, encoder))
pipeline.fit(df).transform(df).show
// +---+---------+---------+--------------+-------------+-------------+
// | id|category1|category2|category1Index| category1Vec| category2Vec|
// +---+---------+---------+--------------+-------------+-------------+
// | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
// | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
// | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// +---+---------+---------+--------------+-------------+-------------+
In Python:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator
df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])
indexer = StringIndexer(inputCol="category1", outputCol="category1Index")
inputs = [indexer.getOutputCol(), "category2"]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"])
pipeline = Pipeline(stages=[indexer, encoder])
pipeline.fit(df).transform(df).show()
# +---+---------+---------+--------------+-------------+-------------+
# | id|category1|category2|category1Index| categoryVec1| categoryVec2|
# +---+---------+---------+--------------+-------------+-------------+
# | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
# | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
# | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# +---+---------+---------+--------------+-------------+-------------+
Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.
This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
Let's consider the following DataFrame:
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))
.toDF("id", "category")
The first step would be to create the indexed DataFrame with the StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| a| 0.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
You can then encode the categoryIndex with OneHotEncoder :
import org.apache.spark.ml.feature.OneHotEncoder
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show
// +---+-------------+
// | id| categoryVec|
// +---+-------------+
// | 0|(2,[0],[1.0])|
// | 1| (2,[],[])|
// | 2|(2,[1],[1.0])|
// | 3|(2,[0],[1.0])|
// | 4|(2,[0],[1.0])|
// | 5|(2,[1],[1.0])|
// +---+-------------+
I am going to provide an answer from another perspective, since I was also wondering about categorical features with regards to tree-based models in Spark ML (not MLlib), and the documentation is not that clear how everything works.
When you transform a column in your dataframe using pyspark.ml.feature.StringIndexer extra meta-data gets stored in the dataframe that specifically marks the transformed feature as a categorical feature.
When you print the dataframe you will see a numeric value (which is an index that corresponds with one of your categorical values) and if you look at the schema you will see that your new transformed column is of type double. However, this new column you created with pyspark.ml.feature.StringIndexer.transform is not just a normal double column, it has extra meta-data associated with it that is very important. You can inspect this meta-data by looking at the metadata property of the appropriate field in your dataframe's schema (you can access the schema objects of your dataframe by looking at yourdataframe.schema)
This extra metadata has two important implications:
When you call .fit() when using a tree based model, it will scan the meta-data of your dataframe and recognize fields that you encoded as categorical with transformers such as pyspark.ml.feature.StringIndexer (as noted above there are other transformers that will also have this effect such as pyspark.ml.feature.VectorIndexer). Because of this, you DO NOT have to one-hot encode your features after you have transformed them with StringIndxer when using tree-based models in spark ML (however, you still have to perform one-hot encoding when using other models that do not naturally handle categoricals like linear regression, etc.).
Because this metadata is stored in the data frame, you can use pyspark.ml.feature.IndexToString to reverse the numeric indices back to the original categorical values (which are often strings) at any time.
There is a component of the ML pipeline called StringIndexer you can use to convert your strings to Double's in a reasonable way. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer has more documentation, and http://spark.apache.org/docs/latest/ml-guide.html shows how to construct pipelines.
I use the following method for oneHotEncoding a single column in a Spark dataFrame:
def ohcOneColumn(df, colName, debug=False):
colsToFillNa = []
if debug: print("Entering method ohcOneColumn")
countUnique = df.groupBy(colName).count().count()
if debug: print(countUnique)
collectOnce = df.select(colName).distinct().collect()
for uniqueValIndex in range(countUnique):
uniqueVal = collectOnce[uniqueValIndex][0]
if debug: print(uniqueVal)
newColName = str(colName) + '_' + str(uniqueVal) + '_TF'
df = df.withColumn(newColName, df[colName]==uniqueVal)
colsToFillNa.append(newColName)
df = df.drop(colName)
df = df.na.fill(False, subset=colsToFillNa)
return df
I use the following method for oneHotEncoding Spark dataFrames:
from pyspark.sql.functions import col, countDistinct, approxCountDistinct
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator
def detectAndLabelCat(sparkDf, minValCount=5, debug=False, excludeCols=['Target']):
if debug: print("Entering method detectAndLabelCat")
newDf = sparkDf
colList = sparkDf.columns
for colName in sparkDf.columns:
uniqueVals = sparkDf.groupBy(colName).count()
if debug: print(uniqueVals)
countUnique = uniqueVals.count()
dtype = str(sparkDf.schema[colName].dataType)
#dtype = str(df.schema[nc].dataType)
if (colName in excludeCols):
if debug: print(str(colName) + ' is in the excluded columns list.')
elif countUnique == 1:
newDf = newDf.drop(colName)
if debug:
print('dropping column ' + str(colName) + ' because it only contains one unique value.')
#end if debug
#elif (1==2):
elif ((countUnique < minValCount) | (dtype=="String") | (dtype=="StringType")):
if debug:
print(len(newDf.columns))
oldColumns = newDf.columns
newDf = ohcOneColumn(newDf, colName, debug=debug)
if debug:
print(len(newDf.columns))
newColumns = set(newDf.columns) - set(oldColumns)
print('Adding:')
print(newColumns)
for newColumn in newColumns:
if newColumn in newDf.columns:
try:
newUniqueValCount = newDf.groupBy(newColumn).count().count()
print("There are " + str(newUniqueValCount) + " unique values in " + str(newColumn))
except:
print('Uncaught error discussing ' + str(newColumn))
#else:
# newColumns.remove(newColumn)
print('Dropping:')
print(set(oldColumns) - set(newDf.columns))
else:
if debug: print('Nothing done for column ' + str(colName))
#end if countUnique == 1, elif countUnique other condition
#end outer for
return newDf
You can cast a string column type in a spark data frame to a numerical data type using the cast function.
from pyspark.sql import SQLContext
from pyspark.sql.types import DoubleType, IntegerType
sqlContext = SQLContext(sc)
dataset = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('./data/titanic.csv')
dataset = dataset.withColumn("Age", dataset["Age"].cast(DoubleType()))
dataset = dataset.withColumn("Survived", dataset["Survived"].cast(IntegerType()))
In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame. We can then use the VectorAssembler to merge the features in a single vector and apply your favorite Spark ML algorithm.

Categories