Convert StringType to ArrayType in PySpark - python

I am trying to Run the FPGrowth algorithm in PySpark on my Dataset.
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="name", minSupport=0.5,minConfidence=0.6)
model = fpGrowth.fit(df)
I am getting the following error:
An error occurred while calling o2139.fit.
: java.lang.IllegalArgumentException: requirement failed: The input
column must be ArrayType, but got StringType.
at scala.Predef$.require(Predef.scala:224)
My Dataframe df is in the form:
df.show(2)
+---+---------+--------------------+
| id| name| actor|
+---+---------+--------------------+
| 0|['ab,df']| tom|
| 1|['rs,ce']| brad|
+---+---------+--------------------+
only showing top 2 rows
The FP algorithm works if my data in column "name" is in the form:
name
[ab,df]
[rs,ce]
How do I get it in this form that is convert from StringType to ArrayType
I formed the Dataframe from my RDD:
rd2=rd.map(lambda x: (x[1], x[0][0] , [x[0][1]]))
rd3 = rd2.map(lambda p:Row(id=int(p[0]),name=str(p[2]),actor=str(p[1])))
df = spark.createDataFrame(rd3)
rd2.take(2):
[(0, 'tom', ['ab,df']), (1, 'brad', ['rs,ce'])]

Split by comma for each row in the name column of your dataframe. e.g.
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf('list', PandasUDFType.SCALAR)
def split_comma(v):
return v[1:-1].split(',')
df.withColumn('name', split_comma(df.name))
Or better, don't defer this. Set name directly to the list.
rd2 = rd.map(lambda x: (x[1], x[0][0], x[0][1].split(',')))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))

Based on your previous question, it seems as though you are building rdd2 incorrectly.
Try this:
rd2 = rd.map(lambda x: (x[1], x[0][0] , x[0][1].split(",")))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))
The change is that we call str.split(",") on x[0][1] so that it will convert a string like 'a,b' to a list: ['a', 'b'].

Related

Length of object (3) does not match with length of fields (1) Pyspark

I am getting problem with below code. I want to create a single column dataframe.
May I know what wrong I am doing here.
from pyspark.sql import functions as F from pyspark.sql.types import IntegerType,ArrayType,StructType,StructField,StringType data = [ (["James","Jon","Jane"]), (["Miken","Mik","Mike"]), (["John","Johns"])]
cols = StructType([ StructField("Name",ArrayType(StringType()),True) ])
df = spark.createDataFrame(data=data,schema=cols)
df.printSchema()
df.show()
output:
Name
["James","Jon","Jane"]
["Miken","Mik","Mike"]
["John","Johns"]
I am getting a error below.
Length of object (3) does not match with length of fields (1)
This error is because you added data in form of a multiple-column structure and your requirement is in single-column data value.
So for get data in single column values you need to [(row,) for row in data]:
data = [(["James","Jon","Jane"]), (["Miken","Mik","Mike"]), (["John","Johns"])]
cols = StructType([ StructField("Name",ArrayType(StringType()),True) ])
df = spark.createDataFrame(data=[(row,) for row in data], schema=cols)
df.printSchema()
df.show()
Output:
Pyspark has this problem. The way I go about is introduce and ID column, drop it ones the df is created
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType,ArrayType,StructType,StructField,StringType
data = [ (1,["James","Jon","Jane"]), (2,["Miken","Mik","Mike"]), (3,["John","Johns"])]
cols = StructType([ StructField("ID",IntegerType(),True), StructField("Name",ArrayType(StringType()),True) ])
df = spark.createDataFrame(data=data,schema=cols).drop('ID')
df.printSchema()
df.show()

How do I count the number of occurrences in a spark RDD and return it as a dictionary?

I am loading a csv file as a dataframe and I converted it to a RDD. This RDD contains the list of cities like [NY,NY,NY,LA,LA,LA,Detroit,Miami]. I want to be able to extract the frequency of each city like so:
NY: 3
LA: 3
Detroit: 1
Miami: 1
I know I can do this with dataframe functions, but I need to specifically do this with RDD functions like map,filter, etc.
Here's what I tried so far:
df= spark.read.format("csv").option("header", "true").load(filename)
newRDD = df.rdd.map(lambda x: x[6]).filter(lambda x: x!=None)
I am just getting the 6th column in the above code in my dataframe that contains the cities
You can try reduceByKey.
>>> df = spark.createDataFrame(["NY","NY","NY","LA","LA","LA","Detroit","Miami"], StringType())
>>> rdd2 = df.rdd.map(lambda x: (x[0],1)).reduceByKey(lambda x, y: x+y)
>>> rdd2.collect()
[('Detroit', 1), ('NY', 3), ('LA', 3), ('Miami', 1)]

How to use pandas UDF in pyspark and return result in StructType

How can I drive a column based on panda-udf in pyspark. I've written udf as below:
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf("in_type string, in_var string, in_numer int", PandasUDFType.GROUPED_MAP)
def getSplitOP(in_data):
if in_data is None or len(in_data) < 1:
return None
#Input/variable.12-2017
splt=in_data.split("/",1)
in_type=splt[0]
splt_1=splt[1].split(".",1)
in_var = splt_1[0]
splt_2=splt_1[1].split("-",1)
in_numer=int(splt_2[0])
return (in_type, in_var, in_numer)
#Expected output: ("input", "variable", 12)
df = df.withColumn("splt_col", getSplitOP(df.In_data))
Can someone help me out to identify, what's wrong with above code, and why it's not working.
This will work:
df = spark.createDataFrame([("input/variable.12-2017",), ("output/invariable.11-2018",)], ("in_data",))
df.show()
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf("in_type string, in_var string, in_numer int", PandasUDFType.GROUPED_MAP)
def getSplitOP(pdf):
in_data = pdf.in_data
#Input/variable.12-2017
splt = in_data.apply(lambda x: x.split("/",1))
in_type = splt.apply(lambda x: x[0])
splt_1 = splt.apply(lambda x: x[1].split(".",1))
in_var = splt_1.apply(lambda x: x[0])
splt_2 = splt_1.apply(lambda x: x[1].split("-",1))
in_numer = splt_2.apply(lambda x: int(x[0]))
return pd.DataFrame({"in_type": in_type, "in_var": in_var, "in_numer": in_numer})
#Expected output: ("input", "variable", 12)
df = df.groupBy().apply(getSplitOP)
df.show()
There must not be a blank line after #pandas_udf.
pandas Series objects don't directly support string functions such as split. Use apply to operate elementwise on each Series.
You used a GROUPED_MAP in order to return multiple columns, but your code isn't inherently grouped by anything. Note that groupBy is used without any arguments here. This requires all the data to fit on a single processor.

PySpark - Convert to JSON row by row

I have a very large pyspark data frame. I need to convert the dataframe into a JSON formatted string for each row then publish the string to a Kafka topic. I originally used the following code.
for message in df.toJSON().collect():
kafkaClient.send(message)
However the dataframe is very large so it fails when trying to collect().
I was thinking of using a UDF since it processes it row by row.
from pyspark.sql.functions import udf, struct
def get_row(row):
json = row.toJSON()
kafkaClient.send(message)
return "Sent"
send_row_udf = F.udf(get_row, StringType())
df_json = df.withColumn("Sent", get_row(struct([df[x] for x in df.columns])))
df_json.select("Sent").show()
But I am getting an error because the column is inputed to the function and not the row.
For illustrative purposes, we can use the df below where we can assume Col1 and Col2 must be send over.
df= spark.createDataFrame([("A", 1), ("B", 2), ("D", 3)],["Col1", "Col2"])
The JSON string for each row:
'{"Col1":"A","Col2":1}'
'{"Col1":"B","Col2":2}'
'{"Col1":"D","Col2":3}'
You cannot use select like this. Use foreach / foreachPartition:
import json
def send(part):
kafkaClient = ...
for r in part:
kafkaClient.send(json.dumps(r.asDict()))
If you need diagnostic information just use Accumulator.
In current releases I would use Kafka source directly (2.0 and later):
from pyspark.sql.functions import to_json, struct
(df.select(to_json(struct([df[x] for x in df.columns])).alias("value"))
.write
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_servers)
.option("topic", topic)
.save())
You'll need Kafka SQL package for example:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.1
Here is an approach that should work for you.
Collect the column names (keys) and the column values into lists (values) for each row. Then rearrange these into a list of key-value-pair tuples to pass into the dict constructor. Finally, convert the dict to a string using json.dumps().
Collect Keys and Values into Lists
Collect the column names and the values into a single list, but interleave the keys and values.
import pyspark.sql.functions as f
def kvp(cols, *args):
a = cols
b = map(str, args)
c = a + b
c[::2] = a
c[1::2] = b
return c
kvp_udf = lambda cols: f.udf(lambda *args: kvp(cols, *args), ArrayType(StringType()))
df.withColumn('kvp', kvp_udf(df.columns)(*df.columns)).show()
#+----+----+------------------+
#|Col1|Col2| kvp|
#+----+----+------------------+
#| A| 1|[Col1, A, Col2, 1]|
#| B| 2|[Col1, B, Col2, 2]|
#| D| 3|[Col1, D, Col2, 3]|
#+----+----+------------------+
Pass the Key-Value-Pair column into dict constructor
Use json.dumps() to convert the dict into JSON string.
import json
df.withColumn('kvp', kvp_udf(df.columns)(*df.columns))\
.select(
f.udf(lambda x: json.dumps(dict(zip(x[::2],x[1::2]))), StringType())(f.col('kvp'))\
.alias('json')
)\
.show(truncate=False)
#+--------------------------+
#|json |
#+--------------------------+
#|{"Col2": "1", "Col1": "A"}|
#|{"Col2": "2", "Col1": "B"}|
#|{"Col2": "3", "Col1": "D"}|
#+--------------------------+
Note: Unfortunately, this will convert all datatypes to strings.

How to change dataframe column names in PySpark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
df.columns = new_column_name_list
However, the same doesn't work in PySpark dataframes created using sqlContext.
The only solution I could figure out to do this easily is the following:
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)
This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.
Is there a better and more efficient way to do this like we do in pandas?
My Spark version is 1.5.0
There are many ways to do that:
Option 1. Using selectExpr.
data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "askdaosdka"])
data.show()
data.printSchema()
# Output
#+-------+----------+
#| Name|askdaosdka|
#+-------+----------+
#|Alberto| 2|
#| Dakota| 2|
#+-------+----------+
#root
# |-- Name: string (nullable = true)
# |-- askdaosdka: long (nullable = true)
df = data.selectExpr("Name as name", "askdaosdka as age")
df.show()
df.printSchema()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.
from functools import reduce
oldColumns = data.schema.names
newColumns = ["name", "age"]
df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
df.printSchema()
df.show()
Option 3. using
alias, in Scala you can also use as.
from pyspark.sql.functions import col
data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
data.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.
sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
df2.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
df = df.withColumnRenamed("colName", "newColName")\
.withColumnRenamed("colName2", "newColName2")
Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.
If you want to change all columns names, try df.toDF(*cols)
In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)
new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))
df = df.toDF(*new_column_name_list)
Thanks to #user8117731 for toDf trick.
df.withColumnRenamed('age', 'age2')
If you want to rename a single column and keep the rest as it is:
from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])
this is the approach that I used:
create pyspark session:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()
create dataframe:
df = spark.createDataFrame(data = [('Bob', 5.62,'juice'), ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])
view df with column names:
df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob| 5.62|juice|
| Sue| 0.85| milk|
+----+------+-----+
create a list with new column names:
newcolnames = ['NameNew','AmountNew','ItemNew']
change the column names of the df:
for c,n in zip(df.columns,newcolnames):
df=df.withColumnRenamed(c,n)
view df with new column names:
df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
| Bob| 5.62| juice|
| Sue| 0.85| milk|
+-------+---------+-------+
I made an easy to use function to rename multiple columns for a pyspark dataframe,
in case anyone wants to use it:
def renameCols(df, old_columns, new_columns):
for old_col,new_col in zip(old_columns,new_columns):
df = df.withColumnRenamed(old_col,new_col)
return df
old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)
Be careful, both lists must be the same length.
Another way to rename just one column (using import pyspark.sql.functions as F):
df = df.select( '*', F.col('count').alias('new_count') ).drop('count')
Method 1:
df = df.withColumnRenamed("old_column_name", "new_column_name")
Method 2:
If you want to do some computation and rename the new values
df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")
You can use the following function to rename all the columns of your dataframe.
def df_col_rename(X, to_rename, replace_with):
"""
:param X: spark dataframe
:param to_rename: list of original names
:param replace_with: list of new names
:return: dataframe with updated names
"""
import pyspark.sql.functions as F
mapping = dict(zip(to_rename, replace_with))
X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
return X
In case you need to update only a few columns' names, you can use the same column name in the replace_with list
To rename all columns
df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])
To rename a some columns
df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])
we can use col.alias for renaming the column:
from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()
We can use various approaches to rename the column name.
First, let create a simple DataFrame.
df = spark.createDataFrame([("x", 1), ("y", 2)],
["col_1", "col_2"])
Now let's try to rename col_1 to col_3. PFB a few approaches to do the same.
# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()
# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()
# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()
# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()
Here is the output.
+-----+-----+
|col_3|col_2|
+-----+-----+
| x| 1|
| y| 2|
+-----+-----+
I hope this helps.
A way that you can use 'alias' to change the column name:
col('my_column').alias('new_name')
Another way that you can use 'alias' (possibly not mentioned):
df.my_column.alias('new_name')
You can put into for loop, and use zip to pairs each column name in two array.
new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]
new_df = df
for old, new in zip(df.columns, new_name):
new_df = new_df.withColumnRenamed(old, new)
I like to use a dict to rename the df.
rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
df = df.withColumnRenamed(col, rename[col])
For a single column rename, you can still use toDF(). For example,
df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()
There are multiple approaches you can use:
df1=df.withColumn("new_column","old_column").drop(col("old_column"))
df1=df.withColumn("new_column","old_column")
df1=df.select("old_column".alias("new_column"))
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
CreatingDataFrame = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = StructType([ \
StructField("employee_name",StringType(),True), \
StructField("department",StringType(),True), \
StructField("state",StringType(),True), \
StructField("salary", IntegerType(), True), \
StructField("age", StringType(), True), \
StructField("bonus", IntegerType(), True) \
])
OurData = spark.createDataFrame(data=CreatingDataFrame,schema=schema)
OurData.show()
# COMMAND ----------
GrouppedBonusData=OurData.groupBy("department").sum("bonus")
# COMMAND ----------
GrouppedBonusData.show()
# COMMAND ----------
GrouppedBonusData.printSchema()
# COMMAND ----------
from pyspark.sql.functions import col
BonusColumnRenamed = GrouppedBonusData.select(col("department").alias("department"), col("sum(bonus)").alias("Total_Bonus"))
BonusColumnRenamed.show()
# COMMAND ----------
GrouppedBonusData.groupBy("department").count().show()
# COMMAND ----------
GrouppedSalaryData=OurData.groupBy("department").sum("salary")
# COMMAND ----------
GrouppedSalaryData.show()
# COMMAND ----------
from pyspark.sql.functions import col
SalaryColumnRenamed = GrouppedSalaryData.select(col("department").alias("Department"), col("sum(salary)").alias("Total_Salary"))
SalaryColumnRenamed.show()
Try the following method. The following method can allow you rename columns of multiple files
Reference: https://www.linkedin.com/pulse/pyspark-methods-rename-columns-kyle-gibson/
df_initial = spark.read.load('com.databricks.spark.csv')
rename_dict = {
'Alberto':'Name',
'Dakota':'askdaosdka'
}
df_renamed = df_initial \
.select([col(c).alias(rename_dict.get(c, c)) for c in df_initial.columns])
rename_dict = {
'FName':'FirstName',
'LName':'LastName',
'DOB':'BirthDate'
}
return df.select([col(c).alias(rename_dict.get(c, c)) for c in df.columns])
df_renamed = spark.read.load('/mnt/datalake/bronze/testData') \
.transform(renameColumns)
The simplest solution is using withColumnRenamed:
renamed_df = df.withColumnRenamed(‘name_1’, ‘New_name_1’).withColumnRenamed(‘name_2’, ‘New_name_2’)
renamed_df.show()
And if you would like to do this like we do with Pandas, you can use toDF:
Create an order of list of new columns and pass it to toDF
df_list = ["newName_1", “newName_2", “newName_3", “newName_4"]
renamed_df = df.toDF(*df_list)
renamed_df.show()
This is an easy way to rename multiple columns with a loop:
cols_to_rename = ["col1","col2","col3"]
for col in cols_to_rename:
df = df.withColumnRenamed(col,"new_{}".format(col))
List comprehension + f-string:
df = df.toDF(*[f'n_{c}' for c in df.columns])
Simple list comprehension:
df = df.toDF(*[c.lower() for c in df.columns])
The closest statement to df.columns = new_column_name_list is:
import pyspark.sql.functions as F
df = df.select(*[F.col(name_old).alias(name_new)
for (name_old, name_new)
in zip(df.columns, new_column_name_list)]
This doesn't require any rarely-used functions, and emphasizes some patterns that are very helpful in Spark. You could also break up the steps if you find this one-liner to be doing too many things:
import pyspark.sql.functions as F
column_mapping = [F.col(name_old).alias(name_new)
for (name_old, name_new)
in zip(df.columns, new_column_name_list)]
df = df.select(*column_mapping)

Categories