PySpark - Convert to JSON row by row - python

I have a very large pyspark data frame. I need to convert the dataframe into a JSON formatted string for each row then publish the string to a Kafka topic. I originally used the following code.
for message in df.toJSON().collect():
kafkaClient.send(message)
However the dataframe is very large so it fails when trying to collect().
I was thinking of using a UDF since it processes it row by row.
from pyspark.sql.functions import udf, struct
def get_row(row):
json = row.toJSON()
kafkaClient.send(message)
return "Sent"
send_row_udf = F.udf(get_row, StringType())
df_json = df.withColumn("Sent", get_row(struct([df[x] for x in df.columns])))
df_json.select("Sent").show()
But I am getting an error because the column is inputed to the function and not the row.
For illustrative purposes, we can use the df below where we can assume Col1 and Col2 must be send over.
df= spark.createDataFrame([("A", 1), ("B", 2), ("D", 3)],["Col1", "Col2"])
The JSON string for each row:
'{"Col1":"A","Col2":1}'
'{"Col1":"B","Col2":2}'
'{"Col1":"D","Col2":3}'

You cannot use select like this. Use foreach / foreachPartition:
import json
def send(part):
kafkaClient = ...
for r in part:
kafkaClient.send(json.dumps(r.asDict()))
If you need diagnostic information just use Accumulator.
In current releases I would use Kafka source directly (2.0 and later):
from pyspark.sql.functions import to_json, struct
(df.select(to_json(struct([df[x] for x in df.columns])).alias("value"))
.write
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_servers)
.option("topic", topic)
.save())
You'll need Kafka SQL package for example:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.1

Here is an approach that should work for you.
Collect the column names (keys) and the column values into lists (values) for each row. Then rearrange these into a list of key-value-pair tuples to pass into the dict constructor. Finally, convert the dict to a string using json.dumps().
Collect Keys and Values into Lists
Collect the column names and the values into a single list, but interleave the keys and values.
import pyspark.sql.functions as f
def kvp(cols, *args):
a = cols
b = map(str, args)
c = a + b
c[::2] = a
c[1::2] = b
return c
kvp_udf = lambda cols: f.udf(lambda *args: kvp(cols, *args), ArrayType(StringType()))
df.withColumn('kvp', kvp_udf(df.columns)(*df.columns)).show()
#+----+----+------------------+
#|Col1|Col2| kvp|
#+----+----+------------------+
#| A| 1|[Col1, A, Col2, 1]|
#| B| 2|[Col1, B, Col2, 2]|
#| D| 3|[Col1, D, Col2, 3]|
#+----+----+------------------+
Pass the Key-Value-Pair column into dict constructor
Use json.dumps() to convert the dict into JSON string.
import json
df.withColumn('kvp', kvp_udf(df.columns)(*df.columns))\
.select(
f.udf(lambda x: json.dumps(dict(zip(x[::2],x[1::2]))), StringType())(f.col('kvp'))\
.alias('json')
)\
.show(truncate=False)
#+--------------------------+
#|json |
#+--------------------------+
#|{"Col2": "1", "Col1": "A"}|
#|{"Col2": "2", "Col1": "B"}|
#|{"Col2": "3", "Col1": "D"}|
#+--------------------------+
Note: Unfortunately, this will convert all datatypes to strings.

Related

Give two different data frame inputs to PySpark UDF and save output in new data frame

I am trying to use python function using PySpark data frame. I need to give two data frames at input and would like to store the results in the another data frame.
Python function I want to use:
#udf(StringType())
def fuzz_ratio(df1, df2):
return np.vectorize(fuzz.token_sort_ratio(df1, df2))
This is how I am trying to use the above function:
result_df.withcolumn("VAL", fuzz_ratio(col(df1.VAL), col(df2.VAL)))
df1 and df2 are inputs. VAL columns of both these data frame contain the values I need to input to the function fuzz_ratio. The output should be saved in VAL column of result_df.
Example:
The VAL is the column name in all the dataframes. df1 and df2 column VAL is of string type.
When you move both columns to the same dataframe, something like the following pandas_udf could be used. pandas_udf is vectorized for performance. It's not the same as regular Spark udf.
Input:
from pyspark.sql import functions as F
import pandas as pd
from fuzzywuzzy import fuzz
df = spark.createDataFrame([('danial khilji', 'danial'), ('a','as',)], ['col1', 'col2'])
df.show()
# +-------------+------+
# | col1| col2|
# +-------------+------+
# |danial khilji|danial|
# | a| as|
# +-------------+------+
Script:
#F.pandas_udf('long')
def fuzz_ratio(c1: pd.Series, c2: pd.Series) -> pd.Series:
return pd.concat([c1, c2], axis=1).apply(lambda x: fuzz.token_sort_ratio(x[0], x[1]), axis=1)
df.withColumn('fuzz_ratio', fuzz_ratio('col1', 'col2')).show()
# +-------------+------+----------+
# | col1| col2|fuzz_ratio|
# +-------------+------+----------+
# |danial khilji|danial| 63|
# | a| as| 67|
# +-------------+------+----------+

Merge multiple dataframes outputted via a FOR loop function into one single dataframe

I have a FOR loop function that iterates over a list of tables and columns (zip) to get minimum and maximum values. The output is separated for each of the combination rather than one single dataframe/table. Is there a way to combine the results of FOR loop into one final output within the function?
from pyspark.sql import functions as f
def minmax(tables, cols):
for table, column in zip(tables, cols):
minmax = spark.table(table).where(col(column).isNotNull()).select(f.lit(table).alias("table"), f.lit(column).alias("col"), min(col(column)).alias("min"),
max(col(column)).alias("max"))
minmax.show()
tables = ["sales_123", "sales_REW"]
cols = ["costs", "price"]
minmax(tables, cols)
Output from the function:
+---------+-----+---+---+
| table| col|min|max|
+---------+-----+---+---+
|sales_123|costs| 0|400|
+---------+-----+---+---+
+----------+-----+---+---+
| table| col|min|max|
+----------+-----+---+---+
|sales_REW |price| 0|400|
+----------+-----+---+---+
Desired Output:
+---------+-----+---+---+
| table| col|min|max|
+---------+-----+---+---+
|sales_123|costs| 0|400|
|sales_REW|price| 0|400|
+---------+-----+---+---+
Put all the dataframes into a list and do the union after the for-loop:
from functools import reduce
from pyspark.sql import functions as f
from pyspark.sql import DataFrame
def minmax(tables, cols):
dfs = []
for table, column in zip(tables, cols):
minmax = spark.table(table).where(col(column).isNotNull()).select(f.lit(table).alias("table"), f.lit(column).alias("col"), min(col(column)).alias("min"), max(col(column)).alias("max"))
dfs.append(minmax)
df = reduce(DataFrame.union, dfs)
Note that the order of the columns needs to be the same of all involved dataframes (as is the case here). Otherwise this can have unexpected results.

PySpark converting a column of type 'map' to multiple columns in a dataframe

Input
I have a column Parameters of type map of the form:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}]
df = sqlContext.createDataFrame(d)
df.collect()
# [Row(Parameters={'foo': '1', 'bar': '2', 'baz': 'aaa'})]
df.printSchema()
# root
# |-- Parameters: map (nullable = true)
# | |-- key: string
# | |-- value: string (valueContainsNull = true)
Output
I want to reshape it in PySpark so that all the keys (foo, bar, etc.) would become columns, namely:
[Row(foo='1', bar='2', baz='aaa')]
Using withColumn works:
(df
.withColumn('foo', df.Parameters['foo'])
.withColumn('bar', df.Parameters['bar'])
.withColumn('baz', df.Parameters['baz'])
.drop('Parameters')
).collect()
But I need a solution that doesn't explicitly mention the column names, as I have dozens of them.
Since keys of the MapType are not a part of the schema you'll have to collect these first for example like this:
from pyspark.sql.functions import explode
keys = (df
.select(explode("Parameters"))
.select("key")
.distinct()
.rdd.flatMap(lambda x: x)
.collect())
When you have this all what is left is simple select:
from pyspark.sql.functions import col
exprs = [col("Parameters").getItem(k).alias(k) for k in keys]
df.select(*exprs)
Performant solution
One of the question constraints is to dynamically determine the column names, which is fine, but be warned that this can be really slow. Here's how you can avoid typing and write code that'll execute quickly.
cols = list(map(
lambda f: F.col("Parameters").getItem(f).alias(str(f)),
["foo", "bar", "baz"]))
df.select(cols).show()
+---+---+---+
|foo|bar|baz|
+---+---+---+
| 1| 2|aaa|
+---+---+---+
Notice that this runs a single select operation. Don't run withColumn multiple times because that's slower.
The fast solution is only possible if you know all the map keys. You'll need to revert to the slower solution if you don't know all the unique values for the map keys.
Slower solution
The accepted answer is good. My solution is a bit more performant because it doesn't call .rdd or flatMap().
import pyspark.sql.functions as F
d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}]
df = spark.createDataFrame(d)
keys_df = df.select(F.explode(F.map_keys(F.col("Parameters")))).distinct()
keys = list(map(lambda row: row[0], keys_df.collect()))
key_cols = list(map(lambda f: F.col("Parameters").getItem(f).alias(str(f)), keys))
df.select(key_cols).show()
+---+---+---+
|bar|foo|baz|
+---+---+---+
| 2| 1|aaa|
+---+---+---+
Collecting results to the driver node can be a performance bottleneck. It's good to execute this code list(map(lambda row: row[0], keys_df.collect())) as a separate command to make sure it's not running too slowly.
Performance-wise, not hard-coding column names, use this:
from pyspark.sql import functions as F
df = df.withColumn("_c", F.to_json("Parameters"))
json_schema = spark.read.json(df.rdd.map(lambda r: r._c)).schema
df = df.withColumn("_c", F.from_json("_c", json_schema))
df = df.select("_c.*")
df.show()
# +----+----+---+
# | bar| baz|foo|
# +----+----+---+
# | 2| aaa| 1|
# |null|null| 1|
# +----+----+---+
It doesn't use neither distinct nor collect. It once calls rdd, so that the extracted schema would have a suitable format to use in from_json.

How to change dataframe column names in PySpark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
df.columns = new_column_name_list
However, the same doesn't work in PySpark dataframes created using sqlContext.
The only solution I could figure out to do this easily is the following:
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)
This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.
Is there a better and more efficient way to do this like we do in pandas?
My Spark version is 1.5.0
There are many ways to do that:
Option 1. Using selectExpr.
data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "askdaosdka"])
data.show()
data.printSchema()
# Output
#+-------+----------+
#| Name|askdaosdka|
#+-------+----------+
#|Alberto| 2|
#| Dakota| 2|
#+-------+----------+
#root
# |-- Name: string (nullable = true)
# |-- askdaosdka: long (nullable = true)
df = data.selectExpr("Name as name", "askdaosdka as age")
df.show()
df.printSchema()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.
from functools import reduce
oldColumns = data.schema.names
newColumns = ["name", "age"]
df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
df.printSchema()
df.show()
Option 3. using
alias, in Scala you can also use as.
from pyspark.sql.functions import col
data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
data.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.
sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
df2.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
df = df.withColumnRenamed("colName", "newColName")\
.withColumnRenamed("colName2", "newColName2")
Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.
If you want to change all columns names, try df.toDF(*cols)
In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)
new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))
df = df.toDF(*new_column_name_list)
Thanks to #user8117731 for toDf trick.
df.withColumnRenamed('age', 'age2')
If you want to rename a single column and keep the rest as it is:
from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])
this is the approach that I used:
create pyspark session:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()
create dataframe:
df = spark.createDataFrame(data = [('Bob', 5.62,'juice'), ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])
view df with column names:
df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob| 5.62|juice|
| Sue| 0.85| milk|
+----+------+-----+
create a list with new column names:
newcolnames = ['NameNew','AmountNew','ItemNew']
change the column names of the df:
for c,n in zip(df.columns,newcolnames):
df=df.withColumnRenamed(c,n)
view df with new column names:
df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
| Bob| 5.62| juice|
| Sue| 0.85| milk|
+-------+---------+-------+
I made an easy to use function to rename multiple columns for a pyspark dataframe,
in case anyone wants to use it:
def renameCols(df, old_columns, new_columns):
for old_col,new_col in zip(old_columns,new_columns):
df = df.withColumnRenamed(old_col,new_col)
return df
old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)
Be careful, both lists must be the same length.
Another way to rename just one column (using import pyspark.sql.functions as F):
df = df.select( '*', F.col('count').alias('new_count') ).drop('count')
Method 1:
df = df.withColumnRenamed("old_column_name", "new_column_name")
Method 2:
If you want to do some computation and rename the new values
df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")
You can use the following function to rename all the columns of your dataframe.
def df_col_rename(X, to_rename, replace_with):
"""
:param X: spark dataframe
:param to_rename: list of original names
:param replace_with: list of new names
:return: dataframe with updated names
"""
import pyspark.sql.functions as F
mapping = dict(zip(to_rename, replace_with))
X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
return X
In case you need to update only a few columns' names, you can use the same column name in the replace_with list
To rename all columns
df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])
To rename a some columns
df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])
we can use col.alias for renaming the column:
from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()
We can use various approaches to rename the column name.
First, let create a simple DataFrame.
df = spark.createDataFrame([("x", 1), ("y", 2)],
["col_1", "col_2"])
Now let's try to rename col_1 to col_3. PFB a few approaches to do the same.
# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()
# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()
# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()
# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()
Here is the output.
+-----+-----+
|col_3|col_2|
+-----+-----+
| x| 1|
| y| 2|
+-----+-----+
I hope this helps.
A way that you can use 'alias' to change the column name:
col('my_column').alias('new_name')
Another way that you can use 'alias' (possibly not mentioned):
df.my_column.alias('new_name')
You can put into for loop, and use zip to pairs each column name in two array.
new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]
new_df = df
for old, new in zip(df.columns, new_name):
new_df = new_df.withColumnRenamed(old, new)
I like to use a dict to rename the df.
rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
df = df.withColumnRenamed(col, rename[col])
For a single column rename, you can still use toDF(). For example,
df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()
There are multiple approaches you can use:
df1=df.withColumn("new_column","old_column").drop(col("old_column"))
df1=df.withColumn("new_column","old_column")
df1=df.select("old_column".alias("new_column"))
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
CreatingDataFrame = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = StructType([ \
StructField("employee_name",StringType(),True), \
StructField("department",StringType(),True), \
StructField("state",StringType(),True), \
StructField("salary", IntegerType(), True), \
StructField("age", StringType(), True), \
StructField("bonus", IntegerType(), True) \
])
OurData = spark.createDataFrame(data=CreatingDataFrame,schema=schema)
OurData.show()
# COMMAND ----------
GrouppedBonusData=OurData.groupBy("department").sum("bonus")
# COMMAND ----------
GrouppedBonusData.show()
# COMMAND ----------
GrouppedBonusData.printSchema()
# COMMAND ----------
from pyspark.sql.functions import col
BonusColumnRenamed = GrouppedBonusData.select(col("department").alias("department"), col("sum(bonus)").alias("Total_Bonus"))
BonusColumnRenamed.show()
# COMMAND ----------
GrouppedBonusData.groupBy("department").count().show()
# COMMAND ----------
GrouppedSalaryData=OurData.groupBy("department").sum("salary")
# COMMAND ----------
GrouppedSalaryData.show()
# COMMAND ----------
from pyspark.sql.functions import col
SalaryColumnRenamed = GrouppedSalaryData.select(col("department").alias("Department"), col("sum(salary)").alias("Total_Salary"))
SalaryColumnRenamed.show()
Try the following method. The following method can allow you rename columns of multiple files
Reference: https://www.linkedin.com/pulse/pyspark-methods-rename-columns-kyle-gibson/
df_initial = spark.read.load('com.databricks.spark.csv')
rename_dict = {
'Alberto':'Name',
'Dakota':'askdaosdka'
}
df_renamed = df_initial \
.select([col(c).alias(rename_dict.get(c, c)) for c in df_initial.columns])
rename_dict = {
'FName':'FirstName',
'LName':'LastName',
'DOB':'BirthDate'
}
return df.select([col(c).alias(rename_dict.get(c, c)) for c in df.columns])
df_renamed = spark.read.load('/mnt/datalake/bronze/testData') \
.transform(renameColumns)
The simplest solution is using withColumnRenamed:
renamed_df = df.withColumnRenamed(‘name_1’, ‘New_name_1’).withColumnRenamed(‘name_2’, ‘New_name_2’)
renamed_df.show()
And if you would like to do this like we do with Pandas, you can use toDF:
Create an order of list of new columns and pass it to toDF
df_list = ["newName_1", “newName_2", “newName_3", “newName_4"]
renamed_df = df.toDF(*df_list)
renamed_df.show()
This is an easy way to rename multiple columns with a loop:
cols_to_rename = ["col1","col2","col3"]
for col in cols_to_rename:
df = df.withColumnRenamed(col,"new_{}".format(col))
List comprehension + f-string:
df = df.toDF(*[f'n_{c}' for c in df.columns])
Simple list comprehension:
df = df.toDF(*[c.lower() for c in df.columns])
The closest statement to df.columns = new_column_name_list is:
import pyspark.sql.functions as F
df = df.select(*[F.col(name_old).alias(name_new)
for (name_old, name_new)
in zip(df.columns, new_column_name_list)]
This doesn't require any rarely-used functions, and emphasizes some patterns that are very helpful in Spark. You could also break up the steps if you find this one-liner to be doing too many things:
import pyspark.sql.functions as F
column_mapping = [F.col(name_old).alias(name_new)
for (name_old, name_new)
in zip(df.columns, new_column_name_list)]
df = df.select(*column_mapping)

Create Spark DataFrame. Can not infer schema for type

Could someone help me solve this problem I have with Spark DataFrame?
When I do myFloatRDD.toDF() I get an error:
TypeError: Can not infer schema for type: type 'float'
I don't understand why...
Example:
myFloatRdd = sc.parallelize([1.0,2.0,3.0])
df = myFloatRdd.toDF()
Thanks
SparkSession.createDataFrame, which is used under the hood, requires an RDD / list of Row/tuple/list/dict* or pandas.DataFrame, unless schema with DataType is provided. Try to convert float to tuple like this:
myFloatRdd.map(lambda x: (x, )).toDF()
or even better:
from pyspark.sql import Row
row = Row("val") # Or some other column name
myFloatRdd.map(row).toDF()
To create a DataFrame from a list of scalars you'll have to use SparkSession.createDataFrame directly and provide a schema***:
from pyspark.sql.types import FloatType
df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType())
df.show()
## +-----+
## |value|
## +-----+
## | 1.0|
## | 2.0|
## | 3.0|
## +-----+
but for a simple range it would be better to use SparkSession.range:
from pyspark.sql.functions import col
spark.range(1, 4).select(col("id").cast("double"))
* No longer supported.
** Spark SQL also provides a limited support for schema inference on Python objects exposing __dict__.
*** Supported only in Spark 2.0 or later.
from pyspark.sql.types import IntegerType, Row
mylist = [1, 2, 3, 4, None ]
l = map(lambda x : Row(x), mylist)
# notice the parens after the type name
df=spark.createDataFrame(l,["id"])
df.where(df.id.isNull() == False).show()
Basiclly, you need to init your int into Row(), then we can use the schema
Inferring the Schema Using Reflection
from pyspark.sql import Row
# spark - sparkSession
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
orders = sc.textFile("/practicedata/orders")
#Split on delimiters
parts = orders.map(lambda l: l.split(","))
#Convert to Row
orders_struct = parts.map(lambda p: Row(order_id=int(p[0]), order_date=p[1], customer_id=p[2], order_status=p[3]))
for i in orders_struct.take(5): print(i)
#convert the RDD to DataFrame
orders_df = spark.createDataFrame(orders_struct)
Programmatically Specifying the Schema
from pyspark.sql import Row
# spark - sparkSession
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
orders = sc.textFile("/practicedata/orders")
#Split on delimiters
parts = orders.map(lambda l: l.split(","))
#Convert to tuple
orders_struct = parts.map(lambda p: (p[0], p[1], p[2], p[3].strip()))
#convert the RDD to DataFrame
orders_df = spark.createDataFrame(orders_struct)
# The schema is encoded in a string.
schemaString = "order_id order_date customer_id status"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = Struct
ordersDf = spark.createDataFrame(orders_struct, schema)
Type(fields)
from pyspark.sql import Row
myFloatRdd.map(lambda x: Row(x)).toDF()

Categories