flatten nested json scala code in pyspark - python

Trying to do the following scala code but in pyspark:
val maxJsonParts = 3 // whatever that number is...
val jsonElements = (0 until maxJsonParts)
.map(i => get_json_object($"Payment", s"$$[$i]"))
val newDF = dataframe
.withColumn("Payment", explode(array(jsonElements: _*)))
.where(!isnull($"Payment"))
For example, I am trying to make a nested column such as in the payment column below:
id
name
payment
1
James
[ {"#id": 1, "currency":"GBP"},{"#id": 2, "currency": "USD"} ]
to become:
id
name
payment
1
James
{"#id": 1, "currency":"GBP"}
1
James
{"#id":2, "currency":"USD"}
The table schema:
root
|-- id: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Payment: string (nullable = true)
I tried writing this in Pyspark but its just turning the nested column (payment) to null:
lst = [range(0,10)]
jsonElem = [F.get_json_object(F.col("payment"), f"$[{i}]") for i in lst]
bronzeDF = bronzeDF.withColumn("payment2", F.explode(F.array(*jsonElem)))
bronzeDF.show()
Any help is highly appreciated.

Here is another approach which allows you to parse the given JSON based on the right schema in order to generate the payment array. The solution is based on from_json function which allows you to parse a string JSON into struct type.
from pyspark.sql.types import IntegerType, StringType, ArrayType, StructField
from pyspark.sql.functions import from_json, explode
data = [
(1, 'James', '[ {"#id": 1, "currency":"GBP"},{"#id": 2, "currency": "USD"} ]'),
(2, 'Tonny', '[ {"#id": 3, "currency":"EUR"},{"#id": 4, "currency": "USD"} ]'),
]
df = spark.createDataFrame(data, ['id', 'name', 'payment'])
str_schema = 'array<struct<`#id`:int,`currency`:string>>'
# st_schema = ArrayType(StructType([
# StructField('#id', IntegerType()),
# StructField('currency', StringType())]))
df = df.withColumn("payment", explode(from_json(df["payment"], str_schema)))
df.show()
# +---+-----+--------+
# | id| name| payment|
# +---+-----+--------+
# | 1|James|[1, GBP]|
# | 1|James|[2, USD]|
# | 2|Tonny|[3, EUR]|
# | 2|Tonny|[4, USD]|
# +---+-----+--------+
Note: as you can see you can choose between the string representation of the schema or ArrayType. Both should produce the same results.

I came to the solution:
first convert the column to a string type as follows:
bronzeDF = bronzeDF.withColumn("payment2", F.to_json("payment")).drop("payment")
Then you can perform the following code on the column to stack the n nested json objects as separate rows with the same outer key values:
max_json_parts = 50
lst = [f for f in range(0, max_json_parts, 1)]
jsonElem = [F.get_json_object(F.col("payment2"), f"$[{i}]") for i in lst]
bronzeDF = bronzeDF.withColumn("payment2", F.explode(F.array(*jsonElem))).where(F.col("payment2").isNotNull())
Then convert back to struct with a defined schema and and explode the object keys as separate columns:
bronzeDF = bronzeDF.withColumn("temp", F.from_json("payment2", jsonSchemaPayment)).select("*", "temp.*").drop("payment2")

Related

Comparing schema of dataframe using Pyspark

I have a data frame (df).
For showing its schema I use:
from pyspark.sql.functions import *
df1.printSchema()
And I get the following result:
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Sometimes the schema changes (the column type or name):
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
I would like to compare between the two schemas (df1 and df2) and get only the differences in types and columns names (Sometimes the column can move to another position).
The results should be a table (or data frame) something like this:
column df1 df2 diff
name: string array type
gender: N/A integer new column
(age column is the same and didn't change. In case of omission of column there will be indication 'omitted')
How can I do it if efficiently if I have many columns in each?
Without any external library, we can find the schema difference using
from pyspark.sql.session import SparkSession
from pyspark.sql import DataFrame
def schema_diff(spark: SparkSession, df_1: DataFrame, df_2: DataFrame):
s1 = spark.createDataFrame(df_1.dtypes, ["d1_name", "d1_type"])
s2 = spark.createDataFrame(df_2.dtypes, ["d2_name", "d2_type"])
difference = (
s1.join(s2, s1.d1_name == s2.d2_name, how="outer")
.where(s1.d1_type.isNull() | s2.d2_type.isNull())
.select(s1.d1_name, s1.d1_type, s2.d2_name, s2.d2_type)
.fillna("")
)
return difference
fillna is optional. I prefer to view them as empty string.
in where clause we use type because this will help us to show even if column exists in both dataframe but they have different schemas.
this will also show all columns that are in second dataframe but not in first dataframe
Usage:
diff = schema_diff(spark, df_1, df_2)
diff.show(diff.count(), truncate=False)
You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])
and then join those two pandas dataframes through 'outer' join?
A custom function that could be useful for someone.
def SchemaDiff(DF1, DF2):
# Getting schema for both dataframes in a dictionary
DF1Schema = {x[0]:x[1] for x in DF1.dtypes}
DF2Schema = {x[0]:x[1] for x in DF2.dtypes}
# Column present in DF1 but not in DF2
DF1MinusDF2 = dict.fromkeys((set(DF1.columns) - set(DF2.columns)), '')
for column_name in DF1MinusDF2:
DF1MinusDF2[column_name] = DF1Schema[column_name]
# Column present in DF2 but not in DF1
DF2MinusDF1 = dict.fromkeys((set(DF2.columns) - set(DF1.columns)), '')
for column_name in DF2MinusDF1:
DF2MinusDF1[column_name] = DF2Schema[column_name]
# Find data type changed in DF1 as compared to DF2
UpdatedDF1Schema = {k:v for k,v in DF1Schema.items() if k not in DF1MinusDF2}
UpdatedDF1Schema = {**UpdatedDF1Schema, **DF2MinusDF1}
DF1DataTypesChanged = {}
for column_name in UpdatedDF1Schema:
if UpdatedDF1Schema[column_name] != DF2Schema[column_name]:
DF1DataTypesChanged[column_name] = DF2Schema[column_name]
return DF1MinusDF2, DF2MinusDF1, DF1DataTypesChanged
you can simply use
df1.printSchema() == df2.printSchema()

Change column type from string to date in Pyspark

I'm trying to change my column type from string to date. I have consulted answers from:
How to change the column type from String to Date in DataFrames?
Why I get null results from date_format() PySpark function?
When I tried to apply answers from link 1, I got null result instead, so I referred to answer from link 2 but I don't understand this part:
output_format = ... # Some SimpleDateFormat string
from pyspark.sql.functions import col, unix_timestamp, to_date
#sample data
df = sc.parallelize([['12-21-2006'],
['05-30-2007'],
['01-01-1984'],
['12-24-2017']]).toDF(["date_in_strFormat"])
df.printSchema()
df = df.withColumn('date_in_dateFormat',
to_date(unix_timestamp(col('date_in_strFormat'), 'MM-dd-yyyy').cast("timestamp")))
df.show()
df.printSchema()
Output is:
root
|-- date_in_strFormat: string (nullable = true)
+-----------------+------------------+
|date_in_strFormat|date_in_dateFormat|
+-----------------+------------------+
| 12-21-2006| 2006-12-21|
| 05-30-2007| 2007-05-30|
| 01-01-1984| 1984-01-01|
| 12-24-2017| 2017-12-24|
+-----------------+------------------+
root
|-- date_in_strFormat: string (nullable = true)
|-- date_in_dateFormat: date (nullable = true)
simple way:
from pyspark.sql.types import *
df_1 = df.withColumn("col_with_date_format",
df["col_with_date_format"].cast(DateType()))
Here is a more easy way by using default to_date function:
from pyspark.sql import functions as F
df= df.withColumn('col_with_date_format',F.to_date(df.col_with_str_format))

Selecting only numeric/string columns names from a Spark DF in pyspark

I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only.
For example, this is the Schema of my DF:
root
|-- Gender: string (nullable = true)
|-- SeniorCitizen: string (nullable = true)
|-- MonthlyCharges: double (nullable = true)
|-- TotalCharges: double (nullable = true)
|-- Churn: string (nullable = true)
This is what I need:
num_cols = [MonthlyCharges, TotalCharges]
str_cols = [Gender, SeniorCitizen, Churn]
How can I make it?
dtypes is list of tuples (columnNane,type) you can use simple filter
columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]
PySpark provides a rich API related to schema types. As #DanieldePaula mentioned you can access fields' metadata through df.schema.fields.
Here is a different approach based on statically typed checking:
from pyspark.sql.types import StringType, DoubleType
df = spark.createDataFrame([
[1, 2.3, "t1"],
[2, 5.3, "t2"],
[3, 2.1, "t3"],
[4, 1.5, "t4"]
], ["cola", "colb", "colc"])
# get string
str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
# ['colc']
# or double
dbl_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, DoubleType)]
# ['colb']
You can do what zlidme suggested to get only string (categorical columns). To extend on the answer given take a look at the example bellow. It will give you all numeric (continuous) columns in a list called continuousCols, all categorical columns in a list called categoricalCols and all columns in a list called allCols.
data = {'mylongint': [0, 1, 2],
'shoes': ['blue', 'green', 'yellow'],
'hous': ['furnitur', 'roof', 'foundation'],
'C': [1, 0, 0]}
play_df = pd.DataFrame(data)
play_ddf = spark.createDataFrame(play_df)
#store all column names in a list
allCols = [item[0] for item in play_ddf]
#store all column names that are categorical in a list
categoricalCols = [item[0] for item in play_ddf.dtypes if item[1].startswith('string')]
#store all column names that are continous in a list
continuousCols =[item[0] for item in play_ddf.dtypes if item[1].startswith('bigint')]
print(len(allCols), ' - ', len(continuousCols), ' - ', len(categoricalCols))
This will give the result: 4 - 2 - 2

How to change dataframe column names in PySpark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
df.columns = new_column_name_list
However, the same doesn't work in PySpark dataframes created using sqlContext.
The only solution I could figure out to do this easily is the following:
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)
This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.
Is there a better and more efficient way to do this like we do in pandas?
My Spark version is 1.5.0
There are many ways to do that:
Option 1. Using selectExpr.
data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "askdaosdka"])
data.show()
data.printSchema()
# Output
#+-------+----------+
#| Name|askdaosdka|
#+-------+----------+
#|Alberto| 2|
#| Dakota| 2|
#+-------+----------+
#root
# |-- Name: string (nullable = true)
# |-- askdaosdka: long (nullable = true)
df = data.selectExpr("Name as name", "askdaosdka as age")
df.show()
df.printSchema()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.
from functools import reduce
oldColumns = data.schema.names
newColumns = ["name", "age"]
df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
df.printSchema()
df.show()
Option 3. using
alias, in Scala you can also use as.
from pyspark.sql.functions import col
data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
data.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.
sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
df2.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
df = df.withColumnRenamed("colName", "newColName")\
.withColumnRenamed("colName2", "newColName2")
Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.
If you want to change all columns names, try df.toDF(*cols)
In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)
new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))
df = df.toDF(*new_column_name_list)
Thanks to #user8117731 for toDf trick.
df.withColumnRenamed('age', 'age2')
If you want to rename a single column and keep the rest as it is:
from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])
this is the approach that I used:
create pyspark session:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()
create dataframe:
df = spark.createDataFrame(data = [('Bob', 5.62,'juice'), ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])
view df with column names:
df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob| 5.62|juice|
| Sue| 0.85| milk|
+----+------+-----+
create a list with new column names:
newcolnames = ['NameNew','AmountNew','ItemNew']
change the column names of the df:
for c,n in zip(df.columns,newcolnames):
df=df.withColumnRenamed(c,n)
view df with new column names:
df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
| Bob| 5.62| juice|
| Sue| 0.85| milk|
+-------+---------+-------+
I made an easy to use function to rename multiple columns for a pyspark dataframe,
in case anyone wants to use it:
def renameCols(df, old_columns, new_columns):
for old_col,new_col in zip(old_columns,new_columns):
df = df.withColumnRenamed(old_col,new_col)
return df
old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)
Be careful, both lists must be the same length.
Another way to rename just one column (using import pyspark.sql.functions as F):
df = df.select( '*', F.col('count').alias('new_count') ).drop('count')
Method 1:
df = df.withColumnRenamed("old_column_name", "new_column_name")
Method 2:
If you want to do some computation and rename the new values
df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")
You can use the following function to rename all the columns of your dataframe.
def df_col_rename(X, to_rename, replace_with):
"""
:param X: spark dataframe
:param to_rename: list of original names
:param replace_with: list of new names
:return: dataframe with updated names
"""
import pyspark.sql.functions as F
mapping = dict(zip(to_rename, replace_with))
X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
return X
In case you need to update only a few columns' names, you can use the same column name in the replace_with list
To rename all columns
df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])
To rename a some columns
df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])
we can use col.alias for renaming the column:
from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()
We can use various approaches to rename the column name.
First, let create a simple DataFrame.
df = spark.createDataFrame([("x", 1), ("y", 2)],
["col_1", "col_2"])
Now let's try to rename col_1 to col_3. PFB a few approaches to do the same.
# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()
# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()
# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()
# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()
Here is the output.
+-----+-----+
|col_3|col_2|
+-----+-----+
| x| 1|
| y| 2|
+-----+-----+
I hope this helps.
A way that you can use 'alias' to change the column name:
col('my_column').alias('new_name')
Another way that you can use 'alias' (possibly not mentioned):
df.my_column.alias('new_name')
You can put into for loop, and use zip to pairs each column name in two array.
new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]
new_df = df
for old, new in zip(df.columns, new_name):
new_df = new_df.withColumnRenamed(old, new)
I like to use a dict to rename the df.
rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
df = df.withColumnRenamed(col, rename[col])
For a single column rename, you can still use toDF(). For example,
df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()
There are multiple approaches you can use:
df1=df.withColumn("new_column","old_column").drop(col("old_column"))
df1=df.withColumn("new_column","old_column")
df1=df.select("old_column".alias("new_column"))
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
CreatingDataFrame = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = StructType([ \
StructField("employee_name",StringType(),True), \
StructField("department",StringType(),True), \
StructField("state",StringType(),True), \
StructField("salary", IntegerType(), True), \
StructField("age", StringType(), True), \
StructField("bonus", IntegerType(), True) \
])
OurData = spark.createDataFrame(data=CreatingDataFrame,schema=schema)
OurData.show()
# COMMAND ----------
GrouppedBonusData=OurData.groupBy("department").sum("bonus")
# COMMAND ----------
GrouppedBonusData.show()
# COMMAND ----------
GrouppedBonusData.printSchema()
# COMMAND ----------
from pyspark.sql.functions import col
BonusColumnRenamed = GrouppedBonusData.select(col("department").alias("department"), col("sum(bonus)").alias("Total_Bonus"))
BonusColumnRenamed.show()
# COMMAND ----------
GrouppedBonusData.groupBy("department").count().show()
# COMMAND ----------
GrouppedSalaryData=OurData.groupBy("department").sum("salary")
# COMMAND ----------
GrouppedSalaryData.show()
# COMMAND ----------
from pyspark.sql.functions import col
SalaryColumnRenamed = GrouppedSalaryData.select(col("department").alias("Department"), col("sum(salary)").alias("Total_Salary"))
SalaryColumnRenamed.show()
Try the following method. The following method can allow you rename columns of multiple files
Reference: https://www.linkedin.com/pulse/pyspark-methods-rename-columns-kyle-gibson/
df_initial = spark.read.load('com.databricks.spark.csv')
rename_dict = {
'Alberto':'Name',
'Dakota':'askdaosdka'
}
df_renamed = df_initial \
.select([col(c).alias(rename_dict.get(c, c)) for c in df_initial.columns])
rename_dict = {
'FName':'FirstName',
'LName':'LastName',
'DOB':'BirthDate'
}
return df.select([col(c).alias(rename_dict.get(c, c)) for c in df.columns])
df_renamed = spark.read.load('/mnt/datalake/bronze/testData') \
.transform(renameColumns)
The simplest solution is using withColumnRenamed:
renamed_df = df.withColumnRenamed(‘name_1’, ‘New_name_1’).withColumnRenamed(‘name_2’, ‘New_name_2’)
renamed_df.show()
And if you would like to do this like we do with Pandas, you can use toDF:
Create an order of list of new columns and pass it to toDF
df_list = ["newName_1", “newName_2", “newName_3", “newName_4"]
renamed_df = df.toDF(*df_list)
renamed_df.show()
This is an easy way to rename multiple columns with a loop:
cols_to_rename = ["col1","col2","col3"]
for col in cols_to_rename:
df = df.withColumnRenamed(col,"new_{}".format(col))
List comprehension + f-string:
df = df.toDF(*[f'n_{c}' for c in df.columns])
Simple list comprehension:
df = df.toDF(*[c.lower() for c in df.columns])
The closest statement to df.columns = new_column_name_list is:
import pyspark.sql.functions as F
df = df.select(*[F.col(name_old).alias(name_new)
for (name_old, name_new)
in zip(df.columns, new_column_name_list)]
This doesn't require any rarely-used functions, and emphasizes some patterns that are very helpful in Spark. You could also break up the steps if you find this one-liner to be doing too many things:
import pyspark.sql.functions as F
column_mapping = [F.col(name_old).alias(name_new)
for (name_old, name_new)
in zip(df.columns, new_column_name_list)]
df = df.select(*column_mapping)

Create Spark DataFrame. Can not infer schema for type

Could someone help me solve this problem I have with Spark DataFrame?
When I do myFloatRDD.toDF() I get an error:
TypeError: Can not infer schema for type: type 'float'
I don't understand why...
Example:
myFloatRdd = sc.parallelize([1.0,2.0,3.0])
df = myFloatRdd.toDF()
Thanks
SparkSession.createDataFrame, which is used under the hood, requires an RDD / list of Row/tuple/list/dict* or pandas.DataFrame, unless schema with DataType is provided. Try to convert float to tuple like this:
myFloatRdd.map(lambda x: (x, )).toDF()
or even better:
from pyspark.sql import Row
row = Row("val") # Or some other column name
myFloatRdd.map(row).toDF()
To create a DataFrame from a list of scalars you'll have to use SparkSession.createDataFrame directly and provide a schema***:
from pyspark.sql.types import FloatType
df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType())
df.show()
## +-----+
## |value|
## +-----+
## | 1.0|
## | 2.0|
## | 3.0|
## +-----+
but for a simple range it would be better to use SparkSession.range:
from pyspark.sql.functions import col
spark.range(1, 4).select(col("id").cast("double"))
* No longer supported.
** Spark SQL also provides a limited support for schema inference on Python objects exposing __dict__.
*** Supported only in Spark 2.0 or later.
from pyspark.sql.types import IntegerType, Row
mylist = [1, 2, 3, 4, None ]
l = map(lambda x : Row(x), mylist)
# notice the parens after the type name
df=spark.createDataFrame(l,["id"])
df.where(df.id.isNull() == False).show()
Basiclly, you need to init your int into Row(), then we can use the schema
Inferring the Schema Using Reflection
from pyspark.sql import Row
# spark - sparkSession
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
orders = sc.textFile("/practicedata/orders")
#Split on delimiters
parts = orders.map(lambda l: l.split(","))
#Convert to Row
orders_struct = parts.map(lambda p: Row(order_id=int(p[0]), order_date=p[1], customer_id=p[2], order_status=p[3]))
for i in orders_struct.take(5): print(i)
#convert the RDD to DataFrame
orders_df = spark.createDataFrame(orders_struct)
Programmatically Specifying the Schema
from pyspark.sql import Row
# spark - sparkSession
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
orders = sc.textFile("/practicedata/orders")
#Split on delimiters
parts = orders.map(lambda l: l.split(","))
#Convert to tuple
orders_struct = parts.map(lambda p: (p[0], p[1], p[2], p[3].strip()))
#convert the RDD to DataFrame
orders_df = spark.createDataFrame(orders_struct)
# The schema is encoded in a string.
schemaString = "order_id order_date customer_id status"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = Struct
ordersDf = spark.createDataFrame(orders_struct, schema)
Type(fields)
from pyspark.sql import Row
myFloatRdd.map(lambda x: Row(x)).toDF()

Categories