Add RDD to DataFrame Column PySpark - python

I want to create a Dataframe with the columns of two RDD's.
The first is RDD that i get from CSV and second is another RDD with a cluster prediction of each row.
My schema is:
customSchema = StructType([ \
StructField("Area", FloatType(), True), \
StructField("Perimeter", FloatType(), True), \
StructField("Compactness", FloatType(), True), \
StructField("Lenght", FloatType(), True), \
StructField("Width", FloatType(), True), \
StructField("Asymmetry", FloatType(), True), \
StructField("KernelGroove", FloatType(), True)])
Map my rdd and create the DataFrame:
FN2 = rdd.map(lambda x: (float(x[0]), float(x[1]),float(x[2]),float(x[3]),float(x[4]),float(x[5]),float(x[6])))
df = sqlContext.createDataFrame(FN2, customSchema)
And my cluster prediction:
result = Kmodel.predict(rdd)
So, to conclude i want to have in my DataFrame the rows of my CSV and their cluster prediction at the end.
I tried to add a new column with .WithColumn() but i got nothing.
Thanks.

If you have a common field on both data frame, then join with the key otherwise create a unique Id and join both dataframe to get rows of CSV and their cluster prediction in a single dataframe
Scala code to generate a unique id for each row, try to convert for the pyspark. You need to generate a increasing row id and join them with row id
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(("abc", 2), ("def", 1), ("hij", 3))).toDF("word", "count")
val wcschema = df.schema
val inputRows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val wcID = sqlContext.createDataFrame(inputRows, StructType(StructField("id", LongType, false) +: wcschema.fields))
or use sql query
val tmpTable1 = sqlContext.sql("select row_number() over (order by count) as rnk,word,count from wordcount")
tmpTable1.show()

Related

Ambiguous columns error in pyspark while iteratively joining dataframes

I am currently writing a code to join(left) two dataframes multiple times iteratively based on a set of columns corresponding to the two dataframes on each iteration. For one iteration it is working fine but on second iteration I am getting ambiguous columns error.
This is the sample dataframe on which I am working
sample_data = [("Amit","","Gupta","36678","M",4000),
("Anita","Mathews","","40299","F",5000),
("Ram","","Aggarwal","42124","M",5000),
("Pooja","Anne","Goel","39298","F",5000),
("Geeta","Banuwala","Brown","12345","F",-2)
]
sample_schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("id", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True)
])
df1 = spark.createDataFrame(data = sample_data, schema = sample_schema)
sample_data = [("Amit", "ABC","MTS","36678",10),
("Ani", "DEF","CS","40299",200),
("Ram", "ABC","MTS","421",40),
("Pooja", "DEF","CS","39298",50),
("Geeta", "ABC","MTS","12345",-20)
]
sample_schema = StructType([
StructField("firstname",StringType(),True),
StructField("Company",StringType(),True),
StructField("position",StringType(),True),
StructField("id", StringType(), True),
StructField("points", IntegerType(), True)
])
df2 = spark.createDataFrame(data = sample_data, schema = sample_schema)
The code I used for this is
def joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep):
resultant_df = None
df1_cols = df1.columns
df2 = df2.withColumn("flag", lit(True))
for i in range(len(cols_to_join)):
joined_df = df1.join(df2, [(df1[col_1] == df2[col_2]) for col_1, col_2 in cols_to_join[i].items()], 'left')
joined_df = joined_df.select(*[df1[column] if column in cols_df1_to_keep else df2[column] for column in cols_df1_to_keep + cols_df2_to_keep])
df1 = (joined_df
.filter("flag is NULL")
.select(df1_cols)
)
resultant_df = (joined_df.filter(col("flag") == True) if i == 0
else resultant_df.filter(col("flag") == True).union(resultant_df)
)
return resultant_df
cols_to_join = [{"id": "id"}, {"firstname":"firstname"}]
cols_df1_to_keep = ["firstname", "middlename", "lastname", "id", "gender", "salary"]
cols_df2_to_keep = ["company", "position", "points"]
x = joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep)
it works fine if I execute this code for single run but on second iteration for again joining the rest of the rows on column "firstname" which are not joined on basis of column "id" in first iteration it is throwing following error
Column position#29518, company#29517, points#29520 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
This is the example how you can do or conditional join.
df1.join(df2, on=(df1.id == df2.id) | (df1.firstname == df2.firstname), how='left')
To make the condition dynamic, you can use reduce to chain the conditions.
from functools import reduce
def chain_join_cond(prev, value):
(lcol, rcol) = list(value.items())[0]
return prev | (df1[lcol] == df2[rcol])
# If your condition is OR, use False for initial condition.
# If your condition is AND, use True for initial condition(and use & to concatenate the conditions.)
cond = reduce(chain_join_cond, cols_to_join, F.lit(False))
# Use the cond for `on` option in join.
# df1.join(df2, on=cond, how='left')
Then to get a specific column set from df1 or df2 use list comprehensions to generate the select statement.
df = (df1.join(df2, on=cond, how='left')
.select(*[df1[c] for c in cols_df1_to_keep], *[df2[c] for c in cols_df2_to_keep]))
If you have the cols_to_join as tuple instead of dict, you can slightly simplify the code.
cols_to_join = [("id", "id"), ("firstname", "firstname")]
cond = reduce(lambda p, v: p | (df1[v[0]] == df2[v[1]]) , cols_to_join, F.lit(False))

Iterate through each column and find the max length

I want to get the maximum length from each column from a pyspark dataframe.
Following is the sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
I tried to implement the solution provided in Scala but could not convert it.
This would work
from pyspark.sql.functions import col, length, max
df=df.select([max(length(col(name))) for name in df.schema.names])
Result
Edit: For reference: Converting to Rows (As asked here, updated there as well - pyspark max string length for each column in the dataframe)
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Output:

filter on the pyspark dataframe schema to get new dataframe with columns having specific type

I want to create a generic function in pyspark that takes dataframe and a datatype as a parameter and filter the columns that does not satisfy the criteria. I am not very good at python and I am kind of stuck at the point from where I am not able to find how can I do that.
I have a scala representation of the code that does the same thing.
//sample data
val df = Seq(("587","mumbai",Some(5000),5.05),("786","chennai",Some(40000),7.055),("432","Gujarat",Some(20000),6.75),("2","Delhi",None,10.0)).toDF("Id","City","Salary","Increase").withColumn("RefID",$"Id")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)
res is the dataframe that has only integer columns in this case the salary column and we have drop all the other columns that have different types dynamically.
I won't the same behaviour in pyspark but I am not able to accomplish that.
This is what I have tried
//sample data
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True), \
StructField("raise",DoubleType(),True) \
])
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000,2.5),
("Michael","Rose","","40288","M",4000,4.7),
("Robert","","Williams","42114","M",4000,8.9),
("Maria","Anne","Jones","39192","F",4000,0.0),
("Jen","Mary","Brown","","F",-1,-1.2)
]
df = spark.createDataFrame(data=data2,schema=schema)
//getting the column list from schema of the dataframe
pschema = df.schema.fields
datatypes = [IntegerType,DoubleType] //column datatype that I want.
out = filter(lambda x: x.dataType.isin(datatypes), pschema) //gives invalid syntax error.
Can someone help me out in terms of what is the thing that I am doing wrong. Scala code only passes single datatype but I as per my use case I want to handle scenario in which we can pass multiple datatypes and we get the dataframe with required columns of that specified datatypes.
Initially if someone can give any idea for how can I make it work for single datatype then I can give it a try to see if I can do that same for multiple datatypes.
Note : Sample data for scala and pyspark is different as I copied the Pyspark sample data from somewhere just to speed the operation as I am just concerned about the final output requirement.

How to read in JSON so each element of dict/hash is a new row in dataframe?

I'm attempting to read a large dataset written in JSON into a dataframe.
a minimal working example of this dataframe:
{"X":{"sex":"Male","age":57,"BMI":"19.7"},"XX":{"BMI":"30.7","age":44,"sex":"Female"},"XXX":{"age":18,"sex":"Female","BMI":"22.3"},"XXXX":{"sex":"Male","age":29,"BMI":"25.7"},"ZZZ":{"sex":"Male","age":61,"BMI":"40.5"}}
However, the dataset is not being read correctly, as it should have about 10,999 elements, and I'm only getting 1.
The JSON is a hash/dict where each element should be a new row.
I've tried
df = spark.read.option.json("dbfs:/FileStore/shared_uploads/xyz/data.json")
df = spark.read.option("multiline", "true").json("dbfs:/FileStore/shared_uploads/xyz/data.json")
I've also tried inferSchema, but this doesn't interpret the schema even close to correctly: I still get 1 row.
and made a custom schema, where each field is a sub-key of each row.
e.g.
custom_schema = StructType([
StructField('Admission_Date', StringType(), True),
StructField('BMI', StringType(), True),
StructField('age', IntegerType(), True),
StructField('latest_date', StringType(), True),...
...
StructField('sex', StringType(), True),True)
])
and then load with the custom schema:
df = spark.read.option("multiline", "true").schema(custom_schema).json("dbfs:/FileStore/shared_uploads/xyz/data.json")
but this again yields a single row.
How can I load this JSON so that every key is considered a single row?
You can create array column from all the dataframe columns, explode it and star expand the resulting struct column :
from pyspark.sql import functions as F
df1 = df.select(
F.explode(F.array(*df.columns)).alias("rows")
).select("rows.*")
df1.show()
#+----+---+------+
#| BMI|age| sex|
#+----+---+------+
#|19.7| 57| Male|
#|30.7| 44|Female|
#|22.3| 18|Female|
#|25.7| 29| Male|
#|40.5| 61| Male|
#+----+---+------+

PySpark: Cannot create small dataframe

I'm trying to create a small dataframe so that I can save two scalars (doubles) and a string
from How to create spark dataframe with column name which contains dot/period?
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson])
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
producing the error message:
TypeError: StructType can not accept object 's3://sanford-biofx-dev/con/dev3/dev' in type <class 'str'> which doesn't make any sense to me, this column was meant for strings
my other solution is from
Add new rows to pyspark Dataframe
columns = ["comparison", "paired p", "Pearson coefficient"]
vals = [output_stem, paired_p_value, scalar_pearson]
df = spark.createDataFrame(vals, columns)
display(df)
but this gives an error: TypeError: Can not infer schema for type: <class 'str'>
I just want a small dataframe:
comparison | paired p-value | Pearson Coefficient
-------------------------------------------------
s3://sadf | 0.045 | -0.039
The solution is to put a comma of mystery at the end of input_data thanks to #10465355 says Reinstate Monica
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson],)
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
I don't understand why this comma is necessary, or what it does, but it seems to do the job

Categories