How to union Spark SQL Dataframes in Python - python

Here are several ways of creating a union of dataframes, which (if any) is best /recommended when we are talking about big dataframes? Should I create an empty dataframe first or continuously union to the first dataframe created?
Empty Dataframe creation
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("A", StringType(), False),
StructField("B", StringType(), False),
StructField("C", StringType(), False)
])
pred_union_df = spark_context.parallelize([]).toDF(schema)
Method 1 - Union as you go:
for ind in indications:
fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
pred = get_predictions(fitted_model, pred_output_df, ind)
pred_union_df = pred_union_df.union(pred[['A', 'B', 'C']])
Method 2 - Union at the end:
all_pred = []
for ind in indications:
fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
pred = get_predictions(fitted_model, pred_output_df, ind)
all_pred.append(pred)
pred_union_df = pred_union_df.union(all_pred)
Or do I have it all wrong?
Edit:
Method 2 was not possible as I thought it would be from this answer. I had to loop through the list and union each dataframe.

Method 2 is always preferred since it avoid the long lineage issue.
Although DataFrame.union only takes one DataFrame as argument, RDD.union does take a list. Given your sample code, you could try to union them before calling toDF.
If your data is on disk, you could also try to load them all at once to achieve union, e.g.,
dataframe = spark.read.csv([path1, path2, path3])

Related

Ambiguous columns error in pyspark while iteratively joining dataframes

I am currently writing a code to join(left) two dataframes multiple times iteratively based on a set of columns corresponding to the two dataframes on each iteration. For one iteration it is working fine but on second iteration I am getting ambiguous columns error.
This is the sample dataframe on which I am working
sample_data = [("Amit","","Gupta","36678","M",4000),
("Anita","Mathews","","40299","F",5000),
("Ram","","Aggarwal","42124","M",5000),
("Pooja","Anne","Goel","39298","F",5000),
("Geeta","Banuwala","Brown","12345","F",-2)
]
sample_schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("id", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True)
])
df1 = spark.createDataFrame(data = sample_data, schema = sample_schema)
sample_data = [("Amit", "ABC","MTS","36678",10),
("Ani", "DEF","CS","40299",200),
("Ram", "ABC","MTS","421",40),
("Pooja", "DEF","CS","39298",50),
("Geeta", "ABC","MTS","12345",-20)
]
sample_schema = StructType([
StructField("firstname",StringType(),True),
StructField("Company",StringType(),True),
StructField("position",StringType(),True),
StructField("id", StringType(), True),
StructField("points", IntegerType(), True)
])
df2 = spark.createDataFrame(data = sample_data, schema = sample_schema)
The code I used for this is
def joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep):
resultant_df = None
df1_cols = df1.columns
df2 = df2.withColumn("flag", lit(True))
for i in range(len(cols_to_join)):
joined_df = df1.join(df2, [(df1[col_1] == df2[col_2]) for col_1, col_2 in cols_to_join[i].items()], 'left')
joined_df = joined_df.select(*[df1[column] if column in cols_df1_to_keep else df2[column] for column in cols_df1_to_keep + cols_df2_to_keep])
df1 = (joined_df
.filter("flag is NULL")
.select(df1_cols)
)
resultant_df = (joined_df.filter(col("flag") == True) if i == 0
else resultant_df.filter(col("flag") == True).union(resultant_df)
)
return resultant_df
cols_to_join = [{"id": "id"}, {"firstname":"firstname"}]
cols_df1_to_keep = ["firstname", "middlename", "lastname", "id", "gender", "salary"]
cols_df2_to_keep = ["company", "position", "points"]
x = joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep)
it works fine if I execute this code for single run but on second iteration for again joining the rest of the rows on column "firstname" which are not joined on basis of column "id" in first iteration it is throwing following error
Column position#29518, company#29517, points#29520 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
This is the example how you can do or conditional join.
df1.join(df2, on=(df1.id == df2.id) | (df1.firstname == df2.firstname), how='left')
To make the condition dynamic, you can use reduce to chain the conditions.
from functools import reduce
def chain_join_cond(prev, value):
(lcol, rcol) = list(value.items())[0]
return prev | (df1[lcol] == df2[rcol])
# If your condition is OR, use False for initial condition.
# If your condition is AND, use True for initial condition(and use & to concatenate the conditions.)
cond = reduce(chain_join_cond, cols_to_join, F.lit(False))
# Use the cond for `on` option in join.
# df1.join(df2, on=cond, how='left')
Then to get a specific column set from df1 or df2 use list comprehensions to generate the select statement.
df = (df1.join(df2, on=cond, how='left')
.select(*[df1[c] for c in cols_df1_to_keep], *[df2[c] for c in cols_df2_to_keep]))
If you have the cols_to_join as tuple instead of dict, you can slightly simplify the code.
cols_to_join = [("id", "id"), ("firstname", "firstname")]
cond = reduce(lambda p, v: p | (df1[v[0]] == df2[v[1]]) , cols_to_join, F.lit(False))

filter on the pyspark dataframe schema to get new dataframe with columns having specific type

I want to create a generic function in pyspark that takes dataframe and a datatype as a parameter and filter the columns that does not satisfy the criteria. I am not very good at python and I am kind of stuck at the point from where I am not able to find how can I do that.
I have a scala representation of the code that does the same thing.
//sample data
val df = Seq(("587","mumbai",Some(5000),5.05),("786","chennai",Some(40000),7.055),("432","Gujarat",Some(20000),6.75),("2","Delhi",None,10.0)).toDF("Id","City","Salary","Increase").withColumn("RefID",$"Id")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)
res is the dataframe that has only integer columns in this case the salary column and we have drop all the other columns that have different types dynamically.
I won't the same behaviour in pyspark but I am not able to accomplish that.
This is what I have tried
//sample data
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True), \
StructField("raise",DoubleType(),True) \
])
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000,2.5),
("Michael","Rose","","40288","M",4000,4.7),
("Robert","","Williams","42114","M",4000,8.9),
("Maria","Anne","Jones","39192","F",4000,0.0),
("Jen","Mary","Brown","","F",-1,-1.2)
]
df = spark.createDataFrame(data=data2,schema=schema)
//getting the column list from schema of the dataframe
pschema = df.schema.fields
datatypes = [IntegerType,DoubleType] //column datatype that I want.
out = filter(lambda x: x.dataType.isin(datatypes), pschema) //gives invalid syntax error.
Can someone help me out in terms of what is the thing that I am doing wrong. Scala code only passes single datatype but I as per my use case I want to handle scenario in which we can pass multiple datatypes and we get the dataframe with required columns of that specified datatypes.
Initially if someone can give any idea for how can I make it work for single datatype then I can give it a try to see if I can do that same for multiple datatypes.
Note : Sample data for scala and pyspark is different as I copied the Pyspark sample data from somewhere just to speed the operation as I am just concerned about the final output requirement.

How to read in JSON so each element of dict/hash is a new row in dataframe?

I'm attempting to read a large dataset written in JSON into a dataframe.
a minimal working example of this dataframe:
{"X":{"sex":"Male","age":57,"BMI":"19.7"},"XX":{"BMI":"30.7","age":44,"sex":"Female"},"XXX":{"age":18,"sex":"Female","BMI":"22.3"},"XXXX":{"sex":"Male","age":29,"BMI":"25.7"},"ZZZ":{"sex":"Male","age":61,"BMI":"40.5"}}
However, the dataset is not being read correctly, as it should have about 10,999 elements, and I'm only getting 1.
The JSON is a hash/dict where each element should be a new row.
I've tried
df = spark.read.option.json("dbfs:/FileStore/shared_uploads/xyz/data.json")
df = spark.read.option("multiline", "true").json("dbfs:/FileStore/shared_uploads/xyz/data.json")
I've also tried inferSchema, but this doesn't interpret the schema even close to correctly: I still get 1 row.
and made a custom schema, where each field is a sub-key of each row.
e.g.
custom_schema = StructType([
StructField('Admission_Date', StringType(), True),
StructField('BMI', StringType(), True),
StructField('age', IntegerType(), True),
StructField('latest_date', StringType(), True),...
...
StructField('sex', StringType(), True),True)
])
and then load with the custom schema:
df = spark.read.option("multiline", "true").schema(custom_schema).json("dbfs:/FileStore/shared_uploads/xyz/data.json")
but this again yields a single row.
How can I load this JSON so that every key is considered a single row?
You can create array column from all the dataframe columns, explode it and star expand the resulting struct column :
from pyspark.sql import functions as F
df1 = df.select(
F.explode(F.array(*df.columns)).alias("rows")
).select("rows.*")
df1.show()
#+----+---+------+
#| BMI|age| sex|
#+----+---+------+
#|19.7| 57| Male|
#|30.7| 44|Female|
#|22.3| 18|Female|
#|25.7| 29| Male|
#|40.5| 61| Male|
#+----+---+------+

Pyspark / Dataframe: Add new column that keeps nested list as nested list

I have a basic question about dataframes and adding a column that should contain a nested list. This is basically the problem:
b = [[['url.de'],['name']],[['url2.de'],['name2']]]
a = sc.parallelize(b)
a = a.map(lambda p: Row(URL=p[0],name=p[1]))
df = sqlContext.createDataFrame(a)
list1 = [[['a','s', 'o'],['hallo','ti']],[['a','s', 'o'],['hallo','ti']]]
c = [b[0] + [list1[0]],b[1] + [list1[1]]]
#Output looks like this:
[[['url.de'], ['name'], [['a', 's', 'o'], ['hallo', 'ti']]],
[['url2.de'], ['name2'], [['a', 's', 'o'], ['hallo', 'ti']]]]
To Create a new Dataframe from this output, I´m trying to create a new schema:
schema = df.withColumn('NewColumn',array(lit("10"))).schema
I then use it to create the new DataFrame:
df = sqlContext.createDataFrame(c,schema)
df.map(lambda x: x).collect()
#Output
[Row(URL=[u'url.de'], name=[u'name'], NewColumn=[u'[a, s, o]', u'[hallo, ti]']),
Row(URL=[u'url2.de'], name=[u'name2'], NewColumn=[u'[a, s, o]', u'[hallo, ti]'])]
The Problem now is that, the nested list was transformed into a list with two unicode entries instead of keeping the original format.
I think this is due to my definition of the new Column "... array(lit("10"))".
What do I have to use in order to keep the original format?
You can directly inspect the schema of the dataframe by calling df.schema. You can see that in the given scenario we have the following:
StructType(
List(
StructField(URL,ArrayType(StringType,true),true),
StructField(name,ArrayType(StringType,true),true),
StructField(NewColumn,ArrayType(StringType,false),false)
)
)
The NewColumn that you added is an ArrayType column whose entries are all StringType. So anything that is contained in the array will be converted to a string, even if it is itself an array. If you want to have nested arrays (2 layers), then you need to change your schema so that the the NewColumn field has an ArrayType(ArrayType(StringType,False),False) type. You can do this by explicitly defining the schema:
from pyspark.sql.types import StructType, StructField, ArrayType, StringType
schema = StructType([
StructField("URL", ArrayType(StringType(),True), True),
StructField("name", ArrayType(StringType(),True), True),
StructField("NewColumn", ArrayType(ArrayType(StringType(),False),False), False)])
Or by changing your code by having the NewColumn be defined by nesting the array function, array(array()):
df.withColumn('NewColumn',array(array(lit("10")))).schema

Correctly reading the types from file in PySpark

I have a tab-separated file containing lines as
id1 name1 ['a', 'b'] 3.0 2.0 0.0 1.0
that is, an id, a name, a list with some strings, and a series of 4 float attributes.
I am reading this file as
rdd = sc.textFile('myfile.tsv') \
.map(lambda row: row.split('\t'))
df = sqlc.createDataFrame(rdd, schema)
where I give the schema as
schema = StructType([
StructField('id', StringType(), True),
StructField('name', StringType(), True),
StructField('list', ArrayType(StringType()), True),
StructField('att1', FloatType(), True),
StructField('att2', FloatType(), True),
StructField('att3', FloatType(), True),
StructField('att4', FloatType(), True)
])
Problem is, both the list and the attributes do not get properly read, judging from a collect on the DataFrame. In fact, I get None for all of them:
Row(id=u'id1', brand_name=u'name1', list=None, att1=None, att2=None, att3=None, att4=None)
What am I doing wrong?
It is properly read, it just doesn't work as you expect. Schema argument declares what are the types to avoid expensive schema inference not how to cast the data. Providing input that matches declared schema is your responsibility.
This can be also handled either by data source (take a look at spark-csv and inferSchema option). It won't handle complex types like array though.
Since your schema is mostly flat and you know the types you can try something like this:
df = rdd.toDF([f.name for f in schema.fields])
exprs = [
# You should excluding casting
# on other complex types as well
col(f.name).cast(f.dataType) if f.dataType.typeName() != "array"
else col(f.name)
for f in schema.fields
]
df.select(*exprs)
and handle complex types separately using string processing functions or UDFs. Alternatively, since you read data in Python anyway, just enforce desired types before you create DF.

Categories