Spark SQL Dataset: Split Multiple Array Columns to individual rows - python

I'm new to Spark SQL and the Dataset / Dataframe API.
I have columns of which 2 columns both have multi values / arrays in my Dataset.
I want to step through the arrays per line positionally, and output a new row for each set of corresponding positional entries in the arrays. You can see how from the 2 diagrams below.
For example:
Input dataframe / dataset
+---+---------+-----+
| id| le|leloc|
+---+---------+-----+
| 1|[aaa,bbb]|[1,2]|
| 2|[ccc,ddd]|[3,4]|
+---+---------+-----+
Expected Output dataset
I need output as per below, the data is transformed from columns to rows:
+---+---------+-----+
| id| le|leloc|
+---+---------+-----+
| 1|aaa |1 |
| 1|bbb |2 |
| 2|ccc |3 |
| 2|ddd |4 |
+---+---------+-----+

%python
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Gen some data
df1 = spark.createDataFrame([ ( 1, list(['A','B','X']), list(['1', '2', '8']) ) for x in range(2)], ['value1', 'array1', 'array2'] )
df2 = spark.createDataFrame([ ( 2, list(['C','D','Y']), list(['3', '4', '9']) ) for x in range(2)], ['value1', 'array1', 'array2'] )
df = df1.union(df2).distinct()
# from here specifically for you
col_temp_expr = "transform(array1, (x, i) -> concat(x, ',', array2[i]))"
dfA = df.withColumn("col_temp", expr(col_temp_expr))
dfB = dfA.select("value1", "array2", explode((col("col_temp")))) # Not an array
dfC = dfB.withColumn('tempArray', split(dfB['col'], ',')) # Now an array
dfC.select("value1", dfC.tempArray[0], dfC.tempArray[1]).show()
returns:
+------+------------+------------+
|value1|tempArray[0]|tempArray[1]|
+------+------------+------------+
| 1| A| 1|
| 1| B| 2|
| 1| X| 8|
| 2| C| 3|
| 2| D| 4|
| 2| Y| 9|
+------+------------+------------+
You can rename cols. This had more elements per array.

Related

How to generate the max values for new columns in PySpark dataframe?

Suppose I have a pyspark dataframe df.
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
| 4| 5|
+---+---+
I'd like to add new column c.
column c = max(0, column b - 100)
+---+---+---+
| a| b| c|
+---+---+---+
| 1|200|100|
| 2|300|200|
| 4| 50| 0|
+---+---+---+
How should I generate the new column c in pyspark dataframe? Thanks in advance!
Hope you are looking something like this:
from pyspark.sql.functions import col, lit, greatest
df = spark.createDataFrame(
[
(1, 200),
(2, 300),
(4, 50),
],
["a", "b"]
)
df_new = df.withColumn("c", greatest(lit(0), col("b")-lit(100)))
.show()

pyspark iterate N rows from Data Frame to each execution

def fun_1(csv):
# returns int[] of length = Number of New Lines in String csv
def fun_2(csv): # My WorkArround to Pass one CSV Line at One Time
return fun_1(csv)[0]
The Input Data Frame is df
+----+----+-----+
|col1|col2|CSVs |
+----+----+-----+
| 1| a|2,0,1|
| 2| b|2,0,2|
| 3| c|2,0,3|
| 4| a|2,0,1|
| 5| b|2,0,2|
| 6| c|2,0,3|
| 7| a|2,0,1|
+----+----+-----+
Below is a Code Snippet which works but takes long time
from pyspark.sql.functions import udf
from pyspark.sql import functions as sf
funudf = udf(fun_2) # wish it could be fun_1
df=df.withColumn( 'pred' , funudf(sf.col('csv')))
fun_1 , has a memory issue and could only handle 50000 max rows at a time. I wish to use funudf = udf(fun_1) .
Hence, how can I split the PySpark DF into segments of 50000 rows , call funudf ->fun_1 .
Output has two colunms, 'col1' from the input and 'funudf return value' .
You can achieve the desired result of forcing PySpark to operate on fixed batches of rows by using the groupByKey method exposed in the RDD API. Using groupByKey will force PySpark to shuffle all the data for a single key to a single executor.
NOTE: for this very same reason using groupByKey is often discouraged because of the network cost.
Strategy:
Add a column that groups your data into the desired size of batches and groupByKey
Define a function that reproduces the logic of your UDF (and also returns an id for joining later on). This operates on pyspark.resultiterable.ResultIterable, the result of groupByKey. Apply function to your groups using mapValues
Convert the resulting RDD into a DataFrame and join back in.
Example:
# Synthesize DF
data = {'_id': range(9), 'group': ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c'], 'vals': [2.0*i for i in range(9)]}
df = spark.createDataFrame(pd.DataFrame(data))
df.show()
##
# Step - 1 Convert to rdd and groupByKey to force each group to separate executor
##
kv = df.rdd.map(lambda r: (r.group, [r._id, r.group, r.vals]))
groups = kv.groupByKey()
##
# Step 2 - Calulate function
##
# Dummy function taking
def mult3(ditr):
data = ditr.data
ids = [v[0] for v in data]
vals = [3*v[2] for v in data]
return zip(ids, vals)
# run mult3 and flaten results
mv = groups.mapValues(mult3).map(lambda r: r[1]).flatMap(lambda r: r) # rdd[(id, val)]
##
# Step 3 - Join results back into base DF
##
# convert results into a DF and join back in
schema = t.StructType([t.StructField('_id', t.LongType()), t.StructField('vals_x_3', t.FloatType())])
df_vals = spark.createDataFrame(mv, schema)
joined = df.join(df_vals, '_id')
joined.show()
>>>
+---+-----+----+
|_id|group|vals|
+---+-----+----+
| 0| a| 0.0|
| 1| b| 2.0|
| 2| c| 4.0|
| 3| a| 6.0|
| 4| b| 8.0|
| 5| c|10.0|
| 6| a|12.0|
| 7| b|14.0|
| 8| c|16.0|
+---+-----+----+
+---+-----+----+--------+
|_id|group|vals|vals_x_3|
+---+-----+----+--------+
| 0| a| 0.0| 0.0|
| 7| b|14.0| 42.0|
| 6| a|12.0| 36.0|
| 5| c|10.0| 30.0|
| 1| b| 2.0| 6.0|
| 3| a| 6.0| 18.0|
| 8| c|16.0| 48.0|
| 2| c| 4.0| 12.0|
| 4| b| 8.0| 24.0|
+---+-----+----+--------+

How to merge two dataframe in pyspark when one column is an array and another column is string?

df1:
+---+------+
| id| code|
+---+------+
| 1|[A, F]|
| 2| [G]|
| 3| [A]|
+---+------+
df2:
+--------+----+
| col1|col2|
+--------+----+
| Apple| A|
| Google| G|
|Facebook| F|
+--------+----+
I want the df3 should be like this by using the df1, and df2 columns :
+---+------+-----------------+
| id| code| changed|
+---+------+-----------------+
| 1|[A, F]|[Apple, Facebook]|
| 2| [G]| [Google]|
| 3| [A]| [Apple]|
+---+------+-----------------+
I know this can be archived if the code column is NOT an ARRAY. I don't know how to iterate the code array for this purpose.
Try:
from pyspark.sql.functions import *
import pyspark.sql.functions as f
res=(df1
.select(f.col("id"), f.explode(f.col("code")).alias("code"))
.join(df2, f.col("code")==df2.col2)
.groupBy("id")
.agg(f.collect_list(f.col("code")).alias("code"), f.collect_list(f.col("col1")).alias("changed"))
)

Find count of distinct values between two same values in a csv file using pyspark

Im working on pyspark to deal with big CSV files more than 50gb.
Now I need to find the number of distinct values between two references to the same value.
for example,
input dataframe:
+----+
|col1|
+----+
| a|
| b|
| c|
| c|
| a|
| b|
| a|
+----+
output dataframe:
+----+-----+
|col1|col2 |
+----+-----+
| a| null|
| b| null|
| c| null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+-----+
I'm struggling with this for past one week. Tried window functions and many things in spark. But couldn't get anything. It would be a great help if someone knows how to fix this. Thank you.
Comment if you need any clarification in the question.
I am providing solution, with some assumptions.
Assuming, previous reference can be found in max of previous 'n' rows. If 'n' is reasonable less value, i think this is good solution.
I assumed you can find the previous reference in 5 rows.
def get_distincts(list, current_value):
cnt = {}
flag = False
for i in list:
if current_value == i :
flag = True
break
else:
cnt[i] = "some_value"
if flag:
return len(cnt)
else:
return None
get_distincts_udf = udf(get_distincts, IntegerType())
df = spark.createDataFrame([["a"],["b"],["c"],["c"],["a"],["b"],["a"]]).toDF("col1")
#You can replace this, if you have some unique id column
df = df.withColumn("seq_id", monotonically_increasing_id())
window = Window.orderBy("seq_id")
df = df.withColumn("list", array([lag(col("col1"),i, None).over(window) for i in range(1,6) ]))
df = df.withColumn("col2", get_distincts_udf(col('list'), col('col1'))).drop('seq_id','list')
df.show()
which results
+----+----+
|col1|col2|
+----+----+
| a|null|
| b|null|
| c|null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+----+
You can try the following approach:
add a monotonically_increasing column id to keep track the order of rows
find prev_id for each col1 and save the result to a new df
for the new DF (alias 'd1'), make a LEFT JOIN to the DF itself (alias 'd2') with a condition (d2.id > d1.prev_id) & (d2.id < d1.id)
then groupby('d1.col1', 'd1.id') and aggregate on the countDistinct('d2.col1')
The code based on the above logic and your sample data is shown below:
from pyspark.sql import functions as F, Window
df1 = spark.createDataFrame([ (i,) for i in list("abccaba")], ["col1"])
# create a WinSpec partitioned by col1 so that we can find the prev_id
win = Window.partitionBy('col1').orderBy('id')
# set up id and prev_id
df11 = df1.withColumn('id', F.monotonically_increasing_id())\
.withColumn('prev_id', F.lag('id').over(win))
# check the newly added columns
df11.sort('id').show()
# +----+---+-------+
# |col1| id|prev_id|
# +----+---+-------+
# | a| 0| null|
# | b| 1| null|
# | c| 2| null|
# | c| 3| 2|
# | a| 4| 0|
# | b| 5| 1|
# | a| 6| 4|
# +----+---+-------+
# let's cache the new dataframe
df11.persist()
# do a self-join on id and prev_id and then do the aggregation
df12 = df11.alias('d1') \
.join(df11.alias('d2')
, (F.col('d2.id') > F.col('d1.prev_id')) & (F.col('d2.id') < F.col('d1.id')), how='left') \
.select('d1.col1', 'd1.id', F.col('d2.col1').alias('ids')) \
.groupBy('col1','id') \
.agg(F.countDistinct('ids').alias('distinct_values'))
# display the result
df12.sort('id').show()
# +----+---+---------------+
# |col1| id|distinct_values|
# +----+---+---------------+
# | a| 0| 0|
# | b| 1| 0|
# | c| 2| 0|
# | c| 3| 0|
# | a| 4| 2|
# | b| 5| 2|
# | a| 6| 1|
# +----+---+---------------+
# release the cached df11
df11.unpersist()
Note you will need to keep this id column to sort rows, otherwise your resulting rows will be totally messed up each time you collect them.
reuse_distance = []
block_dict = {}
stack_dict = {}
counter_reuse = 0
counter_stack = 0
reuse_list = []
Here block is nothing but the characters you want to read and search from csv
stack_list = []
stack_dist = -1
reuse_dist = -1
if block in block_dict:
reuse_dist = counter_reuse - block_dict[block]-1
block_dict[block] = counter_reuse
counter_reuse += 1
stack_dist_ind= stack_list.index(block)
stack_dist = counter_stack -stack_dist_ind - 1
del stack_list[stack_dist_ind]
stack_list.append(block)
else:
block_dict[block] = counter_reuse
counter_reuse += 1
counter_stack += 1
stack_list.append(block)
reuse_distance_2.append([block, stack_dist, reuse_dist])

PySpark: Add a column to DataFrame when column is a list

I have read similar questions but couldn't find a solution to my specific problem.
I have a list
l = [1, 2, 3]
and a DataFrame
df = sc.parallelize([
['p1', 'a'],
['p2', 'b'],
['p3', 'c'],
]).toDF(('product', 'name'))
I would like to obtain a new DataFrame where the list l is added as a further column, namely
+-------+----+---------+
|product|name| new_col |
+-------+----+---------+
| p1| a| 1 |
| p2| b| 2 |
| p3| c| 3 |
+-------+----+---------+
Approaches with JOIN, where I was joining df with an
sc.parallelize([[1], [2], [3]])
have failed. Approaches using withColumn, as in
new_df = df.withColumn('new_col', l)
have failed because the list is not a Column object.
So, from reading some interesting stuff here, I've ascertained that you can't really just append a random / arbitrary column to a given DataFrame object. It appears what you want is more of a zip than a join. I looked around and found this ticket, which makes me think you won't be able to zip given that you have DataFrame rather than RDD objects.
The only way I've been able to solve your issue invovles leaving the world of DataFrame objects and returning to RDD objects. I've also needed to create an index for the purpose of the join, which may or may not work with your use case.
l = sc.parallelize([1, 2, 3])
index = sc.parallelize(range(0, l.count()))
z = index.zip(l)
rdd = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']])
rdd_index = index.zip(rdd)
# just in case!
assert(rdd.count() == l.count())
# perform an inner join on the index we generated above, then map it to look pretty.
new_rdd = rdd_index.join(z).map(lambda (x, y): [y[0][0], y[0][1], y[1]])
new_df = new_rdd.toDF(["product", 'name', 'new_col'])
When I run new_df.show(), I get:
+-------+----+-------+
|product|name|new_col|
+-------+----+-------+
| p1| a| 1|
| p2| b| 2|
| p3| c| 3|
+-------+----+-------+
Sidenote: I'm really surprised this didn't work. Looks like an outer join?
from pyspark.sql import Row
l = sc.parallelize([1, 2, 3])
new_row = Row("new_col_name")
l_as_df = l.map(new_row).toDF()
new_df = df.join(l_as_df)
When I run new_df.show(), I get:
+-------+----+------------+
|product|name|new_col_name|
+-------+----+------------+
| p1| a| 1|
| p1| a| 2|
| p1| a| 3|
| p2| b| 1|
| p3| c| 1|
| p2| b| 2|
| p2| b| 3|
| p3| c| 2|
| p3| c| 3|
+-------+----+------------+
If the product column is unique then consider the following approach:
original dataframe:
df = spark.sparkContext.parallelize([
['p1', 'a'],
['p2', 'b'],
['p3', 'c'],
]).toDF(('product', 'name'))
df.show()
+-------+----+
|product|name|
+-------+----+
| p1| a|
| p2| b|
| p3| c|
+-------+----+
new column (and new index column):
lst = [1, 2, 3]
indx = ['p1','p2','p3']
create a new dataframe from the list above (with an index):
from pyspark.sql.types import *
myschema= StructType([ StructField("indx", StringType(), True),
StructField("newCol", IntegerType(), True)
])
df1=spark.createDataFrame(zip(indx,lst),schema = myschema)
df1.show()
+----+------+
|indx|newCol|
+----+------+
| p1| 1|
| p2| 2|
| p3| 3|
+----+------+
join this to the original dataframe, using the index created:
dfnew = df.join(df1, df.product == df1.indx,how='left')\
.drop(df1.indx)\
.sort("product")
to get:
dfnew.show()
+-------+----+------+
|product|name|newCol|
+-------+----+------+
| p1| a| 1|
| p2| b| 2|
| p3| c| 3|
+-------+----+------+
This is achievable via RDDs.
1 Convert dataframes to indexed rdds:
df_rdd = df.rdd.zipWithIndex().map(lambda row: (row[1], (row[0][0], row[0][1])))
l_rdd = sc.parallelize(l).zipWithIndex().map(lambda row: (row[1], row[0]))
2 Join two RDDs on index, drop index and rearrange elements:
res_rdd = df_rdd.join(l_rdd).map(lambda row: [row[1][0][0], row[1][0][1], row[1][1]])
3 Convert result to Dataframe:
res_df = res_rdd.toDF(['product', 'name', 'new_col'])
res_df.show()
+-------+----+-------+
|product|name|new_col|
+-------+----+-------+
| p1| a| 1|
| p2| b| 2|
| p3| c| 3|
+-------+----+-------+

Categories