I have a spark dataframe to which i need to make some change
Input dataframe :
+-----+-----+----------+
| name|Index| Value |
+-----+-----+----------+
|name1|1 |ab |
|name2|1 |vf |
|name2|2 |ee |
|name2|3 |id |
|name3|1 |bd |
for every name there are multiple values which we need to bind together as shown below
Output dataframe :
+-----+----------+
| name|value |
+-----+----------+
|name1|[ab] |
|name2|[vf,ee,id]|
|name3|[bd] |
Thank you
Related
I am working on 2 datasets in PySpark, lets say Dataset_A and Dataset_B. I want to check if 'P/N' column in Dataset_A is present in 'Assembly_P/N' column in Dataset_B. Then I need to create a new column in Dataset_A titled 'Present or Not' with the values 'Present' or 'Not Present' depending on the search result.
PS. Both Datasets are huge and I am trying to figure out an efficient solution to do this without actually joining the tables.
sample
Dataset_A
| P/N |
| -------- |
| 1bc |
| 2df |
| 1cd |
Dataset_B
| Assembly_P/N |
| -------- |
| 1bc |
| 6gh |
| 2df |
Expected Result
Dataset_A
| P/N | Present or Not |
| -------- | -------- |
| 1bc | Present |
| 2df | Present |
| 1cd | Not Present |
from pyspark.sql.functions import udf
from pyspark.sql.functions import when, col, lit
def check_value(PN):
if dataset_B(col("Assembly_P/N")).isNotNull().rlike("%PN%"):
return 'Present'
else:
return 'Not Present'
check_value_udf = udf(check_value,StringType())
dataset_A = dataset_A.withColumn('Present or Not',check_value_udf(dataset_A.P/N))
I am getting PicklingError
For a project with table features, I try to create a new table with pivot_table.
Small problem however, one of my columns contains an array. Here an ex
| House | Job |
| ----- | --- |
| Gryffindor | ["Head of Auror Office", " Minister for Magic"]|
| Gryffindor | ["Auror"]|
| Slytherin | ["Auror","Student"] |
Ideally, I would like with a pivot table to create a table that looks like this
| House | Head of Auror Office | Minister for Magic | Auror | Student |
|:----- |:--------------------:|:------------------:|:-----:|:-------:|
| Gryffindor | 1 | 1| 1| 0|
| Slytherin | 0 | 0| 1| 1|
Of course I can have a value like 2,3 or 4 in the array so something that is not fixed. Anyone have a solution? Maybe the pivot_table is not the best solution :/
Sorry for the arrays, It's not working :(
suppose your table is df with two columns:
(df.explode('Job')
.groupby(['House', 'Job']).size().reset_index()
.pivot(index = 'House', columns = 'Job').fillna(0))
the code first expand the list into rows, then do the count, and finally do the pivot table
I have a pyspark dataframe that looks like this.
+--------------------+-------+--------------------+
| ID |country| attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2,3,4...] |
|3de27656 | US|[1,7,2,4...] |
|75ce4e58 | US|[1,2,1,4...] |
|908df65c | US|[1,8,3,0...] |
|f0503257 | US|[1,2,3,2...] |
|2tBxD6j | US|[1,2,3,4...] |
|33811685 | US|[1,5,3,5...] |
|aad21639 | US|[7,8,9,4...] |
|e3d9e3bb | US|[1,10,9,4...] |
|463f6f69 | US|[12,2,13,4...] |
+--------------------+-------+--------------------+
I also have a set that looks like this
reference_set = (1,2,100,500,821)
what I want to do is create a new list as a column in the dataframe using maybe a list comprehension like this [attr for attr in attrs if attr in reference_set]
so my final dataframe should be something like this
+--------------------+-------+--------------------+
| ID |country| filtered_attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2] |
|3de27656 | US|[1,2] |
|75ce4e58 | US|[1,2] |
|908df65c | US|[1] |
|f0503257 | US|[1,2] |
|2tBxD6j | US|[1,2] |
|33811685 | US|[1] |
|aad21639 | US|[] |
|e3d9e3bb | US|[1] |
|463f6f69 | US|[2] |
+--------------------+-------+--------------------+
How can I do this? as I'm new to pyspark I can't think of a logic.
Edit : posted a logic below, if there's a more efficient way of doing this please let me know.
You can use built-in function - array_intersect.
# Sample dataframe
df = spark.createDataFrame([('ffae10af', 'US', [1,2,3,4])], ["ID", "Country", "attrs"])
reference_set = {1,2,100,500,821}
# This step is to add set as column in dataframe
set_to_string = ",".join([str(x) for x in reference_set])
df.withColumn('reference_set', split(lit(set_to_string), ',').cast('array<bigint>')). \
withColumn('filtered_attrs', array_intersect('attrs','reference_set'))\
.show(truncate = False)
+--------+-------+------------+---------------------+--------------+
|ID |Country|attrs |reference_set |filtered_attrs|
+--------+-------+------------+---------------------+--------------+
|ffae10af|US |[1, 2, 3, 4]|[1, 2, 100, 500, 821]|[1, 2] |
+--------+-------+------------+---------------------+--------------+
I managed to use the filter function paired with a UDF to make this work.
def filter_items(item):
if item in reference_set:
return True
else:
return False
custom_udf = udf(lambda attributes : list(filter(filter_items, attributes)))
processed_df = df.withColumn('filtered_attrs',custom_udf(col('attrs')))
This gives me the required output
We're trying to migrate a vanilla python code-base to pyspark. The agenda is to do some filtering on a dataframe (previously pandas, now spark), then group it by user-ids, and finally apply meanshift clustering on top.
I'm using pandas_udf(df.schema, PandasUDFType.GROUPED_MAP) on the grouped-data. But now there's a problem in the way the final output should be represented.
Let's say we have two columns in the input dataframe, user-id and location. For each user we need to get all clusters (on the location), retain only the biggest one, and then return its attributes, which is a 3-dimensional vector. Let's assume the columns of the 3-tuple are col-1, col-2 and col-3. I can only think of creating the original dataframe with 5 columns, with these 3 fields set to None, using something like withColumn('col-i', lit(None).astype(FloatType())). Then, in the first row for each user, I'm planning to populate these three columns with these attributes. But this seems really ugly way of doing it, and it would unnecessarily waste a lot of space, because apart from the first row, all entries in col-1, col-2 and col-3 would be zero. The output dataframe would look something like below in this case:
+---------+----------+-------+-------+-------+
| user-id | location | col-1 | col-2 | col-3 |
+---------+----------+-------+-------+-------+
| 02751a9 | 0.894956 | 21.9 | 31.5 | 54.1 |
| 02751a9 | 0.811956 | null | null | null |
| 02751a9 | 0.954956 | null | null | null |
| ... |
| 02751a9 | 0.811956 | null | null | null |
+--------------------------------------------+
| 0af2204 | 0.938011 | 11.1 | 12.3 | 53.3 |
| 0af2204 | 0.878081 | null | null | null |
| 0af2204 | 0.933054 | null | null | null |
| 0af2204 | 0.921342 | null | null | null |
| ... |
| 0af2204 | 0.978081 | null | null | null |
+--------------------------------------------+
This feels so wrong. Is there an elegant way of doing it?
What I ended up doing, was grouped the df by user-ids, applied functions.collect_list on the columns, so that each cell contains a list. Now each user has only one row. Then I applied meanshift clustering on each row's data.
This question already has answers here:
Split Spark dataframe string column into multiple columns
(5 answers)
Closed 4 years ago.
How we can split sparkDataframe column based on some separator like '/' using pyspark2.4
My Column contains :
+-------------------+
| timezone|
+-------------------+
| America/New_York|
| Africa/Casablanca|
| Europe/Madrid|
| Europe/Madrid|
| |
| Null |
Thanks
# Creating a dataframe
values = [('America/New_York',),('Africa/Casablanca',),('Europe/Madrid',),('Europe/Madrid',),('Germany',),('',),(None,)]
df = sqlContext.createDataFrame(values,['timezone',])
df.show(truncate=False)
+-----------------+
|timezone |
+-----------------+
|America/New_York |
|Africa/Casablanca|
|Europe/Madrid |
|Europe/Madrid |
|Germany |
| |
|null |
+-----------------+
from pyspark.sql.functions import instr, split
df = df.withColumn('separator_if_exists',(instr(col('timezone'),'/') > 0) & instr(col('timezone'),'/').isNotNull())
df = df.withColumn('col1',when(col('separator_if_exists') == True,split(col('timezone'),'/')[0]).otherwise(None))
df = df.withColumn('col2',when(col('separator_if_exists') == True,split(col('timezone'),'/')[1]).otherwise(None)).drop('separator_if_exists')
df.show(truncate=False)
+-----------------+-------+----------+
|timezone |col1 |col2 |
+-----------------+-------+----------+
|America/New_York |America|New_York |
|Africa/Casablanca|Africa |Casablanca|
|Europe/Madrid |Europe |Madrid |
|Europe/Madrid |Europe |Madrid |
|Germany |null |null |
| |null |null |
|null |null |null |
+-----------------+-------+----------+
Documentation: split() and instr(). Note that instr() is 1 based indexing. If the substring to be searched is not found, 0 is returned.