How can I transfer the exploded object back to a pyspark dataframe? - python

I am trying to convert this back to a pyspark data frame.
I am currently trying to calculate the 2-gram distribution here. I am trying to use:
new_df = sample_df_updated.select(['ngrams'])
from pyspark.sql.functions import explode
new_df.select(explode(new_df.ngrams)).show(truncate=False)
+------------------+
|col |
+------------------+
|the project |
|project gutenberg |
|gutenberg ebook |
|ebook of |
|of alice’s |
|alice’s adventures|
|adventures in |
|in wonderland, |
|wonderland, by |
|by lewis |
|lewis carroll |
|this ebook |
|ebook is |
|is for |
|for the |
|the use |
|use of |
|of anyone |
|anyone anywhere |
|anywhere at |
+------------------+
I am trying to use code like this:
df2 = new_df.select(explode(new_df.ngrams)).show(truncate=False)
df2.groupBy('col').count().show()
But it results in the error
'NoneType' object has no attribute 'show'
How to transfer it into a dataframe?

The .show() command makes df2 not a DataFrame.
Try:
df2 = new_df.select(explode(new_df.ngrams))
df2.show(truncate=False)
df2.groupBy('col').count().show()
And, could also be helpful to rename the exploded column for clarity.
df2 = new_df.select(explode(new_df.ngrams).alias('exploded_ngrams'))
df2.show(truncate=False)
df2.groupBy('exploded_ngrams').count().show()

What about you just explode the column on the dataframe?
new_df.withColumn("ngrams", explode("ngrams")).show(truncate=False)

Related

How to map 2 dataset to check if a value from Dataset_A is present in Dataset_B and create a new column in Dataset_A as 'Present or Not'?

I am working on 2 datasets in PySpark, lets say Dataset_A and Dataset_B. I want to check if 'P/N' column in Dataset_A is present in 'Assembly_P/N' column in Dataset_B. Then I need to create a new column in Dataset_A titled 'Present or Not' with the values 'Present' or 'Not Present' depending on the search result.
PS. Both Datasets are huge and I am trying to figure out an efficient solution to do this without actually joining the tables.
sample
Dataset_A
| P/N |
| -------- |
| 1bc |
| 2df |
| 1cd |
Dataset_B
| Assembly_P/N |
| -------- |
| 1bc |
| 6gh |
| 2df |
Expected Result
Dataset_A
| P/N | Present or Not |
| -------- | -------- |
| 1bc | Present |
| 2df | Present |
| 1cd | Not Present |
from pyspark.sql.functions import udf
from pyspark.sql.functions import when, col, lit
def check_value(PN):
if dataset_B(col("Assembly_P/N")).isNotNull().rlike("%PN%"):
return 'Present'
else:
return 'Not Present'
check_value_udf = udf(check_value,StringType())
dataset_A = dataset_A.withColumn('Present or Not',check_value_udf(dataset_A.P/N))
I am getting PicklingError

Python pivot-table with array value

For a project with table features, I try to create a new table with pivot_table.
Small problem however, one of my columns contains an array. Here an ex
| House | Job |
| ----- | --- |
| Gryffindor | ["Head of Auror Office", " Minister for Magic"]|
| Gryffindor | ["Auror"]|
| Slytherin | ["Auror","Student"] |
Ideally, I would like with a pivot table to create a table that looks like this
| House | Head of Auror Office | Minister for Magic | Auror | Student |
|:----- |:--------------------:|:------------------:|:-----:|:-------:|
| Gryffindor | 1 | 1| 1| 0|
| Slytherin | 0 | 0| 1| 1|
Of course I can have a value like 2,3 or 4 in the array so something that is not fixed. Anyone have a solution? Maybe the pivot_table is not the best solution :/
Sorry for the arrays, It's not working :(
suppose your table is df with two columns:
(df.explode('Job')
.groupby(['House', 'Job']).size().reset_index()
.pivot(index = 'House', columns = 'Job').fillna(0))
the code first expand the list into rows, then do the count, and finally do the pivot table

Modify Spark Dataframe

I have a spark dataframe to which i need to make some change
Input dataframe :
+-----+-----+----------+
| name|Index| Value |
+-----+-----+----------+
|name1|1 |ab |
|name2|1 |vf |
|name2|2 |ee |
|name2|3 |id |
|name3|1 |bd |
for every name there are multiple values which we need to bind together as shown below
Output dataframe :
+-----+----------+
| name|value |
+-----+----------+
|name1|[ab] |
|name2|[vf,ee,id]|
|name3|[bd] |
Thank you

Dict2Columns - PySpark

I would like to convert one columns with dict values to expand columns with values as follows:
+-------+--------------------------------------------+
| Idx| value |
+-------+--------------------------------------------+
| 123|{'country_code': 'gb','postal_area': 'CR'} |
| 456|{'country_code': 'cn','postal_area': 'RS'} |
| 789|{'country_code': 'cl','postal_area': 'QS'} |
+-------+--------------------------------------------+
then i would like to get something like this:
display(df)
+-------+-------------------------------+
| Idx| country_code | postal_area |
+-------+-------------------------------+
| 123| gb | CR |
| 456| cn | RS |
| 789| cl | QS |
+-------+-------------------------------+
i just Try to do only for one line something like this:
#PySpark code
sc = spark.sparkContext
dict_lst = {'country_code': 'gb','postal_area': 'CR'}
rdd = sc.parallelize([json.dumps(dict_lst)])
df = spark.read.json(rdd)
display(df)
and i've got:
+-------------+-------------+
|country_code | postal_area |
+-------------+-------------+
| bg | CR |
+-------------+-------------+
so, here maybe i have part of the solution. now i would like to know hoy can i concat df with dataframe Result
well after Trying... the best solution is getting values from regexp_extract function from PySpark:
from pyspark.sql.functions import regexp_extract
df.withColumn("country_code", regexp_extract('value', "(?<=.country_code.:\s.)(.*?)(?=\')", 0)).withColumn("postal_area", regexp_extract('value', "(?<=.postal_area.:\s.)(.*?)(?=\')", 0))
hope this helps for futures askings about getting values from a String Dictionary

Create new columns based on group by with Pyspark

I've got a scenario where I have to take the results from a group by and create a new columns.
For example, say I have this data:
| Tool | Category | Price |
| Hammer | Hand Tool | 25.00 |
| Drill | Power Tool | 56.33 |
| Screw Driver | Hand Tool | 4.99 |
My output should look like:
| Tool | Hand Tool | Power Tool |
| Hammer | 25.00 | NULL |
| Drill | NULL | 56.33 |
| Screw Driver | 4.99 | NULL |
I'm not sure how to get this output. I'm trying something like the snippet below but it blows up with column is not iterable.
def get_tool_info():
return tool_table.groupBy('Category').pivot('Price', 'Category')
What is the best way to dynamically generate these new columns and assign the price values?
Try this :
from pyspark.sql.types import StructType, StructField, StringType, FloatType
import pyspark.sql.functions as F
schema = StructType([StructField("Tool", StringType()), StructField("Category", StringType()), StructField("Price", FloatType())])
data = [["Hammer", "Hand Tool", 25.00], ["Drill", "Power Tool", 56.33], ["Screw Driver", "Hand Tool", 4.99]]
df = spark.createDataFrame(data, schema)
df.groupby("Tool").pivot("Category").agg(F.first("Price")).show()
Output :
+------------+---------+----------+
| Tool|Hand Tool|Power Tool|
+------------+---------+----------+
| Drill| null| 56.33|
|Screw Driver| 4.99| null|
| Hammer| 25.0| null|
+------------+---------+----------+

Categories