Dict2Columns - PySpark - python

I would like to convert one columns with dict values to expand columns with values as follows:
+-------+--------------------------------------------+
| Idx| value |
+-------+--------------------------------------------+
| 123|{'country_code': 'gb','postal_area': 'CR'} |
| 456|{'country_code': 'cn','postal_area': 'RS'} |
| 789|{'country_code': 'cl','postal_area': 'QS'} |
+-------+--------------------------------------------+
then i would like to get something like this:
display(df)
+-------+-------------------------------+
| Idx| country_code | postal_area |
+-------+-------------------------------+
| 123| gb | CR |
| 456| cn | RS |
| 789| cl | QS |
+-------+-------------------------------+
i just Try to do only for one line something like this:
#PySpark code
sc = spark.sparkContext
dict_lst = {'country_code': 'gb','postal_area': 'CR'}
rdd = sc.parallelize([json.dumps(dict_lst)])
df = spark.read.json(rdd)
display(df)
and i've got:
+-------------+-------------+
|country_code | postal_area |
+-------------+-------------+
| bg | CR |
+-------------+-------------+
so, here maybe i have part of the solution. now i would like to know hoy can i concat df with dataframe Result

well after Trying... the best solution is getting values from regexp_extract function from PySpark:
from pyspark.sql.functions import regexp_extract
df.withColumn("country_code", regexp_extract('value', "(?<=.country_code.:\s.)(.*?)(?=\')", 0)).withColumn("postal_area", regexp_extract('value', "(?<=.postal_area.:\s.)(.*?)(?=\')", 0))
hope this helps for futures askings about getting values from a String Dictionary

Related

How can I transfer the exploded object back to a pyspark dataframe?

I am trying to convert this back to a pyspark data frame.
I am currently trying to calculate the 2-gram distribution here. I am trying to use:
new_df = sample_df_updated.select(['ngrams'])
from pyspark.sql.functions import explode
new_df.select(explode(new_df.ngrams)).show(truncate=False)
+------------------+
|col |
+------------------+
|the project |
|project gutenberg |
|gutenberg ebook |
|ebook of |
|of alice’s |
|alice’s adventures|
|adventures in |
|in wonderland, |
|wonderland, by |
|by lewis |
|lewis carroll |
|this ebook |
|ebook is |
|is for |
|for the |
|the use |
|use of |
|of anyone |
|anyone anywhere |
|anywhere at |
+------------------+
I am trying to use code like this:
df2 = new_df.select(explode(new_df.ngrams)).show(truncate=False)
df2.groupBy('col').count().show()
But it results in the error
'NoneType' object has no attribute 'show'
How to transfer it into a dataframe?
The .show() command makes df2 not a DataFrame.
Try:
df2 = new_df.select(explode(new_df.ngrams))
df2.show(truncate=False)
df2.groupBy('col').count().show()
And, could also be helpful to rename the exploded column for clarity.
df2 = new_df.select(explode(new_df.ngrams).alias('exploded_ngrams'))
df2.show(truncate=False)
df2.groupBy('exploded_ngrams').count().show()
What about you just explode the column on the dataframe?
new_df.withColumn("ngrams", explode("ngrams")).show(truncate=False)

Modify Spark Dataframe

I have a spark dataframe to which i need to make some change
Input dataframe :
+-----+-----+----------+
| name|Index| Value |
+-----+-----+----------+
|name1|1 |ab |
|name2|1 |vf |
|name2|2 |ee |
|name2|3 |id |
|name3|1 |bd |
for every name there are multiple values which we need to bind together as shown below
Output dataframe :
+-----+----------+
| name|value |
+-----+----------+
|name1|[ab] |
|name2|[vf,ee,id]|
|name3|[bd] |
Thank you

How to use list comprehension on a column with array in pyspark?

I have a pyspark dataframe that looks like this.
+--------------------+-------+--------------------+
| ID |country| attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2,3,4...] |
|3de27656 | US|[1,7,2,4...] |
|75ce4e58 | US|[1,2,1,4...] |
|908df65c | US|[1,8,3,0...] |
|f0503257 | US|[1,2,3,2...] |
|2tBxD6j | US|[1,2,3,4...] |
|33811685 | US|[1,5,3,5...] |
|aad21639 | US|[7,8,9,4...] |
|e3d9e3bb | US|[1,10,9,4...] |
|463f6f69 | US|[12,2,13,4...] |
+--------------------+-------+--------------------+
I also have a set that looks like this
reference_set = (1,2,100,500,821)
what I want to do is create a new list as a column in the dataframe using maybe a list comprehension like this [attr for attr in attrs if attr in reference_set]
so my final dataframe should be something like this
+--------------------+-------+--------------------+
| ID |country| filtered_attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2] |
|3de27656 | US|[1,2] |
|75ce4e58 | US|[1,2] |
|908df65c | US|[1] |
|f0503257 | US|[1,2] |
|2tBxD6j | US|[1,2] |
|33811685 | US|[1] |
|aad21639 | US|[] |
|e3d9e3bb | US|[1] |
|463f6f69 | US|[2] |
+--------------------+-------+--------------------+
How can I do this? as I'm new to pyspark I can't think of a logic.
Edit : posted a logic below, if there's a more efficient way of doing this please let me know.
You can use built-in function - array_intersect.
# Sample dataframe
df = spark.createDataFrame([('ffae10af', 'US', [1,2,3,4])], ["ID", "Country", "attrs"])
reference_set = {1,2,100,500,821}
# This step is to add set as column in dataframe
set_to_string = ",".join([str(x) for x in reference_set])
df.withColumn('reference_set', split(lit(set_to_string), ',').cast('array<bigint>')). \
withColumn('filtered_attrs', array_intersect('attrs','reference_set'))\
.show(truncate = False)
+--------+-------+------------+---------------------+--------------+
|ID |Country|attrs |reference_set |filtered_attrs|
+--------+-------+------------+---------------------+--------------+
|ffae10af|US |[1, 2, 3, 4]|[1, 2, 100, 500, 821]|[1, 2] |
+--------+-------+------------+---------------------+--------------+
I managed to use the filter function paired with a UDF to make this work.
def filter_items(item):
if item in reference_set:
return True
else:
return False
custom_udf = udf(lambda attributes : list(filter(filter_items, attributes)))
processed_df = df.withColumn('filtered_attrs',custom_udf(col('attrs')))
This gives me the required output

Create new columns based on group by with Pyspark

I've got a scenario where I have to take the results from a group by and create a new columns.
For example, say I have this data:
| Tool | Category | Price |
| Hammer | Hand Tool | 25.00 |
| Drill | Power Tool | 56.33 |
| Screw Driver | Hand Tool | 4.99 |
My output should look like:
| Tool | Hand Tool | Power Tool |
| Hammer | 25.00 | NULL |
| Drill | NULL | 56.33 |
| Screw Driver | 4.99 | NULL |
I'm not sure how to get this output. I'm trying something like the snippet below but it blows up with column is not iterable.
def get_tool_info():
return tool_table.groupBy('Category').pivot('Price', 'Category')
What is the best way to dynamically generate these new columns and assign the price values?
Try this :
from pyspark.sql.types import StructType, StructField, StringType, FloatType
import pyspark.sql.functions as F
schema = StructType([StructField("Tool", StringType()), StructField("Category", StringType()), StructField("Price", FloatType())])
data = [["Hammer", "Hand Tool", 25.00], ["Drill", "Power Tool", 56.33], ["Screw Driver", "Hand Tool", 4.99]]
df = spark.createDataFrame(data, schema)
df.groupby("Tool").pivot("Category").agg(F.first("Price")).show()
Output :
+------------+---------+----------+
| Tool|Hand Tool|Power Tool|
+------------+---------+----------+
| Drill| null| 56.33|
|Screw Driver| 4.99| null|
| Hammer| 25.0| null|
+------------+---------+----------+

Data Profiling using python

I have a data frame as below :
member_id | loan_amnt | Age | Marital_status
AK219 | 49539.09 | 34 | Married
AK314 | 1022454.00 | 37 | NA
BN204 | 75422.00 | 34 | Single
I want to create an output file in the below format
Columns | Null Values | Duplicate |
member_id | N | N |
loan_amnt | N | N |
Age | N | Y |
Marital Status| Y | N |
I know about one python package called PandasProfiling but I want build this in the above manner so that I can enhance my code with respect to the data sets.
Use something like:
m=df.apply(lambda x: x.duplicated())
n=df.isna()
df_new=(pd.concat([pd.Series(n.any(),name='Null_Values'),pd.Series(m.any(),name='Duplicates')],axis=1)
.replace({True:'Y',False:'N'}))
Here is python one-liner:
pd.concat([df.isnull().any() , df.apply(lambda x: x.count() != x.nunique())], 1).replace({True: "Y", False: "N"})
Actually the Pandas_Profiling gives you multiple options where you can figure out if there are repetitive values.

Categories