Merge multiple columns into one column in pyspark dataframe using python

Merge multiple columns into one column in pyspark dataframe using python - python

I need to merge multiple columns of a dataframe into one single column with list(or tuple) as the value for the column using pyspark in python.
Input dataframe:
+-------+-------+-------+-------+-------+
| name |mark1 |mark2 |mark3 | Grade |
+-------+-------+-------+-------+-------+
| Jim | 20 | 30 | 40 | "C" |
+-------+-------+-------+-------+-------+
| Bill | 30 | 35 | 45 | "A" |
+-------+-------+-------+-------+-------+
| Kim | 25 | 36 | 42 | "B" |
+-------+-------+-------+-------+-------+
Output dataframe should be
+-------+-----------------+
| name |marks |
+-------+-----------------+
| Jim | [20,30,40,"C"] |
+-------+-----------------+
| Bill | [30,35,45,"A"] |
+-------+-----------------+
| Kim | [25,36,42,"B"] |
+-------+-----------------+

Columns can be merged with sparks array function:
import pyspark.sql.functions as f
columns = [f.col("mark1"), ...]
output = input.withColumn("marks", f.array(columns)).select("name", "marks")
You might need to change the type of the entries in order for the merge to be successful

look at this doc : https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["mark1", "mark2", "mark3"],
outputCol="marks")
output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)

You can do it in a select like following:
from pyspark.sql.functions import *
df.select( 'name' ,
concat(
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade")
).alias('marks')
)
If [ ] necessary, it can be added lit function.
from pyspark.sql.functions import *
df.select( 'name' ,
concat(lit("["),
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade"), lit("]")
).alias('marks')
)

If this is still relevant, you can use StringIndexer to encode your string values to float substitutes.

Related

How to map 2 dataset to check if a value from Dataset_A is present in Dataset_B and create a new column in Dataset_A as 'Present or Not'?

I am working on 2 datasets in PySpark, lets say Dataset_A and Dataset_B. I want to check if 'P/N' column in Dataset_A is present in 'Assembly_P/N' column in Dataset_B. Then I need to create a new column in Dataset_A titled 'Present or Not' with the values 'Present' or 'Not Present' depending on the search result.
PS. Both Datasets are huge and I am trying to figure out an efficient solution to do this without actually joining the tables.
sample
Dataset_A
| P/N |
| -------- |
| 1bc |
| 2df |
| 1cd |
Dataset_B
| Assembly_P/N |
| -------- |
| 1bc |
| 6gh |
| 2df |
Expected Result
Dataset_A
| P/N | Present or Not |
| -------- | -------- |
| 1bc | Present |
| 2df | Present |
| 1cd | Not Present |
from pyspark.sql.functions import udf
from pyspark.sql.functions import when, col, lit
def check_value(PN):
if dataset_B(col("Assembly_P/N")).isNotNull().rlike("%PN%"):
return 'Present'
else:
return 'Not Present'
check_value_udf = udf(check_value,StringType())
dataset_A = dataset_A.withColumn('Present or Not',check_value_udf(dataset_A.P/N))
I am getting PicklingError

Dict2Columns - PySpark

I would like to convert one columns with dict values to expand columns with values as follows:
+-------+--------------------------------------------+
| Idx| value |
+-------+--------------------------------------------+
| 123|{'country_code': 'gb','postal_area': 'CR'} |
| 456|{'country_code': 'cn','postal_area': 'RS'} |
| 789|{'country_code': 'cl','postal_area': 'QS'} |
+-------+--------------------------------------------+
then i would like to get something like this:
display(df)
+-------+-------------------------------+
| Idx| country_code | postal_area |
+-------+-------------------------------+
| 123| gb | CR |
| 456| cn | RS |
| 789| cl | QS |
+-------+-------------------------------+
i just Try to do only for one line something like this:
#PySpark code
sc = spark.sparkContext
dict_lst = {'country_code': 'gb','postal_area': 'CR'}
rdd = sc.parallelize([json.dumps(dict_lst)])
df = spark.read.json(rdd)
display(df)
and i've got:
+-------------+-------------+
|country_code | postal_area |
+-------------+-------------+
| bg | CR |
+-------------+-------------+
so, here maybe i have part of the solution. now i would like to know hoy can i concat df with dataframe Result

well after Trying... the best solution is getting values from regexp_extract function from PySpark:
from pyspark.sql.functions import regexp_extract
df.withColumn("country_code", regexp_extract('value', "(?<=.country_code.:\s.)(.*?)(?=\')", 0)).withColumn("postal_area", regexp_extract('value', "(?<=.postal_area.:\s.)(.*?)(?=\')", 0))
hope this helps for futures askings about getting values from a String Dictionary

Data Profiling using python

I have a data frame as below :
member_id | loan_amnt | Age | Marital_status
AK219 | 49539.09 | 34 | Married
AK314 | 1022454.00 | 37 | NA
BN204 | 75422.00 | 34 | Single
I want to create an output file in the below format
Columns | Null Values | Duplicate |
member_id | N | N |
loan_amnt | N | N |
Age | N | Y |
Marital Status| Y | N |
I know about one python package called PandasProfiling but I want build this in the above manner so that I can enhance my code with respect to the data sets.

Use something like:
m=df.apply(lambda x: x.duplicated())
n=df.isna()
df_new=(pd.concat([pd.Series(n.any(),name='Null_Values'),pd.Series(m.any(),name='Duplicates')],axis=1)
.replace({True:'Y',False:'N'}))

Here is python one-liner:
pd.concat([df.isnull().any() , df.apply(lambda x: x.count() != x.nunique())], 1).replace({True: "Y", False: "N"})

Actually the Pandas_Profiling gives you multiple options where you can figure out if there are repetitive values.

pyspark - how can I remove all duplicate rows (ignoring certain columns) and not leaving any dupe pairs behind?

I have the following table:
df = spark.createDataFrame([(2,'john',1),
(2,'john',1),
(3,'pete',8),
(3,'pete',8),
(5,'steve',9)],
['id','name','value'])
df.show()
+----+-------+-------+--------------+
| id | name | value | date |
+----+-------+-------+--------------+
| 2 | john | 1 | 131434234342 |
| 2 | john | 1 | 10-22-2018 |
| 3 | pete | 8 | 10-22-2018 |
| 3 | pete | 8 | 3258958304 |
| 5 | steve | 9 | 124324234 |
+----+-------+-------+--------------+
I want to remove all duplicate pairs (When the duplicates occur in id, name, or value but NOT date) so that I end up with:
+----+-------+-------+-----------+
| id | name | value | date |
+----+-------+-------+-----------+
| 5 | steve | 9 | 124324234 |
+----+-------+-------+-----------+
How can I do this in PySpark?

You could groupBy id, name and value and filter on the count column : :
df = df.groupBy('id','name','value').count().where('count = 1')
df.show()
+---+-----+-----+-----+
| id| name|value|count|
+---+-----+-----+-----+
| 5|steve| 9| 1|
+---+-----+-----+-----+
You could eventually drop the count column if needed

Do groupBy for the columns you want and count and do a filter where count is equal to 1 and then you can drop the count column like below
import pyspark.sql.functions as f
df = df.groupBy("id", "name", "value").agg(f.count("*").alias('cnt')).where('cnt = 1').drop('cnt')
You can add the date column in the GroupBy condition if you want
Hope this helps you

PySpark: Create a column from ordered concatenation of columns

I am having an issue creating a new column from the ordered concatenation of two existing columns on a pyspark dataframe i.e.:
+------+------+--------+
| Col1 | Col2 | NewCol |
+------+------+--------+
| ORD | DFW | DFWORD |
| CUN | MCI | CUNMCI |
| LAX | JFK | JFKLAX |
+------+------+--------+
In other words, I want to grab Col1 and Col2, ordered them alphabetically and concatenate them.
Any suggestions?

Combine concat_ws, array and sort_array
from pyspark.sql.functions import concat_ws, array, sort_array
df = spark.createDataFrame(
[("ORD", "DFW"), ("CUN", "MCI"), ("LAX", "JFK")],
("Col1", "Col2"))
df.withColumn("NewCol", concat_ws("", sort_array(array("Col1", "Col2")))).show()
# +----+----+------+
# |Col1|Col2|NewCol|
# +----+----+------+
# | ORD| DFW|DFWORD|
# | CUN| MCI|CUNMCI|
# | LAX| JFK|JFKLAX|
# +----+----+------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge multiple columns into one column in pyspark dataframe using python - python

Columns can be merged with sparks array function: import pyspark.sql.functions as f columns = [f.col("mark1"), ...] output = input.withColumn("marks", f.array(columns)).select("name", "marks") You might need to change the type of the entries in order for the merge to be successful

If this is still relevant, you can use StringIndexer to encode your string values to float substitutes.

Related

How to map 2 dataset to check if a value from Dataset_A is present in Dataset_B and create a new column in Dataset_A as 'Present or Not'?

Dict2Columns - PySpark

Data Profiling using python

pyspark - how can I remove all duplicate rows (ignoring certain columns) and not leaving any dupe pairs behind?

PySpark: Create a column from ordered concatenation of columns

Categories

Resources