I need to merge multiple columns of a dataframe into one single column with list(or tuple) as the value for the column using pyspark in python.
Input dataframe:
+-------+-------+-------+-------+-------+
| name |mark1 |mark2 |mark3 | Grade |
+-------+-------+-------+-------+-------+
| Jim | 20 | 30 | 40 | "C" |
+-------+-------+-------+-------+-------+
| Bill | 30 | 35 | 45 | "A" |
+-------+-------+-------+-------+-------+
| Kim | 25 | 36 | 42 | "B" |
+-------+-------+-------+-------+-------+
Output dataframe should be
+-------+-----------------+
| name |marks |
+-------+-----------------+
| Jim | [20,30,40,"C"] |
+-------+-----------------+
| Bill | [30,35,45,"A"] |
+-------+-----------------+
| Kim | [25,36,42,"B"] |
+-------+-----------------+
Columns can be merged with sparks array function:
import pyspark.sql.functions as f
columns = [f.col("mark1"), ...]
output = input.withColumn("marks", f.array(columns)).select("name", "marks")
You might need to change the type of the entries in order for the merge to be successful
look at this doc : https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["mark1", "mark2", "mark3"],
outputCol="marks")
output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)
You can do it in a select like following:
from pyspark.sql.functions import *
df.select( 'name' ,
concat(
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade")
).alias('marks')
)
If [ ] necessary, it can be added lit function.
from pyspark.sql.functions import *
df.select( 'name' ,
concat(lit("["),
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade"), lit("]")
).alias('marks')
)
If this is still relevant, you can use StringIndexer to encode your string values to float substitutes.
Related
I am working on 2 datasets in PySpark, lets say Dataset_A and Dataset_B. I want to check if 'P/N' column in Dataset_A is present in 'Assembly_P/N' column in Dataset_B. Then I need to create a new column in Dataset_A titled 'Present or Not' with the values 'Present' or 'Not Present' depending on the search result.
PS. Both Datasets are huge and I am trying to figure out an efficient solution to do this without actually joining the tables.
sample
Dataset_A
| P/N |
| -------- |
| 1bc |
| 2df |
| 1cd |
Dataset_B
| Assembly_P/N |
| -------- |
| 1bc |
| 6gh |
| 2df |
Expected Result
Dataset_A
| P/N | Present or Not |
| -------- | -------- |
| 1bc | Present |
| 2df | Present |
| 1cd | Not Present |
from pyspark.sql.functions import udf
from pyspark.sql.functions import when, col, lit
def check_value(PN):
if dataset_B(col("Assembly_P/N")).isNotNull().rlike("%PN%"):
return 'Present'
else:
return 'Not Present'
check_value_udf = udf(check_value,StringType())
dataset_A = dataset_A.withColumn('Present or Not',check_value_udf(dataset_A.P/N))
I am getting PicklingError
I would like to convert one columns with dict values to expand columns with values as follows:
+-------+--------------------------------------------+
| Idx| value |
+-------+--------------------------------------------+
| 123|{'country_code': 'gb','postal_area': 'CR'} |
| 456|{'country_code': 'cn','postal_area': 'RS'} |
| 789|{'country_code': 'cl','postal_area': 'QS'} |
+-------+--------------------------------------------+
then i would like to get something like this:
display(df)
+-------+-------------------------------+
| Idx| country_code | postal_area |
+-------+-------------------------------+
| 123| gb | CR |
| 456| cn | RS |
| 789| cl | QS |
+-------+-------------------------------+
i just Try to do only for one line something like this:
#PySpark code
sc = spark.sparkContext
dict_lst = {'country_code': 'gb','postal_area': 'CR'}
rdd = sc.parallelize([json.dumps(dict_lst)])
df = spark.read.json(rdd)
display(df)
and i've got:
+-------------+-------------+
|country_code | postal_area |
+-------------+-------------+
| bg | CR |
+-------------+-------------+
so, here maybe i have part of the solution. now i would like to know hoy can i concat df with dataframe Result
well after Trying... the best solution is getting values from regexp_extract function from PySpark:
from pyspark.sql.functions import regexp_extract
df.withColumn("country_code", regexp_extract('value', "(?<=.country_code.:\s.)(.*?)(?=\')", 0)).withColumn("postal_area", regexp_extract('value', "(?<=.postal_area.:\s.)(.*?)(?=\')", 0))
hope this helps for futures askings about getting values from a String Dictionary
I have a data frame as below :
member_id | loan_amnt | Age | Marital_status
AK219 | 49539.09 | 34 | Married
AK314 | 1022454.00 | 37 | NA
BN204 | 75422.00 | 34 | Single
I want to create an output file in the below format
Columns | Null Values | Duplicate |
member_id | N | N |
loan_amnt | N | N |
Age | N | Y |
Marital Status| Y | N |
I know about one python package called PandasProfiling but I want build this in the above manner so that I can enhance my code with respect to the data sets.
Use something like:
m=df.apply(lambda x: x.duplicated())
n=df.isna()
df_new=(pd.concat([pd.Series(n.any(),name='Null_Values'),pd.Series(m.any(),name='Duplicates')],axis=1)
.replace({True:'Y',False:'N'}))
Here is python one-liner:
pd.concat([df.isnull().any() , df.apply(lambda x: x.count() != x.nunique())], 1).replace({True: "Y", False: "N"})
Actually the Pandas_Profiling gives you multiple options where you can figure out if there are repetitive values.
I have the following table:
df = spark.createDataFrame([(2,'john',1),
(2,'john',1),
(3,'pete',8),
(3,'pete',8),
(5,'steve',9)],
['id','name','value'])
df.show()
+----+-------+-------+--------------+
| id | name | value | date |
+----+-------+-------+--------------+
| 2 | john | 1 | 131434234342 |
| 2 | john | 1 | 10-22-2018 |
| 3 | pete | 8 | 10-22-2018 |
| 3 | pete | 8 | 3258958304 |
| 5 | steve | 9 | 124324234 |
+----+-------+-------+--------------+
I want to remove all duplicate pairs (When the duplicates occur in id, name, or value but NOT date) so that I end up with:
+----+-------+-------+-----------+
| id | name | value | date |
+----+-------+-------+-----------+
| 5 | steve | 9 | 124324234 |
+----+-------+-------+-----------+
How can I do this in PySpark?
You could groupBy id, name and value and filter on the count column : :
df = df.groupBy('id','name','value').count().where('count = 1')
df.show()
+---+-----+-----+-----+
| id| name|value|count|
+---+-----+-----+-----+
| 5|steve| 9| 1|
+---+-----+-----+-----+
You could eventually drop the count column if needed
Do groupBy for the columns you want and count and do a filter where count is equal to 1 and then you can drop the count column like below
import pyspark.sql.functions as f
df = df.groupBy("id", "name", "value").agg(f.count("*").alias('cnt')).where('cnt = 1').drop('cnt')
You can add the date column in the GroupBy condition if you want
Hope this helps you
I am having an issue creating a new column from the ordered concatenation of two existing columns on a pyspark dataframe i.e.:
+------+------+--------+
| Col1 | Col2 | NewCol |
+------+------+--------+
| ORD | DFW | DFWORD |
| CUN | MCI | CUNMCI |
| LAX | JFK | JFKLAX |
+------+------+--------+
In other words, I want to grab Col1 and Col2, ordered them alphabetically and concatenate them.
Any suggestions?
Combine concat_ws, array and sort_array
from pyspark.sql.functions import concat_ws, array, sort_array
df = spark.createDataFrame(
[("ORD", "DFW"), ("CUN", "MCI"), ("LAX", "JFK")],
("Col1", "Col2"))
df.withColumn("NewCol", concat_ws("", sort_array(array("Col1", "Col2")))).show()
# +----+----+------+
# |Col1|Col2|NewCol|
# +----+----+------+
# | ORD| DFW|DFWORD|
# | CUN| MCI|CUNMCI|
# | LAX| JFK|JFKLAX|
# +----+----+------+