Remove duplicates from a dataframe in PySpark - python

I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error:
"AttributeError: 'list' object has no attribute 'dropDuplicates'"
Not quite sure why as I seem to be following the syntax in the latest documentation.
#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()
#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()
#dropping duplicates from the dataframe
df1.dropDuplicates().show()

It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:
(df1 = sqlContext
.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
.dropDuplicates())
df1.collect()

if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):
count before dedupe:
df.count()
do the de-dupe (convert the column you are de-duping to string type):
from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))
df.drop_duplicates(subset=['colName']).count()
can use a sorted groupby to check to see that duplicates have been removed:
df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

In summary, distinct() and dropDuplicates() methods remove duplicates with one difference, which is essential.
dropDuplicates() is more suitable by considering only a subset of the columns
data = [("James","","Smith","36636","M",60000),
("James","Rose","","40288","M",70000),
("Robert","","Williams","42114","",400000),
("Maria","Anne","Jones","39192","F",500000),
("Maria","Mary","Brown","","F",0)]
columns = ["first_name","middle_name","last_name","dob","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
df.groupBy('first_name').agg(count(
'first_name').alias("count_duplicates")).filter(
col('count_duplicates') >= 2).show()
df.dropDuplicates(['first_name']).show()
# output
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob |gender|salary|
+----------+-----------+---------+-----+------+------+
|James | |Smith |36636|M |60000 |
|James |Rose | |40288|M |70000 |
|Robert | |Williams |42114| |400000|
|Maria |Anne |Jones |39192|F |500000|
|Maria |Mary |Brown | |F |0 |
+----------+-----------+---------+-----+------+------+
+----------+----------------+
|first_name|count_duplicates|
+----------+----------------+
| James| 2|
| Maria| 2|
+----------+----------------+
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name| dob|gender|salary|
+----------+-----------+---------+-----+------+------+
| James| | Smith|36636| M| 60000|
| Maria| Anne| Jones|39192| F|500000|
| Robert| | Williams|42114| |400000|
+----------+-----------+---------+-----+------+------+

Related

How to Update a dataframe column by taking value from another dataframe?

I have two dataframes df_1 and df_2:
rdd = spark.sparkContext.parallelize([
(1, '', '5647-0394'),
(2, '', '6748-9384'),
(3, '', '9485-9484')])
df_1 = spark.createDataFrame(rdd, schema=['ID', 'UPDATED_MESSAGE', 'ZIP_CODE'])
# +---+---------------+---------+
# | ID|UPDATED_MESSAGE| ZIP_CODE|
# +---+---------------+---------+
# | 1| |5647-0394|
# | 2| |6748-9384|
# | 3| |9485-9484|
# +---+---------------+---------+
rdd = spark.sparkContext.parallelize([
('JAMES', 'INDIA_WON', '6748-9384')])
df_2 = spark.createDataFrame(rdd, schema=['NAME', 'CODE', 'ADDRESS_CODE'])
# +-----+---------+------------+
# | NAME| CODE|ADDRESS_CODE|
# +-----+---------+------------+
# |JAMES|INDIA_WON| 6748-9384|
# +-----+---------+------------+
I need to update df_1 column 'UPDATED MESSAGE' with value 'INDIA_WON' from df_2 column 'CODE'. Currently the column "UPDATED_MESSAGE" is Null. I need to update every row with value as 'INDIA_WON', How can we do it in PySpark?
The condition here is if we find 'ADDRESS_CODE" value in df_1 column "ZIP_CODE", we need to populate all the values in 'UPDATED_MESSAGE' = 'INDIA_WON'.
I hope I've interpreted what you need well. If yes, then your logic seems strange. It seems, that your tables are very small. Spark is the engine for big data (millions to billions of records). If your tables are small, consider doing things in Pandas.
from pyspark.sql import functions as F
df_2 = df_2.groupBy('ADDRESS_CODE').agg(F.first('CODE').alias('CODE'))
df_joined = df_1.join(df_2, df_1.ZIP_CODE == df_2.ADDRESS_CODE, 'left')
df_filtered = df_joined.filter(~F.isnull('ADDRESS_CODE'))
if bool(df_filtered.head(1)):
df_1 = df_1.withColumn('UPDATED_MESSAGE', F.lit(df_filtered.head()['CODE']))
df_1.show()
# +---+---------------+---------+
# | ID|UPDATED_MESSAGE| ZIP_CODE|
# +---+---------------+---------+
# | 1| INDIA_WON|5647-0394|
# | 2| INDIA_WON|6748-9384|
# | 3| INDIA_WON|9485-9484|
# +---+---------------+---------+
The below Python method returns either an original df_1 when no ZIP_CODE match has been found in df_2 or an modified df_1 where column UPDATED_MESSAGE is filled in with the value from df_2.CODE column:
from pyspark.sql.functions import lit
def update_df1(df_1, df_2):
if (df_1.join(df_2, on=(col("ZIP_CODE") == col("ADDRESS_CODE")), how="inner").count() == 0):
return df_1
code = df_2.collect()[0]["CODE"]
return df_1.withColumn("UPDATED_MESSAGE", lit(code))
update_df1(df_1, df_2).show()
+---+---------------+---------+
| ID|UPDATED_MESSAGE| ZIP_CODE|
+---+---------------+---------+
| 1| INDIA_WON|5647-0394|
| 2| INDIA_WON|6748-9384|
| 3| INDIA_WON|9485-9484|
+---+---------------+---------+
I propose use of broadcast join in this case to avoid excessive shuffle.
Code and logic below
new=(df_1.drop('UPDATED_MESSAGE').join(broadcast(df_2.drop('NAME')),how='left', on=df_1.ZIP_CODE==df_2.ADDRESS_CODE)#Drop the null column and join
.drop('ADDRESS_CODE')#Drop column no longer neede
.toDF('ID', 'ZIP_CODE', 'UPDATED_MESSAGE')#rename new df
).show()
Why use dataframes when Spark SQL is so much easier?
Turn data frames into temporary views.
%python
df_1.createOrReplaceTempView("tmp_zipcodes")
df_2.createOrReplaceTempView("tmp_person")
Write simple Spark SQL to get answer.
%sql
select
a.id,
case when b.code is null then '' else b.code end as update_message,
a.zip_code
from tmp_zipcodes as a
left join tmp_person as b
on a.zip_code = b.address_code
Output from query. Use spark.sql() to make an dataframe if you need to write to disk.
Overwrite whole data frame with new answer.
sql_txt = """
select
a.id,
case when b.code is null then '' else b.code end as update_message,
a.zip_code
from tmp_zipcodes as a
left join tmp_person as b
on a.zip_code = b.address_code
"""
df_1 = spark.sql(sql_txt)

Transform nested dictionary key values to pyspark dataframe

I have a Pyspark dataframe that looks like this:
I would like extract those nested dictionaries in the "dic" column and transform them into PySpark dataframe. Like this:
Please let me know how I can achieve this.
Thanks!
from pyspark.sql import functions as F
df.show() #sample dataframe
+---------+----------------------------------------------------------------------------------------------------------+
|timestmap|dic |
+---------+----------------------------------------------------------------------------------------------------------+
|timestamp|{"Name":"David","Age":"25","Location":"New York","Height":"170","fields":{"Color":"Blue","Shape":"round"}}|
+---------+----------------------------------------------------------------------------------------------------------+
For Spark2.4+, you could use from_json and schema_of_json.
schema=df.select(F.schema_of_json(df.select("dic").first()[0])).first()[0]
df.withColumn("dic", F.from_json("dic", schema))\
.selectExpr("dic.*").selectExpr("*","fields.*").drop("fields").show()
#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25| 170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+
You could also use rdd way with read.json if you don't have spark2.4. There will be performance hit of df to rdd conversion.
df1 = spark.read.json(df.rdd.map(lambda r: r.dic))\
df1.select(*[x for x in df1.columns if x!='fields'], F.col("fields.*")).show()
#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25| 170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+

PySpark - Date 0000.00.00 imported differently via function .to_date() and .csv() module

I am importing data, which has a date column in yyyy.MM.dd format. Missing values have been marked as 0000.00.00. This 0000.00.00 is treated differently depending upon the function/module employed to bring the data in the dataframe.
.csv file looks like this -
2016.12.23,2016.12.23
0000.00.00,0000.00.00
Method 1: .csv()
schema = StructType([
StructField('date', StringType()),
StructField('date1', DateType()),
])
df = spark.read.schema(schema)\
.format('csv')\
.option('header','false')\
.option('sep',',')\
.option('dateFormat','yyyy.MM.dd')\
.load(path+'file.csv')
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00|0002-11-30|
+----------+----------+
Method 2: .to_date()
from pyspark.sql.functions import to_date, col
df = sqlContext.createDataFrame([('2016.12.23','2016.12.23'),('0000.00.00','0000.00.00')],['date','date1'])
df = df.withColumn('date1',to_date(col('date1'),'yyyy.MM.dd'))
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00| null|
+----------+----------+
Question: Why two methods give different results? I would have expected the get Null for both. In the first case instead I get 0002-11-30. Can anyone explain this anomaly?

Compare a pyspark dataframe to another dataframe

I have 2 data frames to compare both have the same number of columns and the comparison result should have the field that is mismatching and the values along with the ID.
Dataframe one
+-----+---+--------+
| name| id| City|
+-----+---+--------+
| Sam| 3| Toronto|
| BALU| 11| YYY|
|CLAIR| 7|Montreal|
|HELEN| 10| London|
|HELEN| 16| Ottawa|
+-----+---+--------+
Dataframe two
+-------------+-----------+-------------+
|Expected_name|Expected_id|Expected_City|
+-------------+-----------+-------------+
| SAM| 3| Toronto|
| BALU| 11| YYY|
| CLARE| 7| Montreal|
| HELEN| 10| Londn|
| HELEN| 15| Ottawa|
+-------------+-----------+-------------+
Expected Output
+---+------------+--------------+-----+
| ID|Actual_value|Expected_value|Field|
+---+------------+--------------+-----+
| 7| CLAIR| CLARE| name|
| 3| Sam| SAM| name|
| 10| London| Londn| City|
+---+------------+--------------+-----+
Code
Create example data
from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession
sc = SparkContext()
sql_context = SQLContext(sc)
spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("ERROR") # log only on fails
df_Actual = sql_context.createDataFrame(
[("Sam", 3,'Toronto'), ("BALU", 11,'YYY'), ("CLAIR", 7,'Montreal'),
("HELEN", 10,'London'), ("HELEN", 16,'Ottawa')],
["name", "id","City"]
)
df_Expected = sql_context.createDataFrame(
[("SAM", 3,'Toronto'), ("BALU", 11,'YYY'), ("CLARE", 7,'Montreal'),
("HELEN", 10,'Londn'), ("HELEN", 15,'Ottawa')],
["Expected_name", "Expected_id","Expected_City"]
)
Create empty dataframe for Result
field = [
StructField("ID",StringType(), True),
StructField("Actual_value", StringType(), True),
StructField("Expected_value", StringType(), True),
StructField("Field", StringType(), True)
]
schema = StructType(field)
Df_Result = sql_context.createDataFrame(sc.emptyRDD(), schema)
Join expected and actual on id's
df_cobined = df_Actual.join(df_Expected, (df_Actual.id == df_Expected.Expected_id))
col_names=df_Actual.schema.names
Loop through each column to find mismatches
for col_name in col_names:
#Filter for column values not matching
df_comp= df_cobined.filter(col(col_name)!=col("Expected_"+col_name ))\
.select(col('id'),col(col_name),col("Expected_"+col_name ))
#Add not matching column name
df_comp = df_comp.withColumn("Field", lit(col_name))
#Add to final result
Df_Result = Df_Result.union(df_comp)
Df_Result.show()
This code works as expected. However, in the real case, I have more columns and millions of rows to compare. With this code, it takes more time to finish the comparison. Is there a better way to increase the performance and get the same result?
One way to avoid doing the union is the following:
Create a list of columns to compare: to_compare
Next select the id column and use pyspark.sql.functions.when to compare the columns. For those with a mismatch, build an array of structs with 3 fields: (Actual_value, Expected_value, Field) for each column in to_compare
Explode the temp array column and drop the nulls
Finally select the id and use col.* to expand the values from the struct into columns.
Code:
StructType to store the mismatched fields.
import pyspark.sql.functions as f
# these are the fields you want to compare
to_compare = [c for c in df_Actual.columns if c != "id"]
df_new = df_cobined.select(
"id",
f.array([
f.when(
f.col(c) != f.col("Expected_"+c),
f.struct(
f.col(c).alias("Actual_value"),
f.col("Expected_"+c).alias("Expected_value"),
f.lit(c).alias("Field")
)
).alias(c)
for c in to_compare
]).alias("temp")
)\
.select("id", f.explode("temp"))\
.dropna()\
.select("id", "col.*")
df_new.show()
#+---+------------+--------------+-----+
#| id|Actual_value|Expected_value|Field|
#+---+------------+--------------+-----+
#| 7| CLAIR| CLARE| name|
#| 10| London| Londn| City|
#| 3| Sam| SAM| name|
#+---+------------+--------------+-----+
Join only those records where expected id equals actual and there is mismatch in any other column:
df1.join(df2, df1.id=df2.id and (df1.name != df2.name or df1.age != df2.age...))
This means you will do for loop only across mismatched rows, instead of whole dataset.
For this who are looking for an answer, I transposed the data frame and then did a comparison.
from pyspark.sql.functions import array, col, explode, struct, lit
def Transposedf(df, by,colheader):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([ struct(lit(c).alias("Field"), col(c).alias(colheader)) for c in cols ])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.Field", "kvs."+colheader])
Then the comparison looks like this
def Compare_df(df_Expected,df_Actual):
df_combined = (df_Actual
.join(df_Expected, ((df_Actual.id == df_Expected.id)
& (df_Actual.Field == df_Expected.Field)
& (df_Actual.Actual_value != df_Expected.Expected_value)))
.select([df_Actual.account_unique_id,df_Actual.Field,df_Actual.Actual_value,df_Expected.Expected_value])
)
return df_combined
I called these 2 functions as
df_Actual=Transposedf(df_Actual, ["id"],'Actual_value')
df_Expected=Transposedf(df_Expected, ["id"],'Expected_value')
#Compare the expected and actual
df_result=Compare_df(df_Expected,df_Actual)

Pyspark: explode json in column to multiple columns

The data looks like this -
+-----------+-----------+-----------------------------+
| id| point| data|
+-----------------------------------------------------+
| abc| 6|{"key1":"124", "key2": "345"}|
| dfl| 7|{"key1":"777", "key2": "888"}|
| 4bd| 6|{"key1":"111", "key2": "788"}|
I am trying to break it into the following format.
+-----------+-----------+-----------+-----------+
| id| point| key1| key2|
+------------------------------------------------
| abc| 6| 124| 345|
| dfl| 7| 777| 888|
| 4bd| 6| 111| 788|
The explode function explodes the dataframe into multiple rows. But that is not the desired solution.
Note: This solution does not answers my questions.
PySpark "explode" dict in column
As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json should get you your desired result, but you would need to first define the required schema
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('key1', StringType(), True),
StructField('key2', StringType(), True)
]
)
df.withColumn("data", from_json("data", schema))\
.select(col('id'), col('point'), col('data.*'))\
.show()
which should give you
+---+-----+----+----+
| id|point|key1|key2|
+---+-----+----+----+
|abc| 6| 124| 345|
|df1| 7| 777| 888|
|4bd| 6| 111| 788|
+---+-----+----+----+
As suggested by #pault, the data field is a string field. since the keys are the same (i.e. 'key1', 'key2') in the JSON string over rows, you might also use json_tuple() (this function is New in version 1.6 based on the documentation)
from pyspark.sql import functions as F
df.select('id', 'point', F.json_tuple('data', 'key1', 'key2').alias('key1', 'key2')).show()
Below is My original post: which is most likely WRONG if the original table is from df.show(truncate=False) and thus the data field is NOT a python data structure.
Since you have exploded the data into rows, I supposed the column data is a Python data structure instead of a string:
from pyspark.sql import functions as F
df.select('id', 'point', F.col('data').getItem('key1').alias('key1'), F.col('data')['key2'].alias('key2')).show()
As mentioned by #jxc, json_tuple should work fine if you were not able to define the schema beforehand and you only needed to deal with a single level of json string. I think it's more straight forward and easier to use. Strangely, I didn't find anyone else mention this function before.
In my use case, original dataframe schema: StructType(List(StructField(a,StringType,true))), json string column shown as:
+---------------------------------------+
|a |
+---------------------------------------+
|{"k1": "v1", "k2": "2", "k3": {"m": 1}}|
|{"k1": "v11", "k3": "v33"} |
|{"k1": "v13", "k2": "23"} |
+---------------------------------------+
Expand json fields into new columns with json_tuple:
from pyspark.sql import functions as F
df = df.select(F.col('a'),
F.json_tuple(F.col('a'), 'k1', 'k2', 'k3') \
.alias('k1', 'k2', 'k3'))
df.schema
df.show(truncate=False)
The document doesn't say much about it, but at least in my use case, new columns extracted by json_tuple are StringType, and it only extract single depth of JSON string.
StructType(List(StructField(k1,StringType,true),StructField(k2,StringType,true),StructField(k3,StringType,true)))
+---------------------------------------+---+----+-------+
|a |k1 |k2 |k3 |
+---------------------------------------+---+----+-------+
|{"k1": "v1", "k2": "2", "k3": {"m": 1}}|v1 |2 |{"m":1}|
|{"k1": "v11", "k3": "v33"} |v11|null|v33 |
|{"k1": "v13", "k2": "23"} |v13|23 |null |
+---------------------------------------+---+----+-------+
This works for my use case
data1 = spark.read.parquet(path)
json_schema = spark.read.json(data1.rdd.map(lambda row: row.json_col)).schema
data2 = data1.withColumn("data", from_json("json_col", json_schema))
col1 = data2.columns
col1.remove("data")
col2 = data2.select("data.*").columns
append_str ="data."
col3 = [append_str + val for val in col2]
col_list = col1 + col3
data3 = data2.select(*col_list).drop("json_col")
All credits to Shrikant Prabhu
You can simply use SQL
SELECT id, point, data.*
FROM original_table
Like this the schema of the new table will adapt if the data changes and you won't have to do anything in your pipelin.
In this approach you just need to set the name of column with Json content.
No need to set up the schema. It makes everything automatically.
json_col_name = 'data'
keys = df.head()[json_col_name].keys()
jsonFields= [f"{json_col_name}.{key} {key}" for key in keys]
main_fields = [key for key in df.columns if key != json_col_name]
df_new = df.selectExpr(main_fields + jsonFields)

Categories