I have a Pyspark dataframe that looks like this:
I would like extract those nested dictionaries in the "dic" column and transform them into PySpark dataframe. Like this:
Please let me know how I can achieve this.
Thanks!
from pyspark.sql import functions as F
df.show() #sample dataframe
+---------+----------------------------------------------------------------------------------------------------------+
|timestmap|dic |
+---------+----------------------------------------------------------------------------------------------------------+
|timestamp|{"Name":"David","Age":"25","Location":"New York","Height":"170","fields":{"Color":"Blue","Shape":"round"}}|
+---------+----------------------------------------------------------------------------------------------------------+
For Spark2.4+, you could use from_json and schema_of_json.
schema=df.select(F.schema_of_json(df.select("dic").first()[0])).first()[0]
df.withColumn("dic", F.from_json("dic", schema))\
.selectExpr("dic.*").selectExpr("*","fields.*").drop("fields").show()
#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25| 170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+
You could also use rdd way with read.json if you don't have spark2.4. There will be performance hit of df to rdd conversion.
df1 = spark.read.json(df.rdd.map(lambda r: r.dic))\
df1.select(*[x for x in df1.columns if x!='fields'], F.col("fields.*")).show()
#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25| 170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+
Related
I am trying to use python function using PySpark data frame. I need to give two data frames at input and would like to store the results in the another data frame.
Python function I want to use:
#udf(StringType())
def fuzz_ratio(df1, df2):
return np.vectorize(fuzz.token_sort_ratio(df1, df2))
This is how I am trying to use the above function:
result_df.withcolumn("VAL", fuzz_ratio(col(df1.VAL), col(df2.VAL)))
df1 and df2 are inputs. VAL columns of both these data frame contain the values I need to input to the function fuzz_ratio. The output should be saved in VAL column of result_df.
Example:
The VAL is the column name in all the dataframes. df1 and df2 column VAL is of string type.
When you move both columns to the same dataframe, something like the following pandas_udf could be used. pandas_udf is vectorized for performance. It's not the same as regular Spark udf.
Input:
from pyspark.sql import functions as F
import pandas as pd
from fuzzywuzzy import fuzz
df = spark.createDataFrame([('danial khilji', 'danial'), ('a','as',)], ['col1', 'col2'])
df.show()
# +-------------+------+
# | col1| col2|
# +-------------+------+
# |danial khilji|danial|
# | a| as|
# +-------------+------+
Script:
#F.pandas_udf('long')
def fuzz_ratio(c1: pd.Series, c2: pd.Series) -> pd.Series:
return pd.concat([c1, c2], axis=1).apply(lambda x: fuzz.token_sort_ratio(x[0], x[1]), axis=1)
df.withColumn('fuzz_ratio', fuzz_ratio('col1', 'col2')).show()
# +-------------+------+----------+
# | col1| col2|fuzz_ratio|
# +-------------+------+----------+
# |danial khilji|danial| 63|
# | a| as| 67|
# +-------------+------+----------+
Spark dataframe which has column emailID : ram.shyam.78uy#testing.com. i would like to extract the string between "." and "#" i.e 78uy and store it in column.
tried
split_for_alias = split(rs_csv['emailID'],'[.]')
rs_csv_alias= rs_csv.withColumn('alias',split_for_alias.getItem(size(split_for_alias) -2))
Its adding 78uy#testing as alias. Another column can be added and chop off the extra values. But is it possible to do in single statement.
Extract the alphanumeric immediately to the left of special character . and immediately followed by special character #
DataFrame
data= [
(1,"am.shyam.78uy#testing.com"),
(2, "j.k.kilo#jom.com")
]
df=spark.createDataFrame(data, ("id",'emailID'))
df.show()
+---+--------------------+
| id| emailID|
+---+--------------------+
| 1|am.shyam.78uy#tes...|
| 2| j.k.kilo#jom.com|
+---+--------------------+
Code
df.withColumn('name', regexp_extract('emailID', '(?<=\.)(\w+)(?=\#)',1)).show()
outcome
+---+--------------------+----+
| id| emailID|name|
+---+--------------------+----+
| 1|am.shyam.78uy#tes...|78uy|
| 2| j.k.kilo#jom.com|kilo|
+---+--------------------+----+
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First we setup a Pandas DataFrame to test:
import pandas as pd
df = pd.DataFrame({"id":[1,2],"email": ["am.shyam.78uy#testing.com", "j.k.kilo#jom.com"]})
Next, we make a native Python function. The logic is clear this way.
from typing import List, Dict, Any
def extract(df:List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
email = row["email"].split("#")[0].split(".")[-1]
row["new_col"] = email
return df
Then we can test on the Pandas engine:
from fugue import transform
transform(df, extract, schema="*, new_col:str")
Because it works, we can bring it to Spark by supplying an engine:
import fugue_spark
transform(df, extract, schema="*, new_col:str", engine="spark").show()
+---+--------------------+-------+
| id| email|new_col|
+---+--------------------+-------+
| 1|am.shyam.78uy#tes...| 78uy|
| 2| j.k.kilo#jom.com| kilo|
+---+--------------------+-------+
Note .show() is needed because Spark evaluates lazily. This transform can take in both Pandas and Spark DataFrames and will output a Spark DataFrame if using the Spark engine.
I have a dataframe
and I want to add € sign and % sign to my resultant dataframe where there are values and not to all rows. My final dataframe would be
Here is what I tried:
df = lit(col('€'+'Currency'))
df= lit(col('Average'+'%'))
Thanks in advance
In pyspark it should be of simple when() otherwise() implementation. Make sure to cast the column Data Type to SrtingType() instead of DoubleType().
from pyspark.sql import functions as F
# Sample Dataframe
data = [(None,"55.6"),("492.38",None)]
columns=["Currency","Average"]
df=spark.createDataFrame(data=data, schema=columns)
# Implementation
df = df.withColumn("Currency", F.when(df.Currency.isNotNull(), F.concat(F.lit("$"),df.Currency)).otherwise(df.Currency))\
.withColumn("Average", F.when(df.Average.isNotNull(), F.concat(df.Average, F.lit("%"))).otherwise(df.Average))
df.show()
#+--------+-------+
#|Currency|Average|
#+--------+-------+
#| null| 55.6%|
#| $492.38| null|
#+--------+-------+
I am importing data, which has a date column in yyyy.MM.dd format. Missing values have been marked as 0000.00.00. This 0000.00.00 is treated differently depending upon the function/module employed to bring the data in the dataframe.
.csv file looks like this -
2016.12.23,2016.12.23
0000.00.00,0000.00.00
Method 1: .csv()
schema = StructType([
StructField('date', StringType()),
StructField('date1', DateType()),
])
df = spark.read.schema(schema)\
.format('csv')\
.option('header','false')\
.option('sep',',')\
.option('dateFormat','yyyy.MM.dd')\
.load(path+'file.csv')
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00|0002-11-30|
+----------+----------+
Method 2: .to_date()
from pyspark.sql.functions import to_date, col
df = sqlContext.createDataFrame([('2016.12.23','2016.12.23'),('0000.00.00','0000.00.00')],['date','date1'])
df = df.withColumn('date1',to_date(col('date1'),'yyyy.MM.dd'))
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00| null|
+----------+----------+
Question: Why two methods give different results? I would have expected the get Null for both. In the first case instead I get 0002-11-30. Can anyone explain this anomaly?
I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error:
"AttributeError: 'list' object has no attribute 'dropDuplicates'"
Not quite sure why as I seem to be following the syntax in the latest documentation.
#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()
#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()
#dropping duplicates from the dataframe
df1.dropDuplicates().show()
It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:
(df1 = sqlContext
.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
.dropDuplicates())
df1.collect()
if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):
count before dedupe:
df.count()
do the de-dupe (convert the column you are de-duping to string type):
from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))
df.drop_duplicates(subset=['colName']).count()
can use a sorted groupby to check to see that duplicates have been removed:
df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)
In summary, distinct() and dropDuplicates() methods remove duplicates with one difference, which is essential.
dropDuplicates() is more suitable by considering only a subset of the columns
data = [("James","","Smith","36636","M",60000),
("James","Rose","","40288","M",70000),
("Robert","","Williams","42114","",400000),
("Maria","Anne","Jones","39192","F",500000),
("Maria","Mary","Brown","","F",0)]
columns = ["first_name","middle_name","last_name","dob","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
df.groupBy('first_name').agg(count(
'first_name').alias("count_duplicates")).filter(
col('count_duplicates') >= 2).show()
df.dropDuplicates(['first_name']).show()
# output
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob |gender|salary|
+----------+-----------+---------+-----+------+------+
|James | |Smith |36636|M |60000 |
|James |Rose | |40288|M |70000 |
|Robert | |Williams |42114| |400000|
|Maria |Anne |Jones |39192|F |500000|
|Maria |Mary |Brown | |F |0 |
+----------+-----------+---------+-----+------+------+
+----------+----------------+
|first_name|count_duplicates|
+----------+----------------+
| James| 2|
| Maria| 2|
+----------+----------------+
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name| dob|gender|salary|
+----------+-----------+---------+-----+------+------+
| James| | Smith|36636| M| 60000|
| Maria| Anne| Jones|39192| F|500000|
| Robert| | Williams|42114| |400000|
+----------+-----------+---------+-----+------+------+