pyspark extracting a string using python - python

Spark dataframe which has column emailID : ram.shyam.78uy#testing.com. i would like to extract the string between "." and "#" i.e 78uy and store it in column.
tried
split_for_alias = split(rs_csv['emailID'],'[.]')
rs_csv_alias= rs_csv.withColumn('alias',split_for_alias.getItem(size(split_for_alias) -2))
Its adding 78uy#testing as alias. Another column can be added and chop off the extra values. But is it possible to do in single statement.

Extract the alphanumeric immediately to the left of special character . and immediately followed by special character #
DataFrame
data= [
(1,"am.shyam.78uy#testing.com"),
(2, "j.k.kilo#jom.com")
]
df=spark.createDataFrame(data, ("id",'emailID'))
df.show()
+---+--------------------+
| id| emailID|
+---+--------------------+
| 1|am.shyam.78uy#tes...|
| 2| j.k.kilo#jom.com|
+---+--------------------+
Code
df.withColumn('name', regexp_extract('emailID', '(?<=\.)(\w+)(?=\#)',1)).show()
outcome
+---+--------------------+----+
| id| emailID|name|
+---+--------------------+----+
| 1|am.shyam.78uy#tes...|78uy|
| 2| j.k.kilo#jom.com|kilo|
+---+--------------------+----+

We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First we setup a Pandas DataFrame to test:
import pandas as pd
df = pd.DataFrame({"id":[1,2],"email": ["am.shyam.78uy#testing.com", "j.k.kilo#jom.com"]})
Next, we make a native Python function. The logic is clear this way.
from typing import List, Dict, Any
def extract(df:List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
email = row["email"].split("#")[0].split(".")[-1]
row["new_col"] = email
return df
Then we can test on the Pandas engine:
from fugue import transform
transform(df, extract, schema="*, new_col:str")
Because it works, we can bring it to Spark by supplying an engine:
import fugue_spark
transform(df, extract, schema="*, new_col:str", engine="spark").show()
+---+--------------------+-------+
| id| email|new_col|
+---+--------------------+-------+
| 1|am.shyam.78uy#tes...| 78uy|
| 2| j.k.kilo#jom.com| kilo|
+---+--------------------+-------+
Note .show() is needed because Spark evaluates lazily. This transform can take in both Pandas and Spark DataFrames and will output a Spark DataFrame if using the Spark engine.

Related

WHEN function condition not getting honoured in pyspark [duplicate]

I have a udf function which takes the key and return the corresponding value from name_dict.
from pyspark.sql import *
from pyspark.sql.functions import udf, when, col
name_dict = {'James': "manager", 'Robert': 'director'}
func = udf(lambda name: name_dict[name])
The original dataframe: James and Robert are in the dict, but Michael is not.
data = [("James","M"),("Michael","M"),("Robert",None)]
test = spark.createDataFrame(data = data, schema = ['name', 'gender'])
test.show()
+-------+------+
| name|gender|
+-------+------+
| James| M|
|Michael| M|
| Robert| null|
+-------+------+
To prevent KeyError, I use the when condition to filter the rows before any operation, but it does not work.
test.withColumn('senior', when(col('name').isin(['James', 'Robert']), func(col('name'))).otherwise(col('gender'))).show()
PythonException: An exception was thrown from a UDF: 'KeyError:
'Michael'', from , line 8. Full traceback
below...
What is the cause of this and are there any feasible ways to solve this problem? Assume that not all the names are keys of the dictionary and for those that are not included, I would like to copy the value from another column, say gender here.
This actually the behavior of user-defined functions in Spark. You can read from the docs:
The user-defined functions do not support conditional expressions or
short circuiting in boolean expressions and it ends up with being
executed all internally. If the functions can fail on special rows,
the workaround is to incorporate the condition into the functions.
So in your case you need to rewrite your UDF as:
func = udf(lambda name: name_dict.get(name, "NA"))
Then calling it using:
test.withColumn('senior', func(col('name'))).show()
#+-------+------+--------+
#| name|gender| senior|
#+-------+------+--------+
#| James| M| manager|
#|Michael| M| NA|
#| Robert| null|director|
#+-------+------+--------+
However, in you case you can actually do this without having to use udf, by using a map column:
from itertools import chain
from pyspark.sql.functions import col, create_map, lit
map_col = create_map(*[lit(x) for x in chain(*name_dict.items())])
test.withColumn('senior', map_col[col('name')]).show()

Get last / delimited value from Dataframe column in PySpark

I am trying to get the last string after '/'.
The column can look like this: "lala/mae.da/rg1/zzzzz" (not necessary only 3 /), and I'd like to return: zzzzz
In SQL and Python it's very easy, but I would like to know if there is a way to do it in PySpark.
Solving it in Python:
original_string = "lala/mae.da/rg1/zzzzz"
last_char_index = original_string.rfind("/")
new_string = original_string[last_char_index+1:]
or directly:
new_string = original_string.rsplit('/', 1)[1]
And in SQL:
RIGHT(MyColumn, CHARINDEX('/', REVERSE(MyColumn))-1)
For PySpark I was thinking something like this:
df = df.select(col("MyColumn").rsplit('/', 1)[1])
but I get the following error: TypeError: 'Column' object is not callable and I am not even sure Spark allows me to do rsplit at all.
Do you have any suggestion on how can I solve this?
Adding another solution even though #Pav3k's answer is great. element_at which gets an item at specific position out of a list:
from pyspark.sql import functions as F
df = df.withColumn('my_col_split', F.split(df['MyColumn'], '/'))\
.select('MyColumn',F.element_at(F.col('my_col_split'), -1).alias('rsplit')
)
>>> df.show(truncate=False)
+---------------------+------+
|MyColumn |rsplit|
+---------------------+------+
|lala/mae.da/rg1/zzzzz|zzzzz |
|fefe |fefe |
|fe/fe/frs/fs/fe32/4 |4 |
+---------------------+------+
Pav3k's DF used.
import pandas as pd
from pyspark.sql import functions as F
df = pd.DataFrame({"MyColumn": ["lala/mae.da/rg1/zzzzz", "fefe", "fe/fe/frs/fs/fe32/4"]})
df = spark.createDataFrame(df)
df.show(truncate=False)
# output
+---------------------+
|MyColumn |
+---------------------+
|lala/mae.da/rg1/zzzzz|
|fefe |
|fe/fe/frs/fs/fe32/4 |
+---------------------+
(
df
.withColumn("NewCol",
F.split("MyColumn", "/")
)
.withColumn("NewCol", F.col("Newcol")[F.size("NewCol") -1])
.show()
)
# output
+--------------------+------+
| MyColumn|NewCol|
+--------------------+------+
|lala/mae.da/rg1/z...| zzzzz|
| fefe| fefe|
| fe/fe/frs/fs/fe32/4| 4|
+--------------------+------+
Since Spark 2.4, you can use split built-in function to split your string then use element_at built-in function to get the last element of your obtained array, as follows:
from pyspark.sql import functions as F
df = df.select(F.element_at(F.split(F.col("MyColumn"), '/'), -1))

How to prefix € sign and suffix % sign for 2 column in python dataframe

I have a dataframe
and I want to add € sign and % sign to my resultant dataframe where there are values and not to all rows. My final dataframe would be
Here is what I tried:
df = lit(col('€'+'Currency'))
df= lit(col('Average'+'%'))
Thanks in advance
In pyspark it should be of simple when() otherwise() implementation. Make sure to cast the column Data Type to SrtingType() instead of DoubleType().
from pyspark.sql import functions as F
# Sample Dataframe
data = [(None,"55.6"),("492.38",None)]
columns=["Currency","Average"]
df=spark.createDataFrame(data=data, schema=columns)
# Implementation
df = df.withColumn("Currency", F.when(df.Currency.isNotNull(), F.concat(F.lit("$"),df.Currency)).otherwise(df.Currency))\
.withColumn("Average", F.when(df.Average.isNotNull(), F.concat(df.Average, F.lit("%"))).otherwise(df.Average))
df.show()
#+--------+-------+
#|Currency|Average|
#+--------+-------+
#| null| 55.6%|
#| $492.38| null|
#+--------+-------+

Programatically select columns from a dataframe with udf

I am new to pyspark.
I am trying to extract columns of a dataframe using a config file which contains a UDF.
If I define the select column as a list on the client it works but if I import the list from a config file the column list is of the type string.
Is there an alternate way.
opening spark-shell using pyspark .
*******************************************************************
version 2.2.0
Using Python version 2.7.16 (default, Mar 18 2019 18:38:44)
SparkSession available as 'spark'
*******************************************************************
jsonDF = spark.read.json("/tmp/people.json")
jsonDF.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
jsonDF.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
jsonCurDF = jsonDF.filter(jsonDF.age.isNotNull()).cache()
# Define the UDF
from pyspark.sql.functions import udf
#udf("long")
def squared_udf(s):
return s * s
# Selecting the columns from a list.
colSelList = ['age', 'name', squared_udf('age')]
jsonCurDF.select(colSelList).show()
+---+------+----------------+
|age| name|squared_udf(age)|
+---+------+----------------+
| 30| Andy| 900|
| 19|Justin| 361|
+---+------+----------------+
# If I use an external config file
colSelListStr = ["age", "name" , "squared_udf('age')"]
jsonCurDF.select(colSelListStr).show()
The above command fails "cannot resolve '`squared_udf('age')'
Tried registering the function, tried selectExpr and using the column function.
In the colSelList the udf call is translated to a column type.
print colSelList[2]
Column<squared_udf(age)
print colSelListStr[2]
squared_udf('age')
print column(colSelListStr[2])
Column<squared_udf('age')
What am I doing wrong here? or is there an alternate solution?
It's because squared_age is considered as string not a function when you are passing it from a list.
There is a round way you can do this and you don't need to import UDF for this.
Assume this is the list you need to select
directly passing this list will result into an error because squared_age doesn't contain in this data frame
so first you take all the columns of existing df into a list by
existing_cols = df.columns
and your these are the columns you need
now take intersection of both the lists
it will you give you a common element list
intersection = list(set(existing_cols) & set(col_list))
now try like this
newDF= df.select(intersection).rdd.map(lambda x: (x["age"], x["name"], x["age"]*x["age"])).toDF(col_list)
which will give you this
hope this helped.

Remove duplicates from a dataframe in PySpark

I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error:
"AttributeError: 'list' object has no attribute 'dropDuplicates'"
Not quite sure why as I seem to be following the syntax in the latest documentation.
#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()
#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()
#dropping duplicates from the dataframe
df1.dropDuplicates().show()
It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:
(df1 = sqlContext
.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
.dropDuplicates())
df1.collect()
if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):
count before dedupe:
df.count()
do the de-dupe (convert the column you are de-duping to string type):
from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))
df.drop_duplicates(subset=['colName']).count()
can use a sorted groupby to check to see that duplicates have been removed:
df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)
In summary, distinct() and dropDuplicates() methods remove duplicates with one difference, which is essential.
dropDuplicates() is more suitable by considering only a subset of the columns
data = [("James","","Smith","36636","M",60000),
("James","Rose","","40288","M",70000),
("Robert","","Williams","42114","",400000),
("Maria","Anne","Jones","39192","F",500000),
("Maria","Mary","Brown","","F",0)]
columns = ["first_name","middle_name","last_name","dob","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
df.groupBy('first_name').agg(count(
'first_name').alias("count_duplicates")).filter(
col('count_duplicates') >= 2).show()
df.dropDuplicates(['first_name']).show()
# output
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob |gender|salary|
+----------+-----------+---------+-----+------+------+
|James | |Smith |36636|M |60000 |
|James |Rose | |40288|M |70000 |
|Robert | |Williams |42114| |400000|
|Maria |Anne |Jones |39192|F |500000|
|Maria |Mary |Brown | |F |0 |
+----------+-----------+---------+-----+------+------+
+----------+----------------+
|first_name|count_duplicates|
+----------+----------------+
| James| 2|
| Maria| 2|
+----------+----------------+
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name| dob|gender|salary|
+----------+-----------+---------+-----+------+------+
| James| | Smith|36636| M| 60000|
| Maria| Anne| Jones|39192| F|500000|
| Robert| | Williams|42114| |400000|
+----------+-----------+---------+-----+------+------+

Categories